
Search by job, company or skills
Tech Mahindra represents the connected world, offering innovative and customer-centric information technology experiences, enabling Enterprises, Associates, and the Society to Rise. It has 150,000+ professionals working for 1000+ Global Customers (including Fortune 500 companies) in 90 Countries. We're part of the esteemed Mahindra group, headquartered in India. Under a new CEO, Tech Mahindra is committed to a transformative journey with Scale @ Speed as our guiding principle.
Job description:
Site Reliability Engineering (SRE) at combines software and systems engineering with the art of
machine learning to build and run large-scale, massively distributed, and fault-tolerant systems. You will have
the opportunity to sharpen your expertise in coding, performance analysis, and large-scale system design
while making a tangible impact on the future of Infrastructure services and AML systems.
Responsibilities
• Design, build, and maintain highly available, scalable, and fault-tolerant systems. Collaborate with
software engineering teams to ensure applications are designed with reliability and performance in
mind.
• Develop and maintain automation procedures to maximize system efficiency, minimize human
intervention, and optimize routine tasks.
• Monitor and analyze system performance to identify and address bottlenecks before they impact
users. Ensure the infrastructure can handle rapid growth in web traffic and ML data processing.
• Participate in 24/7 on-call rotations (including scheduled shifts and holidays). Practice sustainable oncall
response, conduct root-cause analysis, and lead blameless post-mortems to prevent recurrence.
• Implement monitoring tools (SLIs/SLOs/SLAs) and set up automated alerting and metrics to track
system health and performance.
• Implement and maintain security best practices and ensure all systems meet regulatory requirements.
Job Requirements
Minimum Qualifications:
• Education: Bachelor's or Master's degree in Computer Science, Information Technology, Computer
Engineering, or a related field.
• Experience: 3+ years of experience as a Site Reliability Engineer, Systems Engineer, or Software
Engineer.
• Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java)
and shell scripting. Strong understanding of data structures and algorithms.
• Systems: Strong understanding of Linux operating systems and open-source technologies and a
solid understanding of network architecture.
• Databases: Competent knowledge of relational database systems and database modeling.
Preferred Qualifications:
• Experience with containers and container orchestration platforms such as Docker and Kubernetes.
• Proficiency in or exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or
PaddlePaddle.
• Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
• Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with
cross-functional teams in a fast-paced environment.
Job ID: 147182831
We don’t charge any money for job offers