Serve as a Subject Matter Expert (SME) for large-scale infrastructure operations, sharing expertise, documenting best practices, and conducting root-cause analyses for high-impact or recurring incidents.
Lead incident management, response coordination, troubleshooting, and proactive customer communication during system outages and production incidents.
Facilitate regular sync-up meetings with stakeholders to communicate updates, clarify issues, and gather customer feedback.
Analyze and report operational metrics to drive informed decision-making and continuous process improvements.
Develop and enhance operational tools and automated solutions to increase efficiency and reduce operational overhead.
Document comprehensive operational procedures, configurations, and environment setups.
Identify and eliminate operational toil by automating repetitive tasks and optimizing processes.
Train junior engineers in different subjects of expertise.
Participate in a 24x7 shifting rotation.
Your Qualifications
Bachelor's degree in Information Technology, Engineering, or a related field.
Minimum of 5 years experience supporting critical, high-availability production environments with a strong focus on automation and operational improvements.
Minimum of 5 years experience with at least 12 tools per domain:
Linux Systems Administration: RHEL, CentOS, Ubuntu, or similar Unix-based OS
Relevant certifications in key skills (e.g., CKA, CKAD, AWS Certified)
Experience working in collaborative, cross-functional teams within structured processes that follow modern DevOps practices and workflows.
Proven ability to drive operational efficiency through automation, using languages such as Bash and Python to streamline workflows, reduce manual toil, and improve system reliability.