Serve as Subject Matter Expert (SME) for distributed applications on hybrid cloud platforms, documenting best practices and providing guidance to peers.
Champion continuous operational improvements informed by metrics analysis and customer feedback.
Lead incident management, troubleshooting, response coordination, and conduct comprehensive post-incident reviews.
Clearly communicate complex technical issues to development teams, document root causes, and collaborate internally to create robust solutions.
Manage, deploy, and maintain enterprise applications and cloud-based systems using secure, scalable, and reliable frameworks.
Proactively monitor, troubleshoot, and optimize the health, performance, and reliability of applications and platforms.
Perform detailed log analysis and utilize stack traces to debug and resolve issues reported by partners and end-users.
Develop comprehensive documentation covering operational procedures, system configurations, and environment setups.
Continuously identify and implement automation opportunities to reduce manual tasks and operational overhead.
Train junior engineers in different subjects of expertise.
Participate in a 24x7 shifting rotation.
Your Qualifications
Bachelor's degree in Information Technology, Engineering, or a related technical field.
Minimum 5 years of experience supporting critical, high-availability production systems with a focus on automation, reliability, and operational excellence.
At least 5+ years of hands-on experience in at least 12 tools per domain:
Linux Administration & Troubleshooting: RHEL, CentOS, Ubuntu, or similar Unix-based OS.
Distributed Applications: Microservices architecture and distributed application support.