Your Role
People & Team Leadership
- Lead, coach, and mentor IT engineers to build strong technical and leadership capabilities.
- Set clear performance goals aligned with our Beliefs, Vision, Mission, Methods (BVMM).
- Conduct 1:1s, performance reviews, and career growth discussions.
- Foster a culture of ownership, collaboration, and continuous learning.
- Maintain balanced workloads, shift coverage, and clear succession plans to sustain healthy 247 operations.
Service Operations & Reliability
- Oversee daily service health, capacity, and reliability across all supported environments.
- Ensure compliance with operational KPIs through proactive planning and improvement.
- Balance demand vs. capacity and manage shift coverage to prevent burnout.
- Partner with engineering teams to maintain runbooks, knowledge bases, and escalation paths.
- Drive automation and workflow optimization to reduce manual overhead.
- Use data insights to guide decisions and improvements.
Incident & Problem Management
- Lead end-to-end incident response, triage, communication, and resolution in real time.
- Act as Incident Commander for high-impact events across a global environment.
- Track and improve metrics like MTTD, MTTM, and MTTR.
- Champion blameless Post-Incident Reviews (PIRs) and translate learnings into long-term system and process improvements.
Strategic & Cross-Functional Impact
- Represent in customer reviews, operational syncs, and briefings.
- Collaborate with SREs, product owners, and partner engineers to align priorities and reliability goals.
- Contribute to frameworks and governance initiatives.
- Lead service onboarding/off-boarding and strengthen operational readiness checkpoints.
- Identify and close systemic operational gaps through process and tool improvements.
Your Qualifications
- Bachelor's degree in Computer Science, Information Technology, Engineering, or a related discipline.
- 3+ years in Service Delivery, Incident Response, or Operations Leadership within enterprise-scale, 247 environments.
- Proven experience managing technical teams, driving performance, and leading through critical situations.
- Strong grounding in ITSM / ITIL principles (Incident & Problem Management).
- Familiarity with cloud, distributed systems, or enterprise infrastructure.
- Skilled in monitoring, alerting, and ticketing tools (e.g., PagerDuty, Datadog, Grafana, Splunk, ServiceNow).
Core Competencies
- People and Performance Leadership
- Incident Command and Escalation Management
- Analytical and Problem-Solving Skills
- Communication and Decision-Making Under Pressure
- Root Cause and Post-Incident Analysis
- Operational Planning and Service Governance
- Stakeholder and Partner Management
- IT Service Management (Incident & Problem Management)
- Observability, Monitoring, and Automation Tools
- Passion for People Development, Operational Discipline, and Continuous Improvement
Plus points if you have:
- ITIL V3 or V4 certification
- AWS Certified SysOps Administrator
- SRE Foundation or Crisis/Incident Management certifications
- Background in SRE practices and operational frameworks that promote reliability and automation