Head of Site Reliability Engineering

acquire intelligence

Taguig, Philippines

Fresher

Save

Posted 3 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We're an award-winning global outsourcer providing contact center and back office services on behalf of our global clients. Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Role objective

The Head of Site Reliability Engineering is a hybrid technical‑leadership role. You will:

Own reliability of production services running on AWS while steering the roadmap for platform resilience and building out the SRE team.
Lead and grow a remote team of SREs—coaching, hiring, performance‑managing, and fostering a blameless culture.
Set and enforce Service Level Objectives (SLOs), error budgets, and incident response processes.
Drive automation via Infrastructure‑as‑Code (Pulumi / TypeScript), CI/CD, and observability pipelines.
Represent the SRE discipline to product, engineering, and senior leadership across our global business.
Hands on monitoring and incident response will be critical as the team grows.

This role offers the opportunity to build reliability engineering from the ground up in a mission-critical IoT platform.

Key Responsibilities

Leadership & People Management

Build an SRE team of initially 3-6 engineers: goal setting, career development, regular 1:1s, and annual performance reviews.
Ensure operational system knowledge is captured and that the team is kept fresh on operating and troubleshooting procedures.
Recruit, onboard, and mentor new engineers; scale the team to meet business growth.
Maintain an inclusive, psychologically‑safe culture centred on learning and continuous improvement.
Own, and participate in, the on‑call roster for the team, ensuring equitable rotations and sustainable workloads.

Service Level Management & Reliability

Define, monitor, and enforce SLOs and error budgets across all production systems.
Continuously analyse error‑budget burn to halt risky deployments and guide capacity decisions.
Champion a data‑driven reliability mindset throughout engineering and product teams.

Infrastructure Automation & Management

Architect and implement Infrastructure‑as‑Code in Pulumi/TypeScript for AWS resources (EKS, MSK, SingleStore, MongoDB, S3, etc.).
Lead large‑scale migration or modernisation projects (e.g., Kubernetes upgrades, multi‑AZ resilience).
Eliminate toil—any manual task >2 engineer‑days/quarter or frequently repeated becomes an automation candidate.

Incident Response & Post‑Mortem Leadership

Participate in on-call monitoring and response roster.
Serve as escalation point and incident commander.
Ensure post‑mortems are published within 48 hours with actionable never again tasks tracked to closure.
Improve runbooks and game‑day exercises; train engineers on incident command principles.

Security & Compliance

Enforce least‑privilege IAM policies and champion DevSecOps practices.
Contribute to SOC 2 & ISO 27001 evidence collection and continuous control monitoring.
Oversee security patch pipelines, vulnerability management, and secrets hygiene.

Operational Excellence & Continuous Improvement