Site Reliability Engineer (SRE)

Acquire Intelligence

Taguig, Philippines

Fresher

Save

Posted 20 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We're an award-winning global outsourcer providing contact center and back office services on behalf of our global clients. Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Role objective

The Site Reliability Engineer serves as the guardian of our production systems, ensuring

the reliability, scalability, and performance of our IoT telemetry platform. You will define

and enforce Service Level Objectives (SLOs), automate operational processes, and build

the infrastructure and tooling that enables our engineering teams to deploy with

confidence. By implementing comprehensive monitoring, incident response procedures,

and reliability practices, you will play a pivotal role in maintaining the uptime and data

freshness that our customers depend on for their critical fleet operations.

The Role Will Focus On The Following Key Areas

SLO Management

Infrastructure Automation

Incident Response

Security & comliance

Key Responsibilities

Responsibilities of the Site Reliability Engineer will include but are not limited to:

Service Level Management & Reliability

Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across

all production systems

Track error budget burn rates and make data-driven decisions to halt risky

deployments when thresholds are exceeded

Implement comprehensive monitoring and alerting strategies using Prometheus,

Grafana, and PagerDuty

Establish and maintain reliability standards that support business-critical uptime

Requirements

Infrastructure Automation & Management

Design and implement Infrastructure as Code (IaC) solutions using Pulumi with

TypeScript

Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK

(Managed Streaming for Kafka), SingleStore, MongoDB S3

Automate operational processes to eliminate toil, targeting any task that consumes

more than 2 engineer-days per quarter

Incident Response & Post-Mortem Leadership

Serve as incident commander during production outages and service degradations
Lead comprehensive post-mortem processes within 48 hours of incidents
Drive never-again corrective actions to completion, ensuring systemic improvements
Maintain and improve incident response procedures and runbooks

Security & Compliance

Implement and enforce least-privilege IAM policies across all AWS resources
Manage security patch pipelines and vulnerability remediation processes
Support compliance initiatives including SOC2 and ISO 27001 certification requirements
Ensure security best practices are embedded in all infrastructure and operational

procedures

On-Call & Operational Excellence