Search by job, company or skills

acquire intelligence

Site Reliability Engineer

Save
new job description bg glownew job description bg glow
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

A successful Site Reliability Engineer will have:

Experience

• Minimum 3+ years of hands-on experience running AWS production systems at

scale

• Proven expertise with AWS EKS (Elastic Kubernetes Service) or similar and MSK

(Managed Streaming for Kafka) in production environments as well as database

performance diagnostics (MySQL, Postgres, MongoDB) in multi-TB scale databases

• Strong background in Infrastructure as Code, preferably with Pulumi using

TypeScript or equivalent Terraform experience

• Demonstrated experience participating in incident management (ideally as an

incident commander with a track record of leading post-mortem processes)

• Experience with high-volume data processing systems, ideally IoT telemetry or

streaming pipelines processing ≥50k messages per second

• Background in implementing and maintaining observability solutions using

Prometheus, Grafana, PagerDuty, or similar tools Experience with CI/CD pipeline

management and deployment automation using GitLab, or similar platforms

• Exposure to Hypervisors (VMWare, Hyper V), Microsoft Server stack, SAN/NAS, L2/3

Networking Layers, Firewalls (Palo Alto), Switching (Aruba, Juniper) considered

advantageous.

Technical Skills & Qualifications

• Bachelor's degree in computer science, engineering, or related technical field, or

equivalent practical experience

• Expert-level proficiency in TypeScript for production systems, including Node.js

services, AWS Lambda functions, and operational tooling

• Deep understanding of AWS services ecosystem, with particular expertise in

container orchestration, messaging systems, and content delivery

• Strong networking fundamentals including TCP/IP, DNS, TLS, HTTP protocols, and

container networking (CNI)

• Proficiency with monitoring and observability tools including Prometheus, Grafana,

and incident management platforms

• Experience with Infrastructure as Code tools, particularly Pulumi with TypeScript for

comprehensive AWS resource management

• Understanding of security best practices including least-privilege access, IAM policy

management, and compliance frameworks

Behaviours

• Systems thinking – Able to understand complex distributed systems and identify

potential failure points and optimization opportunities

• Automation-first mindset – Consistently seeks to eliminate manual processes and

build scalable, repeatable solutions

• Incident leadership – Calm under pressure with strong communication skills during

high-stress situations and post-incident analysis

• Collaborative approach – Works effectively with development teams to build

reliability into systems from the ground up

• Continuous improvement focus – Proactively identifies opportunities for operational

enhancement and drives them to completion

• Detail-oriented execution – Maintains high standards for documentation,

monitoring, and operational procedures.

Motivation and interests

Founding influence – Join as one of the first SREs where your tooling choices and

operational processes become the organizational standard

• Protected focus time – Error-budget policy guarantees at least 10% of time

dedicated to make tomorrow better reliability work

• Sustainable on-call – True follow-the-sun rotation with no permanent night shifts,

only regional time zone coverage

• Professional growth – Opportunity to shape reliability engineering practices in a

high-growth technology company

• Global impact – Your work directly enables thousands of

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147726429

Similar Jobs

Taguig, Philippines

Skills:

LoggingNetworkingDatadogEc2ECSIamAzureAWScloud security principlesLinux systems administrationMonitoringalertingEKScloud cost optimizationobservability platforms