
Search by job, company or skills
A successful Site Reliability Engineer will have:
Experience
• Minimum 3+ years of hands-on experience running AWS production systems at
scale
• Proven expertise with AWS EKS (Elastic Kubernetes Service) or similar and MSK
(Managed Streaming for Kafka) in production environments as well as database
performance diagnostics (MySQL, Postgres, MongoDB) in multi-TB scale databases
• Strong background in Infrastructure as Code, preferably with Pulumi using
TypeScript or equivalent Terraform experience
• Demonstrated experience participating in incident management (ideally as an
incident commander with a track record of leading post-mortem processes)
• Experience with high-volume data processing systems, ideally IoT telemetry or
streaming pipelines processing ≥50k messages per second
• Background in implementing and maintaining observability solutions using
Prometheus, Grafana, PagerDuty, or similar tools Experience with CI/CD pipeline
management and deployment automation using GitLab, or similar platforms
• Exposure to Hypervisors (VMWare, Hyper V), Microsoft Server stack, SAN/NAS, L2/3
Networking Layers, Firewalls (Palo Alto), Switching (Aruba, Juniper) considered
advantageous.
Technical Skills & Qualifications
• Bachelor's degree in computer science, engineering, or related technical field, or
equivalent practical experience
• Expert-level proficiency in TypeScript for production systems, including Node.js
services, AWS Lambda functions, and operational tooling
• Deep understanding of AWS services ecosystem, with particular expertise in
container orchestration, messaging systems, and content delivery
• Strong networking fundamentals including TCP/IP, DNS, TLS, HTTP protocols, and
container networking (CNI)
• Proficiency with monitoring and observability tools including Prometheus, Grafana,
and incident management platforms
• Experience with Infrastructure as Code tools, particularly Pulumi with TypeScript for
comprehensive AWS resource management
• Understanding of security best practices including least-privilege access, IAM policy
management, and compliance frameworks
Behaviours
• Systems thinking – Able to understand complex distributed systems and identify
potential failure points and optimization opportunities
• Automation-first mindset – Consistently seeks to eliminate manual processes and
build scalable, repeatable solutions
• Incident leadership – Calm under pressure with strong communication skills during
high-stress situations and post-incident analysis
• Collaborative approach – Works effectively with development teams to build
reliability into systems from the ground up
• Continuous improvement focus – Proactively identifies opportunities for operational
enhancement and drives them to completion
• Detail-oriented execution – Maintains high standards for documentation,
monitoring, and operational procedures.
Motivation and interests
Founding influence – Join as one of the first SREs where your tooling choices and
operational processes become the organizational standard
• Protected focus time – Error-budget policy guarantees at least 10% of time
dedicated to make tomorrow better reliability work
• Sustainable on-call – True follow-the-sun rotation with no permanent night shifts,
only regional time zone coverage
• Professional growth – Opportunity to shape reliability engineering practices in a
high-growth technology company
• Global impact – Your work directly enables thousands of
Job ID: 147726429
Skills:
Logging, Networking, Datadog, Ec2, ECS, Iam, Azure, AWS, cloud security principles, Linux systems administration, Monitoring, alerting, EKS, cloud cost optimization, observability platforms
We don’t charge any money for job offers