Senior Site Reliability Engineer

4-6 Years

Save

Early Applicant

Job Description

Role Description

Apply SRE principles to ensure the reliability, availability, scalability, and performance of production systems
Design, implement, and maintain automation and Infrastructure as Code to reduce operational toil and manual intervention
Operate and Optimize services in AWS and containerized environments (EKS/ECS)
Ensure platform aligns with compliance requirements
Build and operate CI/CD pipelines using Gitlab
Define, and implement Service Level Objectives (SLOs), and error budgets
Implement and maintain observability solutions including metrics, logs, and traces to proactively detect and diagnose system issues
Contribute to incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews
Identify systemic reliability risks, performance bottlenecks, and capacity constraints; collaborate with the team to address them
Work closely with devs to ensure systems are designed for operability, resilience, and maintainability
Perform performance testing, capacity planning, and availability analysis to support system growth and scaling
Continuously evaluate and improve tooling related to reliability, monitoring, alerting, and cost efficiency
Document operational knowledge, runbooks, and best practices to improve operational readiness

Qualifications