This is a remote position.
Core Expertise
- SRE Foundations & Practices
- Deep understanding of SRE principles (SLIs, SLOs, error budgets, toil reduction, reliability vs. velocity trade-offs).
- Proven experience driving SRE adoption and culture change across teams and applications.
- Strong knowledge of incident management, on-call practices, and blameless postmortems.
- Cloud & Infrastructure
- 5+ years of experience with Google Cloud Platform (GCP) services
- Solid expertise with Kubernetes , including scaling, workload optimization, network policies, service mesh, and troubleshooting.
- Experience with infrastructure as code
- Reliability & Observability
- Strong knowledge of monitoring, logging, and tracing
- Proven ability to design and implement alerting strategies aligned with SLOs/SLIs.
- Hands-on experience optimizing application performance, resiliency, and cost efficiency in cloud-native environments.
- Automation & Tooling
- Proficiency in at least one modern programming language (preferably Python) for automation, reliability tooling, and operational improvements.
- Familiarity with CI/CD pipelines and release engineering best practices.
- Expertise in automating reliability tasks, reducing toil, and scaling best practices across multiple applications.
Leadership & Collaboration
- Ability to evangelize SRE best practices and influence engineering/product teams in adopting them.
- Experience mentoring engineers and establishing communities of practice around reliability.
- Strong stakeholder management skills to balance product delivery goals with reliability requirements.
- Excellent communication skills.
Requirements
Preferred Qualifications
- Hands-on experience migrating applications to SRE operating models in multi-team/multi-application settings.
- Certification(s): Google Cloud Professional DevOps Engineer, Kubernetes CKA/CKS, or equivalent.
Benefits
Full Time Employment with competitive salary and benefits
Medical, dental, and vision insurance coverage