Search by job, company or skills

Maya

Senior Site Reliability Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

NATURE OF WORK

  • Lead architectural design and implementation of fault-tolerant, self-healing infrastructure across cloud and hybrid environments
  • Drive organization-wide automation initiatives, eliminating manual operations through advanced IaC and CI/CD frameworks
  • Own technical program leadership for reliability initiatives spanning multiple teams and services
  • Strategic management of OPEX and CAPEX budgets with cost optimization accountability
  • Deep expertise in compliance frameworks (CIS, PCI-DSS, BSP) with ability to architect compliant solutions
  • Establish and enforce cloud governance policies, account structures, and organizational standards across AWS/Azure/GCP environments

DISPLAYED SKILL MASTERY

  • Architect and implement advanced CI/CD pipelines with progressive delivery patterns (canary, blue-green)
  • Design and maintain enterprise-grade Infrastructure-as-Code modules with reusability and governance
  • Lead complex incident resolution and conduct deep-dive root cause analysis
  • Drive adoption of emerging technologies and reliability patterns across engineering teams
  • Mentor Senior SREs on architectural decisions and reliability best practices
  • Design and implement cloud landing zones, multi-account strategies, and policy-as-code frameworks
  • Build comprehensive SLI/SLO frameworks with automated alerting, error budget tracking, and burn rate analysis
  • Correlate metrics across distributed systems using APM, distributed tracing, and custom dashboards

REQUIRED QUALIFICATIONS

  • Expert-level proficiency in Kubernetes (CRDs, Operators, multi-tenancy, advanced scheduling)
  • Advanced Terraform expertise (custom providers, module design, automated testing)
  • Deep Service Mesh knowledge (Istio traffic management, circuit breaking, rate limiting, mTLS)
  • Proven experience building Internal Developer Platforms (IDP) with self-service workflows
  • Advanced GitLab CI/CD and GitOps implementation (ArgoCD/FluxCD, multi-project pipelines)
  • Expert-level WAF, API Gateway (Kong, Apigee, AWS APIGW), and network security implementation
  • Strong software development skills in Go, Python, or Java with ability to review code for reliability impact
  • Experience leading technical programs and cross-functional reliability initiatives
  • Deep understanding of observability platforms (Dynatrace, Prometheus, OpenTelemetry) with custom integration experience
  • Proven track record architecting microservices with high-availability and resiliency patterns
  • Experience implementing AWS Organizations, Control Tower, Service Control Policies, and multi-account governance frameworks
  • Proficiency in cloud policy-as-code tools (AWS Config, OPA, Sentinel) and compliance automation
  • Knowledge of cloud security standards (CIS Benchmarks, AWS Well-Architected Framework, Azure/GCP best practices)
  • Advanced expertise in Dynatrace, Datadog, or Grafana for building enterprise observability solutions
  • Experience implementing SLO-based alerting, error budgets, and burn rate monitoring using Prometheus, Grafana, or commercial APM tools
  • Proficiency in distributed tracing (Jaeger, Zipkin, OpenTelemetry) and log aggregation (ELK, Loki)
  • Ability to design custom metrics, synthetic monitoring, and real user monitoring (RUM) strategies

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146621243

Similar Jobs