Senior Site Reliability Engineer

Maya

Philippines

Fresher

Save

Posted 14 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

NATURE OF WORK

Lead architectural design and implementation of fault-tolerant, self-healing infrastructure across cloud and hybrid environments
Drive organization-wide automation initiatives, eliminating manual operations through advanced IaC and CI/CD frameworks
Own technical program leadership for reliability initiatives spanning multiple teams and services
Strategic management of OPEX and CAPEX budgets with cost optimization accountability
Deep expertise in compliance frameworks (CIS, PCI-DSS, BSP) with ability to architect compliant solutions
Establish and enforce cloud governance policies, account structures, and organizational standards across AWS/Azure/GCP environments

DISPLAYED SKILL MASTERY

Architect and implement advanced CI/CD pipelines with progressive delivery patterns (canary, blue-green)
Design and maintain enterprise-grade Infrastructure-as-Code modules with reusability and governance
Lead complex incident resolution and conduct deep-dive root cause analysis
Drive adoption of emerging technologies and reliability patterns across engineering teams
Mentor Senior SREs on architectural decisions and reliability best practices
Design and implement cloud landing zones, multi-account strategies, and policy-as-code frameworks
Build comprehensive SLI/SLO frameworks with automated alerting, error budget tracking, and burn rate analysis
Correlate metrics across distributed systems using APM, distributed tracing, and custom dashboards

REQUIRED QUALIFICATIONS

Expert-level proficiency in Kubernetes (CRDs, Operators, multi-tenancy, advanced scheduling)
Advanced Terraform expertise (custom providers, module design, automated testing)
Deep Service Mesh knowledge (Istio traffic management, circuit breaking, rate limiting, mTLS)
Proven experience building Internal Developer Platforms (IDP) with self-service workflows
Advanced GitLab CI/CD and GitOps implementation (ArgoCD/FluxCD, multi-project pipelines)
Expert-level WAF, API Gateway (Kong, Apigee, AWS APIGW), and network security implementation
Strong software development skills in Go, Python, or Java with ability to review code for reliability impact
Experience leading technical programs and cross-functional reliability initiatives
Deep understanding of observability platforms (Dynatrace, Prometheus, OpenTelemetry) with custom integration experience
Proven track record architecting microservices with high-availability and resiliency patterns
Experience implementing AWS Organizations, Control Tower, Service Control Policies, and multi-account governance frameworks
Proficiency in cloud policy-as-code tools (AWS Config, OPA, Sentinel) and compliance automation
Knowledge of cloud security standards (CIS Benchmarks, AWS Well-Architected Framework, Azure/GCP best practices)
Advanced expertise in Dynatrace, Datadog, or Grafana for building enterprise observability solutions
Experience implementing SLO-based alerting, error budgets, and burn rate monitoring using Prometheus, Grafana, or commercial APM tools
Proficiency in distributed tracing (Jaeger, Zipkin, OpenTelemetry) and log aggregation (ELK, Loki)
Ability to design custom metrics, synthetic monitoring, and real user monitoring (RUM) strategies