Search by job, company or skills

  • Posted 10 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Title: Site Reliability Engineering (SRE) Subject Matter Expert (SME)

Overview

We're looking for an experienced SRE Subject Matter Expert (SME) to lead our reliability, performance, and automation initiatives. This role will design and drive best-in-class observability, performance engineering, AIOps, and reliability practices to ensure our systems are stable, scalable, and efficient.

The ideal candidate is both hands-on and strategicable to solve technical problems, mentor teams, and influence company-wide engineering decisions.

Key Responsibilities

1. Observability & Monitoring

  • Build and manage observability frameworks across logs, metrics, traces, and events.
  • Design and maintain monitoring tools (e.g., Prometheus, Grafana, ELK, Splunk, Datadog, Dynatrace, New Relic) for better system insights.
  • Define and track SLOs, SLIs, and error budgets with product and engineering teams.
  • Enable proactive incident detection and root cause analysis.

2. Performance Engineering

  • Lead load, stress, and scalability testing for applications and infrastructure.
  • Create performance models and capacity plans for critical systems.
  • Work closely with developers to find and fix performance bottlenecks.

3. Reliability Engineering

  • Automate incident response, disaster recovery, and self-healing systems.
  • Lead Chaos Engineering and resilience testing.
  • Promote a blameless postmortem culture and drive reliability reviews.
  • Ensure all systems follow best practices for fault tolerance and high availability.

4. AIOps & Automation

  • Define and implement the AIOps strategy using ML/AI to improve observability and response.
  • Use anomaly detection, event correlation, and predictive analytics for proactive issue resolution.
  • Integrate AIOps tools with ITSM systems for smarter alerting and automated remediation.

5. Leadership & Enablement

  • Act as a thought leader and mentor for SRE practices across teams.
  • Collaborate with engineering, infrastructure, and business units to embed SRE principles company-wide.
  • Champion a continuous improvement culture focused on availability, scalability, and operational excellence.

Required Qualifications

  • 10+ years in IT Operations, Reliability, or Performance Engineering.
  • Deep expertise in observability and monitoring tools (Prometheus, Grafana, Splunk, Datadog, Dynatrace, ELK, etc.).
  • Strong experience with performance testing tools (JMeter, LoadRunner, Gatling, k6, etc.) and capacity planning.
  • Hands-on experience with AWS, Azure, or GCP and container platforms (Kubernetes, Docker, OpenShift).
  • Skilled in automation (Terraform, Ansible, Python, Go, Shell scripting).
  • Familiar with AIOps tools (Moogsoft, BigPanda, Dynatrace Davis AI, ServiceNow AIOps).
  • Strong understanding of distributed systems, networking, CI/CD, and DevOps.

Preferred Qualifications

  • Experience leading enterprise-wide SRE or observability transformations.
  • Knowledge of Chaos Engineering tools (Gremlin, Chaos Mesh, Litmus).
  • Familiarity with ITSM/ITIL and modern incident management.
  • Excellent communication and stakeholder management, including executive-level influence.
  • Certifications in Google SRE, AWS DevOps, Azure SRE, or Datadog/Dynatrace (a plus).

More Info

Job Type:
Industry:
Employment Type:

Job ID: 134910891

Similar Jobs