Site Reliability Engineer

2-5 Years

Save

Early Applicant

Job Description

Job Description

Handle service monitoring, incident response, and drive technical support efficiency
Responsible for managing and maintaining network monitoring tools, systems, and

processes that ensure the availability, scalability, and performance of our production

environments.

Responsible for incident handling, service monitoring, and technical support efficiency.
Closely work with developers, DevOps, infrastructure teams, and different stakeholders

to achieve proactive incident prevention, issue resolution and incident documentations.

Key Responsibilities

Ensure that all tickets are updated and handled based on set KPI's and SLA's
Manage monitoring, alerting, and logging tools to ensure system health and service

uptime.

Ensure early detection, triage and escalation of service degradation based on defined

service level agreement

Trigger L2 ticket handling and on-call rotations for critical incidents.
Execute triage, diagnosis, and resolution of incidents required for L3 escalations, both

internal and 3rd party support teams

Support major incident response, contribute to root cause analysis (RCA), and help

document postmortems.

Track, analyze, and act on incident trends and recurring technical issues.
Use data from ticketing systems (Jira, ServiceNow, etc.) to improve team responsiveness

and resolution quality.

documentation of known issues, fixes, and playbooks to improve mean time to resolution.

reliability

Participate in technical competency mapping to ensure coverage and reduce unnecessaryescalations.

Minimum Qualifications

Qualifications and Experience:

Science, Management Information Systems, or equivalent.

25 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
Minimum of 3 years experience in Site Reliability Engineering, DevOps, or Infrastructure roles is required.
Hands-on experience with monitoring tools (e.g., Prometheus, Grafana, ELK, or Datadog).
Familiarity with incident response and troubleshooting in production systems.
Experience with at least one cloud platform (AWS, GCP, or Azure).
Knowledgeable in scripting (e.g., Python, Bash) and Linux systems.
Exposure to ITIL-based processes, especially Incident and Problem Management.
Experience working in fintech, banking, or SaaS with high availability SLAs.
Familiarity with DevOps practices, CI/CD pipelines, and cloud-based monitoring tools.
Experience with automation platforms
Knowledge of BSP regulatory frameworks, policies, and guidelines.