Job Description
- Handle service monitoring, incident response, and drive technical support efficiency
- Responsible for managing and maintaining network monitoring tools, systems, and
processes that ensure the availability, scalability, and performance of our production
environments.
- Responsible for incident handling, service monitoring, and technical support efficiency.
- Closely work with developers, DevOps, infrastructure teams, and different stakeholders
to achieve proactive incident prevention, issue resolution and incident documentations.
Key Responsibilities
- Ensure that all tickets are updated and handled based on set KPI's and SLA's
- Manage monitoring, alerting, and logging tools to ensure system health and service
uptime.
- Ensure early detection, triage and escalation of service degradation based on defined
service level agreement
- Trigger L2 ticket handling and on-call rotations for critical incidents.
- Execute triage, diagnosis, and resolution of incidents required for L3 escalations, both
internal and 3rd party support teams
- Support major incident response, contribute to root cause analysis (RCA), and help
document postmortems.
- Track, analyze, and act on incident trends and recurring technical issues.
- Use data from ticketing systems (Jira, ServiceNow, etc.) to improve team responsiveness
and resolution quality.
- Update and maintain SOPs, runbooks, and knowledge base articles including the
documentation of known issues, fixes, and playbooks to improve mean time to resolution.
- Collaborate with development and QA teams to improve deployment readiness and
reliability
- Participate in technical competency mapping to ensure coverage and reduce unnecessaryescalations.
Minimum Qualifications
Qualifications and Experience:
- Bachelor's degree in Electronics Engineering, Information Technology, Computer
Science, Management Information Systems, or equivalent.
- 25 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
- Minimum of 3 years experience in Site Reliability Engineering, DevOps, or Infrastructure roles is required.
- Hands-on experience with monitoring tools (e.g., Prometheus, Grafana, ELK, or Datadog).
- Familiarity with incident response and troubleshooting in production systems.
- Experience with at least one cloud platform (AWS, GCP, or Azure).
- Knowledgeable in scripting (e.g., Python, Bash) and Linux systems.
- Exposure to ITIL-based processes, especially Incident and Problem Management.
- Experience working in fintech, banking, or SaaS with high availability SLAs.
- Familiarity with DevOps practices, CI/CD pipelines, and cloud-based monitoring tools.
- Experience with automation platforms
- Knowledge of BSP regulatory frameworks, policies, and guidelines.