Search by job, company or skills

Yondu, Inc.

Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 10 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description

  • Handle service monitoring, incident response, and drive technical support efficiency
  • Responsible for managing and maintaining network monitoring tools, systems, and

processes that ensure the availability, scalability, and performance of our production

environments.

  • Responsible for incident handling, service monitoring, and technical support efficiency.
  • Closely work with developers, DevOps, infrastructure teams, and different stakeholders

to achieve proactive incident prevention, issue resolution and incident documentations.

Key Responsibilities

  • Ensure that all tickets are updated and handled based on set KPI's and SLA's
  • Manage monitoring, alerting, and logging tools to ensure system health and service

uptime.

  • Ensure early detection, triage and escalation of service degradation based on defined

service level agreement

  • Trigger L2 ticket handling and on-call rotations for critical incidents.
  • Execute triage, diagnosis, and resolution of incidents required for L3 escalations, both

internal and 3rd party support teams

  • Support major incident response, contribute to root cause analysis (RCA), and help

document postmortems.

  • Track, analyze, and act on incident trends and recurring technical issues.
  • Use data from ticketing systems (Jira, ServiceNow, etc.) to improve team responsiveness

and resolution quality.

  • Update and maintain SOPs, runbooks, and knowledge base articles including the

documentation of known issues, fixes, and playbooks to improve mean time to resolution.

  • Collaborate with development and QA teams to improve deployment readiness and

reliability

  • Participate in technical competency mapping to ensure coverage and reduce unnecessaryescalations.

Minimum Qualifications

Qualifications and Experience:

  • Bachelor's degree in Electronics Engineering, Information Technology, Computer

Science, Management Information Systems, or equivalent.

  • 25 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
  • Minimum of 3 years experience in Site Reliability Engineering, DevOps, or Infrastructure roles is required.
  • Hands-on experience with monitoring tools (e.g., Prometheus, Grafana, ELK, or Datadog).
  • Familiarity with incident response and troubleshooting in production systems.
  • Experience with at least one cloud platform (AWS, GCP, or Azure).
  • Knowledgeable in scripting (e.g., Python, Bash) and Linux systems.
  • Exposure to ITIL-based processes, especially Incident and Problem Management.
  • Experience working in fintech, banking, or SaaS with high availability SLAs.
  • Familiarity with DevOps practices, CI/CD pipelines, and cloud-based monitoring tools.
  • Experience with automation platforms
  • Knowledge of BSP regulatory frameworks, policies, and guidelines.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 134805711