Search by job, company or skills

Cebu Pacific Air

Senior Site Reliability Engineer

4-8 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Department

Digital & Technology Office

Employee Type

Probationary

The Senior Site Reliability Engineer will serve as the first line of defense for our 24/7 operations. You will act as the guardian of our production environment, utilizing Dynatrace to maintain a holistic view of both Infrastructure and Application health.

You will not just monitor uptime; you will actively test system resilience, manage major incidents, and facilitate stability reporting. You will be the primary notification point for all P1/P2 incidents, responsible for deep-dive triage, quick remediation, and coordinating Major Incident Management (MIM).

Key Responsibilities

24/7 Incident Command & Alerting

  • 24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the eyes on glass for the organization.
  • Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
  • Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.

Observability Strategy (Dynatrace Focus)

  • Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
  • Configure Management Zones, Alerting Profiles, and Dashboards to provide a Single Pane of Glass.
  • Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
  • Leverage Davis AI to automatically detect anomalies and reduce alert noise.
  • Comprehensive Monitoring Scope:
  • Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
  • Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
  • Security: Monitor for DDoS attack patterns and WAF spikes.

Resilience & Chaos Engineering

  • Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the system's resilience and verify that failover mechanisms work as expected.
  • Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
  • First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute Quick Fix runbooks to mitigate impact before escalating to platform engineering.

Application Triage & Analysis

  • Deep-Dive Triage: Go beyond system check to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
  • Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
  • Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.

Governance & Reporting (Stability Cadence)

  • Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
  • Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
  • Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed blind spots in production.

Automation & Toil Reduction

  • Remediation Scripting: Develop scripts (Python/Bash) to Auto-Heal common issues (e.g., clearing logs when disk is full, restarting stuck services).
  • Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.

Required Qualifications

  • Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
  • Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
  • Troubleshooting Expertise:
  • Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
  • Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
  • Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
  • Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
  • Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services

Experience Range Range (Years)

4 - 8 years

Job posted on

2026-03-12

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144506379

Similar Jobs