Senior Site Reliability Engineer

Cebu Pacific Air

Philippines

4-8 Years

Save

Posted 6 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Department

Digital & Technology Office

Employee Type

Probationary

The Senior Site Reliability Engineer will serve as the first line of defense for our 24/7 operations. You will act as the guardian of our production environment, utilizing Dynatrace to maintain a holistic view of both Infrastructure and Application health.

You will not just monitor uptime; you will actively test system resilience, manage major incidents, and facilitate stability reporting. You will be the primary notification point for all P1/P2 incidents, responsible for deep-dive triage, quick remediation, and coordinating Major Incident Management (MIM).

Key Responsibilities

24/7 Incident Command & Alerting

24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the eyes on glass for the organization.
Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.

Observability Strategy (Dynatrace Focus)

Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
Configure Management Zones, Alerting Profiles, and Dashboards to provide a Single Pane of Glass.
Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
Leverage Davis AI to automatically detect anomalies and reduce alert noise.
Comprehensive Monitoring Scope:
Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
Security: Monitor for DDoS attack patterns and WAF spikes.

Resilience & Chaos Engineering

Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the system's resilience and verify that failover mechanisms work as expected.
Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute Quick Fix runbooks to mitigate impact before escalating to platform engineering.

Application Triage & Analysis

Deep-Dive Triage: Go beyond system check to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.

Governance & Reporting (Stability Cadence)

Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed blind spots in production.

Automation & Toil Reduction

Remediation Scripting: Develop scripts (Python/Bash) to Auto-Heal common issues (e.g., clearing logs when disk is full, restarting stuck services).
Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.

Required Qualifications

Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
Troubleshooting Expertise:
Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services

Experience Range Range (Years)

4 - 8 years

Job posted on

2026-03-12