Company: Union Bank of the Philippines
Position: Journey Site Reliability Engineer
Office Location: UnionBank Plaza - Ortigas, Pasig City
Job Summary: A Journey Site Reliability Engineer is assigned to a specific journey/team within Union Bank, reporting to the Journey SRE Head/Lead and partnering closely with Product, Engineering, DevOps, and Business stakeholders for key lines of business (e.g., UBO, Portal, Payments). This role drives reliability, observability, performance, and operational excellence across critical customer journeys, with accountability for measurable outcomes.
Duties And Responsibilities
- Resilience and Reliability Engineering
- Assist in defining and tracking SLOs for core banking and payment services, focusing on key metrics such as availability, throughput, and latency.
- Support the implementation of reliability practices, such as chaos engineering experiments and fault injection testing, under the guidance of senior engineers.
- Collaborate with development teams to reduce mean time to detection (MTTD) and mean time to recovery (MTTR) during incidents.
- Observability and Monitoring
- Help observability team in setting up the right observability tooling, using Dynatrace to enhance system monitoring, including real-time metrics, distributed traces, and logs
- Maintain and update dashboards and alerts to proactively identify potential system issues.
- Work with senior engineers to evaluate and improve observability tools and practices.
- Performance and Scalability
- Assist in load testing and performance tuning exercises to ensure applications handle high transaction volumes effectively.
- Identify basic performance bottlenecks and escalate findings to senior engineers for resolution.
- Collaborate with DevOps teams to support CI/CD pipelines with performance insights.
- Incident Management and Root Cause Analysis
- Participate in incident response efforts, assisting in troubleshooting and timely communication.
- Document and track root cause analysis (RCA) findings, contributing to the knowledge base for continuous improvement.
- Support the development of playbooks to improve response to common incidents.
- Automation and Tooling
- Participate in incident response efforts, assisting in troubleshooting and timely communication.
- Document and track root cause analysis (RCA) findings, contributing to the knowledge base for continuous improvement.
- Support the development of playbooks to improve response to common incidents.
- Continuous Improvement and Team Collaboration
- Work closely with individual application development teams to embed reliability principles into their workflows.
- Participate in knowledge-sharing sessions to learn and promote best practices for system resilience.
- Support senior engineers in fostering a culture of observability and proactive risk management across teams.
Required Skills
- Bachelor's degree in Computer Science, Engineering, Mathematics, or related field (or equivalent experience).
- 1-5 years of experience in Software Engineering, DevOps, or SRE roles, with familiarity in monitoring or incident management.
- Basic understanding of observability tools like Dynatrace, Prometheus, or similar platforms.
- Foundational programming or scripting skills in one or more languages (e.g., Java, Python, or Bash).
- Familiarity with containerization (e.g., Docker) and CI/CD pipelines is a plus.
- Strong problem-solving skills, a collaborative mindset, and eagerness to learn from senior engineers.