Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).
Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costsso customers notice faster pipelines and fewer surprises.
Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times
Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.
Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.
Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks.
Your Qualifications
Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.
Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).
Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)
Real-world AWS or GCP and production environment usage as a User or Administrator
Kubernetes (or Docker) for scheduling/scale.
Incident management, post-incident reviews, change management, and service reporting.