Search by job, company or skills

O

Senior AI Platform Engineer

Save
  • Posted 9 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.

Your Role

As a Senior AI Platform Engineer, you will be responsible for operating, maintaining, and continuously improving the company's AI platforms running on Kubernetes (On-premise and/or on AWS/GCP) - similar on the AIoEKS (AI on EKS) deployment frameworks and Kubeflow's Machine Learning Toolkit

Platform Ownership & Operations

  • Deploy new releases and configuration changes through GitOps/DevOps
  • Monitor platform and service health using logs, metrics, and observability tools
  • Improve platform observability, operational tooling/automations, self-service capabilities and reliability practices to reduce recurring issues
  • Participate in incident response, root cause analysis and 24x7 operational rotations

User & Developer Experience

  • Investigate & troubleshoot user concerns by either correlating them to system-related issues, breaking integrations and/or user-specific errors/misconfigurations up to recommending/executing resolutions
  • Advocate for platform standards, security best practices, and operational excellence

Collaboration and Leadership

  • Provide structured Python mentorship to junior engineers, focusing on strong fundamentals and bridge foundational Python knowledge toward MLOps competencies
  • Lead the adoption of MLOps best practices for the team
  • Influence the team roadmap by identifying gaps in tooling, skills, and processes required to support production-grade AI systems

Your Qualifications

  • 3+ years of experience supporting production workloads/platforms (Ray.IO, Jupyter Notebooks, AWS SageMaker, Kubeflow AI Tools or an AI-related equivalent)
  • 5+ years of hands-on experience AI/ML lifecycle (development/deployment, DevOps/MLOps)
  • 5+ years of Python experience in development & support on AI/ML workflows and data engineering pipelines
  • Practically skilled in Kubernetes environments including Cloud-provider managed Kubernetes flavors (AWS-EKS/GCP-GKE)
  • Knowledge on microservice architectures and service communication patterns
  • Strong troubleshooting fundamentals such as application crashes, resource contentions, service latency, and scaling behavior
  • Well-rounded competency in analyzing logs, metrics, monitoring systems, and service KPIs

Plus Points If You Have

  • Exposure in other Data/AI platforms such as Flyte, HuggingFace & AI Agent Platforms (Vertex AI, Claude Code, LangChain, etc...)
  • Hands-on experience with automation or scripting (Bash, Python)
  • Kubernetes or cloud certifications (CKAD, AWS)

Ready to start your awesome journey and be part of OpsWerks

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149199733