Lead and manage end‑to‑end infrastructure for enterprise Gen AI applications hosted on OpenShift (OCP) platforms
Own capacity planning and sizing for OpenShift clusters, including OCP pods, Oracle databases, Redis caches, Dell ECS storage, Elastic DB, Postgres, Redhat, Ubuntu (optional) and related infrastructure components
Design and operationalize Disaster Recovery (DR) infrastructure for Gen AI platforms, ensuring high availability and resilience
Lead E2E DR setup, including replication, failover, testing, and documentation, in collaboration with infrastructure and network teams
Manage certificate lifecycle (TLS/SSL), key management, and secrets handling across Gen AI applications and platforms
Implement and oversee vulnerability management, patching, and remediation across containers, Kubernetes, and underlying infrastructure
Support and coordinate penetration testing activities, addressing infrastructure‑related findings and security gaps
Good understanding of AWS services (EC2, VPC, CloudWatch, Lambda, Bedrock) and tools (Terraform, CloudFormation) alongside on Prem OpenShift environments
Operate and support Control‑M schedulers, logging, monitoring, and alerting tools for platform observability
Bonus: experience/knowledge of open weight LLM models for text and vision use cases
Requirements
10+ years of Engineering experience or demonstrated deeper experience setting up infrastructure and tools that enable scalable, secure, and efficient software development and deployment. You'll work closely with Technical Delivery Manger, Development, DevOps, and security teams to ensure platform reliability and performance
Proven expertise managing OpenShift (OCP) environments in enterprise‑scale production deployments
Hands‑on experience with infrastructure setup and sizing, performance tuning, and capacity assessment for AI workloads
Experience supporting Oracle Database from an application infrastructure perspective
Practical knowledge of certificate management, secrets management, and key handling
Experience implementing CI/CD pipelines and infrastructure automation
Strong background in security, vulnerability management, and compliance controls
Proven experience designing and implementing DR infrastructure for mission‑critical platforms
Experience working with AWS cloud services and hybrid cloud integrations
Strong coordination and leadership skills to work across Infrastructure, Network, Security, and Application teams
Experience with containerization and orchestration tools (Docker, Kubernetes)