Job Overview:
- The AI / DevOps Engineer is responsible for building, deploying, and operating AI-powered software and the infrastructure that runs it. The role combines hands-on software development with DevOps and platform engineering, with a strong focus on Large Language Model (LLM) applications, agentic systems, and workflow automation.
- The individual in this role designs and develops AI-enabled applications and automations, integrates LLM APIs and self-hosted models, and builds the pipelines, infrastructure, and observability needed to ship and run these systems reliably across cloud, on-premise, private cloud, and GPU environments.
- This position bridges development and operations across the full lifecycle, from requirements and system design through CI/CD, deployment, monitoring, security, and post-production support, while continuously evaluating emerging AI tools and practices to improve efficiency and quality.
- Continuous learning is essential, as the AI / DevOps Engineer must stay current with a fast-moving AI and infrastructure landscape, including new models, agentic coding tools, orchestration frameworks, and automation techniques.
Responsibilities:
AI & LLM Application Development
- Design, develop, and maintain AI-powered applications and services that integrate Large Language Models (LLMs) and other machine learning models into business workflows.
- Integrate LLM APIs (such as Anthropic Claude and other providers) as well as self-hosted and open-source models running on private GPU infrastructure.
- Build retrieval-augmented generation (RAG) pipelines using vector databases, embeddings, and semantic search to ground model outputs in enterprise data.
- Design and implement agentic systems and tool-use workflows, including integrations through the Model Context Protocol (MCP) and connections to internal and third-party services.
- Apply prompt engineering, evaluation, and guardrail techniques to improve accuracy, safety, reliability, and cost-efficiency of AI features.
- Write clean, efficient, reusable, and well-tested code following established standards and secure coding practices.
Software Development & Integration
- Design and develop scalable, secure backend services, APIs, and integrations that support AI and automation use cases.
- Perform application and data integration with internal and external systems using RESTful APIs, web services, webhooks, and message queues.
- Translate business requirements into functional and technical specifications, and participate in architecture and design discussions.
- Ensure solutions are compatible across multiple platforms and environments, including cloud, on-premise, and private cloud deployments.
Workflow & Process Automation
- Design, develop, and deploy automation workflows that combine traditional automation with AI-driven decision-making.
- Build automations using modern tools and platforms such as Power Automate, n8n, Zapier-style iPaaS, and custom scripts, replacing manual and repetitive processes.
- Develop and operate desktop and agentic automation, including AI desktop agents (for example, agent-based assistants such as Cowork / Open Claw-style tools) that perform tasks across applications.
- Implement web automation and data extraction where required using tools such as Playwright, Puppeteer, or Selenium.
- Use agentic coding tools such as Claude Code to accelerate development, automate engineering tasks, and build internal tooling.
- Automate IT and operational workflows such as provisioning, monitoring, alerting, ticketing, and incident response.
Infrastructure, Cloud & DevOps
- Build, maintain, and optimize infrastructure across cloud (AWS, GCP), on-premise, and private cloud environments for efficiency, scalability, and reliability.
- Provision and manage GPU compute for model inference and AI workloads, optimizing for performance and cost.
- Design and maintain CI/CD pipelines to automate building, testing, and deployment of applications, models, and automations.
- Manage source control and Git-based workflows on platforms such as GitHub, GitLab, or Bitbucket, including branching strategies, pull/merge requests, and code review processes.
- Containerize and orchestrate workloads using Docker and Kubernetes, and manage infrastructure as code (e.g., Terraform).
- Manage deployment, release, and configuration management, and support smooth promotion of changes from development to production.
- Administer Linux/Unix and Windows environments supporting development and production systems.
Monitoring, Reliability & Performance
- Implement monitoring, logging, alerting, and observability for applications, infrastructure, and AI/LLM workloads using tools such as Prometheus, Grafana, the ELK/Loki stack, Datadog, or cloud-native services (e.g., AWS CloudWatch).
- Track AI-specific metrics such as latency, token usage, cost, accuracy, and quality, and act on the results.
- Proactively identify, troubleshoot, and resolve performance issues and production incidents within agreed timelines.
- Participate in root cause analysis and drive preventive improvements to system reliability and stability.
Security & Compliance
- Apply security best practices across development, automation, and operations, including secrets management, access control, and network security.
- Address AI-specific security and governance concerns such as data privacy, prompt injection, safe handling of sensitive data, and responsible use of models.
Collaboration & Continuous Improvement
- Work closely with developers, data and AI engineers, operations staff, business analysts, and other stakeholders to deliver end-to-end solutions.
- Participate in Agile/Scrum ceremonies including sprint planning, daily stand-ups, reviews, and retrospectives.
- Act as a liaison between technical teams and stakeholders, and communicate solutions, trade-offs, and results clearly.
- Prepare and maintain technical documentation, including system designs, runbooks, and operational procedures.
Qualifications:
- Bachelor's degree in Computer Science, Information Technology, Software Engineering, or a related field (or equivalent practical experience).
- At least 2–4 years of combined experience across software development, DevOps, or automation; experience with AI/LLM-based solutions is strongly preferred.
- Demonstrated experience building and deploying applications in cloud, on-premise, or private cloud environments.
- Experience integrating APIs and third-party services, and building automated workflows.
- Working knowledge of Agile/Scrum methodologies and collaboration tools.
- Relevant certifications in cloud (AWS, GCP), DevOps, or AI/ML are an advantage.
Technical Skills
- Programming Languages & Core Stack
- Python (required, primary): main language for AI/LLM development, automation, data work, and scripting; experience with frameworks and libraries such as FastAPI or Flask, plus LangChain, LlamaIndex, or the Anthropic and OpenAI SDKs.
- TypeScript / JavaScript (required): for backend services (Node.js) and front-end or full-stack work (React or similar), API integrations, and building agentic and MCP-based tooling.
- Bash / Shell scripting (required): for automation, CI/CD, and Linux/Unix system administration.
- SQL (required): for querying and managing relational databases such as PostgreSQL, MySQL, or SQL Server.
- PowerShell (preferred): for Windows administration and automation in mixed environments.
- Go and/or C# (advantageous): for performant backend services, infrastructure tooling (Go), or .NET-based enterprise integrations (C#).
- Configuration and infrastructure-as-code languages: YAML and JSON for pipelines and config, and HCL (Terraform) for provisioning infrastructure.
- Strong proficiency in the core languages above, with the ability to pick up additional languages as project needs evolve.
- Hands-on experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI) and building AI features such as RAG, agents, and tool use.
- ·Familiarity with AI frameworks and libraries such as LangChain, LlamaIndex, or similar, and with vector databases (e.g., Pinecone, Weaviate, pgvector, or FAISS).
- Experience with the Model Context Protocol (MCP) and agentic coding tools such as Claude Code is an advantage.
- Strong knowledge of cloud platforms, especially AWS and GCP, plus on-premise and private cloud deployment.
- Experience provisioning and using GPU compute for model inference and training.
- Solid DevOps skills: CI/CD pipelines (e.g., GitHub Actions, GitLab CI/CD, Jenkins), Docker, Kubernetes, and infrastructure as code (e.g., Terraform).
- Experience with Linux/Unix and Windows administration and scripting (e.g., Bash, Python).
- Knowledge of RESTful APIs, JSON, webhooks, and system integration concepts.
- Experience with databases including PostgreSQL, MySQL, MongoDB, or SQL Server.
- Familiarity with workflow automation tools (e.g., Power Automate, n8n) and web automation (e.g., Playwright, Puppeteer, Selenium).
- Strong experience with Git and Git-based platforms such as GitHub, GitLab, and Bitbucket, including branching strategies, pull/merge requests, code review, and repository management, along with modern software development life cycle practices.
- Hands-on experience with monitoring and observability tooling such as Prometheus, Grafana, the ELK/Loki stack, Datadog, or cloud-native services (e.g., AWS CloudWatch), including metrics, logs, dashboards, and alerting for applications, infrastructure, and AI workloads.
- Awareness of AI safety, security, and governance considerations, including data privacy and prompt-injection risks.
Soft Skills
- Strong analytical and problem-solving abilities.
- Excellent verbal and written communication skills.
- Ability to work both independently and collaboratively within a team.
- Strong organizational and time-management skills.
- Ability to manage multiple priorities and meet deadlines in a fast-paced environment.
- Attention to detail and a commitment to quality.
- Curiosity and a strong willingness to learn and adapt to rapidly evolving AI and infrastructure technologies.