
Search by job, company or skills

AI / DevOps Engineer
Job Description, Skills, Qualifications
Job Overview:
· The AI / DevOps Engineer is responsible for building, deploying, and operating AI-powered software and the infrastructure that runs it. The role combines hands-on software development with DevOps and platform engineering, with a strong focus on Large Language Model (LLM) applications, agentic systems, and workflow automation.
· The individual in this role designs and develops AI-enabled applications and automations, integrates LLM APIs and self-hosted models, and builds the pipelines, infrastructure, and observability needed to ship and run these systems reliably across cloud, on-premise, private cloud, and GPU environments.
· This position bridges development and operations across the full lifecycle, from requirements and system design through CI/CD, deployment, monitoring, security, and post-production support, while continuously evaluating emerging AI tools and practices to improve efficiency and quality.
· Continuous learning is essential, as the AI / DevOps Engineer must stay current with a fast-moving AI and infrastructure landscape, including new models, agentic coding tools, orchestration frameworks, and automation techniques.
Responsibilities:
AI & LLM Application Development
· Design, develop, and maintain AI-powered applications and services that integrate Large Language Models (LLMs) and other machine learning models into business workflows.
· Integrate LLM APIs (such as Anthropic Claude and other providers) as well as self-hosted and open-source models running on private GPU infrastructure.
· Build retrieval-augmented generation (RAG) pipelines using vector databases, embeddings, and semantic search to ground model outputs in enterprise data.
· Design and implement agentic systems and tool-use workflows, including integrations through the Model Context Protocol (MCP) and connections to internal and third-party services.
· Apply prompt engineering, evaluation, and guardrail techniques to improve accuracy, safety, reliability, and cost-efficiency of AI features.
· Write clean, efficient, reusable, and well-tested code following established standards and secure coding practices.
Software Development & Integration
· Design and develop scalable, secure backend services, APIs, and integrations that support AI and automation use cases.
· Perform application and data integration with internal and external systems using RESTful APIs, web services, webhooks, and message queues.
· Translate business requirements into functional and technical specifications, and participate in architecture and design discussions.
· Ensure solutions are compatible across multiple platforms and environments, including cloud, on-premise, and private cloud deployments.
Workflow & Process Automation
· Design, develop, and deploy automation workflows that combine traditional automation with AI-driven decision-making.
· Build automations using modern tools and platforms such as Power Automate, n8n, Zapier-style iPaaS, and custom scripts, replacing manual and repetitive processes.
· Develop and operate desktop and agentic automation, including AI desktop agents (for example, agent-based assistants such as Cowork / Open Claw-style tools) that perform tasks across applications.
· Implement web automation and data extraction where required using tools such as Playwright, Puppeteer, or Selenium.
· Use agentic coding tools such as Claude Code to accelerate development, automate engineering tasks, and build internal tooling.
· Automate IT and operational workflows such as provisioning, monitoring, alerting, ticketing, and incident response.
Infrastructure, Cloud & DevOps
· Build, maintain, and optimize infrastructure across cloud (AWS, GCP), on-premise, and private cloud environments for efficiency, scalability, and reliability.
· Provision and manage GPU compute for model inference and AI workloads, optimizing for performance and cost.
· Design and maintain CI/CD pipelines to automate building, testing, and deployment of applications, models, and automations.
· Manage source control and Git-based workflows on platforms such as GitHub, GitLab, or Bitbucket, including branching strategies, pull/merge requests, and code review processes.
· Containerize and orchestrate workloads using Docker and Kubernetes, and manage infrastructure as code (e.g., Terraform).
· Manage deployment, release, and configuration management, and support smooth promotion of changes from development to production.
· Administer Linux/Unix and Windows environments supporting development and production systems.
Monitoring, Reliability & Performance
· Implement monitoring, logging, alerting, and observability for applications, infrastructure, and AI/LLM workloads using tools such as Prometheus, Grafana, the ELK/Loki stack, Datadog, or cloud-native services (e.g., AWS CloudWatch).
· Track AI-specific metrics such as latency, token usage, cost, accuracy, and quality, and act on the results.
· Proactively identify, troubleshoot, and resolve performance issues and production incidents within agreed timelines.
· Participate in root cause analysis and drive preventive improvements to system reliability and stability.
Security & Compliance
· Apply security best practices across development, automation, and operations, including secrets management, access control, and network security.
· Address AI-specific security and governance concerns such as data privacy, prompt injection, safe handling of sensitive data, and responsible use of models.
· Ensure activities comply with organizational policies, security standards, and audit requirements, and maintain proper version control and documentation.
Collaboration & Continuous Improvement
· Work closely with developers, data and AI engineers, operations staff, business analysts, and other stakeholders to deliver end-to-end solutions.
· Participate in Agile/Scrum ceremonies including sprint planning, daily stand-ups, reviews, and retrospectives.
· Act as a liaison between technical teams and stakeholders, and communicate solutions, trade-offs, and results clearly.
Qualifications:
· Bachelor's degree in Computer Science, Information Technology, Software Engineering, or a related field (or equivalent practical experience).
· At least 2–4 years of combined experience across software development, DevOps, or automation; experience with AI/LLM-based solutions is strongly preferred.
· Demonstrated experience building and deploying applications in cloud, on-premise, or private cloud environments.
· Experience integrating APIs and third-party services, and building automated workflows.
· Working knowledge of Agile/Scrum methodologies and collaboration tools.
· Relevant certifications in cloud (AWS, GCP), DevOps, or AI/ML are an advantage.
Technical Skills
Programming Languages & Core Stack
· Python (required, primary): main language for AI/LLM development, automation, data work, and scripting; experience with frameworks and libraries such as FastAPI or Flask, plus LangChain, LlamaIndex, or the Anthropic and OpenAI SDKs.
· TypeScript / JavaScript (required): for backend services (Node.js) and front-end or full-stack work (React or similar), API integrations, and building agentic and MCP-based tooling.
· Bash / Shell scripting (required): for automation, CI/CD, and Linux/Unix system administration.
· SQL (required): for querying and managing relational databases such as PostgreSQL, MySQL, or SQL Server.
· PowerShell (preferred): for Windows administration and automation in mixed environments.
· Go and/or C# (advantageous): for performant backend services, infrastructure tooling (Go), or .NET-based enterprise integrations (C#).
· Configuration and infrastructure-as-code languages: YAML and JSON for pipelines and config, and HCL (Terraform) for provisioning infrastructure.
· Strong proficiency in the core languages above, with the ability to pick up additional languages as project needs evolve.
· Hands-on experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI) and building AI features such as RAG, agents, and tool use.
· Familiarity with AI frameworks and libraries such as LangChain, LlamaIndex, or similar, and with vector databases (e.g., Pinecone, Weaviate, pgvector, or FAISS).
· Experience with the Model Context Protocol (MCP) and agentic coding tools such as Claude Code is an advantage.
· Strong knowledge of cloud platforms, especially AWS and GCP, plus on-premise and private cloud deployment.
· Experience provisioning and using GPU compute for model inference and training.
· Solid DevOps skills: CI/CD pipelines (e.g., GitHub Actions, GitLab CI/CD, Jenkins), Docker, Kubernetes, and infrastructure as code (e.g., Terraform).
· Experience with Linux/Unix and Windows administration and scripting (e.g., Bash, Python).
· Knowledge of RESTful APIs, JSON, webhooks, and system integration concepts.
· Experience with databases including PostgreSQL, MySQL, MongoDB, or SQL Server.
· Familiarity with workflow automation tools (e.g., Power Automate, n8n) and web automation (e.g., Playwright, Puppeteer, Selenium).
· Strong experience with Git and Git-based platforms such as GitHub, GitLab, and Bitbucket, including branching strategies, pull/merge requests, code review, and repository management, along with modern software development life cycle practices.
· Hands-on experience with monitoring and observability tooling such as Prometheus, Grafana, the ELK/Loki stack, Datadog, or cloud-native services (e.g., AWS CloudWatch), including metrics, logs, dashboards, and alerting for applications, infrastructure, and AI workloads.
· Awareness of AI safety, security, and governance considerations, including data privacy and prompt-injection risks.
Job ID: 149107885
We don’t charge any money for job offers