Job Description
Key Responsibilities
Build AI Automation: Develop the core LLM module to automate incident resolution, correlations, and remediation using data from logs and tickets.
Design LLM Architecture: Implement Retrieval-Augmented Generation (RAG) using advanced prompt engineering, context management, and workflow orchestration.
Manage Data Retrieval: Configure and deploy Vector Databases (Milvus/Azure AI Search), optimizing RAG with smart chunking strategies to ensure fast and accurate information retrieval.
Tooling: Use Hugging Face Transformers, OpenAI API, and LangChain for model building, fine-tuning, and deployment.
Anomaly Detection (ML): (Secondary) Develop and deploy traditional Machine Learning models to detect anomalies in log data using techniques like LSTM/IsolationForest.
MLOps: Deploy and monitor all models using modern tools like Docker, Kubernetes, MLflow, and Grafana.
Minimum Qualifications
Required Skills:
Primary: Expert-level experience building and deploying production-grade LLM applications, including RAG implementation, prompt engineering, and deep knowledge of Vector DBs.
Data Science: Strong proficiency in Python, Pandas, NumPy, and standard ML libraries (Scikit-learn) for feature engineering and model selection.
Deployment: Experience with containerization (Docker, Kubernetes) and MLOps platforms (MLflow/Azure ML).