We are looking for an AI Engineer who leads our efforts to train, align, and optimize large language models. RLHF and reinforcement learning are the core of this role — you will own the full post-training pipeline from supervised fine-tuning through reward modeling and RL optimization, while also ensuring models run efficiently in production. This is a role that bridges alignment research and systems engineering.
What You'll Do
- Own and drive the full RLHF pipeline: data collection, reward model training, and RL fine-tuning using PPO, DPO, GRPO, and RLAIF
- Design and run Supervised Fine-Tuning (SFT) pipelines on open-weight models (LLaMA, Mistral, Qwen) as the foundation for RLHF
- Build and train reward models that accurately capture human preferences from annotation data
- Design human feedback collection pipelines: labeling rubrics, annotator calibration, and preference dataset curation
- Implement Constitutional AI and RLAIF techniques to reduce reliance on costly human annotation
- Red team models post-training — probing for jailbreaks, regressions, unsafe outputs, and alignment failures
- Design and maintain evaluation benchmarks to measure alignment, safety, and capability before and after RL training
- Optimize inference pipelines and runtimes (llama.cpp, vLLM, TensorRT) to serve aligned models efficiently at scale
- Implement quantization strategies (INT4/INT8/FP8, LoRA, QLoRA) to deploy fine-tuned models on target hardware
- Write and tune low-level C/C++ and Rust code for inference performance where Python cannot reach
- Diagnose and resolve training instabilities, reward hacking, and production inference bugs under pressure
- Stay at the frontier — read alignment and RL papers weekly and translate findings into working experiments
Core Requirements and Technical Skills
- Hands-on experience implementing RLHF end-to-end — not just using libraries, but understanding the mechanics
- Deep familiarity with policy gradient methods: PPO stability, KL divergence constraints, reward shaping
- Experience with Direct Preference Optimization (DPO) and its variants as an RLHF alternative
- Understanding of reward hacking, Goodhart's Law, and mitigation strategies in RL training
- Familiarity with RLAIF (RL from AI Feedback) and Constitutional AI approaches
- Ability to design preference datasets and annotation rubrics that produce high-quality reward signal
- Experience diagnosing training instabilities: reward collapse, mode collapse, KL divergence blowup
- Python as the primary language for all training, fine-tuning, and evaluation pipelines
- Strong mathematical foundation: RL theory, probability, linear algebra, optimization — deep enough to derive loss functions and debug training dynamics
- C and C++ for systems-level inference work, runtime contributions, and performance-critical paths
- Rust experience with ML tooling.
- Familiarity with transformer architecture, attention, tokenization, and how post-training interacts with pretraining
- Experience with distributed training frameworks for large-scale fine-tuning
- Experience with vector databases such as FAISS or Milvus
- Familiarity with retrieval-augmented generation (RAG) pipelines
- Experience integrating LLMs with external tools, APIs, and agent-based systems
- Exposure to Rapid Application Development (RAD) approaches for building and iterating AI solutions efficiently