We are seeking a Principal Systems Architect to lead the engineering of a core infrastructure platform for one of our Swiss clients. You will be deploying to the cloud, and you will be orchestrating the bare metal that powers the next generation of scientific discovery. You will design the systems that allow top-tier researchers and institutions to deploy autonomous Computational Co-Scientists with the ease of a SaaS platform, but with the rigorous control and security of an air-gapped data center. Your mandate is to eliminate the friction between Sovereign Control and Agentic Agility.
Compensation:
Competitive salary commensurate with the candidate's experience and skills
Key Responsibilities
- Architect Agentic AI Infrastructure: Design the environment where non-technical Subject-Matter Experts (SMEs) can orchestrate autonomous AI agents. Optimize the Manager Surface to support asynchronous, long-running research tasks.
- Implement the Model Context Protocol (MCP): Build and govern the secure nervous system that allows agents to query proprietary Scientific Data Warehouses (SDW) or control lab equipment in-place, ensuring data never leaves the sovereign perimeter.
- Design the Replication Layer: Extend CRIU-based checkpointing to capture complete Agent State (model version, system prompts, reasoning traces, and generated artifacts) to guarantee bit-perfect reproducibility of complex scientific workflows a decade into the future.
- Orchestrate Sovereign Bare Metal & Hybrid Bursting: Operate and extend a Cozystack infrastructure (Talos Linux, KubeVirt, Kubernetes) to ensure High-Performance Computing (HPC) workloads access raw hardware with Zero Virtualization Tax, while maintaining the flexibility to burst to public clouds.
- Implement Matrixed Billing & Governance: Translate complex institutional finance into infrastructure logic by enforcing strict, kernel-level resource quotas (dynamic Resource Pool mapping) mapped to specific financial cost centers.
- Optimize Domain-Specific Reasoning: Tune the infrastructure stack for diverse System 2 reasoning loops, optimizing memory bandwidth for large-context agents and minimizing latency for high-speed analytical agents.
Technical Requirements
Must-Have Skills:
- Polyglot Engineering: Master-level proficiency in JavaScript (essential for deep integration with the stack, frontend interfaces, and agentic orchestration), alongside strong skills in Python and systems languages (Go / Rust).
- Systems Architecture: 8+ years in Systems Engineering, SRE, or Infrastructure Architecture, with a focus on HPC or large-scale distributed systems.
- The Metal Mindset: Proven experience managing Bare Metal environments and modern platform builders like Cozystack, Talos Linux, or KubeVirt.
- Containerization & Orchestration: Deep expertise in Kubernetes, Docker, and infrastructure sandboxing (Firecracker, gVisor, Kata Containers).
- Infrastructure as Code & Ops: Strong experience with Terraform, Ansible, Prometheus/Grafana, and NVIDIA MIG (Multi-Instance GPU).
Nice-to-Have Skills:
- Storage Expertise: Experience with Petabyte-scale storage solutions like Ceph/CephFS, ZFS/Btrfs, and Snowflake (SDW).
- Regulated Environments: Previous experience working in highly regulated sectors (FinTech, MedTech, GovTech) or with Air-Gapped networks.
- AI/LLM Integration: Prior experience sandboxing untrusted, AI-generated code or connecting LLMs to secure databases.