Staff Software Engineer Platform & Distributed Systems The Problem

aifa labs

Hyderabad, India

8-10 Years

Save

Posted 9 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Staff Software Engineer — Platform & Distributed Systems The Problem

Location: Hyderabad

Experience : 7+ Years

Type: Full-time | Hybrid After 3 Months (3 Days WFO - 2 Days WFH in a Week)

Notice Period : Immediate to15 Days

Level: Staff Engineer (L6/ Consultant/ Architect Equivalent)

We have a working AI platform. Now we need to make it learn.

Every AI output, every user interaction, every accept/reject/modify decision contains signal. Right now, that signal is lost. Your job: instrument a production system to capture everything, stream it to a learning layer, and close the feedback loop — without breaking what's already working for customers.

This is surgery on a moving train. You'll touch frontend, backend, event pipelines, and data flows. You'll design schemas that capture context without bloat. You'll refactor services to emit events at scale. You'll do it incrementally, behind feature flags, with zero downtime.

Not a rewrite. A transformation for Scale, Efficiency, Reliability, & continual learning.

What You'll Own

Event-Driven Architecture - Design and implement event emission across 50+ API endpoints - Define schemas that capture full context (not just what happened, but why) - Build reliable event pipelines (Kafka) with exactly-once semantics where it matters - Handle backpressure, failures, and replay

Full-Stack Instrumentation - Instrument React frontend to capture user behavior (actions, timing, implicit signals) - Build low-friction feedback components that users actually use - Ensure end-to-end event flow: UI → API → Event Bus → Storage

System Transformation - Apply strangler pattern to extract services without disruption - Implement feature flags for incremental rollouts - Design for observability from day one (traces, metrics, structured logs) - Migrate historical data to new event schemas

Technical Leadership - Define patterns that the rest of engineering will follow - Review designs and code for event-driven consistency - Document decisions and trade-offs for future maintainers

Why This Role

→ Transform, don't build from scratch — Harder than greenfield. You'll make a production system smarter while it's serving customers.

→ Full ownership — You'll make architectural decisions, not just implement tickets. Event schemas, service boundaries, instrumentation strategy — yours to design.

→ AI platform scale — Your instrumentation directly impacts how the platform learns. Bad events = bad AI. Good events = compounding intelligence.

→ Staff-level scope — Cross-team influence, technical direction, patterns that scale. This isn't a senior role with a fancy title.

→ Modern stack, real problems — Kafka, Kubernetes, event sourcing, distributed systems. Not legacy maintenance — system evolution.

You Are

Required: - 8+ years software engineering — Staff-level depth. You've made architectural mistakes and learned from them.

Distributed systems experience — You understand CAP theorem trade-offs, eventual consistency, and when strong consistency actually matters.
Event-driven architecture — You've built or significantly contributed to event-driven systems. Kafka, RabbitMQ, or similar at scale.
Full-stack capability — Fluent in backend (Python) AND frontend (Next.js/React). Can move between layers without context-switching pain.
Production transformation experience — You've refactored, migrated, or instrumented systems while they were serving real traffic. Strangler pattern, feature flags, incremental rollouts.

Strong Plus: - Observability implementation (OpenTelemetry, Datadog, Prometheus) - AI/ML platform experience (LLM applications, inference pipelines) - Event sourcing or CQRS patterns - High-scale analytics instrumentation (Segment, Amplitude, custom pipelines) - Microservices decomposition from monoliths

Tech Stack (Important)

Python