Reach out directly about this role
Developer for the Taxi Reliability Platform Team
Projects you will work on:
Real-time load balancing mechanism We manage mobile app traffic in real time to keep the service stable even during network failures. You will develop a system that can withstand DNS resolver failures and loss of connectivity between telecom operators and Yandex data centers. This is a critical component that ensures the stability and retention of millions of users.
SRE GPT We are building an intelligent system that instantly recognizes anomalies and potential incidents. SRE GPT automatically localizes the problem to a specific service or component, analyzes root causes based on historical data and logs, performs standard recovery actions, and escalates complex cases to the relevant specialists. You will develop a multi-agent RAG architecture, integrated with Yandex infrastructure via MCP servers, making SRE automation smarter and more reliable.
Chaos engineering We create controlled failures to test system resilience and uncover hidden problems. You will automate chaos drills, add new failure types, and develop observability tools so the system behaves predictably under load.
Virtual orders We simulate Taxi operations under peak loads, where virtual drivers transport virtual passengers along real routes. You will develop the simulator, analyze performance, and identify bottlenecks that affect system stability and scalability.
Observability tools We combine key metrics, logs, and tracing mechanisms into a single interface that helps engineers quickly understand the system's current state and coordinate actions during incidents. You will develop this ecosystem: improve data collection, visualization, and interaction scenarios to make investigations faster and more effective.
Anomaly detection We analyze service behavior to detect performance degradation and errors in advance. You will improve analysis algorithms, increase signal accuracy, and integration with other automation systems.
Graceful degradation We develop mechanisms that allow temporarily reducing load, disabling non-critical features, while preserving core functionality. You will need to design and implement degradation scenarios so the service remains available even during partial failures.
Auto-recovery We create automation that responds to failures, reduces load, and rolls back potentially dangerous changes. You will develop this system, add new response scenarios, and increase the predictability of service behavior during incidents.
Learn more about our work in the video Anthology of Yandex Taxi Technologies. Service Reliability, the talk Taxi Reliability Tools, and the videos How Yandex Taxi Reliability is Built in Russian and in English.
Development Your tasks will include improving the real-time load balancing system, developing SRE GPT — tools for intelligent analysis and automatic incident recovery, creating a flexible emulator of client actions, automating chaos scenarios and analyzing their impact, and developing tools for latency degradation analysis.
Architecture You will design and develop services for the reliability platform, choose optimal solutions, conduct technical experiments, and assess the impact on the stability and reliability of key Taxi components.
Research You will study the system and identify areas to improve fault tolerance, scaling successful practices across dozens of teams and hundreds of microservices.
3 years
Experience
Full-time
Employment
Hybrid, Onsite
Work Format
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
By job title
IT & Tech
Industry
Corporation
Company Type