Developer for the Taxi Reliability Platform Team

Projects you will work on:

Real-time load balancing mechanism We manage mobile app traffic in real time to keep the service stable even during network failures. You will develop a system that can withstand DNS resolver failures and loss of connectivity between telecom operators and Yandex data centers. This is a critical component that ensures the stability and retention of millions of users.

SRE GPT We are building an intelligent system that instantly recognizes anomalies and potential incidents. SRE GPT automatically localizes the problem to a specific service or component, analyzes root causes based on historical data and logs, performs standard recovery actions, and escalates complex cases to the relevant specialists. You will develop a multi-agent RAG architecture, integrated with Yandex infrastructure via MCP servers, making SRE automation smarter and more reliable.

Chaos engineering We create controlled failures to test system resilience and uncover hidden problems. You will automate chaos drills, add new failure types, and develop observability tools so the system behaves predictably under load.

Virtual orders We simulate Taxi operations under peak loads, where virtual drivers transport virtual passengers along real routes. You will develop the simulator, analyze performance, and identify bottlenecks that affect system stability and scalability.

Observability tools We combine key metrics, logs, and tracing mechanisms into a single interface that helps engineers quickly understand the system's current state and coordinate actions during incidents. You will develop this ecosystem: improve data collection, visualization, and interaction scenarios to make investigations faster and more effective.

Anomaly detection We analyze service behavior to detect performance degradation and errors in advance. You will improve analysis algorithms, increase signal accuracy, and integration with other automation systems.

Graceful degradation We develop mechanisms that allow temporarily reducing load, disabling non-critical features, while preserving core functionality. You will need to design and implement degradation scenarios so the service remains available even during partial failures.

Auto-recovery We create automation that responds to failures, reduces load, and rolls back potentially dangerous changes. You will develop this system, add new response scenarios, and increase the predictability of service behavior during incidents.

Learn more about our work in the video Anthology of Yandex Taxi Technologies. Service Reliability, the talk Taxi Reliability Tools, and the videos How Yandex Taxi Reliability is Built in Russian and in English.

What tasks await you

Development Your tasks will include improving the real-time load balancing system, developing SRE GPT — tools for intelligent analysis and automatic incident recovery, creating a flexible emulator of client actions, automating chaos scenarios and analyzing their impact, and developing tools for latency degradation analysis.

Architecture You will design and develop services for the reliability platform, choose optimal solutions, conduct technical experiments, and assess the impact on the stability and reliability of key Taxi components.

Research You will study the system and identify areas to improve fault tolerance, scaling successful practices across dozens of teams and hundreds of microservices.

We expect you to

Write or are ready to write in Go or Python
Understand distributed systems architecture
Can analyze complex technical tasks and propose solutions

It will be a plus if you

Are interested in fault tolerance, observability, and AI tools in SRE
Want to improve the reliability of a product used daily by millions of people

Contacts

What tasks await you

We expect you to

It will be a plus if you

Similar vacancies

Developer for the Autonomous Transport Deployment Platform Team

Go Developer for the Detection & Response team, Yandex Cloud Security

Backend Developer for the Driver Product Team in Taxi

Senior Go Developer for the Analytics Platform Team

Python or Go Developer for Yandex Fiscalization Services

Go Developer for Yandex Cloud Stackland

Platform Services Group Developer

Backend Developer for the Documents Team

Developer at Tasklets

Backend Developer for Yandex Business Trips Team

Go Developer for Yandex Cloud Billing

Developer at Yandex Platform Engineering

Developer for the Taxi Reliability Platform Team

Key Skills

Details

Average salary for this role