Reach out directly about this role
Warden is an internal platform for incident management that ensures the stability of various Yandex services, from Kinopoisk and Advertising to Taxi, Alice, and others. We are actively expanding, with our DAU growing by 25% in the last six months. Our goal is to enter the external market as an independent product.
We automate incident management during the hot phase and create tools for post-incident review. We influence how quickly an incident is resolved, responsible parties are notified, and downtime and losses are calculated. Warden consists of 10 microservices, including a chatbot used daily by over 3000 Yandex employees. Join us if you are interested in working on a project that impacts the entire company. In our team, you will be the first to know about all technical failures at Yandex.
Build a reliable distributed service that can survive the failure of two data centers You will develop a system used by Yandex services with millions of instances. We have tens of thousands of realtime graphs for processing each incident scenario.
Work with consensus algorithms The core of our distributed system uses the Raft algorithm. We have tasks to improve the reliability of our services.
Work not only on infrastructure tasks We provide a unified tool for stability management for absolutely different classes of services: Search, Music, Taxi. We help all services be available 99.9999% of the time, and developers to learn about a problem within 30 seconds of its occurrence. We help monitor both the overall availability of services and the operation of specific user scenarios: for example, that radio works in Music.
3-5 years
Experience
Full-time
Employment
Hybrid, Onsite
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
By city
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type