Reach out directly about this role
By job title
3-5 years
Experience
Full-time
Employment
Onsite, Hybrid, Remote
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
Backend Developer in ApplicationTeam (Observability)
Yandex's Observability Platform is the key tool that ensures the reliability of the company's infrastructure. It is a centralized monitoring platform that processes 2.4 billion write requests and 2 billion read requests every second in real-time. Such scale dictates special requirements for performance, fault tolerance, and the quality of the solutions we create.
Almost all Yandex teams, as well as thousands of external Yandex Cloud clients, use our tools daily to monitor the operation of their systems and prevent incidents.
ApplicationTeam is the entry point to the Observability platform. We are responsible for user interaction with the system: we process all incoming requests to the monitoring system, design scalable APIs, support SDKs, and build integrations with services of various scales. We work closely with other teams and provide access to metrics, logs, traces, and alerts.
Currently, ApplicationTeam consists of six experienced backend developers who develop interactions with the Observability platform. Our main stack is Java, but for some tasks we use C++ and Go. We work closely with other backend teams and UI developers, together creating the Observability platform.
How we work:
Subscribe to the Telegram channel Inside Yandex Cloud to learn more about our team and technologies!
**Out-of-the-Box Monitoring Today, for developers to start monitoring a new service, they need to spend a long time understanding configurations, integrations, and manual metric setup. Our goal is to ensure that the maximum amount of information about the state of services is collected and visualized automatically, without additional effort on the user's part. You will design and implement technologies for auto-discovery of services, dynamic metric generation, and automatic dashboard creation that work "out of the box" even in the most complex scenarios — from bare-metal to distributed cloud systems. You will face a very cool challenge: how to cover 90% of service types without writing a single line of additional code? How to guarantee full monitoring even under conditions of sudden infrastructure changes?
**Instant Drilldown During Incidents and Gaining Insights In modern distributed systems, localizing an incident can be more difficult than fixing it. Your task is to create tools that, in a "firefighting" situation, allow you to go into the platform and immediately understand what happened, where exactly the failure occurred, how it affected the system, and what to do next — instead of getting lost in a huge volume of raw data. The user sees not fragmented data, but a connected story of what's happening: the system suggests possible root causes and offers ways to quickly delve into details to localize the failure.
**Interfaces and Protocols for Interacting with the Platform In our ecosystem, classical solutions are often insufficient — so we have to invent a lot from scratch or adapt them to our goals. To process millions of metrics per second, we created our own binary format, Spack: unlike Protobuf, it supports dynamic metric sets without a schema, efficiently compresses and quickly decodes data (LZ4, ZSTD), which is critical for stable operation under extreme loads.
You will be designing API architecture, developing SDKs and gRPC interfaces, evolving internal data exchange protocols and formats to ensure unified standards and high fault tolerance for all Yandex services.
**AI/ML Integration We want to move from reactive monitoring to proactive — a system that can identify root causes of incidents and anticipate problems that have not yet manifested clearly but can already be predicted. You will develop an intelligent layer for our platform: from creating and implementing anomaly detection models (based on metrics, logs, traces) to building alerting mechanisms and diagnosing complex failures. You will solve complex tasks: how to learn from incidents that don't repeat? How to reduce the time to detect and localize problems in the infrastructure?
See other vacancies for the Yandex Cloud Observability Platform direction via the link.