Backend Developer in ApplicationTeam (Observability)

Yandex's Observability Platform is the key tool that ensures the reliability of the company's infrastructure. It is a centralized monitoring platform that processes 2.4 billion write requests and 2 billion read requests every second in real-time. Such scale dictates special requirements for performance, fault tolerance, and the quality of the solutions we create.

Almost all Yandex teams, as well as thousands of external Yandex Cloud clients, use our tools daily to monitor the operation of their systems and prevent incidents.

ApplicationTeam is the entry point to the Observability platform. We are responsible for user interaction with the system: we process all incoming requests to the monitoring system, design scalable APIs, support SDKs, and build integrations with services of various scales. We work closely with other teams and provide access to metrics, logs, traces, and alerts.

About the Team

Currently, ApplicationTeam consists of six experienced backend developers who develop interactions with the Observability platform. Our main stack is Java, but for some tasks we use C++ and Go. We work closely with other backend teams and UI developers, together creating the Observability platform.

How we work:

We work using Scrum: we break down tasks into sprints and plan long-term goals for several months * For newcomers, onboarding is provided: you will get your mentor, access to internal documentation, and a series of workshops on the structure of Observability * We work distributedly: team members live in different cities and countries. At the same time, we regularly meet offline to discuss plans and just spend time together

Subscribe to the Telegram channel Inside Yandex Cloud to learn more about our team and technologies!

What tasks await you

**Out-of-the-Box Monitoring Today, for developers to start monitoring a new service, they need to spend a long time understanding configurations, integrations, and manual metric setup. Our goal is to ensure that the maximum amount of information about the state of services is collected and visualized automatically, without additional effort on the user's part. You will design and implement technologies for auto-discovery of services, dynamic metric generation, and automatic dashboard creation that work "out of the box" even in the most complex scenarios — from bare-metal to distributed cloud systems. You will face a very cool challenge: how to cover 90% of service types without writing a single line of additional code? How to guarantee full monitoring even under conditions of sudden infrastructure changes?

**Instant Drilldown During Incidents and Gaining Insights In modern distributed systems, localizing an incident can be more difficult than fixing it. Your task is to create tools that, in a "firefighting" situation, allow you to go into the platform and immediately understand what happened, where exactly the failure occurred, how it affected the system, and what to do next — instead of getting lost in a huge volume of raw data. The user sees not fragmented data, but a connected story of what's happening: the system suggests possible root causes and offers ways to quickly delve into details to localize the failure.

**Interfaces and Protocols for Interacting with the Platform In our ecosystem, classical solutions are often insufficient — so we have to invent a lot from scratch or adapt them to our goals. To process millions of metrics per second, we created our own binary format, Spack: unlike Protobuf, it supports dynamic metric sets without a schema, efficiently compresses and quickly decodes data (LZ4, ZSTD), which is critical for stable operation under extreme loads.

You will be designing API architecture, developing SDKs and gRPC interfaces, evolving internal data exchange protocols and formats to ensure unified standards and high fault tolerance for all Yandex services.

**AI/ML Integration We want to move from reactive monitoring to proactive — a system that can identify root causes of incidents and anticipate problems that have not yet manifested clearly but can already be predicted. You will develop an intelligent layer for our platform: from creating and implementing anomaly detection models (based on metrics, logs, traces) to building alerting mechanisms and diagnosing complex failures. You will solve complex tasks: how to learn from incidents that don't repeat? How to reduce the time to detect and localize problems in the infrastructure?

We expect you to

Understand how distributed and high-load systems are structured, are familiar with their architecture specifics and fault tolerance requirements
Have experience in industrial Java development (experience with large codebases)
Are familiar with basic algorithms and data structures, know how to apply them in work
Possess the basics of working in Unix systems and use their tools for diagnosing and analyzing services
Have developed APIs, SDKs, or libraries for developers
Know how to design convenient, scalable, and secure REST/gRPC interfaces

Will be a plus

Have worked with monitoring systems: Prometheus, Grafana, ELK, Jaeger, DataDog, or similar
Are familiar with Terraform or other IaC tools

See other vacancies for the Yandex Cloud Observability Platform direction via the link.

About the Team

How we work:

We work using Scrum: we break down tasks into sprints and plan long-term goals for several months * For newcomers, onboarding is provided: you will get your mentor, access to internal documentation, and a series of workshops on the structure of Observability * We work distributedly: team members live in different cities and countries. At the same time, we regularly meet offline to discuss plans and just spend time together

Subscribe to the Telegram channel Inside Yandex Cloud to learn more about our team and technologies!

What tasks await you

We expect you to

Understand how distributed and high-load systems are structured, are familiar with their architecture specifics and fault tolerance requirements

Have experience in industrial Java development (experience with large codebases)

Are familiar with basic algorithms and data structures, know how to apply them in work

Possess the basics of working in Unix systems and use their tools for diagnosing and analyzing services

Have developed APIs, SDKs, or libraries for developers

Know how to design convenient, scalable, and secure REST/gRPC interfaces

Key Skills

Contacts

Average salary for this role

Details

About the Team

What tasks await you

We expect you to

Will be a plus

Similar vacancies

Разработчик бэкенда в Yandex Monitoring

Backend Developer at Yandex Crowd

Backend Developer for the Documents Team

Developer for the Event Monitoring Group

Go Developer for the Monitoring Team

Technical Manager, Observability Platform

Go Backend Developer (SberChat Calls team)

Logging System Developer for Observability Platform

Backend Developer for Telephony Services

Java Developer for the Direct Infrastructure Team

Java Developer with Go knowledge (Platform V Monitor EVVA)

Java/Kotlin Developer for the Internal Backend Infrastructure Group of FinTech

Key Skills

Contacts

Average salary for this role

Details

About the Team

What tasks await you

We expect you to

Will be a plus

Similar vacancies

Разработчик бэкенда в Yandex Monitoring

Backend Developer at Yandex Crowd

Backend Developer for the Documents Team

Developer for the Event Monitoring Group

Go Developer for the Monitoring Team

Technical Manager, Observability Platform

Go Backend Developer (SberChat Calls team)

Logging System Developer for Observability Platform

Backend Developer for Telephony Services

Java Developer for the Direct Infrastructure Team

Java Developer with Go knowledge (Platform V Monitor EVVA)

Java/Kotlin Developer for the Internal Backend Infrastructure Group of FinTech