Java Developer for the Alerting & Notifications Team

Our services are part of the Observability platform — they allow creating alerts and notifying about state changes via notification channels (Telegram, phone calls, cloud functions, etc.). Our goal is to provide the user with an easy and fast way to get a clear answer about the state of their systems at any moment. Almost all Yandex teams use the platform's capabilities to monitor the state of their services — both external and internal. In addition, the service is available to Yandex Cloud users.

Our team is responsible for the development and support of four main services:

Alerting — a service that calculates user alerts on top of metrics. We provide the ability to create various types of alerts (PromQL-like expressions, SLO alerts, anomaly alerts using ML algorithms, aggregates over alerts).

Alerting in numbers: * 24M alerts calculated every minute * 3M+ RPS for reading time series and serving user requests * 900+ servers * 29 TB RAM * 7K+ CPU

Notification Service, which allows setting a template and via a single API sending in SMS, messengers, and by phone both simple notifications and their complex sequences ("Send a message, and if there's no response within 10 minutes — make a call").
Event Monitoring System, which allows forming a high-level (aggregated) service state based on input data about its health from various sources (that's tens of millions of unique events) according to user-defined rules.
Synthetic Monitoring System — a new platform service — allows configuring checks like ping checks: host liveness checks, certificate checks.

What tasks await you

Scaling systems in accordance with load growth Steady load growth (around 30% per year) requires regular scaling: finding bottlenecks, researching RnD solution options, implementing horizontal scaling.

Implementing fault tolerance If alerting is not working — users are left without instruments for their production and might miss a problem that could lead to a serious incident. Therefore, it is necessary that alerting manages to calculate all alerts and is fault-tolerant. You will have to implement a hot-standby mode in the load balancer, and in such a way that during a rack or module failure in the data center, we do not lose both replicas. Besides this, it is necessary to isolate projects so that problems of one do not affect others.

Elaborating technical and product solutions for user scenarios Our users are developers just like us. The problems users face are relevant for us too. Therefore, developers always participate in discussing and elaborating both technical and product solutions. For example, what SLO alerts should look like; how to make it clear to the user what good events and bad events are; how to create an SLO alert for timings in this paradigm.

Simplifying user scenarios for working with alerts The first thing a user encounters is how to set up an alert for a specific scenario. Here we want to provide typical alerts: for deviation, trend, sudden spikes and drops, SLO. Then the user needs to understand that the alert caught a problem. Alerts must be able to be calculated in the past to create a not overly sensitive alert that catches real problems. And after the alert is created, it's important to understand the reason for its trigger: was it one replica not responding, a data delivery lag, or a real problem.

Developing the common platform You will need to adapt existing functionality for deployment in Yandex Cloud with access for external users, as well as for creating a unified observability platform for other company employees.

More about backend at Yandex — in the channel Yandex for Backend

We expect that you

Can write and understand multithreaded code: the entire alerting works asynchronously using the actor model
Are ready to write code in Java and Go (ratio 80% to 20%)
Understand key aspects of building fault-tolerant distributed systems

Contacts

What tasks await you

We expect that you

Similar vacancies

Разработчик бэкенда в Yandex Monitoring

Logging System Developer for Observability Platform

Java Developer for the Yandex Tracker Platform Team

Java Developer with Go knowledge (Platform V Monitor EVVA)

Java Developer at Yandex Tracker

Java developer (middle office transformation)

Java Developer for Kinopoisk Media Pipeline

Java Developer for the B2B Platform team at Yandex 360

Go Developer for the Detection & Response team, Yandex Cloud Security

Java/Kotlin Management Platform Developer for Intelligent Buildings

Java Developer (Platform V SUSD)

Java Developer for the "Apartments" Theme

Java Developer for the Alerting & Notifications Team

Key Skills

Details

Average salary for this role