Reach out directly about this role
Java Developer for the Alerting & Notifications Team
Our services are part of the Observability platform — they allow creating alerts and notifying about state changes via notification channels (Telegram, phone calls, cloud functions, etc.). Our goal is to provide the user with an easy and fast way to get a clear answer about the state of their systems at any moment. Almost all Yandex teams use the platform's capabilities to monitor the state of their services — both external and internal. In addition, the service is available to Yandex Cloud users.
Our team is responsible for the development and support of four main services:
Alerting in numbers: * 24M alerts calculated every minute * 3M+ RPS for reading time series and serving user requests * 900+ servers * 29 TB RAM * 7K+ CPU
Notification Service, which allows setting a template and via a single API sending in SMS, messengers, and by phone both simple notifications and their complex sequences ("Send a message, and if there's no response within 10 minutes — make a call").
Event Monitoring System, which allows forming a high-level (aggregated) service state based on input data about its health from various sources (that's tens of millions of unique events) according to user-defined rules.
Synthetic Monitoring System — a new platform service — allows configuring checks like ping checks: host liveness checks, certificate checks.
Scaling systems in accordance with load growth Steady load growth (around 30% per year) requires regular scaling: finding bottlenecks, researching RnD solution options, implementing horizontal scaling.
Implementing fault tolerance If alerting is not working — users are left without instruments for their production and might miss a problem that could lead to a serious incident. Therefore, it is necessary that alerting manages to calculate all alerts and is fault-tolerant. You will have to implement a hot-standby mode in the load balancer, and in such a way that during a rack or module failure in the data center, we do not lose both replicas. Besides this, it is necessary to isolate projects so that problems of one do not affect others.
Elaborating technical and product solutions for user scenarios Our users are developers just like us. The problems users face are relevant for us too. Therefore, developers always participate in discussing and elaborating both technical and product solutions. For example, what SLO alerts should look like; how to make it clear to the user what good events and bad events are; how to create an SLO alert for timings in this paradigm.
Simplifying user scenarios for working with alerts The first thing a user encounters is how to set up an alert for a specific scenario. Here we want to provide typical alerts: for deviation, trend, sudden spikes and drops, SLO. Then the user needs to understand that the alert caught a problem. Alerts must be able to be calculated in the past to create a not overly sensitive alert that catches real problems. And after the alert is created, it's important to understand the reason for its trigger: was it one replica not responding, a data delivery lag, or a real problem.
Developing the common platform You will need to adapt existing functionality for deployment in Yandex Cloud with access for external users, as well as for creating a unified observability platform for other company employees.
More about backend at Yandex — in the channel Yandex for Backend
3-5 years
Experience
Full-time
Employment
Hybrid, Remote, Onsite
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
By job title
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type