Reach out directly about this role
By job title
3-5 years
Experience
Full-time
Employment
Onsite, Hybrid, Remote
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
Developer for Unified Agent
Unified Agent is a key component of Yandex's infrastructure for data collection and processing. It operates as a universal collector, integrating with various services and providing a complete information processing cycle: from collecting logs and metrics to transmitting them to centralized monitoring systems.
Key features of the agent:
Technical aspects:
The scale of usage is impressive: Unified Agent is deployed in 2 million containers within Yandex's infrastructure with a total traffic of about 1 TB/s and is the primary tool for collecting metrics in Yandex Cloud. This solution competes with popular tools such as Fluentd, Logstash, Amazon CloudWatch Agent, and Datadog Agent.
The project is actively developing: we plan to open-source it and make it an open-source solution. For you, this is an excellent opportunity to join the team at an early stage and contribute to shaping the architecture and functionality!
Here is a small part of the challenges facing the Unified Agent development team:
Yandex's infrastructure hosts many distributed servers, and gaining access to the console takes significant time. Therefore, it is necessary to be able to monitor logs from various servers quickly and in real-time.
It is impossible to monitor 2 million hosts manually, just as it is to identify issues using traditional methods. Given that collecting metrics and additional logs significantly increases network traffic, the agent must have an intelligent diagnostics system that not only detects issues on servers in advance but also suggests how to fix them.
When things don't go according to plan, Unified Agent becomes a key diagnostic tool. Therefore, it is critically important that the agent can continue to operate even during system failures, ensuring the delivery of logs and metrics regardless of circumstances. Join the development of a tool that plays a vital role in managing and monitoring the infrastructure of one of the largest technology companies!
Subscribe to the Inside Yandex Cloud Telegram channel to learn more about our team and technologies!
Developing high-load components in C++ The agent functions as a multi-module system, where each module is responsible for network interaction and file system operations. To optimize resource usage, all operations are performed through shared components, which the development team creates and improves.
Optimizing performance when working with large volumes of data The system continuously receives significant data streams as input, processes them optimally, and reliably transmits them to metric collection servers. You will need to create efficient processing scenarios, meticulously controlling the throughput and performance of the entire system, profiling and optimizing critical sections of code.
Developing reliable network interaction between system components The agent facilitates interaction with numerous clients and exchanges data with servers. At the same time, the system must efficiently manage incoming data streams that exceed processing capabilities and adhere to server quotas, preventing infrastructure overload. When creating network protocols, it is necessary to implement load control and traffic balancing mechanisms so that the system operates stably under various operating conditions.
Participating in system architecture design and implementing new scenarios for collecting metrics and logs You will design and develop the agent's architecture, which must be flexible and scalable to match Yandex's growing infrastructure. At the same time, it is important to ensure support for various data protocols and formats, as well as create mechanisms for quickly adapting the system to changes in the company's infrastructure.
Proposing and implementing innovative solutions to improve the product You will need to actively participate in the innovative development of the system: creating new optimization methods and effective ways to collect and process metrics and logs. You will be able to implement your ideas on how to improve the system and propose your own innovative solutions to adapt the agent to changes proactively.