Reach out directly about this role
3 years
Experience
Full-time
Employment
Middle
Grade
AI Engineering
Specialization
IT & Tech
Industry
Corporation
Company Type
Developer of RL Training Infrastructure for LLM
We are creating infrastructure for training and fine-tuning large language models (LLM) and vision-language models (VLM), which are used in Alice, Search, Advertising, and other Yandex services. Modern training of such models is a complex system involving tens of thousands of servers, millions of compute cores, and multi-level connections between them. Our task is to make this system as efficient as possible, rationally using computational resources and minimizing the risks of failures.
Model training has become a task of managing complex distributed systems. It's necessary to ensure fault tolerance, efficient data delivery, and minimize communication delays. The more complex the system, the more points of failure, and the more resources needed for training, the higher the overhead costs for launching. Our team works at the intersection of ML mathematics and 'hardware' infrastructure: we must understand both the specifics of hardware (GPU, networks, data buses, disks, memory) and the nuances of the training process itself: components, interaction between parts, bottlenecks.
One of the popular approaches to training LLMs is Reinforcement Learning, RL. With the growing popularity of this method, increasingly complex approaches are emerging, the need for computational resources is increasing — and, as a consequence, the need to build specialized infrastructure.
Optimization of RL training infrastructure You will improve key components: optimize data delivery and storage, optimize communication between training blocks, and improve efficiency within blocks.
Development of diagnostic tools You will create and improve tools that allow for quick identification and resolution of infrastructure problems.
Enhancing infrastructure fault tolerance You will implement approaches that make the training infrastructure resilient to various errors and failures.
Research and implementation of modern solutions You will study the latest approaches to organizing RL training infrastructure, evaluate their effectiveness, and implement them in real projects.