Developer of RL Training Infrastructure for LLM

We are creating infrastructure for training and fine-tuning large language models (LLM) and vision-language models (VLM), which are used in Alice, Search, Advertising, and other Yandex services. Modern training of such models is a complex system involving tens of thousands of servers, millions of compute cores, and multi-level connections between them. Our task is to make this system as efficient as possible, rationally using computational resources and minimizing the risks of failures.

Model training has become a task of managing complex distributed systems. It's necessary to ensure fault tolerance, efficient data delivery, and minimize communication delays. The more complex the system, the more points of failure, and the more resources needed for training, the higher the overhead costs for launching. Our team works at the intersection of ML mathematics and 'hardware' infrastructure: we must understand both the specifics of hardware (GPU, networks, data buses, disks, memory) and the nuances of the training process itself: components, interaction between parts, bottlenecks.

One of the popular approaches to training LLMs is Reinforcement Learning, RL. With the growing popularity of this method, increasingly complex approaches are emerging, the need for computational resources is increasing — and, as a consequence, the need to build specialized infrastructure.

What tasks await you

Optimization of RL training infrastructure You will improve key components: optimize data delivery and storage, optimize communication between training blocks, and improve efficiency within blocks.

Development of diagnostic tools You will create and improve tools that allow for quick identification and resolution of infrastructure problems.

Enhancing infrastructure fault tolerance You will implement approaches that make the training infrastructure resilient to various errors and failures.

Research and implementation of modern solutions You will study the latest approaches to organizing RL training infrastructure, evaluate their effectiveness, and implement them in real projects.

We expect you to

Know Python and have experience in systems programming, library, or framework development
Be well familiar and have practical experience with the PyTorch framework and distributed training via torch.distributed
Master parallelization approaches: understand data parallelism, tensor parallelism, pipeline parallelism, expert parallelism for distributed inference or training
Be interested in LLM and MLOps: understand the tasks and challenges associated with operating large models in production
Be able to work effectively in a team and share knowledge

Will be a plus

Have participated in creating infrastructure for ML model training
Have implemented and optimized RL solutions
Have worked with RL training libraries for LLM: veRL, slime, NeMo-RL, SkyRL and others, as well as with inference libraries: vLLM, SGLang and TRTLLM
Know C++ and have experience with low-level programming and optimization
Have experience with NVIDIA GPU: understand GPU architecture, have developed or optimized algorithms using CUDA or Triton

What tasks await you

Development of diagnostic tools You will create and improve tools that allow for quick identification and resolution of infrastructure problems.

Enhancing infrastructure fault tolerance You will implement approaches that make the training infrastructure resilient to various errors and failures.

Research and implementation of modern solutions You will study the latest approaches to organizing RL training infrastructure, evaluate their effectiveness, and implement them in real projects.

We expect you to

Know Python and have experience in systems programming, library, or framework development

Be well familiar and have practical experience with the PyTorch framework and distributed training via torch.distributed

Master parallelization approaches: understand data parallelism, tensor parallelism, pipeline parallelism, expert parallelism for distributed inference or training

Be interested in LLM and MLOps: understand the tasks and challenges associated with operating large models in production

Be able to work effectively in a team and share knowledge

Will be a plus

Have participated in creating infrastructure for ML model training

Have implemented and optimized RL solutions

Have worked with RL training libraries for LLM: veRL, slime, NeMo-RL, SkyRL and others, as well as with inference libraries: vLLM, SGLang and TRTLLM

Know C++ and have experience with low-level programming and optimization

Have experience with NVIDIA GPU: understand GPU architecture, have developed or optimized algorithms using CUDA or Triton

LLM RL Training Infrastructure Developer

Key Skills

Contacts

Details

What tasks await you

We expect you to

Will be a plus

Similar vacancies

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

LLM Infrastructure Developer

ML Researcher for the Early-binding Architectures Team

GPU Performance Engineer

Senior DL Developer for Neuro Team

Senior Developer for the Delivery Robot ML Planner Team (RL)

DL Developer for the YandexGPT Architecture Research Team

Senior ML Researcher/Engineer (World Models & RL) for the Delivery Robot Team

LLM Platform Engineer (ML Engineer)

ML Engineer LLM GigaChat

NLP Engineer (GigaChat Pretrain)

Research Engineer (LLM Training and Performance)

LLM RL Training Infrastructure Developer

Key Skills

Contacts

Details

What tasks await you

We expect you to

Will be a plus

Similar vacancies

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

LLM Infrastructure Developer

ML Researcher for the Early-binding Architectures Team

GPU Performance Engineer

Senior DL Developer for Neuro Team

Senior Developer for the Delivery Robot ML Planner Team (RL)

DL Developer for the YandexGPT Architecture Research Team

Senior ML Researcher/Engineer (World Models & RL) for the Delivery Robot Team

LLM Platform Engineer (ML Engineer)

ML Engineer LLM GigaChat

NLP Engineer (GigaChat Pretrain)

Research Engineer (LLM Training and Performance)