Description
Our team is responsible for the efficiency of training GigaChat models: from pretraining from scratch to large-scale online RL / RLHF experiments. We build infrastructure that enables large-scale training on large MoE models, achieving the utmost efficiency in cluster utilization. We work at all levels of the stack: from CUDA/Triton kernels and low-level optimizations to distributed training and inference acceleration.
The goal is to reduce the cost and duration of training, shorten the time-to-feedback for experiments, increase the stability and performance of the pipeline, and make the training of new/experimental architectures as efficient and predictable as possible.
Responsibilities
- develop ML infrastructure and design a distributed LLM training framework with support for 5D parallelism that allows training LLMs at all stages (pre-training, SFT, PEFT, multimodal, RL (RLHF/RLVR))
- achieve maximum resource utilization and close-to-linear scaling for large-scale pretrain / online-RL training runs
- profile and identify bottlenecks in training, formulate and implement acceleration initiatives, integrate and optimize modern distributed training technologies
- optimize training speed for various H200/B200 clusters and system/software stacks (CUDA, NCCL, drivers).
Requirements
- have 2+ years of experience in ML/DL engineering, preferably in training LLM models or building/improving ML infrastructure
- deep understanding of PyTorch: DDP/FSDP, autograd, custom ops, torch.compile, torch.autograd.Function.
- knowledgeable about distributed training and efficient deep learning: 5D (DP/TP/PP/EP/SP) parallelism, mixed-precision, checkpointing, offloading, profiling, and training optimization.
- understand LLM architecture: Transformer, attention (MHA/GQA/MLA), RoPE/positional embeddings, long-context, MoE.
- proficient in Python at production code level (asyncio, multiprocessing, profiling, debugging large systems).
Conditions
- largest DS&AI community — over 600 bank DS specialists
- digest of the latest developments in the field of DS&AI and reports from the world's largest conferences
- opportunity to choose a convenient work format: hybrid or office
- comfortable modern office: Kutuzovskaya metro station, Kutuzovsky Prospekt, 32
- annual salary review, annual bonus
- corporate gym and relaxation areas
- more than 400 educational programs from SberUniversity for professional and career development
- extended DMS, preferential insurance for family and corporate pension program
- mortgage more favorable by up to 7% for each employee
- free SberPrime+ subscription, discounts on products from partner companies
- reward for recommending friends to Sber's team.