Description

Our team is responsible for the efficiency of training GigaChat models: from pretraining from scratch to large-scale online RL / RLHF experiments. We build infrastructure that enables large-scale training on large MoE models, achieving the utmost efficiency in cluster utilization. We work at all levels of the stack: from CUDA/Triton kernels and low-level optimizations to distributed training and inference acceleration.

The goal is to reduce the cost and duration of training, shorten the time-to-feedback for experiments, increase the stability and performance of the pipeline, and make the training of new/experimental architectures as efficient and predictable as possible.

Responsibilities

develop ML infrastructure and design a distributed LLM training framework with support for 5D parallelism that allows training LLMs at all stages (pre-training, SFT, PEFT, multimodal, RL (RLHF/RLVR))
achieve maximum resource utilization and close-to-linear scaling for large-scale pretrain / online-RL training runs
profile and identify bottlenecks in training, formulate and implement acceleration initiatives, integrate and optimize modern distributed training technologies
optimize training speed for various H200/B200 clusters and system/software stacks (CUDA, NCCL, drivers).

Requirements

have 2+ years of experience in ML/DL engineering, preferably in training LLM models or building/improving ML infrastructure
deep understanding of PyTorch: DDP/FSDP, autograd, custom ops, torch.compile, torch.autograd.Function.
knowledgeable about distributed training and efficient deep learning: 5D (DP/TP/PP/EP/SP) parallelism, mixed-precision, checkpointing, offloading, profiling, and training optimization.
understand LLM architecture: Transformer, attention (MHA/GQA/MLA), RoPE/positional embeddings, long-context, MoE.
proficient in Python at production code level (asyncio, multiprocessing, profiling, debugging large systems).

Conditions

largest DS&AI community — over 600 bank DS specialists
digest of the latest developments in the field of DS&AI and reports from the world's largest conferences
opportunity to choose a convenient work format: hybrid or office
comfortable modern office: Kutuzovskaya metro station, Kutuzovsky Prospekt, 32
annual salary review, annual bonus
corporate gym and relaxation areas
more than 400 educational programs from SberUniversity for professional and career development
extended DMS, preferential insurance for family and corporate pension program
mortgage more favorable by up to 7% for each employee
free SberPrime+ subscription, discounts on products from partner companies
reward for recommending friends to Sber's team.

Description

Responsibilities

develop ML infrastructure and design a distributed LLM training framework with support for 5D parallelism that allows training LLMs at all stages (pre-training, SFT, PEFT, multimodal, RL (RLHF/RLVR))
achieve maximum resource utilization and close-to-linear scaling for large-scale pretrain / online-RL training runs
profile and identify bottlenecks in training, formulate and implement acceleration initiatives, integrate and optimize modern distributed training technologies
optimize training speed for various H200/B200 clusters and system/software stacks (CUDA, NCCL, drivers).

Requirements

have 2+ years of experience in ML/DL engineering, preferably in training LLM models or building/improving ML infrastructure
deep understanding of PyTorch: DDP/FSDP, autograd, custom ops, torch.compile, torch.autograd.Function.
knowledgeable about distributed training and efficient deep learning: 5D (DP/TP/PP/EP/SP) parallelism, mixed-precision, checkpointing, offloading, profiling, and training optimization.
understand LLM architecture: Transformer, attention (MHA/GQA/MLA), RoPE/positional embeddings, long-context, MoE.
proficient in Python at production code level (asyncio, multiprocessing, profiling, debugging large systems).

Conditions

largest DS&AI community — over 600 bank DS specialists
digest of the latest developments in the field of DS&AI and reports from the world's largest conferences
opportunity to choose a convenient work format: hybrid or office
comfortable modern office: Kutuzovskaya metro station, Kutuzovsky Prospekt, 32
annual salary review, annual bonus
corporate gym and relaxation areas
more than 400 educational programs from SberUniversity for professional and career development
extended DMS, preferential insurance for family and corporate pension program
mortgage more favorable by up to 7% for each employee
free SberPrime+ subscription, discounts on products from partner companies
reward for recommending friends to Sber's team.

Key Skills

Contacts

Average salary for this role

Details

Description

Responsibilities

Requirements

Conditions

Similar vacancies

NLP Engineer (GigaChat Pretrain)

LLM RL Training Infrastructure Developer

LLM Platform Engineer (ML Engineer)

Senior Deep Learning Research Engineer (Diffusion Models)

LLM Engineer / Inference Engineer (Center for Applied AI)

Research Engineer (LLM Training and Performance)

Senior LLM Researcher (Center for Applied Artificial Intelligence)

ML Engineer LLM GigaChat

Deep Learning Engineer (GigaChat Prod)

Senior Deep Learning Research Engineer

Senior CUDA Engineer (Kandinsky)

Middle Research Engineer (AI Algorithms & Architectures)

Key Skills

Contacts

Average salary for this role

Details

Description

Responsibilities

Requirements

Conditions

Similar vacancies

NLP Engineer (GigaChat Pretrain)

LLM RL Training Infrastructure Developer

LLM Platform Engineer (ML Engineer)

Senior Deep Learning Research Engineer (Diffusion Models)

LLM Engineer / Inference Engineer (Center for Applied AI)

Research Engineer (LLM Training and Performance)

Senior LLM Researcher (Center for Applied Artificial Intelligence)

ML Engineer LLM GigaChat

Deep Learning Engineer (GigaChat Prod)

Senior Deep Learning Research Engineer

Senior CUDA Engineer (Kandinsky)

Middle Research Engineer (AI Algorithms & Architectures)