Description

We develop high-performance CUDA operators for PyTorch, enabling training and inference of multimodal models with maximum GPU resource utilization. The focus is on low-level optimization, custom kernels, memory management, and efficient operation with new GPU architectures.

Responsibilities

Development and optimization of custom CUDA operators and extensions for PyTorch (C++/CUDA).
Profiling and eliminating bottlenecks in computational kernels (Nsight Compute, nvprof).
Optimization of memory usage (shared memory, registers, coalesced access, persistent kernels).
Implementation of parallel computing algorithms considering the architectural features of modern GPUs (Ampere, Hopper, and newer).
Integration of CUDA optimizations into distributed training and inference pipelines.
Close collaboration with Research and Distributed Learning teams to support custom models and operators.

Requirements

Expert-level C++ and CUDA.
Experience in performance optimization for NVIDIA GPUs.
Knowledge of PyTorch internals (ATen, dispatcher, TensorIterator).
Skills in GPU profiling and finding and eliminating bottlenecks in the implementation of neural network operators.
Experience with Mixed Precision and custom kernels.

Will be a plus: Experience with Triton, CUTLASS, cuBLASLt, NCCL; participation in PyTorch open-source projects.

Conditions

Comfortable modern office near Kutuzovskaya metro station
Hybrid work format
Annual salary review, quarterly and annual bonus
Corporate gym and recreation areas
More than 400 educational programs from SberUniversity for professional and career development
Onboarding program and supervisor support at the start
Extended voluntary health insurance, preferential insurance for family, and corporate pension program
Mortgage benefits up to 7% lower for every employee
Free SberPrime+ subscription, discounts on products from partner companies
Referral bonus for recommending friends to the Sber team.

Contacts

Description

Responsibilities

Requirements

Conditions

Similar vacancies

Senior CUDA Engineer (Kandinsky)

C++ Developer (VLLM, SGlang, TensorRT)

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

GPU Performance Engineer

Senior Deep Learning Research Engineer

Senior Deep Learning Research Engineer (Diffusion Models)

LLM Platform Engineer (ML Engineer)

NLP Engineer (GigaChat Pretrain)

Senior NLP Engineer (GigaChat)

LLM Infrastructure Developer

Senior Developer for the GPU Infrastructure Team

C++ Inference Server Developer for the ML Infrastructure Department

Senior CUDA Engineer (Kandinsky)

Key Skills

Details