Description
We develop high-performance CUDA operators for PyTorch, enabling the training and inference of multimodal models with maximum GPU resource utilization. The focus is on low-level optimization, custom kernels, memory management, and efficient work with new GPU architectures.
Responsibilities
- Development and optimization of custom CUDA operators and extensions for PyTorch (C++/CUDA).
- Profiling and elimination of bottlenecks in computational kernels (Nsight Compute, nvprof).
- Optimization of memory usage (shared memory, registers, coalesced access, persistent kernels).
- Implementation of parallel computing algorithms considering architectural features of modern GPUs (Ampere, Hopper and newer).
- Integration of CUDA optimizations into distributed training and inference pipelines.
- Close collaboration with Research and Distributed Learning teams to support custom models and operators.
Requirements
- Expert-level C++ and CUDA.
- Experience in performance optimization for NVIDIA GPUs.
- Knowledge of PyTorch internals (ATen, dispatcher, TensorIterator).
- Skills in GPU profiling and identifying/eliminating bottlenecks in neural network operator implementation.
- Experience with Mixed Precision and custom kernels.
Bonus: Experience with Triton, CUTLASS, cuBLASLt, NCCL; participation in PyTorch open-source projects.
Conditions
- Comfortable modern office near Kutuzovskaya metro station
- Hybrid work format
- Annual salary review, quarterly and annual bonus
- Corporate gym and recreation areas
- More than 400 educational programs from SberUniversity for professional and career development
- Onboarding program and manager assistance at the start
- Extended voluntary health insurance, preferential insurance for family, and corporate pension program
- Mortgage with benefits up to 7% for every employee
- Free SberPrime+ subscription, discounts on products from partner companies
- Referral bonus for recommending friends to join the Sber team