Description
We develop high-performance CUDA operators for PyTorch, enabling training and inference of multimodal models with maximum GPU resource utilization. The focus is on low-level optimization, custom kernels, memory management, and efficient operation with new GPU architectures.
Responsibilities
- Development and optimization of custom CUDA operators and extensions for PyTorch (C++/CUDA).
- Profiling and eliminating bottlenecks in computational kernels (Nsight Compute, nvprof).
- Optimization of memory usage (shared memory, registers, coalesced access, persistent kernels).
- Implementation of parallel computing algorithms considering the architectural features of modern GPUs (Ampere, Hopper, and newer).
- Integration of CUDA optimizations into distributed training and inference pipelines.
- Close collaboration with Research and Distributed Learning teams to support custom models and operators.
Requirements
- Expert-level C++ and CUDA.
- Experience in performance optimization for NVIDIA GPUs.
- Knowledge of PyTorch internals (ATen, dispatcher, TensorIterator).
- Skills in GPU profiling and finding and eliminating bottlenecks in the implementation of neural network operators.
- Experience with Mixed Precision and custom kernels.
Will be a plus: Experience with Triton, CUTLASS, cuBLASLt, NCCL; participation in PyTorch open-source projects.
Conditions
- Comfortable modern office near Kutuzovskaya metro station
- Hybrid work format
- Annual salary review, quarterly and annual bonus
- Corporate gym and recreation areas
- More than 400 educational programs from SberUniversity for professional and career development
- Onboarding program and supervisor support at the start
- Extended voluntary health insurance, preferential insurance for family, and corporate pension program
- Mortgage benefits up to 7% lower for every employee
- Free SberPrime+ subscription, discounts on products from partner companies
- Referral bonus for recommending friends to the Sber team.