GPU Performance Engineer

We manage one of the company's scarcest and most expensive resources — graphics processing units (GPUs). Their efficient use is critical for the operation of Yandex's key services: Search, Advertising, Alice, Taxi, Music, and other AI-based products. Our mission is to ensure maximum output and effect from every GPU card. This is not just resource administration, but a strategic role at the intersection of technology and business.

We are looking for a GPU Performance Engineer who will help grow GPU utilization efficiency, squeeze maximum performance from GPU computations, and make our systems fast, scalable, and resilient under high load.

The team works with 150+ products where GPUs are the foundation for AI models. You will become the link between engineering teams and top management, turning technical solutions into direct financial benefit.

About the team

You will join a team that directly influences the efficiency of Yandex's key products. We have no bureaucracy — decisions are made quickly, and initiatives are welcomed. Ideas on how to improve GPU usage efficiency are especially valued now.

We combine technical expertise with business orientation. For example, we recently launched a system for redistributing GPUs between teams, taking into account the development strategy of each individual service and the overall company strategy. This initiative saved the company hundreds of millions of rubles and provided a boost for focus areas.

Plans include creating a single standard for GPU usage for all Yandex services with a focus on increasing usage efficiency and maximizing the volume of profit obtained.

What tasks await you

Improving GPU utilization efficiency You will formulate hypotheses and research ways to improve GPU utilization efficiency, participate in implementing and deploying the most profitable solutions. It will be necessary to formulate recommendations and best practices for improving performance to squeeze the maximum out of the GPU infrastructure.

Optimization and profiling Your responsibilities will include finding performance bottlenecks and eliminating them using profilers, as well as optimizing memory access, kernels, latency, and throughput.

Developing diagnostic tools You will create and improve tools for quickly identifying and eliminating infrastructure problems that affect the efficiency of utilization, stability, and speed of GPU computations (for both training and inference).

Research and implementation of modern solutions You will explore the latest approaches to organizing infrastructure for training and inference, evaluate their effectiveness, and implement them in real projects.

Architecture analysis, testing, integration You will work closely with developers, ML engineers, and system architects. You will participate in evaluating hardware solutions and propose improvements for future GPU generations, as well as develop testing plans, form benchmarks, and conduct performance regression analysis.

More about ML at Yandex — in the channel Yandex for ML

We expect that you

Know Python and have done systems programming, developed libraries or frameworks
Have worked with the PyTorch framework and distributed training via torch.distributed
Apply parallelization approaches, including data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, for distributed inference or training
Are interested in LLM and MLOps: understand the tasks and challenges associated with operating large models in production
Have worked with GPU (NVIDIA) and CUDA: understand GPU architecture, developed or optimized algorithms for CUDA, used Nsight, nvprof or their equivalents
Know how to optimize GPU application performance and improve GPU utilization efficiency
Are capable of analyzing performance profiles and metrics
Can read and optimize complex code
Know how to work effectively in a team and are willing to share knowledge

Will be a plus

Proficient in C/C++ or similar low-level languages
Have worked with RL-training libraries for LLM: veRL, slime, NeMo-RL, SkyRL and others
Have worked with inference libraries: vLLM, SGLang and TRTLLM
Have experience optimizing for real production loads, worked with low-latency or real-time systems

About the team

Plans include creating a single standard for GPU usage for all Yandex services with a focus on increasing usage efficiency and maximizing the volume of profit obtained.

What tasks await you

More about ML at Yandex — in the channel Yandex for ML

We expect that you

Know Python and have done systems programming, developed libraries or frameworks

Have worked with the PyTorch framework and distributed training via torch.distributed

Apply parallelization approaches, including data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, for distributed inference or training

Are interested in LLM and MLOps: understand the tasks and challenges associated with operating large models in production

Have worked with GPU (NVIDIA) and CUDA: understand GPU architecture, developed or optimized algorithms for CUDA, used Nsight, nvprof or their equivalents

Know how to optimize GPU application performance and improve GPU utilization efficiency

Are capable of analyzing performance profiles and metrics

Can read and optimize complex code

Know how to work effectively in a team and are willing to share knowledge

Will be a plus

Proficient in C/C++ or similar low-level languages

Have worked with RL-training libraries for LLM: veRL, slime, NeMo-RL, SkyRL and others

Have worked with inference libraries: vLLM, SGLang and TRTLLM

Have experience optimizing for real production loads, worked with low-latency or real-time systems

Key Skills

Contacts

Details

About the team

What tasks await you

We expect that you

Will be a plus

Similar vacancies

Senior CUDA Engineer (Kandinsky)

Senior CUDA Engineer (Kandinsky)

LLM RL Training Infrastructure Developer

Senior Developer for the GPU Infrastructure Team

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

NLP Engineer (GigaChat Pretrain)

LLM Platform Engineer (ML Engineer)

ML Researcher for the Early-binding Architectures Team

ML Developer for Inference Acceleration Team

LLM Infrastructure Developer

Senior Deep Learning Research Engineer (Diffusion Models)

Developer of Parallel Computing Systems on GPU/NPU for Autonomous Transport

Key Skills

Contacts

Details

About the team

What tasks await you

We expect that you

Will be a plus

Similar vacancies

Senior CUDA Engineer (Kandinsky)

Senior CUDA Engineer (Kandinsky)

LLM RL Training Infrastructure Developer

Senior Developer for the GPU Infrastructure Team

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

NLP Engineer (GigaChat Pretrain)

LLM Platform Engineer (ML Engineer)

ML Researcher for the Early-binding Architectures Team

ML Developer for Inference Acceleration Team

LLM Infrastructure Developer

Senior Deep Learning Research Engineer (Diffusion Models)

Developer of Parallel Computing Systems on GPU/NPU for Autonomous Transport