Senior Developer for the GPU Infrastructure Team

We are developing an internal container cloud (Runtime Cloud – RTC), which runs all services created by thousands of Yandex developers. Our goal is to create a convenient cloud for services of various scales, from a few to tens of thousands of containers per service, while efficiently utilizing all hardware resources at our disposal. Currently, the internal cloud manages over 100,000 physical servers. They host more than 50,000 services, and the container count is in the millions. Our cloud also hosts InfiniBand clusters for distributed training, which are included in the TOP500 supercomputer ranking.

We not only allow configuring and launching services but also provide users with everything needed to operate them: we set up load balancing, provide monitoring for launched services, collect logs, support CI/CD integration, offer convenient ways for ad-hoc and fleet-wide profiling, and more. We aim to turn the cloud into a single, tightly integrated platform (PaaS) that ensures convenience and reliability for service development and operation, allowing developers to use both standard API/UI mechanisms and an Infrastructure as Code approach.

In addition, we develop internal tools for managing hardware, certificates, and access to minimize the operational load on the cloud and our user support.

The GPU Infrastructure team is responsible for developing services to ensure the operability of Yandex's entire GPU infrastructure, and ML/HPC components for distributed inference and training, which enable efficient use of modern accelerators and RDMA networks. We are actively involved not only in developing system software and distributed computing frameworks for training and inference of ML models but also in the design of our RDMA clusters, their configuration, monitoring, and optimization throughout their entire lifecycle.

Our internal developments, used by thousands of Yandex ML engineers: * Operator for distributing GPU-/RDMA-devices to containers for inference and training services. The service can work both in Yandex's internal cloud with the Porto container runtime and with K8s and CRI runtimes * Mechanisms for HW checks of GPU-/RDMA-devices * Fleet-wide GPU profiler for analyzing the performance of both training and inference services * GPU cluster monitoring service for the internal cloud * Automated testing service for GPU clusters using our own and well-known open-source benchmarks

We actively participate in the development and improvement of open-source solutions that we extensively use internally. Here is a list of just a few of them: * UCX — an efficient communication P2P framework over InfiniBand, ROCE, TCP/IP, CUDA IPC, GDR, etc. * NCCL and UCC — collective operation frameworks for CPU/GPU memory * SGLang/vLLM/TRT-LLM — LLM inference framework * Dynamo — service for running inference frameworks in disaggregated mode * PyTorch Kineto — GPU profiling service using CUPTI

What tasks await you

Maintain and develop system software responsible for configuring, monitoring, and allocating GPU and RDMA devices into user containers on servers The internal cloud consists of servers with different models of GPU and RDMA devices, which must be allocated to service requests in a Porto container within the YP cluster or to other CRI-compatible container runtimes in a K8s cluster. The devices must be properly configured to meet user requirements and must deliver the necessary driver-dependent libraries into the container for the convenience of the services. During operation in the user container, services should receive metrics on utilization, potential problems, and so on. These and other tasks are solved by our service, which must work reliably and be constantly improved to address new challenges. For example, one of the recent tasks was "coloring" InfiniBand traffic using eBPF to implement guarantees.

Maintain and develop fleet-wide GPU profiling for all Yandex services Modern servers, especially those equipped with accelerators and high-speed RDMA networks, are very expensive. This poses new challenges for clouds and services running in them to use the provided hardware as efficiently as possible. One method of optimization is profiling applications while they are running in the cloud. Our cloud has implemented and integrated a profiler based on CUPTI, which allows profiling applications across the entire fleet with minimal overhead, continuously, thereby providing services with up-to-date information on utilization problems.

Develop services for automated management of GPU infrastructure There are many GPU servers in our cloud, and they all require efficient management without human intervention: they must undergo necessary testing of GPU devices, RDMA networks, and other components before entering production after repairs or other scheduled maintenance. We solve this task by integrating and developing modern benchmarks, load, and regression tests. New technologies integrated into our cloud, for example, one of the latest IBGDA, must be covered by regression tests. Our services also monitor the fleet's state to find servers with various problems and guarantee high availability of hardware resources, comparable to or exceeding the level of other companies. These and other methods allow us to guarantee early detection of problems before services are deployed on these servers.

Develop and optimize infrastructure for distributed disaggregated inference and training We believe ML engineers should focus on organizing training and deploying new LLMs into production. And our cloud provides the basic components for organizing efficient distributed inference and training, which are refined and tested considering our specific features. In our cloud, we participate in the research, development, evolution, and operation of cutting-edge tasks: for example, we provide distributed disaggregated inference technology that any service can deploy literally with one click.

Participate in the design and implementation of new hardware in our cloud Modern cloud solutions must be efficient and high-performance in terms of hardware utilization. This process starts with the design, configuration, and reliable HW monitoring of the hardware. We implement modern hardware in our cloud, refining all levels of system software – from user libraries to the container runtime and vendor drivers – so that our users can use new hardware without any changes to their applications. We constantly face new and new tasks, for example, implementing new RDMA networks, new accelerators, ARM support.

We expect you to have

Know Go, C/C++, Python (not necessarily all at once)
Able to write maintainable and efficient code
Have a good understanding of computer networks, operating systems, containerization, and virtualization principles
Able to work with K8s
Interested in R&D work and able to solve atypical tasks

Will be a plus

Know Rust
Have worked on projects related to GPU distributed computing
Developed or used CUDA, OpenCL, SYCL, ROCm, or other runtimes for parallel computing
Developed or used Verbs, UCX, OFI, NCCL, UCC, MPI, or other runtimes for P2P or collective network communications
Developed or used inference frameworks in your work: SGLang, vLLM, TRT-LLM, Mooncake, Dynamo, and others
Developed the Linux kernel and its modules
Know the architecture of x86, AArch64 hardware and its specifics

Contacts

What tasks await you

We expect you to have

Will be a plus

Similar vacancies

Go Developer at Yandex BareMetal

Системный разработчик во внутреннее облако Яндекса

GPU Performance Engineer

Developer Tools Developer for the Autonomous Transport Department

Senior CUDA Engineer (Kandinsky)

Developer in Yandex Infrastructure

Developer for the Engineering Infrastructure Team Yandex Cloud

Go Developer for Yandex Cloud Stackland

Developer for the Autonomous Transport Deployment Platform Team

Yandex Cloud Infrastructure Developer

Senior CUDA Engineer (Kandinsky)

Go Developer for the Analytical Infrastructure Team

Senior Developer for the GPU Infrastructure Team

Key Skills

Details

Average salary for this role