C++ Inference Server Developer for the ML Infrastructure Department

Our team develops services that allow for rapid deployment of neural model inference and their use in production. This could be a lightweight CPU-based network or a large transformer with billions of parameters that needs to handle hundreds of thousands of RPS at 30 ms in Q99. In addition, new services must be easy to deploy, come with multi-level caching, monitoring, model delivery and fine-tuning, and much more built-in from the start.

We aim to ensure that: * ML engineers of any level and from any part of Yandex can use this service in their project literally "at the click of a button." * Research of new models and their delivery to experimentation and production are as simple and fast as possible. * Requests are executed quickly, and CPU/GPU resources are utilized with maximum efficiency.

About the Team

Our team is a department of 10 people, currently scaling to meet ambitious goals. Part of the team works on core technology and is responsible for rolling out the service company-wide — ensuring the service works efficiently and conveniently for different teams. The other part of the team works on inference services in advertising, which involve massive loads (hundreds of thousands of RPS), significant hardware resources (hundreds of thousands of cores, hundreds of GPUs), and have a direct impact on revenue generation.

All team members are from top universities, many have graduated from Yandex School of Data Analysis (ШАД) or are currently studying there. Most of the team works from the Moscow office, as we enjoy not only solving tasks but also being among like-minded and engaged people.

We like to go to a bar in the evening, play board games, or simply have pizza after a successful launch. We occasionally visit campus offices to discuss technology. If you enjoy working on complex, responsible projects in a company of strong and passionate people, come join us :)

What Tasks Await You

Developing a Turnkey Inference Solution We currently have the core part of the service implemented, but to make the solution truly convenient, we need to implement numerous ideas and developments such as dynamic load balancing, multi-level in-memory/disk/remote caching, and dynamic configurations. You will also be involved in developing tools for deploying the service in the cloud.

Assisting with Company-wide Solution Adoption Across the company, there are no fewer than 20 teams involved in operating ML models. To make the solution convenient for everyone and enable rapid experimentation, constant interaction with our customers is necessary, implementing the features they require, such as new backends for applying neural networks, or providing consultancy on deploying new installations.

Benchmarking Against Global Analogs To create a good and competitive solution, we must always look around and adopt best practices and ideas. To this end, we perform qualitative analysis of analogous solutions, both for inference code (Triton Inference Server, KServe) and for deploying services in deployment systems (Seldon Core, Kubeflow). We also need to monitor inference trends and prepare the infrastructure in advance for new model sizes and types.

We Expect You To Have

At least two years of programming experience
Strong proficiency in C++ or willingness to learn it quickly
Knowledge of Concurrency in C++ or Linux

Will Be a Plus

Experience developing high-load services in C++
Experience deploying and operating services for ML Inference on CPU/GPU
Familiarity with Triton, TRT-LLM
Understanding of neural network architecture, keeping up with the latest developments in the field in your free time
Knowledge of Unix/Linux systems (process structure, file system, system calls, etc.)

About the Team

What Tasks Await You

Will Be a Plus

Experience developing high-load services in C++

Experience deploying and operating services for ML Inference on CPU/GPU

Familiarity with Triton, TRT-LLM

Understanding of neural network architecture, keeping up with the latest developments in the field in your free time

Knowledge of Unix/Linux systems (process structure, file system, system calls, etc.)

Key Skills

Contacts

Average salary for this role

Details

About the Team

What Tasks Await You

We Expect You To Have

Will Be a Plus

Similar vacancies

Lead Inference Server Developer in the ML Infrastructure Department

C++ Developer for YandexGPT (Neuro)

C++ Developer (VLLM, SGlang, TensorRT)

ML Developer for Inference Acceleration Team

C++ Developer for the Search Suggestions Team

Backend Developer for Voice Technology Team

LLM RL Training Infrastructure Developer

Developer in Advertising Ranking

LLM Platform Engineer (ML Engineer)

Senior CUDA Engineer (Kandinsky)

LLM Infrastructure Developer

Senior CUDA Engineer (Kandinsky)

Key Skills

Contacts

Average salary for this role

Details

About the Team

What Tasks Await You

We Expect You To Have

Will Be a Plus

Similar vacancies

Lead Inference Server Developer in the ML Infrastructure Department

C++ Developer for YandexGPT (Neuro)

C++ Developer (VLLM, SGlang, TensorRT)

ML Developer for Inference Acceleration Team

C++ Developer for the Search Suggestions Team

Backend Developer for Voice Technology Team

LLM RL Training Infrastructure Developer

Developer in Advertising Ranking

LLM Platform Engineer (ML Engineer)

Senior CUDA Engineer (Kandinsky)

LLM Infrastructure Developer

Senior CUDA Engineer (Kandinsky)