Lead Inference Server Developer in the ML Infrastructure Department

Our team develops services that enable rapid deployment of neural model inference for production use. This could be a lightweight network on CPU or a large transformer with billions of parameters that needs to be served at hundreds of thousands of RPS with 30 ms Q99 latency. Additionally, it's crucial that new services are easy to deploy, come with multi-level caching, monitoring, model delivery and retraining, and much more out of the box.

Our mission is to ensure that:

ML engineers of any level and from any part of Yandex can use our service in their project literally with the push of a button;
researching new models and deploying them for experiments and production is as simple and fast as possible;
requests are executed quickly, and CPU/GPU resources are utilized with maximum efficiency.

About the Team

Our team is a unit of 10 people that is currently scaling up to achieve ambitious goals. Some of our colleagues work on core technology and are responsible for company-wide adoption of the service—ensuring it works efficiently and conveniently for different types of teams. Another part works on inference services for advertising, where there are enormous loads (hundreds of thousands of RPS), vast hardware resources (hundreds of thousands of cores, hundreds of GPUs), and a direct impact on financial results.

All team members are from top universities, many have graduated from or are currently studying at Yandex School of Data Analysis. Most of the team works from the Moscow office, as we enjoy not only solving problems but also being around like-minded, engaged people.

We like to go to a bar together in the evening, play board games, or just have pizza after a successful launch. We periodically visit campuses to discuss technology.

If you love working on complex, impactful projects in a company of strong, motivated people—come join us!

What Tasks Await You

Assisting with Company-wide Solution Adoption There are at least 20 teams at Yandex operating ML models. To make the solution convenient for everyone and enable them to run experiments quickly, constant interaction with customers and implementation of necessary features for them is required, such as new backends for neural network deployment or consultation on setting up new installations.

Developing the Boxed Inference Solution We currently have the core part of the service ready, but to make the solution truly convenient, many ideas and improvements need to be implemented, such as dynamic load balancing, multi-level in-memory, disk, and remote cache, dynamic configs, and developing tools for deploying the service in the cloud.

Benchmarking Against Global Counterparts To create a strong competitive solution, we must always look around and adopt best practices and ideas. To this end, we analyze similar solutions: both for inference code (Triton Inference Server, KServe) and for service deployment in orchestration systems (Seldon Core, Kubeflow). We also need to keep an eye on inference trends and proactively prepare the infrastructure for new model sizes and types.

Integrating Inference Services with the ML Platform The unified ML platform allows for training and comparing model runs, but we want the entire model lifecycle—deployment and operation—to be accessible from it. This requires designing inference usage scenarios through the ML platform and implementing integration with it.

We Expect You To

Have at least five years of programming experience
Be proficient in C++ or be ready to get up to speed quickly
Have been a technical lead of a service, responsible for roadmap planning
Have expertise in Concurrency in C++ or Linux
Prioritize business value when making decisions

It Will Be a Plus If You

Have developed and architected high-load services in C++
Are familiar with the architecture of neural models and follow the latest developments in the field in your spare time
Have designed, deployed, and operated services for ML Inference on CPU/GPU
Have a solid understanding of Unix and Linux systems: process structure, file system, system calls, etc.
Have knowledge of Triton, TRT LLM

About the Team

We like to go to a bar together in the evening, play board games, or just have pizza after a successful launch. We periodically visit campuses to discuss technology.

If you love working on complex, impactful projects in a company of strong, motivated people—come join us!

What Tasks Await You

It Will Be a Plus If You

Have developed and architected high-load services in C++

Are familiar with the architecture of neural models and follow the latest developments in the field in your spare time

Have designed, deployed, and operated services for ML Inference on CPU/GPU

Have a solid understanding of Unix and Linux systems: process structure, file system, system calls, etc.

Have knowledge of Triton, TRT LLM

Key Skills

Contacts

Average salary for this role

Details

About the Team

What Tasks Await You

We Expect You To

It Will Be a Plus If You

Similar vacancies

C++ Inference Server Developer for the ML Infrastructure Department

Lead Developers and Technical Leaders for Advertising

ML Developer for Inference Acceleration Team

C++ Developer for YandexGPT (Neuro)

LLM Platform Engineer (ML Engineer)

Backend Developer for Voice Technology Team

Developer in Advertising Ranking

LLM Infrastructure Developer

LLM RL Training Infrastructure Developer

C++ Developer for the Search Suggestions Team

ML Infrastructure Developer at Plus

C++ Developer (VLLM, SGlang, TensorRT)

Key Skills

Contacts

Average salary for this role

Details

About the Team

What Tasks Await You

We Expect You To

It Will Be a Plus If You

Similar vacancies

C++ Inference Server Developer for the ML Infrastructure Department

Lead Developers and Technical Leaders for Advertising

ML Developer for Inference Acceleration Team

C++ Developer for YandexGPT (Neuro)

LLM Platform Engineer (ML Engineer)

Backend Developer for Voice Technology Team

Developer in Advertising Ranking

LLM Infrastructure Developer

LLM RL Training Infrastructure Developer

C++ Developer for the Search Suggestions Team

ML Infrastructure Developer at Plus

C++ Developer (VLLM, SGlang, TensorRT)