ML Developer for Inference Acceleration Team

YandexGPT is increasingly integrated into the company's services and solves a wide variety of tasks, providing value to people. Each implementation presents developers with unique challenges related to the quality and speed of ML model performance. However, one thing remains constant for every deployment: model inference in production is very expensive. Depending on the audience and load, a service may require from tens to thousands of the most modern GPUs. Optimizing even tens of percent of resources at such scale already represents significant value.

You can read more about the general approach to inference acceleration, as well as the methods used, in the post on Habr "Accelerating LLM Inference".

We are looking for a research engineer with experience in reading and implementing research papers, ready to experiment and apply inference acceleration methods for modern and rapidly evolving LLM architectures.

What tasks await you

Continuous analysis of research papers First and foremost, you will need to deeply familiarize yourself with a series of articles on the topic (more than 20 publications), systematize them, and identify the most promising ones.

Applying methods for YandexGPT You will need to conduct numerous iterations of experiments to test hypotheses for YandexGPT in order to move on to generating and implementing new approaches. You will also need to confirm the practical applicability of the methods: measure quality and acceleration.

Developing universal tools Finally, you will need to create a common solution that will be reused by ML engineers across all of Yandex.

We expect you to

Have worked with modern LLMs and understand their architecture
Write in Python and have development experience with Torch
Have deep knowledge in NLP
Are familiar with the inference pipeline of generative models and know optimizations such as KV-caching
Understand how computations change when batch_size changes
Understand user requirements for model APIs: RPS, latency per token/sample, GPU VRAM, SM utilization

Will be a plus

Are proficient in C++ and familiar with CUDA programming

Contacts

What tasks await you

We expect you to

Will be a plus

Similar vacancies

ML Researcher for the Early-binding Architectures Team

LLM Infrastructure Developer

LLM Platform Engineer (ML Engineer)

NLP Developer for Keyboard

LLM Engineer / Inference Engineer (Center for Applied AI)

ML Developer for the Recommendation Systems Team

ML Developer for the Agent Solutions Development Team

C++ Developer (VLLM, SGlang, TensorRT)

DL Developer for the YandexGPT Architecture Research Team

C++ Inference Server Developer for the ML Infrastructure Department

ML Developer for the International Search Ranking Team

Senior DL Developer for Neuro Team

ML Developer for Inference Acceleration Team

Key Skills

Details

Average salary for this role