LLM Infrastructure Developer

Large-scale LLM inference is a complex infrastructure challenge: GPUs operate at their limits, network delays occur, and hardware failures are possible. We build solutions to minimize the impact of these events on the availability and latency of our inference service.

What tasks await you

Optimizing inference engines You will be responsible for improving efficiency and reducing latency during LLM inference on GPUs.

Developing diagnostic tools You will create and enhance tools for quickly identifying and resolving infrastructure issues that affect inference stability and speed.

Research and implementation You will work with inference optimization methods (quantization, pruning) and modern approaches to parallelization.

We expect you to

Be proficient in C++ and Python: have strong low-level programming and optimization skills
Have experience with GPUs (NVIDIA) and CUDA: understand GPU architecture, have developed or optimized algorithms using CUDA
Have a deep understanding of Transformer architecture: be familiar with internal mechanisms (attention, FFN, normalization) and their implementations
Know parallelization approaches: understand Data Parallel, Tensor Parallel, Pipeline Parallel (preferably also Expert Parallel) for distributed inference or training
Have an interest in LLMs and MLOps: understand the tasks and challenges associated with operating large models in production
Be able to work effectively in a team and share knowledge

It will be a plus if you

Have experience with modern inference optimization solutions: vLLM, TensorRT-LLM (TRT-LLM), or sglang

LLM Infrastructure Developer

What tasks await you

Optimizing inference engines You will be responsible for improving efficiency and reducing latency during LLM inference on GPUs.

Developing diagnostic tools You will create and enhance tools for quickly identifying and resolving infrastructure issues that affect inference stability and speed.

Research and implementation You will work with inference optimization methods (quantization, pruning) and modern approaches to parallelization.

We expect you to

Be proficient in C++ and Python: have strong low-level programming and optimization skills
Have experience with GPUs (NVIDIA) and CUDA: understand GPU architecture, have developed or optimized algorithms using CUDA
Have a deep understanding of Transformer architecture: be familiar with internal mechanisms (attention, FFN, normalization) and their implementations
Know parallelization approaches: understand Data Parallel, Tensor Parallel, Pipeline Parallel (preferably also Expert Parallel) for distributed inference or training
Have an interest in LLMs and MLOps: understand the tasks and challenges associated with operating large models in production
Be able to work effectively in a team and share knowledge

It will be a plus if you

Have experience with modern inference optimization solutions: vLLM, TensorRT-LLM (TRT-LLM), or sglang

LLM Infrastructure Developer

Key Skills

Contacts

Details

What tasks await you

We expect you to

It will be a plus if you

Similar vacancies

C++ Developer (VLLM, SGlang, TensorRT)

LLM RL Training Infrastructure Developer

LLM Platform Engineer (ML Engineer)

ML Developer for Inference Acceleration Team

Senior LLM Developer for the Neuro Team

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

LLM Engineer / Inference Engineer (Center for Applied AI)

DL Developer for the YandexGPT Architecture Research Team

C++ Developer for YandexGPT (Neuro)

Team Lead of DL Development for the International Direction Neuro (LLM)

ML Engineer LLM GigaChat

GPU Performance Engineer

LLM Infrastructure Developer

Key Skills

Contacts

Details

What tasks await you

We expect you to

It will be a plus if you

Similar vacancies

C++ Developer (VLLM, SGlang, TensorRT)

LLM RL Training Infrastructure Developer

LLM Platform Engineer (ML Engineer)

ML Developer for Inference Acceleration Team

Senior LLM Developer for the Neuro Team

Senior DL/LLM Engineer (Pretrain/RL Efficiency)

LLM Engineer / Inference Engineer (Center for Applied AI)

DL Developer for the YandexGPT Architecture Research Team

C++ Developer for YandexGPT (Neuro)

Team Lead of DL Development for the International Direction Neuro (LLM)

ML Engineer LLM GigaChat

GPU Performance Engineer