Reach out directly about this role
By city
5 years
Experience
Full-time
Employment
Hybrid, Onsite
Work Format
Lead
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
Lead Inference Server Developer in the ML Infrastructure Department
Our team develops services that enable rapid deployment of neural model inference for production use. This could be a lightweight network on CPU or a large transformer with billions of parameters that needs to be served at hundreds of thousands of RPS with 30 ms Q99 latency. Additionally, it's crucial that new services are easy to deploy, come with multi-level caching, monitoring, model delivery and retraining, and much more out of the box.
Our mission is to ensure that:
Our team is a unit of 10 people that is currently scaling up to achieve ambitious goals. Some of our colleagues work on core technology and are responsible for company-wide adoption of the service—ensuring it works efficiently and conveniently for different types of teams. Another part works on inference services for advertising, where there are enormous loads (hundreds of thousands of RPS), vast hardware resources (hundreds of thousands of cores, hundreds of GPUs), and a direct impact on financial results.
All team members are from top universities, many have graduated from or are currently studying at Yandex School of Data Analysis. Most of the team works from the Moscow office, as we enjoy not only solving problems but also being around like-minded, engaged people.
We like to go to a bar together in the evening, play board games, or just have pizza after a successful launch. We periodically visit campuses to discuss technology.
If you love working on complex, impactful projects in a company of strong, motivated people—come join us!
Assisting with Company-wide Solution Adoption There are at least 20 teams at Yandex operating ML models. To make the solution convenient for everyone and enable them to run experiments quickly, constant interaction with customers and implementation of necessary features for them is required, such as new backends for neural network deployment or consultation on setting up new installations.
Developing the Boxed Inference Solution We currently have the core part of the service ready, but to make the solution truly convenient, many ideas and improvements need to be implemented, such as dynamic load balancing, multi-level in-memory, disk, and remote cache, dynamic configs, and developing tools for deploying the service in the cloud.
Benchmarking Against Global Counterparts To create a strong competitive solution, we must always look around and adopt best practices and ideas. To this end, we analyze similar solutions: both for inference code (Triton Inference Server, KServe) and for service deployment in orchestration systems (Seldon Core, Kubeflow). We also need to keep an eye on inference trends and proactively prepare the infrastructure for new model sizes and types.
Integrating Inference Services with the ML Platform The unified ML platform allows for training and comparing model runs, but we want the entire model lifecycle—deployment and operation—to be accessible from it. This requires designing inference usage scenarios through the ML platform and implementing integration with it.