Reach out directly about this role
By city
5 years
Experience
Full-time
Employment
Hybrid, Onsite
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
C++ Inference Server Developer for the ML Infrastructure Department
Our team develops services that allow for rapid deployment of neural model inference and their use in production. This could be a lightweight CPU-based network or a large transformer with billions of parameters that needs to handle hundreds of thousands of RPS at 30 ms in Q99. In addition, new services must be easy to deploy, come with multi-level caching, monitoring, model delivery and fine-tuning, and much more built-in from the start.
We aim to ensure that: * ML engineers of any level and from any part of Yandex can use this service in their project literally "at the click of a button." * Research of new models and their delivery to experimentation and production are as simple and fast as possible. * Requests are executed quickly, and CPU/GPU resources are utilized with maximum efficiency.
Our team is a department of 10 people, currently scaling to meet ambitious goals. Part of the team works on core technology and is responsible for rolling out the service company-wide — ensuring the service works efficiently and conveniently for different teams. The other part of the team works on inference services in advertising, which involve massive loads (hundreds of thousands of RPS), significant hardware resources (hundreds of thousands of cores, hundreds of GPUs), and have a direct impact on revenue generation.
All team members are from top universities, many have graduated from Yandex School of Data Analysis (ШАД) or are currently studying there. Most of the team works from the Moscow office, as we enjoy not only solving tasks but also being among like-minded and engaged people.
We like to go to a bar in the evening, play board games, or simply have pizza after a successful launch. We occasionally visit campus offices to discuss technology. If you enjoy working on complex, responsible projects in a company of strong and passionate people, come join us :)
Developing a Turnkey Inference Solution We currently have the core part of the service implemented, but to make the solution truly convenient, we need to implement numerous ideas and developments such as dynamic load balancing, multi-level in-memory/disk/remote caching, and dynamic configurations. You will also be involved in developing tools for deploying the service in the cloud.
Assisting with Company-wide Solution Adoption Across the company, there are no fewer than 20 teams involved in operating ML models. To make the solution convenient for everyone and enable rapid experimentation, constant interaction with our customers is necessary, implementing the features they require, such as new backends for applying neural networks, or providing consultancy on deploying new installations.
Benchmarking Against Global Analogs To create a good and competitive solution, we must always look around and adopt best practices and ideas. To this end, we perform qualitative analysis of analogous solutions, both for inference code (Triton Inference Server, KServe) and for deploying services in deployment systems (Seldon Core, Kubeflow). We also need to monitor inference trends and prepare the infrastructure in advance for new model sizes and types.