Reach out directly about this role
Developer for the YT API Development Group
We are actively developing the ML infrastructure direction at Yandex. Our task is to make the infrastructure efficient and convenient for thousands of ML engineers working in the company. One of the key systems used by ML engineers is YT. Both training runs and data preparation for them happen on YT clusters. That is, Yandex's supercomputers are connected to YT clusters and are actively used to train advanced models, such as Yandex GPT-3, Neuro, and others.
For training to work efficiently on thousands of GPUs, a convenient and reliable infrastructure is necessary. For example, training runs must be able to survive host failures. This requires regularly writing checkpoints, whose size can reach tens of TB. It is also necessary to be able to efficiently stream data from distributed storage, with the data stream consumed by a large distributed training run potentially reaching up to 100 GB/s. Any inefficiencies or delays will lead to downtime and underutilization of expensive GPUs.
We are looking for an engineer who will help us build a convenient infrastructure for reading from and writing to YT for use in ML training.
We are genuinely passionate about large distributed systems and complex technical tasks. Many of us have academic experience and still actively teach — at MIPT, HSE, Yandex School of Data Analysis (YSDA), and other universities. Several people have won prizes in competitive programming competitions.
The team maintains a startup spirit: we communicate friendly during work and non-work hours, tackle tasks together, experiment, and participate in CTFs.
Tasks for optimizing the read/write pipeline Before reaching the GPU, data travels a long path. It is read from HDDs and SSDs on YT machines, encoded into a stable format (for example, JSON or Arrow) for transmission to the client. On the client side, it is decoded by C++ code and wrapped in Python objects for use in machine learning libraries.
You will need to understand this entire path, remove unnecessary conversions, switch to more efficient formats, and then, armed with a profiler, find bottlenecks and optimize them.
New algorithms for parallel reads To fully utilize expensive video cards, the training process usually reads data from different YT machines using multiple threads. Such reads currently create increased load on YT master servers. You will need to develop a new protocol for parallel reads, free from this problem, and implement it across all components: on the master, on the nodes where the data itself is stored, on the proxies that serve as the entry point for the user, as well as in the user libraries themselves.
Writing a convenient library for working with YT from ML training code ML engineers write their code using popular frameworks such as PyTorch or Jax. Our task is to provide tools that allow working with YT as simply and natively as possible from the perspective of these libraries.
More about backend at Yandex — in the channel Yandex for Backend
3 years
Experience
Full-time
Employment
Hybrid
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
By city
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type