Developer for the YT API Development Group

We are actively developing the ML infrastructure direction at Yandex. Our task is to make the infrastructure efficient and convenient for thousands of ML engineers working in the company. One of the key systems used by ML engineers is YT. Both training runs and data preparation for them happen on YT clusters. That is, Yandex's supercomputers are connected to YT clusters and are actively used to train advanced models, such as Yandex GPT-3, Neuro, and others.

For training to work efficiently on thousands of GPUs, a convenient and reliable infrastructure is necessary. For example, training runs must be able to survive host failures. This requires regularly writing checkpoints, whose size can reach tens of TB. It is also necessary to be able to efficiently stream data from distributed storage, with the data stream consumed by a large distributed training run potentially reaching up to 100 GB/s. Any inefficiencies or delays will lead to downtime and underutilization of expensive GPUs.

We are looking for an engineer who will help us build a convenient infrastructure for reading from and writing to YT for use in ML training.

About the team

We are genuinely passionate about large distributed systems and complex technical tasks. Many of us have academic experience and still actively teach — at MIPT, HSE, Yandex School of Data Analysis (YSDA), and other universities. Several people have won prizes in competitive programming competitions.

The team maintains a startup spirit: we communicate friendly during work and non-work hours, tackle tasks together, experiment, and participate in CTFs.

What tasks await you

Tasks for optimizing the read/write pipeline Before reaching the GPU, data travels a long path. It is read from HDDs and SSDs on YT machines, encoded into a stable format (for example, JSON or Arrow) for transmission to the client. On the client side, it is decoded by C++ code and wrapped in Python objects for use in machine learning libraries.

You will need to understand this entire path, remove unnecessary conversions, switch to more efficient formats, and then, armed with a profiler, find bottlenecks and optimize them.

New algorithms for parallel reads To fully utilize expensive video cards, the training process usually reads data from different YT machines using multiple threads. Such reads currently create increased load on YT master servers. You will need to develop a new protocol for parallel reads, free from this problem, and implement it across all components: on the master, on the nodes where the data itself is stored, on the proxies that serve as the entry point for the user, as well as in the user libraries themselves.

Writing a convenient library for working with YT from ML training code ML engineers write their code using popular frameworks such as PyTorch or Jax. Our task is to provide tools that allow working with YT as simply and natively as possible from the perspective of these libraries.

More about backend at Yandex — in the channel Yandex for Backend

We expect that you

Have developed complex systems or libraries in C++
Love working on optimization tasks
Know Python and are willing to develop the Python part of our technology stack
Are ready to dive into the specifics of tasks that arise for ML engineers and create convenient tools for users

Contacts

About the team

What tasks await you

We expect that you

Similar vacancies

C++ Developer for the YQL over YT development group

Backend Developer for the Product Search Content Team

Developer for the Automatic Ad Generation Team

C++ Developer at YDB

C++ Developer for the YDB Distributed System Infrastructure Team

Developer for the YDB Streaming and Federated Queries Team

C++ Developer for the YDB Distributed Storage Team

C++ Developer for the Tablets Team

C++ Developer for YT Flow

Developer for the Directory Infrastructure Development Team

Developer at Yandex Platform Engineering

Backend Developer for the Communication Platform Team

Developer for the YT API Development Group

Key Skills

Details

Average salary for this role