NLP Developer for the YandexGPT Pretrain Team

Our team works on the pretraining of YandexGPT – the first and most resource-intensive stage of training large language models (LLMs). We select data, run experiments, choose training methods, and train the models themselves. Our developments are the foundation for many Yandex services, such as Alice, Neuro in Search, and are also used in Browser, Market, Advertising, and Translator. The quality of these products directly depends on our models.

One of the key characteristics of pretrain models is their "intelligence." This implies knowledge of all facts and concepts from texts, as well as the ability to generalize information. We strive to make YandexGPT the most intelligent model on the market so that products based on our neural network are the best.

What tasks await you

Building a training corpus for the model Modern LLMs require trillions of tokens. Assembling such datasets is a non-trivial task: from trillions of documents on the internet, it is necessary to select and process those that will bring maximum benefit to model training. A full update of the pretrain model corpus allowed YandexGPT 5 Lite to achieve parity with global SOTA on a number of key benchmarks for pretrain models, and surpass them on many others. You can read more about YandexGPT 5 in an article on Habr.

You will go through the entire process of building a dataset for SOTA models: from training classifiers to find useful documents and searching for new data sources to processing this data and conducting experiments. You will choose a dataset that affects the quality of all models at Yandex using advanced methods based on scaling laws.

Creating the foundation for intelligent agents Agents are the next step in AI evolution. The latest high-profile model releases, both commercial and open-source, separately focus on creating systems capable of operating autonomously in a digital environment. It is at the pretrain stage that the capabilities that will make future agents based on YandexGPT many times more powerful can be embedded. You will determine how exactly to do this. The task involves exploring everything: from building the agent environment to determining the optimal training scheme.

Exploring new directions Training LLMs is a rapidly developing field where new research and releases from competitors are constantly emerging. It is important to filter from this stream the results that have a high probability of helping us achieve our goals. You will not just follow trends, but be the first to test and implement the most promising ideas.

More about ML at Yandex – in the channel Yandex for ML

We expect you to

Understand how LLMs work and have experience in NLP or other areas of DL
Know how to identify in the stream of published articles and research the results worth trying
Can generate new ideas that lead to improved results
Are ready for the high pace of work required to compete with the world's leading AI players

Contacts

What tasks await you

We expect you to

Similar vacancies

Senior ML Developer for the YandexGPT Pretraining Quality Team

NLP Developer in Tech Platform

Tech Lead / Senior ML Developer for the Iron Intern Team

YandexGPT Reasoning Team Lead

NLP Developer for Browser

DL Developer for Neuro Quality Group

Team Lead of DL Development for the International Direction Neuro (LLM)

Senior ML Developer (NLP/LLM) for the NeuroSales Product Team

ML Developer for Market AI (NLP, LLM)

ML Developer for Generative Ecom Scenarios (LLM) Team

ML Developer for the Search Blender Team

Senior LLM Developer for the Alignment Team at YandexGPT

NLP Developer for YandexGPT Pretrain Team

Key Skills

Details