Reach out directly about this role
NLP Developer for the YandexGPT Pretrain Team
Our team works on the pretraining of YandexGPT – the first and most resource-intensive stage of training large language models (LLMs). We select data, run experiments, choose training methods, and train the models themselves. Our developments are the foundation for many Yandex services, such as Alice, Neuro in Search, and are also used in Browser, Market, Advertising, and Translator. The quality of these products directly depends on our models.
One of the key characteristics of pretrain models is their "intelligence." This implies knowledge of all facts and concepts from texts, as well as the ability to generalize information. We strive to make YandexGPT the most intelligent model on the market so that products based on our neural network are the best.
Building a training corpus for the model Modern LLMs require trillions of tokens. Assembling such datasets is a non-trivial task: from trillions of documents on the internet, it is necessary to select and process those that will bring maximum benefit to model training. A full update of the pretrain model corpus allowed YandexGPT 5 Lite to achieve parity with global SOTA on a number of key benchmarks for pretrain models, and surpass them on many others. You can read more about YandexGPT 5 in an article on Habr.
You will go through the entire process of building a dataset for SOTA models: from training classifiers to find useful documents and searching for new data sources to processing this data and conducting experiments. You will choose a dataset that affects the quality of all models at Yandex using advanced methods based on scaling laws.
Creating the foundation for intelligent agents Agents are the next step in AI evolution. The latest high-profile model releases, both commercial and open-source, separately focus on creating systems capable of operating autonomously in a digital environment. It is at the pretrain stage that the capabilities that will make future agents based on YandexGPT many times more powerful can be embedded. You will determine how exactly to do this. The task involves exploring everything: from building the agent environment to determining the optimal training scheme.
Exploring new directions Training LLMs is a rapidly developing field where new research and releases from competitors are constantly emerging. It is important to filter from this stream the results that have a high probability of helping us achieve our goals. You will not just follow trends, but be the first to test and implement the most promising ideas.
More about ML at Yandex – in the channel Yandex for ML
3-5 years
Experience
Full-time
Employment
Hybrid, Remote, Onsite
Work Format
Middle
Grade
Data Science & ML
Specialization
AI
Industry
Corporation
Company Type
Data Science & ML
Specialization
AI
Industry
Corporation
Company Type