Description
We are the GigaChat Pretrain Data team, preparing pretrain data for GigaChat and GigaChat Vision. Pretrain data is the foundation on which the journey of a modern LLM model begins and is the factor that most significantly impacts its final quality. We have over 40 PB of raw data, and the main task is to create a dataset from this chaos that will be used to train the best LLM in Russia.
Responsibilities
- generate synthetic data: mathematics, code, arbitrary synthetic data seeded with documents from the Web
- research tokenization and its impact on model quality (possibly writing articles)
- solve clustering tasks for billions of documents
- research various factors inherent in text data
- generate Vision data for enhancing VLM
- develop new HTML parsing algorithms and research their impact on model quality
- research dependencies between pretrain data and the agentic capabilities of the final model
- develop stable infrastructure that will support running hundreds and thousands of experiments on the data.
Requirements
Requirements
- you have at least two years of relevant commercial experience related to NLP or building data infrastructure.
Nice to have
- skills in working with generative AI models; experience in creating AI agents and using them in work will be an advantage
- experience using GigaChat, Kandinsky, and similar products, skills in creating and using AI agents
- instrumental proficiency with AI for analysis, generation, and automation.
- a degree from ShAD/ HSE Faculty of Computer Science/ MIPT School of Applied Mathematics and Computer Science and/or experience with MapReduce systems, e.g., YT.
Conditions
- comfortable modern office near Kutuzovskaya metro station
- hybrid work format (2 days in the office, 3 days remotely)
- annual salary review, annual bonus
- corporate gym and recreation areas
- learning system for professional and career development
- extended voluntary health insurance from the first day of work and family insurance
- preferential mortgage program for employees
- free SberPrime+ subscription, discounts on products from partner companies
- referral bonus for recommending friends to the Sber team.