Description
Kandinsky is a line of models for generating images and videos from text descriptions. Our team is engaged in training and developing the model, analytics and building metrics for its performance, and specializes in creating innovative solutions in the field of artificial intelligence and neural networks. We develop models aimed at improving interaction between humans and AI, automating processes for analyzing large volumes of data, image recognition and natural language processing, as well as creating creative tools for the automatic generation of high-level visual content.
Responsibilities
- designing and developing ETL/ELT pipelines for processing image and video data, both within the Apache Airflow ecosystem and in the form of standalone Python scripts.
- automating data loading, preprocessing, and analysis processes: loading images and videos, processing the received data, identifying technical artifacts (e.g., the presence of black bars), transforming and preparing data into required formats.
- designing and maintaining high-load pipelines capable of scaling to distributed processing.
- developing high-load processes for cutting, compressing, and converting large video files using optimized tools (ffmpeg, multiprocessing, async approaches).
- implementing mechanisms for tracking data state and history: accounting for already processed files, planning tasks for reloading, maintaining service tables.
- supporting the data platform: creating and optimizing DDL/DML scripts, configuring tables for analytical and operational loads.
- preparing datasets according to the requirements of internal and external customers, ensuring data quality and completeness.
- supporting CI/CD processes and standardizing the codebase in accordance with engineering practices and design patterns.
Requirements
- solid practical experience in developing ETL processes using Apache Airflow or similar orchestration systems.
- experience with S3 or compatible object storage systems, understanding of data-lake structure and organization principles.
- understanding of distributed data processing principles and PySpark operation.
- confident Python development skills, including the use of asynchronous tools, multiprocessing, working with large files and media data.
- experience writing Bash scripts to automate routine processes.
- deep understanding of clean architecture design principles, design patterns, and building easily maintainable modular systems.
- experience with PostgreSQL and ClickHouse, skills in writing optimized queries and designing tables.
- experience with Docker and Kubernetes, understanding of data pipeline containerization.
Conditions
- largest DS&AI community — over 600 bank DS specialists.
- digest of the latest developments in the DS&AI field and reports from the world's largest conferences.
- opportunity to be a co-author of research papers and articles for international conferences.
- opportunity to choose a convenient work format: hybrid or office.
- annual salary review, annual bonus.
- corporate gym and relaxation areas.
- over 400 educational programs from SberUniversity for professional and career development.
- extended VHI, preferential insurance for family, and corporate pension program.
- mortgage more profitable by up to 7% for every employee.
- free SberPrime+ subscription, discounts on products from partner companies.
- reward for recommending friends to Sber's team.