Description
We are looking for a Middle ML Engineer/Researcher to join a research team working on the development of omnimodal solutions in the audio domain, as part of a large-scale project to create next-generation artificial intelligence systems.
Responsibilities
● Researching technologies for working with different audio modalities: speech, noise, music, sound effects
● Integrating audio, text, and visual modalities into a unified architecture
● Multimodal reasoning and stream synchronization (audio–text–vision)
● Researching and implementing state-of-the-art approaches (end-to-end, transformers, multimodal LLMs, diffusions)
Requirements
- Excellent Python 3, experience with PyTorch, bash, git, Docker, dvc, HF Transformers
- Good understanding of ASR, TTS, DSP ML, speech & audio processing
- Understanding of transformers, attention mechanisms, KV-cache, diffusion
- Skills in working with large audio datasets
- Understanding of MLOps practices: model monitoring, data drift, CI/CD
Will be a plus:
- Experience in speech, music domains, with voice assistants
- Experience with diffusion and autoregressive architectures for audio/music
- Experience with streaming / real-time systems
- Knowledge of multimodal LLM / VLM / Audio-LM
- Publications or research background in relevant fields
Conditions
- Comfortable modern office near Kutuzovskaya metro station
- Hybrid work format
- Annual salary review, annual bonus from 3 salaries
- Large gym and recreation areas
- Training system for professional and career development
- Extended voluntary health insurance policy from the first day of work and insurance for family
- Employee mortgage program with a discount of -1/3 from the current rate
- Free SberPrime+ subscription, discounts on products from partner companies
- Referral bonus for recommending friends to the Sber team.