Description
We research and implement state-of-the-art methods for instruction-based image/video editing, audio integration into video generation, and quality enhancement via RLHF (Reinforcement Learning from Human Feedback).
Responsibilities
- developing and training diffusion models for instruction-based video and image editing
- researching architectures for joint video and synchronized audio generation from a text prompt
- creating an RLHF pipeline for fine-tuning models: training multimodal reward models (video/audio/text), integrating RL algorithms (PPO, DPO, GRPO) into the diffusion pipeline
- designing experiments, analyzing results
- close collaboration with Distributed Engineers for efficient implementation of ideas.
Requirements
- strong background in CV, generative models (Diffusion, GANs), multimodal ML
- experience working with diffusion models (Stable Diffusion/FLUX, Wan 2.X, etc.) and frameworks (Diffusers)
- practical knowledge of Reinforcement Learning, especially RLHF
- confident proficiency in PyTorch and distributed training skills (DDP/FSDP)
- ability to rapidly prototype and research SOTA methods
- bonus: Experience with audio generation (AudioLDM, MusicGen), publications at NeurIPS/ICML/CVPR
- skills in working with generative AI models; experience creating AI agents and utilizing them in work will be an advantage.
Conditions
- annual salary review, annual bonus
- corporate gym and recreation areas
- Sber's unique training system for professional development
- extended voluntary health insurance and preferential family insurance
- free SberPrime+ subscription, discounts on products from partner companies
- referral bonus for recommending friends to the Sber team
- corporate pension program.