Description
We are the GigaChat ML team. We handle the full model training cycle, from pre-training to alignment. We need a manager who will take full-time, end-to-end ownership of the GigaChat quality metrics system:
- Measure quality on benchmarks and real logs.
- Identify weaknesses and the causes of degradations.
- Develop metrics and processes, accelerate the adoption of new benchmarks and measurement systems.
This role is about fundamental model quality and measurement (not about product metrics like DAU/Retention and not about data collection as the primary focus).
Responsibilities
Ownership of the quality and metrics system (end-to-end ownership)
- Define and maintain a "quality scorecard" for GigaChat: what we consider quality, which metrics are key, which are not.
- Keep focus on system efficiency and reduce measurement costs.
Benchmarks and regression testing
- Continuously update the benchmark suite for key scenarios to keep up with the rapidly evolving LLM field.
- Implement regular comparative testing of model versions and competitors using a unified framework.
Log analytics and weakness diagnosis
- Analyze logs and user feedback from a quality perspective: problem clustering, thematic slices, frequency, severity.
- Link problems in logs to benchmarks: problems must be measurable.
Development and implementation of quality metrics
- Develop new metrics/proxy metrics (automatic and semi-automatic), calibrate them against reference assessments.
- Determine where human evaluation is needed, where automation is sufficient, how to reduce measurement costs without losing reliability.
- Integrate metrics into processes: CI/release checks, quality monitoring, alerts.
Experiments and decision making
- Design and analyze quality A/B experiments (online and/or in controlled tests), draw conclusions: "what improved/degraded", "why", "what to do next", "whether it can be rolled out to production".
Requirements
- Strong Python (pandas, NumPy), confident data analytics skills, ability to quickly turn "raw logs" into insights.
- Good understanding of LLM quality evaluation: what types of metrics exist, where they break, how to validate a metric, how to avoid "gaming".
- Understanding of statistics and experiments: confidence intervals, tests, multiple comparisons, A/B design, interpretation of results.
- Practical experience working with LLMs (Open Source and/or proprietary): understanding the specifics of instruction-following behavior, hallucinations, safety constraints.
- Product and engineering mindset skills: formulating quality criteria so that they become a management mechanism.
Will be a plus
- Experience building evaluation frameworks and "evaluation harnesses" (any internal/external tools), integrating assessments into CI/CD.
- Experience with LLM-based evaluation (LLM judge) and methods for calibrating/controlling judge bias.
- Knowledge of analytics systems and data warehouses (SQL, ClickHouse/BigQuery/Spark/S3), monitoring/dashboards (Grafana/Superset/Looker and equivalents).
Conditions
- Remote within Russia.
- Possibility of employment with an accredited IT company.
- Annual performance bonus of up to 6 monthly salaries.
- Regular salary reviews.
- Corporate gym and recreation areas.
- Over 400 programs at SberUniversity for professional growth.
- Onboarding program and manager support at the start.
- Largest DS&AI community — over 600 DS specialists from the bank, regular knowledge sharing, experience exchange, and best practices, interactive lectures and master classes from leading universities and experts from tech companies, digest of the latest developments in DS&AI and reports from the world's largest conferences, regular internal meetups.
- Extended VHI (Voluntary Health Insurance), preferential insurance for family, corporate pension program.
- Employee mortgages under a discount program.
- SberPrime+ and partner discounts.
- Referral bonus for recommending candidates to the team.