Site Reliability Engineer
The Infrastructure Platform team provides internal tools and cloud services as a service to all product teams in the company, ensuring a scalable and reliable foundation for development. We don't just "fix alerts" — we design the platform, automate processes, and directly influence the development of engineering practices within the company.
Areas of responsibility within the team are divided into four focuses: cloud resources, Kubernetes, databases, and monitoring. We are currently looking for a colleague who will be responsible for cloud resources.
Technology Stack and Processes
- Core Technologies: Kubernetes, Terraform, Prometheus, Grafana Stack (Mimir, Alloy, Loki, Tempo, Pyroscope), GitOps/PaaS/IDP (Internal Development Platform), AI driven.
- Cloud Platforms: Azure, Yandex Cloud, SberCloud will be added in the future.
- Development: Clean code (Go, Bash, Python) is a mandatory part of the job, not just writing scripts.
- Processes: Agile methodology, weekly team demos and meetups. Mandatory participation in engineering on-call duty approximately once every 1.5 months (a week of on-call duty with primary incident response). Additional payment and a day off after are provided for night shifts.
We Expect
- 5+ years of commercial experience as an SRE/DevOps in product companies.
- Ability to see the big picture, analyze complex distributed systems, and design reliable solutions.
- Ability to independently set goals, make decisions, and bring tasks to completion.
- Ability to clearly explain technical concepts, engage in dialogue with different teams, and argue your point of view.
- Practical experience with Kubernetes, CI/CD, monitoring principles, and infrastructure as code.
- Readiness and ability to write quality code to create tools and automation, not just use ready-made configurations.
Your Responsibilities Will Include
- Operational work (30-40% of time): participation in on-call duties, prompt incident resolution, consultation and support for other teams on infrastructure issues.
- Engineering and platform tasks (60-70% of time): development and support of tools for managing cloud infrastructure, automation of routine processes, improvement of observability (monitoring, logging, tracing) and service reliability.
- Communication: active interaction with development, security, IT, and other departments to design and implement solutions.
We Offer
- A corporate culture where people make decisions themselves and take responsibility for them, solving complex problems.
- Conditions that allow you to focus on creation: a salary corresponding to the level of responsibility, healthcare support (full compensation for VHI with dentistry from day one, compensation for sessions with specialists on the Alter platform, sick leave payment up to 100% for 7 days a year, travel insurance abroad).
- A range of discounts from partner companies: co-financing of English language lessons from Skyeng, access to the Best Benefits website.
- External training at the company's expense: specialized conferences and courses.
- High speed of professional development. Tasks that no one has done before us constantly appear. And we don't plan to stop.
- A large team of like-minded people.
Stages
- Interview with a recruiter
- Technical interview (assessment of hard skills, possibly a test task)
- Product section (if necessary)
- Final interview
- Offer presentation