Description
We are looking for an experienced SRE engineer to support and develop a distributed cloud infrastructure based on an OpenStack-like ecosystem in a custom Linux distribution (RPM-based).
You will be responsible for the operational reliability of the platform, automation, observability, release processes, and incident investigation in the production environment.
Responsibilities
- operation and development of the production infrastructure of the cloud platform (control plane + compute/network/storage)
- designing and maintaining SLO/SLI, participating in incident response, postmortem (RCA)
- automation of operational tasks (deployment, updates, migrations, configuration audits)
- development and maintenance of infrastructure tools (scripts, services, operators, utilities)
- diagnosing complex issues in Linux/networks/storage/virtualization, reducing MTTR
- supporting observability: metrics, logs, traces, alerts, dashboards.
- working with CI/CD and release processes: testing, canary deployments, rollback, version control.
Requirements
- excellent knowledge of Linux (at the level of operation and diagnostics): systemd, journalctl, cgroups, namespaces, network stack (iptables/nftables, routing, MTU, TCP/UDP), file systems
- containerization: Docker and/or Podman, working with registry, networking, volumes.
- virtualization: QEMU/KVM, understanding of interaction via libvirt (CLI/API), network bridge/overlay.
- experience with CI/CD (Git, GitLab CI or similar), release automation.
- experience with configuration management (Ansible or similar).
- basic experience with build and package publishing systems for RPM (rpmbuild/mock/koji or similar).
- experience using GigaChat, Kandinsky, and similar tools in products, skills in creating and using AI agents.
Will be a plus
- practical experience operating OpenStack (or its components/analogs)
- experience with Ceph (or other distributed storage systems).
- skills in working with Prometheus/Grafana/Alertmanager (or a similar stack)
- experience building centralized logging (Loki/ELK/OpenSearch).
- understanding of service architectures: REST/RPC, message-bus approach (RabbitMQ/Kafka)
- experience with hardening, basic security mindset (TLS, secrets, access policies)
- experience supporting a custom Linux distribution and internal repositories.
Conditions
- working with a large modular cloud infrastructure and real production tasks
- opportunity to influence the operational architecture, release process, and platform reliability
- technically challenging tasks at the intersection of Linux, virtualization, networks, and distributed systems.
- annual bonus and yearly salary review
- status of an accredited IT company with all the benefits
- extended health insurance from day one and preferential family insurance
- Sber Corporate University, internal educational platform, participation in IT conferences
- preferential mortgage from Sber, SberPrime+ subscription, discounts from partners and services of the group of companies.