About the role
We are looking for a technical leader to manage a team of six L1 engineers, a multi-server architecture with a large fleet of equipment and interconnected services, as well as the full cycle: from deployment to monitoring and incident resolution.
You will personally handle complex, rare, and non-standard incidents. But the strategic mission is broader – to build a system where typical problems are resolved at the L1 level according to the runbook, without your intervention. Through documentation, mentoring, and continuous process improvement.
You will maintain a complete picture of the architecture: understand the dependencies between services, participate in releases, and assess the risks of changes.
Tasks
- Handling complex incidents that go beyond the runbook (Manual Cases)
- Writing, updating, and reviewing runbooks for the L1 team
- Mentoring: assisting on-duty engineers in incident resolution and skill development
- Participation in Change & Release processes: risk assessment, deployment support
- Maintaining and updating the Service List: describing services, dependencies, criticality
- Preparing Root Cause Analysis for significant incidents
- Interaction with Development and Product teams during escalations
Mandatory requirements
- Linux — deep knowledge: network stack, performance diagnostics, system tuning
- Docker / Docker Compose — confident configuration, debugging, optimization
- NGINX, HAProxy — configuration, load balancing, SSL/TLS, upstream management
- MySQL — replication, cluster configurations, backup/restore, query and schema optimization
- Redis — architecture, diagnostics, failover and persistence configuration
- RabbitMQ — understanding of queue models, diagnostics, recovery from failures
- Memcached — configuration, diagnostics, load optimization
- ClickHouse — basic operation, diagnostics, reading query profiles
- PHP — understanding at the operational level: interpreter, configuration (php-fpm, php.ini), logs, basic debugging
- Monitoring and alerting — configuration of Nagios (NRPE/NCPA), Loki, Sentry; writing checks and alerting rules
- Git / GitLab / SVN — understanding of VCS, working with pipelines, participation in the release process
- RAID — understanding of Software RAID and Hardware RAID, diagnostics of array degradation
- LLM assistants (Claude, Cursor, etc.) — confident use for analyzing complex problems, writing runbooks, automating documentation
- Experience in writing technical documentation and runbooks
- English language for reading technical documentation and alerts
- 5+ years of experience as a sysadmin, DevOps