OnDuty Engineer
An on-duty engineer is needed to monitor and support the service. This is a first-line position: you will work strictly according to runbooks, record everything that happens, and escalate non-standard situations to a Senior Engineer. The most important qualities for this role are attentiveness, discipline, and the ability to clearly describe a problem.
Tasks
- Constant monitoring of services and infrastructure
- Responding to alerts strictly according to runbooks
- Initial incident diagnosis: checking availability, logs, service status
- Escalation to Senior Operations Engineer when exceeding runbook scope
- Maintaining an event and incident log; timely statuses and updates
Mandatory Requirements
- Linux, command line — SSH, log navigation (journalctl, tail, grep), service management (systemctl), basic load and disk space diagnostics (top/htop, df, du)
- Networking, basic — checking host and port availability (ping, curl, nc/telnet), understanding DNS, assessing if a service is "up"
- Infrastructure, basic — understanding the difference between a physical host and a VM; knowledge of out-of-band access (IPMI/BMC); basic orientation in a cloud console (instance status, metrics)
- Monitoring and dashboards — reading metrics and graphs (Grafana or equivalent), understanding the essence of an alert, severity, and thresholds; ability to distinguish a real incident from a false-positive
- NGINX — reading configs, working with logs, restarting
- MySQL — basic read-only queries, checking replication, reading slow logs
- Docker / Docker Compose — container status, reading logs, restarting, basic reading of compose files
- Working with LLM assistants (Claude, Cursor, etc.) — using them for diagnostics, finding solutions, documenting
- English language for reading technical documentation and alerts
- Ability to clearly and structurally describe a problem in writing
- At least 1 year of experience as a sysadmin, support, or similar role
Will be a plus
- Physical server administration: IPMI / iDRAC / iLO (remote reset, console access, hardware checks)
- Hypervisors: KVM / Proxmox / VMware or equivalent — VM lifecycle management
- Clouds — GCP, AWS, Azure, Yandex Cloud: instances, disks, networks, metrics, and logs in the console
- On-call systems: PagerDuty, OpsGenie, or equivalent
- Understanding of Prometheus-style monitoring (probe, metric, alert rule)