As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the
on-call rotation as part of our commitment to platform reliability.
Requirements
Reliability & Incident Management
- Lead high-severity incident response and drive post-incident reviews
- Troubleshoot complex issues across applications, infrastructure, and networks
- Improve MTTR through better monitoring, alerts, and diagnostic tooling
- Participate in the on-call rotation supporting production systems
Performance & Scalability
- Identify and resolve performance bottlenecks and scaling challenges
- Conduct load testing and capacity planning for high-traffic scenarios
Infrastructure & Operations
- Enhance cloud-native infrastructure, deployment processes, and automation
- Improve resilience, fault-tolerance, and recovery mechanisms across systems
Observability
- Build and refine dashboards, alerts, metrics, logs, and traces
- Define SLIs/SLOs and improve visibility into system behavior
Tooling & Automation
- Develop tools that reduce operational toil and increase reliability
- Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows
Collaboration
- Work closely with engineering teams to ensure services are robust and production-ready
- Mentor engineers on reliability, debugging, and operational best practices
Required Skills
- Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS/GCP/Azure)
- Deep understanding of Linux, networking, distributed systems, and load balancers
- Hands-on with Terraform or similar IaC tools
- Experience with Prometheus, Grafana, Loki, Mimir, Elastic, or similar observability tools
- Proficiency in scripting/programming (Bash, Python, Go)
- Experience with CI/CD and GitOps
- Strong debugging, incident response, and performance analysis skills
Bonus Skills
- Background in large-scale, high-traffic systems
- Experience with fault-tolerant design, DR, and HA patterns
- Familiarity with SLOs, SLIs, and error budgets
Location Preference
- Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage