Search by job, company or skills

Salla

Senior Site Reliability Engineer (SRE)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 8 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the on-call rotation as part of our commitment to platform reliability.

Requirements

Reliability & Incident Management

  • Lead high-severity incident response and drive post-incident reviews
  • Troubleshoot complex issues across applications, infrastructure, and networks
  • Improve MTTR through better monitoring, alerts, and diagnostic tooling
  • Participate in the on-call rotation supporting production systems

Performance & Scalability

  • Identify and resolve performance bottlenecks and scaling challenges
  • Conduct load testing and capacity planning for high-traffic scenarios

Infrastructure & Operations

  • Enhance cloud-native infrastructure, deployment processes, and automation
  • Improve resilience, fault-tolerance, and recovery mechanisms across systems

Observability

  • Build and refine dashboards, alerts, metrics, logs, and traces
  • Define SLIs/SLOs and improve visibility into system behavior

Tooling & Automation

  • Develop tools that reduce operational toil and increase reliability
  • Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows

Collaboration

  • Work closely with engineering teams to ensure services are robust and production-ready
  • Mentor engineers on reliability, debugging, and operational best practices

Required Skills

  • Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS/GCP/Azure)
  • Deep understanding of Linux, networking, distributed systems, and load balancers
  • Hands-on with Terraform or similar IaC tools
  • Experience with Prometheus, Grafana, Loki, Mimir, Elastic, or similar observability tools
  • Proficiency in scripting/programming (Bash, Python, Go)
  • Experience with CI/CD and GitOps
  • Strong debugging, incident response, and performance analysis skills

Bonus Skills

  • Background in large-scale, high-traffic systems
  • Experience with fault-tolerant design, DR, and HA patterns
  • Familiarity with SLOs, SLIs, and error budgets

Location Preference

  • Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 136406383