Search by job, company or skills

  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About the Role:

We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers, Kubernetes, and CI/CD pipelines, and has a strong focus on reliability, monitoring, and incident handling. You will help keep our services stable, observable, and scalable while collaborating with engineers across the stack.

Responsibilities:

  • Operate and maintain production systems with a focus on reliability, availability, and performance.
  • Work with Docker and Kubernetes to deploy, update, and troubleshoot services.
  • Configure and optimize Kubernetes resources (pods, deployments, services, ingress, config maps, secrets, etc.).
  • Implement and maintain monitoring, logging, and alerting for applications and infrastructure.
  • Build and improve CI/CD pipelines in collaboration with development and DevOps teams.
  • Create and maintain dashboards for key service metrics (latency, error rate, throughput, resource usage).
  • Participate in incident response: investigate issues, identify root cause, and propose fixes and improvements.
  • Work closely with backend developers to improve service reliability, resilience, and observability.
  • Contribute to capacity planning and performance tuning of services and infrastructure.
  • Automate repetitive operational tasks using scripts or small tools.
  • Document runbooks, procedures, and best practices for operating services in production.

Must-Have Qualifications:

  • 3–5 years of professional experience in an SRE, DevOps, or infrastructure-focused engineering role.
  • Strong understanding of Linux systems (shell, processes, networking, permissions, logs).
  • Hands-on experience with Docker and Kubernetes in real environments.
  • Practical experience with:
  • Kubernetes deployments, services, ingress, config maps, and secrets o Basic troubleshooting inside a cluster (pods failing, crashes, restarts, resource issues)
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK/EFK, Application Insights, or similar).
  • Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, or similar).
  • Ability to read and modify pipeline definitions and understand build → test → deploy flows.
  • Basic programming/scripting skills in at least one language (e.g., Python, Bash, PowerShell, Go, etc.).
  • Understanding of core reliability concepts such as SLIs, SLOs, uptime, latency, and availability.
  • Experience troubleshooting production issues using logs, metrics, and dashboards.
  • Good communication skills and ability to collaborate with developers, QA, and product teams.

Nice-to-Have:

  • Experience with at least one major cloud platform (Azure, AWS, Alibaba Cloud, or GCP).
  • Experience with infrastructure as code (Terraform, Bicep, Pulumi, Helm, etc.).
  • Experience with ingress controllers, API gateways, or service mesh.
  • Familiarity with security best practices (secrets management, TLS/certificates, RBAC on Kubernetes or cloud).
  • Experience participating in on-call rotations and using incident management tools (PagerDuty, Opsgenie, etc.).
  • Experience contributing to post-incident reviews and implementing follow-up improvements.

Experience:

3–5 years

More Info

Job Type:
Industry:
Employment Type:

Job ID: 145518281

Similar Jobs