About the Role:
We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers, Kubernetes, and CI/CD pipelines, and has a strong focus on reliability, monitoring, and incident handling. You will help keep our services stable, observable, and scalable while collaborating with engineers across the stack.
Responsibilities:
- Operate and maintain production systems with a focus on reliability, availability, and performance.
- Work with Docker and Kubernetes to deploy, update, and troubleshoot services.
- Configure and optimize Kubernetes resources (pods, deployments, services, ingress, config maps, secrets, etc.).
- Implement and maintain monitoring, logging, and alerting for applications and infrastructure.
- Build and improve CI/CD pipelines in collaboration with development and DevOps teams.
- Create and maintain dashboards for key service metrics (latency, error rate, throughput, resource usage).
- Participate in incident response: investigate issues, identify root cause, and propose fixes and improvements.
- Work closely with backend developers to improve service reliability, resilience, and observability.
- Contribute to capacity planning and performance tuning of services and infrastructure.
- Automate repetitive operational tasks using scripts or small tools.
- Document runbooks, procedures, and best practices for operating services in production.
Must-Have Qualifications:
- 3–5 years of professional experience in an SRE, DevOps, or infrastructure-focused engineering role.
- Strong understanding of Linux systems (shell, processes, networking, permissions, logs).
- Hands-on experience with Docker and Kubernetes in real environments.
- Practical experience with:
- Kubernetes deployments, services, ingress, config maps, and secrets o Basic troubleshooting inside a cluster (pods failing, crashes, restarts, resource issues)
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK/EFK, Application Insights, or similar).
- Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, or similar).
- Ability to read and modify pipeline definitions and understand build → test → deploy flows.
- Basic programming/scripting skills in at least one language (e.g., Python, Bash, PowerShell, Go, etc.).
- Understanding of core reliability concepts such as SLIs, SLOs, uptime, latency, and availability.
- Experience troubleshooting production issues using logs, metrics, and dashboards.
- Good communication skills and ability to collaborate with developers, QA, and product teams.
Nice-to-Have:
- Experience with at least one major cloud platform (Azure, AWS, Alibaba Cloud, or GCP).
- Experience with infrastructure as code (Terraform, Bicep, Pulumi, Helm, etc.).
- Experience with ingress controllers, API gateways, or service mesh.
- Familiarity with security best practices (secrets management, TLS/certificates, RBAC on Kubernetes or cloud).
- Experience participating in on-call rotations and using incident management tools (PagerDuty, Opsgenie, etc.).
- Experience contributing to post-incident reviews and implementing follow-up improvements.
Experience:
3–5 years