Role Overview
A Senior DevOps / Site Reliability Engineer (SRE) to design, build, and maintain scalable infrastructure systems. The ideal candidate has deep expertise in Linux administration, container orchestration, CI/CD, and modern DevOps practices, with the ability to mentor junior team members and drive automation across environments.
Key Responsibilities
- Lead the design, deployment, and administration of Linux-based infrastructure.
- Architect, maintain, and optimize CI/CD pipelines for development and production workloads.
- Build and manage containerized workloads using Docker and Kubernetes (including HA setups, storage, and networking).
- Troubleshoot complex system, networking, and DNS-related issues across distributed systems.
- Implement monitoring, logging, and alerting solutions to ensure system reliability and performance.
- Automate operational tasks using scripting and Infrastructure as Code (Terraform, Ansible, etc.).
- Collaborate with development and security teams to ensure best practices in system reliability and compliance.
- Mentor and guide junior engineers in Linux administration, DevOps, and automation best practices.
Required Skills & Knowledge
- 6+ years of experience in Linux systems engineering/DevOps.
- Strong expertise in Linux administration (performance tuning, kernel-level debugging, storage systems).
- Advanced understanding of networking, DNS, load balancing, and firewalls.
- Proven experience managing Docker and Kubernetes clusters (including upgrades, scaling, and troubleshooting).
- Hands-on experience with CI/CD tools (Jenkins, GitLab CI, ArgoCD, etc.).
- Strong automation skills (Bash, Python, Ansible, Terraform).
- Knowledge of security best practices in systems, containers, and networks.
- Ability to design resilient, highly available infrastructure systems.
- Deep expertise in administration, tuning, backup, recovery, clustering and replica sets for relational databases (MySQL, PostgreSQL and MongoDB)