
Search by job, company or skills
Role Summary
We are looking for a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of botim's real-time communication and open platform infrastructure, supporting millions of active users globally. In this role, you will lead automation initiatives, operate and optimize large-scale Kubernetes clusters, and maintain highly available services across botim's cloud-native, microservices-based ecosystem.
You will work closely with platform, VoIP, and backend engineering teams to strengthen observability using Prometheus, improve CI/CD pipelines, implement Infrastructure as Code, and optimize cloud costs. This role is ideal for an experienced SRE who thrives in high-availability environments, enjoys solving complex production issues, and is passionate about building resilient systems that power real-time messaging and calling at scale.
Responsibilities
Automate routine operational tasks using Shell scripting, ensuring efficiency in log analysis, batch management, and system optimization.
Maintain and optimize middleware components supporting infrastructure operations, ensuring stability and performance.
Administer and optimize Kubernetes clusters, ensuring scalability, security, and performance.
Maintain and optimize monitoring and alerting systems based on Prometheus, ensuring high availability of services.
Contribute to the development of CI/CD pipelines Manage cloud resources efficiently, implementing cost optimization strategies to reduce cloud expenditure.
Improve operational processes, develop automation tools, troubleshoot incidents, and enhance system stability and reliability.
Requirements
Proficiency in Shell scripting for automating operational workflows and system management tasks.
Experience in Python or Go, preferably for system automation, tooling, or backend services.
At least 2 years of hands-on Kubernetes administration experience, including expertise in CSI, CNI, and managing clusters with 20+ nodes in production.
Experience with Prometheus for monitoring and alerting in an enterprise environment.
Familiarity with CI/CD deployment processes, with knowledge of GitOps principles. Hands-on experience with GitOps is a plus.
Experience managing cloud platforms using Infrastructure as Code (IaC) tools like Terraform/OpenTofu. Azure experience is a plus.
Strong problem-solving skills, a proactive approach to troubleshooting, and a commitment to improving operational efficiency and system reliability.
Bonus Points: Experience managing large-scale distributed systems and microservices architecture. Background in Site Reliability Engineering (SRE) best practices
Job ID: 138830553