Search by job, company or skills

Astra Tech

Senior Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 25 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Summary

We are looking for a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of botim's real-time communication and open platform infrastructure, supporting millions of active users globally. In this role, you will lead automation initiatives, operate and optimize large-scale Kubernetes clusters, and maintain highly available services across botim's cloud-native, microservices-based ecosystem.

You will work closely with platform, VoIP, and backend engineering teams to strengthen observability using Prometheus, improve CI/CD pipelines, implement Infrastructure as Code, and optimize cloud costs. This role is ideal for an experienced SRE who thrives in high-availability environments, enjoys solving complex production issues, and is passionate about building resilient systems that power real-time messaging and calling at scale.

Responsibilities

  • Automate routine operational tasks using Shell scripting, ensuring efficiency in log analysis, batch management, and system optimization.

  • Maintain and optimize middleware components supporting infrastructure operations, ensuring stability and performance.

  • Administer and optimize Kubernetes clusters, ensuring scalability, security, and performance.

  • Maintain and optimize monitoring and alerting systems based on Prometheus, ensuring high availability of services.

  • Contribute to the development of CI/CD pipelines Manage cloud resources efficiently, implementing cost optimization strategies to reduce cloud expenditure.

  • Improve operational processes, develop automation tools, troubleshoot incidents, and enhance system stability and reliability.

Requirements

  • Proficiency in Shell scripting for automating operational workflows and system management tasks.

  • Experience in Python or Go, preferably for system automation, tooling, or backend services.

  • At least 2 years of hands-on Kubernetes administration experience, including expertise in CSI, CNI, and managing clusters with 20+ nodes in production.

  • Experience with Prometheus for monitoring and alerting in an enterprise environment.

  • Familiarity with CI/CD deployment processes, with knowledge of GitOps principles. Hands-on experience with GitOps is a plus.

  • Experience managing cloud platforms using Infrastructure as Code (IaC) tools like Terraform/OpenTofu. Azure experience is a plus.

  • Strong problem-solving skills, a proactive approach to troubleshooting, and a commitment to improving operational efficiency and system reliability.

  • Bonus Points: Experience managing large-scale distributed systems and microservices architecture. Background in Site Reliability Engineering (SRE) best practices

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 138830553

Similar Jobs