
Search by job, company or skills

Role: Cluster DevOps / SRE Engineer
Function: DevOps / Site Reliability Engineering / Infrastructure
Location: Bangalore or Mumbai, India
Type: Full-time
Industry: Artificial Intelligence, Cloud Infrastructure, Telecommunications
About Company
The company is the dedicated AI research and innovation arm of a large-scale Indian telecom and technology conglomerate. It is building foundational AI technologies designed specifically for India's languages and digital economy.
Focus areas include multilingual speech recognition, voice synthesis, real-time conversational AI, and multimodal foundation models. The company collaborates with global AI leaders including OpenAI, Anthropic, Google, and Meta.
Its systems are built to serve hundreds of millions of users across India. It is investing across the full AI value chain — from next-generation data centers to consumer AI platforms and an agentic marketplace for SMEs.
Position Overview
The company is seeking a Cluster DevOps / SRE Engineer to design, manage, and optimize large-scale GPU/CPU compute clusters on GCP that power its AI research and model training workloads. The role owns the reliability, performance, and scalability of infrastructure critical to training and serving frontier AI models, sitting at the heart of the company's compute backbone.
Role & Responsibilities
Must Have Criteria
Nice to Have
What We Offer
Job ID: 149194689
Skills:
PowerShell, Prometheus, Elk Stack, Bash, Grafana, Jenkins, ARM templates, Terraform, Docker, Helm, Kubernetes, Python, Azure DevOps, Go, CRI-O, containerd, GitHub Actions, Azure Monitor
Skills:
Terraform, Python, Linux Administration, Newrelic, Shell Scripting, Cloud Engineering, Jenkins, Ansible, Azure Cloud Services, Event-Driven Architecture, Serverless Architecture, System Design and Solution Architecture, RAG Architecture Concepts, AI-Powered Automation Solutions, Monitoring and Observability, DevOps Engineering, AI LLM Fundamentals, AWS Cloud Services, REST API Integration and Automation, Docker and Container Technologies, CI CD Pipeline Design and Implementation, Infrastructure Automation, Production Support and Troubleshooting, Platform Engineering
Skills:
AWS EKS, S3, RDS, Prometheus, Kafka, Grafana, Datadog, Redis, Ec2, Terraform, Ansible, Elasticsearch, Kubernetes, AWS, AWS CDK, OpenSearch, AWS MSK
Skills:
PowerShell, Mariadb, Visual Studio, Artifactory, Grafana, Jira, Groovy, Jenkins, Git, Msbuild, Confluence, Influxdb, Gitlab, Python, Conda, NuGet, Conan, Miro
Skills:
Python Automation, PowerShell, Prometheus, Bash, Grafana, Octopus Deploy, Cloud Microsoft Azure, Scripting Python, Database SQL Server, MLOps Kubeflow KServe Istio EvidenceAI, GitHub Actions, Containers Orchestration Kubernetes AKS, Monitoring ELK Stack, Web Hosting IIS Windows Server
We don’t charge any money for job offers