Search by job, company or skills

recrew ai

DevOps Engineer

Save
  • Posted a day ago
  • Be among the first 30 applicants
Early Applicant

Job Description

Role: Cluster DevOps / SRE Engineer

Function: DevOps / Site Reliability Engineering / Infrastructure

Location: Bangalore or Mumbai, India

Type: Full-time

Industry: Artificial Intelligence, Cloud Infrastructure, Telecommunications

About Company

The company is the dedicated AI research and innovation arm of a large-scale Indian telecom and technology conglomerate. It is building foundational AI technologies designed specifically for India's languages and digital economy.

Focus areas include multilingual speech recognition, voice synthesis, real-time conversational AI, and multimodal foundation models. The company collaborates with global AI leaders including OpenAI, Anthropic, Google, and Meta.

Its systems are built to serve hundreds of millions of users across India. It is investing across the full AI value chain — from next-generation data centers to consumer AI platforms and an agentic marketplace for SMEs.

Position Overview

The company is seeking a Cluster DevOps / SRE Engineer to design, manage, and optimize large-scale GPU/CPU compute clusters on GCP that power its AI research and model training workloads. The role owns the reliability, performance, and scalability of infrastructure critical to training and serving frontier AI models, sitting at the heart of the company's compute backbone.

Role & Responsibilities

  • Design, deploy, and manage large-scale GKE (Google Kubernetes Engine) clusters for AI/ML training and inference workloads
  • Build and maintain CI/CD pipelines using Cloud Build and ArgoCD to automate model training, evaluation, and deployment workflows
  • Monitor cluster health, performance, and reliability using Google Cloud Monitoring and Grafana; drive incident response and root cause analysis (RCA)
  • Optimize GPU utilization across multi-node training jobs on GCP, managing resource scheduling and job queuing systems
  • Collaborate with AI researchers to provision GCP infrastructure for large-scale distributed training experiments
  • Implement and enforce security, IAM access control, and compliance standards across GCP compute infrastructure
  • Automate infrastructure provisioning and configuration management using Terraform on GCP

Must Have Criteria

  • 4–10 years of experience in DevOps, SRE, or infrastructure engineering roles
  • Strong hands-on experience managing GKE (Google Kubernetes Engine) clusters at scale (100+ nodes)
  • Proficiency in Python or Bash scripting for automation and infrastructure tooling
  • Experience with CI/CD pipelines using ArgoCD or Cloud Build on GCP
  • Hands-on experience with Infrastructure-as-Code using Terraform on GCP
  • Experience with observability and monitoring using Google Cloud Monitoring, Prometheus, and Grafana
  • Familiarity with GPU cluster management and distributed computing frameworks such as SLURM or Ray on GCP

Nice to Have

  • Experience managing AI/ML infrastructure for large-scale model training (LLMs, speech models) on GCP
  • Prior work at an AI research lab, cloud provider, or high-scale tech company
  • Experience with GCP-native MLOps tooling such as Vertex AI Pipelines or Kubeflow on GKE
  • Google Cloud Professional certifications (Cloud DevOps Engineer or Cloud Architect)
  • Contributions to open-source infrastructure or DevOps tooling projects

What We Offer

  • Opportunity to build AI infrastructure that will serve hundreds of millions of users across India
  • Work alongside world-class AI researchers and engineers on frontier model development
  • Access to cutting-edge GPU compute infrastructure powered by GCP at scale
  • Competitive compensation with the backing and scale of a major Indian conglomerate
  • Fast-paced, innovation-first culture with direct impact on India's AI ecosystem

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149194689

Similar Jobs

Bengaluru, India

Skills:

PowerShellPrometheusElk StackBashGrafanaJenkinsARM templatesTerraformDockerHelmKubernetesPythonAzure DevOpsGoCRI-OcontainerdGitHub ActionsAzure Monitor

Bengaluru, India

Skills:

TerraformPythonLinux AdministrationNewrelicShell ScriptingCloud EngineeringJenkinsAnsibleAzure Cloud ServicesEvent-Driven ArchitectureServerless ArchitectureSystem Design and Solution ArchitectureRAG Architecture ConceptsAI-Powered Automation SolutionsMonitoring and ObservabilityDevOps EngineeringAI LLM FundamentalsAWS Cloud ServicesREST API Integration and AutomationDocker and Container TechnologiesCI CD Pipeline Design and ImplementationInfrastructure AutomationProduction Support and TroubleshootingPlatform Engineering

Bengaluru, India

Skills:

AWS EKSS3RDSPrometheusKafkaGrafanaDatadogRedisEc2TerraformAnsibleElasticsearchKubernetesAWSAWS CDKOpenSearchAWS MSK

Bengaluru, India

Skills:

PowerShellMariadbVisual StudioArtifactoryGrafanaJiraGroovyJenkinsGitMsbuildConfluenceInfluxdbGitlabPythonCondaNuGetConanMiro

Bengaluru, India

Skills:

Python AutomationPowerShellPrometheusBashGrafanaOctopus DeployCloud Microsoft AzureScripting PythonDatabase SQL ServerMLOps Kubeflow KServe Istio EvidenceAIGitHub ActionsContainers Orchestration Kubernetes AKSMonitoring ELK StackWeb Hosting IIS Windows Server