AI/ML Ops Engineer

6-8 Years

This job is no longer accepting applications

Job Description

Required Qualifications

6+ years of experience in AIOps, SRE, or infrastructure engineering, including at least 2+ years working on AI/ML/GenAI workloads.
Proven experience building and deploying GenAI or LLM-based solutions at scale.
Deep expertise in container orchestration (Kubernetes, KubeRay), GPU provisioning, and cloud-native infrastructure.
Strong programming skills in Python or Go, and experience working with ML workflows.
Experience deploying open-source LLMs (e.g., LLaMA, Qwen) or custom fine-tuned models.
Knowledge of vector databases (e.g., milvus, qdrant, Pinecone) in RAG pipelines.

Responsibilities

Design and deploy GenAI workloads at scale, including LLMs and multimodal models using container orchestration platforms like Kubernetes, Ray, or KServe.
Own the development of the AIOps pipeline, including data ingestion, feature engineering, model training, validation, deployment, and monitoring.
Demonstrated experience with LLMOps, including prompt management, routing, guardrails, and logic fallback mechanisms.
Size and allocate GPU/TPU resources for use cases such as RAG, chatbots, image/video generation, and LLM fine-tuning or inference.
Collaborate with research teams to automate LLM pipelines from training to deployment using MLflow, and manage large-scale training jobs using schedulers like Kubeflow or SLURM.
Engineer highly available and scalable infrastructure across cloud and hybrid platforms (AWS, GCP, Azure).

Good to Have