Search by job, company or skills

Astek Middle East

AI/ML Ops Engineer

This job is no longer accepting applications

new job description bg glownew job description bg glownew job description bg svg
  • Posted 5 months ago

Job Description

Required Qualifications

  • 6+ years of experience in AIOps, SRE, or infrastructure engineering, including at least 2+ years working on AI/ML/GenAI workloads.
  • Proven experience building and deploying GenAI or LLM-based solutions at scale.
  • Deep expertise in container orchestration (Kubernetes, KubeRay), GPU provisioning, and cloud-native infrastructure.
  • Strong programming skills in Python or Go, and experience working with ML workflows.
  • Experience deploying open-source LLMs (e.g., LLaMA, Qwen) or custom fine-tuned models.
  • Knowledge of vector databases (e.g., milvus, qdrant, Pinecone) in RAG pipelines.

Responsibilities

  • Design and deploy GenAI workloads at scale, including LLMs and multimodal models using container orchestration platforms like Kubernetes, Ray, or KServe.
  • Own the development of the AIOps pipeline, including data ingestion, feature engineering, model training, validation, deployment, and monitoring.
  • Demonstrated experience with LLMOps, including prompt management, routing, guardrails, and logic fallback mechanisms.
  • Size and allocate GPU/TPU resources for use cases such as RAG, chatbots, image/video generation, and LLM fine-tuning or inference.
  • Collaborate with research teams to automate LLM pipelines from training to deployment using MLflow, and manage large-scale training jobs using schedulers like Kubeflow or SLURM.
  • Engineer highly available and scalable infrastructure across cloud and hybrid platforms (AWS, GCP, Azure).

Good to Have

  • Experience with LLM pretraining and fine-tuning.

More Info

About Company

Job ID: 129809137