Search by job, company or skills

Uplift People Consulting

Machine Learning Ops Engineer

5-7 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 16 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are looking for a MLOps Engineer to join a fast-growing, AI-driven technology company working on large-scale, production-grade machine learning systems, including LLMs, speech, and streaming models.

This role is fully remote and open to candidates based in Egypt or Syria, working closely with distributed teams across MENA and the US.

Key Responsibilities

Model Deployment & Serving

  • Deploy and host machine learning models, including LLMs (VLLM experience is mandatory), ASR, transcription, and streaming models
  • Serve models in production environments, ensuring high availability, scalability, and performance
  • Optimize throughput, GPU utilization, and resource efficiency
  • Manage and operate multiple models concurrently in production
  • Benchmark and profile model performance to ensure continuous optimization

Containerization & Orchestration

  • Build and manage containerized workloads using Docker and Docker Compose
  • Deploy and operate production systems on Kubernetes (EKS and on-prem using kubeadm)
  • Use Helm to deploy services, metrics, and supporting infrastructure
  • Implement continuous deployment workflows using Flux CD
  • Design auto-scaling and serverless serving solutions using KEDA, KServe, and Knative
  • Ensure network security, traffic routing, and load balancing within Kubernetes environments

Infrastructure & Cloud

  • Work extensively with AWS services (EKS, S3, Load Balancers, and core infrastructure components)
  • Manage infrastructure and Kubernetes clusters using Terraform
  • Configure and maintain GPU-enabled nodes, including NVIDIA drivers, Fabric Manager, and GPU exposure to containers
  • Handle certificate management and system-level security
  • Operate and troubleshoot Linux-based systems

Data, Messaging & Async Architectures

  • Design and manage asynchronous data pipelines using Kafka
  • Work with Redis and SQL-based data stores for caching and persistence

Monitoring, CI/CD & Observability

  • Implement monitoring and observability using Datadog
  • Maintain and extend internal services and pipelines using Python
  • Build and manage CI/CD pipelines with GitHub Actions
  • Use workflow orchestration tools such as Flyte for model fine-tuning and ML workflows
  • Continuously improve system reliability, observability, and performance metrics

Requirements

  • 57+ years of experience in MLOps, DevOps, or Platform Engineering roles
  • Strong hands-on experience deploying and operating ML models in production
  • Mandatory experience with VLLM
  • Deep knowledge of Kubernetes, Docker, and cloud-native architectures
  • Solid experience with AWS and Terraform
  • Hands-on experience with GPU-based workloads
  • Strong understanding of distributed systems and async architectures
  • Proficiency in Python
  • Fluent English (written and spoken)
  • Ability to work independently in a remote, distributed environment

Nice to Have

  • Experience with serverless ML workloads
  • Prior exposure to large-scale LLM platforms
  • Experience working with global or multi-time-zone teams

Location & Work Model

  • Fully remote
  • Candidates must be based in Egypt or Syria
  • Occasional flexibility for cross-time-zone collaboration

More Info

Job Type:
Industry:
Function:
Employment Type:

Job ID: 139396837