Machine Learning Ops Engineer

Uplift People Consulting

Egypt

5-7 Years

Save

Posted 16 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We are looking for a MLOps Engineer to join a fast-growing, AI-driven technology company working on large-scale, production-grade machine learning systems, including LLMs, speech, and streaming models.

This role is fully remote and open to candidates based in Egypt or Syria, working closely with distributed teams across MENA and the US.

Key Responsibilities

Model Deployment & Serving

Deploy and host machine learning models, including LLMs (VLLM experience is mandatory), ASR, transcription, and streaming models
Serve models in production environments, ensuring high availability, scalability, and performance
Optimize throughput, GPU utilization, and resource efficiency
Manage and operate multiple models concurrently in production
Benchmark and profile model performance to ensure continuous optimization

Containerization & Orchestration

Build and manage containerized workloads using Docker and Docker Compose
Deploy and operate production systems on Kubernetes (EKS and on-prem using kubeadm)
Use Helm to deploy services, metrics, and supporting infrastructure
Implement continuous deployment workflows using Flux CD
Design auto-scaling and serverless serving solutions using KEDA, KServe, and Knative
Ensure network security, traffic routing, and load balancing within Kubernetes environments

Infrastructure & Cloud

Work extensively with AWS services (EKS, S3, Load Balancers, and core infrastructure components)
Manage infrastructure and Kubernetes clusters using Terraform
Configure and maintain GPU-enabled nodes, including NVIDIA drivers, Fabric Manager, and GPU exposure to containers
Handle certificate management and system-level security
Operate and troubleshoot Linux-based systems

Data, Messaging & Async Architectures

Design and manage asynchronous data pipelines using Kafka
Work with Redis and SQL-based data stores for caching and persistence

Monitoring, CI/CD & Observability

Implement monitoring and observability using Datadog
Maintain and extend internal services and pipelines using Python
Build and manage CI/CD pipelines with GitHub Actions
Use workflow orchestration tools such as Flyte for model fine-tuning and ML workflows
Continuously improve system reliability, observability, and performance metrics

Requirements