Senior AI Platform Resident Engineer

Astek

Saudi Arabia, Riyadh

5-7 Years

This job is no longer accepting applications

Posted 2 months ago

Job Description

Our client is in search of a Senior AI Platform Resident Engineer L2/L3 Operations (VMs & OpenShift) to oversee the operational excellence of AI platform components on virtual machines and OpenShift.

Role Overview:

In this critical role, you'll lead L2/L3 operational practices, close knowledge gaps, and ensure stable, secure, and well-observed deployments across various services including model serving, vector search, messaging, and runtime operations.

Key Responsibilities:

AI & Vector Systems: Operate and support vLLM, LLM inference, Qdrant, Kafka, and Rasa on VMs and OpenShift, focusing on observability, security hardening, and performance optimization.
Messaging & Caching: Manage Kafka and Redis, ensuring high availability, tuning, and effective backup/restore procedures.
Platform Operations: Deploy, manage, and enhance services across VM-based environments and OpenShift clusters while applying best practices for security and resource management.
Reliability & Observability: Establish metrics, logs, alerts, and monitoring dashboards, leading incident response and root cause analysis.
Knowledge Transfer: Identify L2 skill gaps and deliver structured training to ensure operational readiness.

Profile Requirements:

Advanced degree (MS/PhD) in Computer Science, AI, or a related field.
5+ years in operating distributed systems in production; 2+ years with VM-based environments and OpenShift/Kubernetes.
Strong proficiency in Linux, networking, observability, and security hardening.
Hands-on experience with Kafka, Qdrant, Rasa, or LLM inference frameworks.
Familiarity with CI/CD practices and the ability to standardize release processes.

Core Competencies: