Our client is in search of a Senior AI Platform Resident Engineer L2/L3 Operations (VMs & OpenShift) to oversee the operational excellence of AI platform components on virtual machines and OpenShift.
Role Overview:
In this critical role, you'll lead L2/L3 operational practices, close knowledge gaps, and ensure stable, secure, and well-observed deployments across various services including model serving, vector search, messaging, and runtime operations.
Key Responsibilities:
- AI & Vector Systems: Operate and support vLLM, LLM inference, Qdrant, Kafka, and Rasa on VMs and OpenShift, focusing on observability, security hardening, and performance optimization.
- Messaging & Caching: Manage Kafka and Redis, ensuring high availability, tuning, and effective backup/restore procedures.
- Platform Operations: Deploy, manage, and enhance services across VM-based environments and OpenShift clusters while applying best practices for security and resource management.
- Reliability & Observability: Establish metrics, logs, alerts, and monitoring dashboards, leading incident response and root cause analysis.
- Knowledge Transfer: Identify L2 skill gaps and deliver structured training to ensure operational readiness.
Profile Requirements:
- Advanced degree (MS/PhD) in Computer Science, AI, or a related field.
- 5+ years in operating distributed systems in production; 2+ years with VM-based environments and OpenShift/Kubernetes.
- Strong proficiency in Linux, networking, observability, and security hardening.
- Hands-on experience with Kafka, Qdrant, Rasa, or LLM inference frameworks.
- Familiarity with CI/CD practices and the ability to standardize release processes.
Core Competencies:
- Excellent problem-solving and critical-thinking skills.
- Strong communication and collaboration abilities.
- Ability to lead and mentor junior team members effectively.