About AI Factory
The AI Factory is the product and technology arm of GovAI, responsible for:
- Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entities
- Scaling AI services from pilot to production
- Strengthening AI Operations to ensure:
- Reliability
- Governance
- High-quality support for AI-powered workloads
Role Overview
Employment Type: Full-time
Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)
The AI Support Engineer Is
- The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutions
- Focused on supporting:
- AI workloads
- Model consumption
- RAG pipelines
- AI-driven applications running in production
- An AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)
- Responsible for:
- Triaging AI-related incidents
- Monitoring model behavior and API performance
- Supporting AI service integrations
- Escalating issues to engineering, vendors, or platform teams
Key Responsibilities
AI Model & API Operations Support
- Support production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)
- Troubleshoot inference failures, latency spikes, malformed payloads, and API errors
- Diagnose authentication failures (OAuth, tokens, API keys, quota limits)
- Validate request structures and integration configurations
- Monitor token consumption trends and detect abnormal usage spikes
- Support quota management and controlled usage increases
RAG & AI Pipeline Operations
- Monitor RAG pipelines and retrieval workflows
- Troubleshoot embedding generation failures and indexing issues
- Identify ingestion failures affecting vector databases
- Validate document connector and data pipeline integrity
- Diagnose relevance or response degradation caused by configuration issues
- Escalate data-layer or infrastructure-level issues to DevOps support
AI Governance & Guardrail Monitoring
- Ensure AI service consumption complies with defined access controls and governance policies
- Validate rate limiting, usage policies, and guardrail configurations
- Detect abnormal usage patterns or policy violations
- Support enforcement of entity-level quotas and access restrictions
- Escalate governance breaches to appropriate stakeholders
Incident Triage & SLA Management
- Act as first responder for AI-layer incidents (P0P3)
- Perform structured triage using logs, API traces, and monitoring dashboards
- Classify incidents based on severity and business impact
- Contain and mitigate AI service disruptions and coordinate with vendors when needed
- Escalate complex issues to L2/L3 engineering with complete diagnostic context
- Track incidents through full lifecycle and ensure SLA adherence
- Participate in Root Cause Analysis (RCA) for major AI service failures
Release Validation & Change Support
- Perform smoke validation after AI model updates or API releases
- Monitor regression risks following deployments
- Identify post-release anomalies and escalate early
- Support controlled rollout monitoring for new AI capabilities
Enterprise Integration & Connector Support
- Support integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)
- Troubleshoot API integration failures, webhook errors, and data exchange issues
- Validate secure connectivity and authentication configurations
- Coordinate with DevOps support for infrastructure-related integration failures
Observability & Operational Monitoring
- Monitor AI API performance metrics (latency, error rates, throughput)
- Track token usage, consumption trends, and service availability
- Identify recurring failure patterns and propose preventive actions
- Maintain visibility dashboards for AI service health
Documentation & Knowledge Management
- Maintain AI troubleshooting runbooks and support playbooks
- Update known-issue repositories and FAQs
- Document recurring AI API and RAG-related issues
- Capture structured RCA documentation for major incidents
- Contribute to operational documentation for new AI services
- Handle ITSM/ticketing
Required Technical Skills
- Experience supporting REST APIs and API-based platforms
- Understanding of LLM consumption patterns (RAG, embeddings, inference APIs)
- Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)
- Ability to troubleshoot using logs, traces, and monitoring dashboards
- Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
- Basic understanding of cloud environments (Azure preferred)
- Familiarity with enterprise system integrations
- Understanding of rate limiting, quotas, and API governance
Experience
- 38 years in:
- AI platform support
- API support
- SaaS support
- Application operations
- Experience supporting AI/ML services or developer platforms preferred
- Exposure to regulated or government environments advantageous
- Experience working with external vendors and enterprise stakeholders
- Arabic speaker is a plus