AI Support Engineer

NorthBay - Pakistan

Abu Dhabi, United Arab Emirates

3-8 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

About AI Factory

The AI Factory is the product and technology arm of GovAI, responsible for:

Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entities
Scaling AI services from pilot to production
Strengthening AI Operations to ensure:

Reliability
Governance
High-quality support for AI-powered workloads

Role Overview

Employment Type: Full-time

Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)

The AI Support Engineer Is

The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutions
Focused on supporting:

AI workloads
Model consumption
RAG pipelines
AI-driven applications running in production

An AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)
Responsible for:

Triaging AI-related incidents
Monitoring model behavior and API performance
Supporting AI service integrations
Escalating issues to engineering, vendors, or platform teams

Key Responsibilities

AI Model & API Operations Support

Support production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)
Troubleshoot inference failures, latency spikes, malformed payloads, and API errors
Diagnose authentication failures (OAuth, tokens, API keys, quota limits)
Validate request structures and integration configurations
Monitor token consumption trends and detect abnormal usage spikes
Support quota management and controlled usage increases

RAG & AI Pipeline Operations

Monitor RAG pipelines and retrieval workflows
Troubleshoot embedding generation failures and indexing issues
Identify ingestion failures affecting vector databases
Validate document connector and data pipeline integrity
Diagnose relevance or response degradation caused by configuration issues
Escalate data-layer or infrastructure-level issues to DevOps support

AI Governance & Guardrail Monitoring

Ensure AI service consumption complies with defined access controls and governance policies
Validate rate limiting, usage policies, and guardrail configurations
Detect abnormal usage patterns or policy violations
Support enforcement of entity-level quotas and access restrictions
Escalate governance breaches to appropriate stakeholders

Incident Triage & SLA Management

Act as first responder for AI-layer incidents (P0P3)
Perform structured triage using logs, API traces, and monitoring dashboards
Classify incidents based on severity and business impact
Contain and mitigate AI service disruptions and coordinate with vendors when needed
Escalate complex issues to L2/L3 engineering with complete diagnostic context
Track incidents through full lifecycle and ensure SLA adherence
Participate in Root Cause Analysis (RCA) for major AI service failures

Release Validation & Change Support

Perform smoke validation after AI model updates or API releases
Monitor regression risks following deployments
Identify post-release anomalies and escalate early
Support controlled rollout monitoring for new AI capabilities

Enterprise Integration & Connector Support

Support integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)
Troubleshoot API integration failures, webhook errors, and data exchange issues
Validate secure connectivity and authentication configurations
Coordinate with DevOps support for infrastructure-related integration failures

Observability & Operational Monitoring

Monitor AI API performance metrics (latency, error rates, throughput)
Track token usage, consumption trends, and service availability
Identify recurring failure patterns and propose preventive actions
Maintain visibility dashboards for AI service health

Documentation & Knowledge Management

Maintain AI troubleshooting runbooks and support playbooks
Update known-issue repositories and FAQs
Document recurring AI API and RAG-related issues
Capture structured RCA documentation for major incidents
Contribute to operational documentation for new AI services
Handle ITSM/ticketing

Required Technical Skills

Experience supporting REST APIs and API-based platforms
Understanding of LLM consumption patterns (RAG, embeddings, inference APIs)
Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)
Ability to troubleshoot using logs, traces, and monitoring dashboards
Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
Basic understanding of cloud environments (Azure preferred)
Familiarity with enterprise system integrations
Understanding of rate limiting, quotas, and API governance

Experience