Search by job, company or skills

  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About AI Factory

The AI Factory is the product and technology arm of GovAI, responsible for:

  • Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entities
  • Scaling AI services from pilot to production
  • Strengthening AI Operations to ensure:
    • Reliability
    • Governance
    • High-quality support for AI-powered workloads
Role Overview

Employment Type: Full-time

Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)

The AI Support Engineer Is

  • The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutions
  • Focused on supporting:
    • AI workloads
    • Model consumption
    • RAG pipelines
    • AI-driven applications running in production
  • An AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)
  • Responsible for:
    • Triaging AI-related incidents
    • Monitoring model behavior and API performance
    • Supporting AI service integrations
    • Escalating issues to engineering, vendors, or platform teams
Key Responsibilities

AI Model & API Operations Support

  • Support production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)
  • Troubleshoot inference failures, latency spikes, malformed payloads, and API errors
  • Diagnose authentication failures (OAuth, tokens, API keys, quota limits)
  • Validate request structures and integration configurations
  • Monitor token consumption trends and detect abnormal usage spikes
  • Support quota management and controlled usage increases

RAG & AI Pipeline Operations

  • Monitor RAG pipelines and retrieval workflows
  • Troubleshoot embedding generation failures and indexing issues
  • Identify ingestion failures affecting vector databases
  • Validate document connector and data pipeline integrity
  • Diagnose relevance or response degradation caused by configuration issues
  • Escalate data-layer or infrastructure-level issues to DevOps support

AI Governance & Guardrail Monitoring

  • Ensure AI service consumption complies with defined access controls and governance policies
  • Validate rate limiting, usage policies, and guardrail configurations
  • Detect abnormal usage patterns or policy violations
  • Support enforcement of entity-level quotas and access restrictions
  • Escalate governance breaches to appropriate stakeholders

Incident Triage & SLA Management

  • Act as first responder for AI-layer incidents (P0P3)
  • Perform structured triage using logs, API traces, and monitoring dashboards
  • Classify incidents based on severity and business impact
  • Contain and mitigate AI service disruptions and coordinate with vendors when needed
  • Escalate complex issues to L2/L3 engineering with complete diagnostic context
  • Track incidents through full lifecycle and ensure SLA adherence
  • Participate in Root Cause Analysis (RCA) for major AI service failures

Release Validation & Change Support

  • Perform smoke validation after AI model updates or API releases
  • Monitor regression risks following deployments
  • Identify post-release anomalies and escalate early
  • Support controlled rollout monitoring for new AI capabilities

Enterprise Integration & Connector Support

  • Support integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)
  • Troubleshoot API integration failures, webhook errors, and data exchange issues
  • Validate secure connectivity and authentication configurations
  • Coordinate with DevOps support for infrastructure-related integration failures

Observability & Operational Monitoring

  • Monitor AI API performance metrics (latency, error rates, throughput)
  • Track token usage, consumption trends, and service availability
  • Identify recurring failure patterns and propose preventive actions
  • Maintain visibility dashboards for AI service health

Documentation & Knowledge Management

  • Maintain AI troubleshooting runbooks and support playbooks
  • Update known-issue repositories and FAQs
  • Document recurring AI API and RAG-related issues
  • Capture structured RCA documentation for major incidents
  • Contribute to operational documentation for new AI services
  • Handle ITSM/ticketing

Required Technical Skills

  • Experience supporting REST APIs and API-based platforms
  • Understanding of LLM consumption patterns (RAG, embeddings, inference APIs)
  • Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)
  • Ability to troubleshoot using logs, traces, and monitoring dashboards
  • Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
  • Basic understanding of cloud environments (Azure preferred)
  • Familiarity with enterprise system integrations
  • Understanding of rate limiting, quotas, and API governance

Experience

  • 38 years in:
    • AI platform support
    • API support
    • SaaS support
    • Application operations
  • Experience supporting AI/ML services or developer platforms preferred
  • Exposure to regulated or government environments advantageous
  • Experience working with external vendors and enterprise stakeholders
  • Arabic speaker is a plus

More Info

Job Type:
Industry:
Function:
Employment Type:

Job ID: 143927393