Lead Data Engineer

Inception

Abu Dhabi, United Arab Emirates

8-10 Years

Save

Posted 23 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We are seeking a highly skilled Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines that power enterprise LLM, RAG, and retrieval systems.

Key Responsibilities

Design, build, and optimize scalable data pipelines for AI/LLM workloads, including vectorization and embedding processing.

Develop and maintain ETL/ELT workflows for structured, unstructured, and streaming data.

Create and manage vector database indexing and similarity search pipelines using tools like FAISS, Pinecone, Weaviate, Qdrant, Chroma.

Build retrieval systems for RAG, semantic search, and enterprise knowledge retrieval.

Develop robust, reusable data orchestration pipelines using Airflow, Spark, or similar tools.

Architect and manage data pipelines across Azure (primary), AWS, and GCP environments.

Integrate and optimize storage and processing across SQL, NoSQL, and vector databases.

Contribute to the design and implementation of event-driven architectures.

Collaborate with AI teams to enable embedding generation, LLM integration, and model-serving pipelines.

Ensure end-to-end data quality, monitoring, reliability, and observability.

Lead or participate in system design for large-scale, distributed data and AI systems.

Required Skills

Programming & Data

Strong expertise in Python for data processing, APIs, automation, or distributed workloads.

Strong proficiency in SQL and knowledge of NoSQL databases (MongoDB, DynamoDB, Cosmos DB, etc.).

Experience with vector databases, such as: FAISS, Pinecone, Weaviate, Qdrant, Chroma.

Strong knowledge of data modeling, pipeline development, and ETL/ELT frameworks.

AI/LLM Infrastructure

Solid understanding of vectorization, embeddings, and similarity search techniques.

Familiarity with LLMs, embedding models, and RAG pipeline concepts.

Experience integrating embedding-generation pipelines via Hugging Face, OpenAI, or other model providers.

Cloud & Distributed Systems

Proficiency with Azure (primary), and familiarity with AWS and GCP.

Experience with Docker and containerized development.

Understanding of Kubernetes is a strong plus.

Orchestration & Big Data

Expertise in Apache Airflow for scheduling and orchestration.

Experience with Apache Spark or equivalent distributed processing frameworks.

Architecture & Engineering Fundamentals

Strong system design fundamentals for scalable and distributed systems.

Knowledge of event-driven architecture and modern data platforms.

Strong understanding of DevOps, CI/CD, version control, and observability best practices.

Qualifications

8+ years of progressive experience in data engineering, distributed systems, or AI/ML data infrastructure
Experience building RAG pipelines in production.
Knowledge of graph databases or hybrid search systems.
Understanding of model deployment, inference optimization, and caching techniques for LLM workloads.
Familiarity with data governance, IAM, and security patterns across cloud ecosystems