
Search by job, company or skills
We are seeking a highly skilled Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines that power enterprise LLM, RAG, and retrieval systems.
Key Responsibilities
Design, build, and optimize scalable data pipelines for AI/LLM workloads, including vectorization and embedding processing.
Develop and maintain ETL/ELT workflows for structured, unstructured, and streaming data.
Create and manage vector database indexing and similarity search pipelines using tools like FAISS, Pinecone, Weaviate, Qdrant, Chroma.
Build retrieval systems for RAG, semantic search, and enterprise knowledge retrieval.
Develop robust, reusable data orchestration pipelines using Airflow, Spark, or similar tools.
Architect and manage data pipelines across Azure (primary), AWS, and GCP environments.
Integrate and optimize storage and processing across SQL, NoSQL, and vector databases.
Contribute to the design and implementation of event-driven architectures.
Collaborate with AI teams to enable embedding generation, LLM integration, and model-serving pipelines.
Ensure end-to-end data quality, monitoring, reliability, and observability.
Lead or participate in system design for large-scale, distributed data and AI systems.
Required Skills
Programming & Data
Strong expertise in Python for data processing, APIs, automation, or distributed workloads.
Strong proficiency in SQL and knowledge of NoSQL databases (MongoDB, DynamoDB, Cosmos DB, etc.).
Experience with vector databases, such as: FAISS, Pinecone, Weaviate, Qdrant, Chroma.
Strong knowledge of data modeling, pipeline development, and ETL/ELT frameworks.
AI/LLM Infrastructure
Solid understanding of vectorization, embeddings, and similarity search techniques.
Familiarity with LLMs, embedding models, and RAG pipeline concepts.
Experience integrating embedding-generation pipelines via Hugging Face, OpenAI, or other model providers.
Cloud & Distributed Systems
Proficiency with Azure (primary), and familiarity with AWS and GCP.
Experience with Docker and containerized development.
Understanding of Kubernetes is a strong plus.
Orchestration & Big Data
Expertise in Apache Airflow for scheduling and orchestration.
Experience with Apache Spark or equivalent distributed processing frameworks.
Architecture & Engineering Fundamentals
Strong system design fundamentals for scalable and distributed systems.
Knowledge of event-driven architecture and modern data platforms.
Strong understanding of DevOps, CI/CD, version control, and observability best practices.
Qualifications
Job ID: 135681169