As a Senior AI Data Engineer, you will bridge the gap between raw enterprise data and our Generative AI ecosystem. You will architect the nervous system of our AI models, ensuring our LLMs have access to high-quality, real-time, and contextually relevant data to power the next generation of Valeo's Software Defined Vehicle (SDV) tools and internal processes.
Key Responsibilities
Advanced RAG Architecture: Design and implement end-to-end pipelines for Retrieval-Augmented Generation (RAG), including advanced retrieval techniques, Graph RAG, Agentic RAG, and multimodal RAG.
Vector Database Management: Architect and optimize vector stores (e.g., Qdrant, Pinecone) to handle high-dimensional embeddings and ensure low-latency similarity searches.
Data Connectivity: Design, develop, and maintain secure data connectors to pull information from various external tools, SaaS platforms, and internal databases.
Data Pre-processing for Gen-AI: Develop cleaning and normalization workflows specifically for unstructured data (PDFs, HTML, Markdown) to ensure optimal LLM performance.
Orchestration: Use tools like LangChain, LlamaIndex, or Haystack to orchestrate complex data flows between storage, embedding models, and LLM endpoints.
Monitoring & Evaluation: Implement RAG-as-a-service monitoring to track retrieval quality (faithfulness, relevancy) and data drift in production.
Candidate Profile
Education: B.Sc. or M.Sc. in Computer Science, Data Engineering, or a related field.
Experience: 5+ years in Data Engineering with a recent focus on AI/ML pipelines.
Technical Skills:
Languages: Expert proficiency in Python and SQL; knowledge of Java or Go is a plus.
Data Tools: Experience with Spark, Kafka, Airflow/Prefect, and dbt.
AI Frameworks: Hands-on experience with LangChain, LlamaIndex, OpenAI API, Hugging Face, MCP, A2A, and ADK.
Vector DBs: Proficiency with Qdrant, Pinecone, Chroma, Weaviate, or pgvector.
Cloud/DevOps: Knowledge of GCP/AWS, Docker, and Kubernetes.
Preferred: Deep understanding of embedding models, chunking strategies, and unstructured data extraction tools (e.g., Unstructured.io, PyMuPDF).