We are seeking a skilled Senior QA Engineer with (2–5 years) experience and a strong foundation in backend API testing, AI system evaluation, and production-quality test automation. The ideal candidate will have at least 2 years of hands-on backend/API testing, strong coding skills in Python, TypeScript, or Java, with a passion for building eval infrastructure for AI systems.
Key Responsibilities
- Design, develop, and execute eval datasets and regression harnesses for production AI systems - voice agents and enterprise chat platforms.
- Collaborate with AI engineering teams to embed quality gates into PR workflows - eval scores before merge, not after.
- Build and own LLM-as-judge harnesses, golden datasets, and prompt regression suites.
- Write and maintain automated test frameworks using Pytest, REST Assured, or equivalent coded frameworks.
- Perform API and backend testing across microservices and async LLM pipelines.
- Design observability dashboards so anyone can answer did the AI get worse this week with a chart, not gut feel.
- Partner with engineering on red-teaming - adversarial datasets covering PII, jailbreaks, and prompt injection.
- Continuously research and recommend new eval tooling and testing strategies to improve AI system quality.
Key Requirements
- 4–7 years of experience in QA / SDET / Quality Engineering.
- At least 1.5–2 years in backend / API / systems testing.
- 2+ years of strong coding in Python, TypeScript, or Java.
- 2+ years with modern test frameworks - Pytest / REST Assured / JUnit / Vitest / Jest.
- Hands-on with microservices, async pipelines, and event-driven architecture.
- Experience with CI/CD integration and test infrastructure.
- Builds automation frameworks from scratch - not just uses tools.
- Exposure to AI/LLM eval tooling: Langfuse, LangSmith, RAGAS, DeepEval, or equivalent (preferred).
Preferred Qualifications
- Strong systems thinking - reasons about contracts, retries, latency, and failure modes, not just UI surfaces.
- Experience with observability tooling - OpenTelemetry, Datadog, or Honeycomb.
- Familiarity with voice/telephony testing, ASR/TTS evaluation, or regulated-domain QA (PII, audit trails, compliance).
- Excellent communication and collaboration skills.
- Ability to work independently and take full ownership of quality engineering.
As the ladder goes up, the expectations rise too, providing more responsibility and opportunities for growth.
Why Join Us
- Greenfield eval infrastructure - build quality systems for production AI, not maintain legacy test suites.
- Real stakes: regulated industries, real customers, real money flows. Hallucinations are not allowed.
- Embedded in design from day one - eval scores in PR descriptions before merge, not a downstream gate.
- Work alongside modern AI coding tools (Claude Code, Codex) as part of normal development.
- Collaborative team with a strong emphasis on engineering rigor and continuous improvement.
Skills:- Automated testing, Python, pytest, Unit testing, API Testing, Rest Assured, Microservices, Object Oriented Programming (OOPs), RESTful APIs, Robot Framework and Test Automation (QA)