ai engineer (llm quality analyst)
Code.Hub
Αττική
Αττική
Υβριδική
Πληροφορική
Αορίστου
Πλήρης
Περιγραφή Θέσης
Our Client is seeking a detail-oriented LLM Quality Analyst to join their AI team. You will be responsible for designing, implementing, and managing comprehensive testing and evaluation frameworks for their generative AI products. This role is critical to ensuring their AI systems meet the highest standards of quality, accuracy, safety, and ethical compliance before reaching production.
Key Responsibilities
Testing Framework Development
Design and implement comprehensive testing frameworks for LLM and generative AI applications
Create curated benchmark suites using industry-standard datasets (TruthfulQA, ARC, TriviaQA, MMLU)
Develop custom evaluation datasets tailored to specific use cases and domains
Build synthetic data generation pipelines for edge case testing
Define clear evaluation criteria, rubrics, and quality metrics
Establish testing protocols for different AI model types and applications
Quality Evaluation & Testing
Execute automated and manual evaluations of AI model outputs
Measure and track key quality metrics: relevance, factual consistency, coherence, hallucination rate
Assess model performance across dimensions: accuracy, latency, fairness, toxicity, bias
Perform functional correctness testing for code generation and structured outputs
Conduct A/B testing and shadow testing for model comparisons
Validate prompt engineering strategies and RAG pipeline effectiveness
Feedback Management & Issue Tracking
Manage intake of user-reported issues and feedback
Translate user feedback into reproducible test cases
Replicate and document bugs, edge cases, and failure modes
Log and track defects in issue-tracking systems (Jira, Linear, GitHub Issues)
Prioritize issues based on severity, frequency, and business impact
Collaborate with engineering teams on root cause analysis and resolution
Human Annotation & LLM-Based Evaluation
Coordinate human annotation efforts with clear guidelines and rubrics
Implement overlapping review processes to ensure annotation consistency
Integrate LLM-based evaluators (GPT-4, Claude) for automated quality assessment
Design evaluation prompts that provide structured scores and reasoning
Validate LLM evaluator outputs against human judgments
Continuously refine evaluation methodologies based on findings
Monitoring & Reporting
Monitor live-traffic metrics through observability dashboards
Track model performance trends and identify quality regressions
Generate comprehensive quality reports for product and engineering teams
Present findings and recommendations to stakeholders
Maintain documentation of testing procedures and evaluation results
Drive continuous improvement of AI system quality
Required Qualifications
Education
Bachelor’s degree in Computer Science, Data Science, Linguistics, Cognitive Science, or related field
Master’s degree in AI/ML, NLP, or related field (preferred)
Experience
3+ years of experience in quality assurance, testing, or evaluation roles
2+ years working with AI/ML systems, preferably LLMs or NLP applications
Experience designing and executing test plans for software or AI products
Proven track record of identifying and documenting complex technical issues
Technical Skills
Programming: Proficiency in Python for test automation and data analysis
LLM Knowledge: Understanding of how LLMs work, their capabilities and limitations
Evaluation Frameworks: Familiarity with LLM evaluation tools (Langfuse, DeepEval, RAGAS, Phoenix)
Data Analysis: Experience with pandas, numpy, and data visualization tools
Testing Tools: Knowledge of pytest, unittest, or similar testing frameworks
Issue Tracking: Proficiency with Jira, Linear, GitHub Issues, or similar platforms
APIs: Ability to work with REST APIs and LLM provider APIs (OpenAI, Anthropic)
SQL: Basic SQL skills for querying databases and analyzing results
Quality Assurance Expertise
Strong understanding of QA methodologies and best practices
Experience with test case design and test coverage analysis
Knowledge of different testing types: functional, regression, integration, performance
Familiarity with CI/CD pipelines and automated testing integration
Understanding of metrics and KPIs for quality measurement
AI/ML Evaluation Knowledge
Understanding of common LLM evaluation metrics (BLEU, ROUGE, BERTScore, perplexity)
Knowledge of bias detection and fairness evaluation techniques
Familiarity with hallucination detection and factual consistency checking
Understanding of prompt engineering and its impact on model outputs
Awareness of AI safety, ethics, and responsible AI principles
Core Competencies
Exceptional attention to detail and analytical thinking
Strong problem-solving and critical reasoning abilities
Excellent written and verbal communication skills
Ability to work independently and manage multiple priorities
Collaborative mindset for cross-functional teamwork
Curiosity and willingness to learn new AI technologies
Preferred Qualifications
Experience with specific LLM evaluation platforms (Langfuse, Weights & Biases, Arize)
Knowledge of human-in-the-loop evaluation workflows
Familiarity with RAG systems and vector database evaluation
Experience with adversarial testing and red-teaming for AI systems
Understanding of model fine-tuning and its quality implications
Background in linguistics, cognitive science, or human-computer interaction
Experience with statistical analysis and hypothesis testing
Knowledge of regulatory requirements (GDPR, AI Act) for AI systems
Contributions to AI evaluation research or open-source projects
Experience with multimodal AI evaluation (text, image, audio)
What They Offer
Competitive salary and benefits package
Comprehensive health, dental, and vision insurance
Professional development opportunities in AI/ML
Flexible work arrangements (remote/hybrid options)
Access to cutting-edge AI technologies and tools
Collaborative team environment with AI experts
Opportunity to shape quality standards for innovative AI products
Conference attendance and learning budget
Υβριδική
Πληροφορική
Αορίστου
Πλήρης
Περιγραφή Εταιρείας
Code.Hub is a Recruitment Agency, a Learning & Development partner, an Extended Team and Project Delivery expert in the Tech Industry.
We source and train the best candidates in Tech Industry.
With a team of 150+
- Tech-specific Recruiters
- Certified Trainers
- Software Development specialists
we are prepared for every challenge.
Using exclusively Agile methodologies, we help companies of all levels and technologies to extend their vision many step forward.