Evaluating Reliability in a Multimodal Medical QA System with RAG
Overview
This project investigates how retrieval-augmented generation (RAG) affects the reliability of a multimodal medical question-answering system. I led a team in designing and evaluating a conversational agent grounded in NHS Inform Scotland data.
The focus was not only on building the system, but on analysing how grounding changes model behaviour, particularly in a safety-critical domain where incorrect responses can have real consequences.
The system integrates text, speech, and a proof-of-concept visual pipeline to explore how multimodal interaction and retrieval affect both accuracy and usability.
Problem
Large language models can produce fluent medical responses, but these may be inaccurate, unsupported, or misleading.
A key question is whether retrieval-augmented generation improves factual reliability, and what trade-offs it introduces in system behaviour, such as latency, formatting, and usability.
This project evaluates how grounding affects both accuracy and practical deployment characteristics in a multimodal setting.
Methodology
Core QA System
- Fine-tuned Llama-3.1-8B for medical QA using MedQA
- Built a FAISS-based retrieval system over NHS Inform Scotland
- Generated embeddings with MiniLM
- Compared model behaviour with and without RAG
Multimodal Integration
- Integrated Whisper for speech input
- Developed a tri-modal pipeline including visual input via a vision-language model
Evaluation Design
- Compared RAG vs non-RAG across intrinsic metrics (BLEU, ROUGE, BERTScore)
- Analysed semantic similarity to reference NHS responses
- Conducted qualitative evaluation of factual correctness and usability issues
Key Results
- RAG improved semantic alignment with reference answers:
- BERTScore: 0.7892 → 0.8333 (+5.6%)
- BLEU and ROUGE improved only marginally, indicating limited gains in surface-level fluency
- Without RAG, the system produced factually incorrect or unsupported responses
- With RAG, factual reliability improved, but new issues emerged:
- formatting inconsistencies
- repetitive hyperlinks
- increased latency
These results show that grounding improves factual correctness but introduces new system-level trade-offs, particularly in usability and response quality.
Limitations & Trade-offs
- RAG improved semantic accuracy but had limited impact on fluency
- Responses could remain difficult for non-expert users due to medical terminology
- Retrieval introduced latency and formatting inconsistencies
- Multimodal pipeline was not formally evaluated due to lack of suitable datasets
- Knowledge base was limited to NHS Inform Scotland content
Why This Matters
This project highlights a broader challenge in deploying AI systems in real-world settings: improving one aspect of performance (factual grounding) can introduce new failure modes in usability and system behaviour.
In safety-critical domains such as healthcare, evaluating reliability requires going beyond accuracy metrics to consider how systems behave under realistic interaction conditions.
Technologies
Llama 3.1, FAISS, MiniLM, Whisper, Qwen2-VL, PyTorch, Gradio
