skip to content
Hamza Mooraj logo Hamza Mooraj

Evaluating Reliability in a Multimodal Medical QA System with RAG

Overview

This project investigates how retrieval-augmented generation (RAG) affects the reliability of a multimodal medical question-answering system. I led a team in designing and evaluating a conversational agent grounded in NHS Inform Scotland data.

The focus was not only on building the system, but on analysing how grounding changes model behaviour, particularly in a safety-critical domain where incorrect responses can have real consequences.

The system integrates text, speech, and a proof-of-concept visual pipeline to explore how multimodal interaction and retrieval affect both accuracy and usability.


Problem

Large language models can produce fluent medical responses, but these may be inaccurate, unsupported, or misleading.

A key question is whether retrieval-augmented generation improves factual reliability, and what trade-offs it introduces in system behaviour, such as latency, formatting, and usability.

This project evaluates how grounding affects both accuracy and practical deployment characteristics in a multimodal setting.


Methodology

Core QA System

  • Fine-tuned Llama-3.1-8B for medical QA using MedQA
  • Built a FAISS-based retrieval system over NHS Inform Scotland
  • Generated embeddings with MiniLM
  • Compared model behaviour with and without RAG

Multimodal Integration

  • Integrated Whisper for speech input
  • Developed a tri-modal pipeline including visual input via a vision-language model

Evaluation Design

  • Compared RAG vs non-RAG across intrinsic metrics (BLEU, ROUGE, BERTScore)
  • Analysed semantic similarity to reference NHS responses
  • Conducted qualitative evaluation of factual correctness and usability issues

Key Results

  • RAG improved semantic alignment with reference answers:
    • BERTScore: 0.7892 → 0.8333 (+5.6%)
  • BLEU and ROUGE improved only marginally, indicating limited gains in surface-level fluency
  • Without RAG, the system produced factually incorrect or unsupported responses
  • With RAG, factual reliability improved, but new issues emerged:
    • formatting inconsistencies
    • repetitive hyperlinks
    • increased latency

These results show that grounding improves factual correctness but introduces new system-level trade-offs, particularly in usability and response quality.


Limitations & Trade-offs

  • RAG improved semantic accuracy but had limited impact on fluency
  • Responses could remain difficult for non-expert users due to medical terminology
  • Retrieval introduced latency and formatting inconsistencies
  • Multimodal pipeline was not formally evaluated due to lack of suitable datasets
  • Knowledge base was limited to NHS Inform Scotland content

Why This Matters

This project highlights a broader challenge in deploying AI systems in real-world settings: improving one aspect of performance (factual grounding) can introduce new failure modes in usability and system behaviour.

In safety-critical domains such as healthcare, evaluating reliability requires going beyond accuracy metrics to consider how systems behave under realistic interaction conditions.


Technologies

Llama 3.1, FAISS, MiniLM, Whisper, Qwen2-VL, PyTorch, Gradio