⚕️ Cardiology AI Assistant (ESC 2024)

⚡ Powered by Microsoft Phi-3-Mini · ZeroGPU H200

Ask questions based on the 2024 ESC Medical Guidelines. Uses RAG with MedCPT embeddings, Cross-Encoder reranking, Phi-3 generation, and live evaluation metrics.

Example Questions

Your answer will appear here after submission.


Metrics will appear here once the answer is generated.


How each metric is computed

Metric Method Interpretation
BERTScore F1 Sentence-level cosine-sim F1 between answer sentences and top-60 context sentences using all-MiniLM-L6-v2 (forced CPU) Measures how semantically similar the answer is to the source context
ROUGE-1 Precision: fraction of answer unigrams that appear in the retrieved context Are the words the model used actually in the retrieved passages?
ROUGE-2 Precision: fraction of answer bigrams that appear in the retrieved context Are the phrases the model used actually in the retrieved passages?
Semantic Similarity Cosine similarity of full answer ↔ question embeddings Does the answer embed in the same semantic space as the question?
Faithfulness Fraction of answer sentences with cosine-sim ≥ 0.35 to any context sentence Are answer claims grounded in retrieved text?
Answer Relevance Cosine similarity of answer ↔ question embeddings How directly does the answer respond to the question?
Context Recall Fraction of top-60 context sentences with cosine-sim ≥ 0.35 to any answer sentence How much of the retrieved evidence is used in the answer?

Why precision for ROUGE? The retrieved context is ~6,000 tokens; a correct ~60-token answer has only ~4% unigram recall against that pool — even if every word came from the context. Precision asks the right question: "Did the model use words that actually appear in the retrieved passages?"

All metrics are reference-free — they use the retrieved context and original query as the reference signal, so no annotated ground-truth is needed.