Evaluating medical note generation
Ruben Stern
Machine Learning Engineer
Marianne de Vriendt
Machine Learning Engineer
Grégoire Retourné
Machine Learning Engineer
Sam Humeau
Machine Learning Engineer
The application of generative AI in medical note generation has stood out in the past year for its transformative role in saving clinicians time and allowing them to focus on caregiving.
For months now, thousands of doctors have reached out to express their appreciation for Nabla and for the quality of the notes it produces. To emphasize our models’ precision, it's worth noting that clinicians make edits to only 5% of the notes generated by Nabla.
To generate high quality notes and improve our generation process, we first must be able to measure the quality of a note. This is what this post will cover.
What is a medical note?
A medical note is a summary of an interaction between a doctor and their patient. Doctors are usually most familiar with a note template called SOAP (Subjective, Objective, Assessment, Plan). By default, Nabla outputs a more detailed style of note that our users have reported to be very comfortable with. Here is a (fictional) example:
## Chief complaintPersistent foot pain
## History of present illness- Foot pain started about four weeks ago- Pain was intense at first, lasted for about two and a half weeks at that level- Pain has been less intense in the last week or so, but still persistent- Pain is triggered when weight is put on the foot- Pain is located from the pad of the foot to a specific area- Pain was initially managed with rest and Metatarsal Inserts, Metatarsal Pads- Acute moment of pain occurred while playing soccer- Pain is associated with swelling in the foot
## Past medical historyJoint replacement surgery after the initial phase of the pandemic in 2020
## Past surgical historyJoint replacement surgery
## Social historyPlays soccer
## Imaging resultsX-rays of the foot were taken which show no fracture or dislocation
## Physical examSwelling observed in the foot.
## Assessment & PlanFoot pain- Capsulitis of the foot, possibly due to transfer pain from joint replacement surgery. The joint replacement has shortened the foot, causing more pressure on a smaller bone. The capsule around the joint may have a tear or be stretched.- Short-term fix is a splint to take the slack off of the joint. Avoid running and activities that involve bending the foot. If pain persists after a month, consider steroid shots. If the problem continues, consider removing the implant, putting a bone graft in, and fusing the joint. This would also involve shortening a bone to balance the foot.
## PrescriptionSplint for the foot
## AppointmentsFollow-up appointment in six weeks if pain is not better.
What is a good medical note?
By working closely with doctors, gaining insights into their preferences and requirements for medical notes, we have identified three key criteria.
- A note that contains the useful information
- A note with an enjoyable style (no redundancy, no repetitions, conciseness, etc.)
- A note that does not make up facts
The following sections aim at quantifying these points.
Measuring note quality
For the 3 aspects above, our methodology is to express the evaluation as a large set of clear questions that we address at a Large Language Model (in this case GPT-4), which will operate as a judge.
Data
By default, Nabla does not record any data. However, doctors using our product can share the transcript and the note of the encounter with us, if they consent to it and have the consent of their patients. Our evaluation is based on a dataset of 86 such encounters, happening in the context of a general medicine encounter.
Recall (getting the useful information)
For each encounter, a group of Nabla engineers and doctors extracted from the transcript several pieces of information that should or could be in the corresponding note. The resulting 1,088 pieces of information are then expressed as questions, such as these:
Question: Does the note mention the patient taking children's Tylenol?Expected answer: Yes
Question: Does the note mention the patient vomiting?Expected answer: Yes
For each encounter, we then ask GPT-4 to judge whether the note generated by Nabla contains these facts, using the following prompt:
You are a doctor that has hired a student to take notes during your encounters.You want to review a note and answer some questions about it.
## Note{note}
## Questions
### Question 1{question, e.g. "Does the note mention the patient taking children's Tylenol?"}Think step by step and give a short explanation before giving an answer.Conclude your answer with one of the following possibilities, depending on what you find most appropriate: response A: Yes response B: NoYou must choose one answer and it must be written EXACTLY like above. Do not invent new responses. Do not repeat the content of the question. Start your response by "### Question 1".Your answer must finish by a new sentence being: "Response X: Y.", X being what you chose, Y being its value.
### Question 2[...]
This prompt allows to judge several aspects in a single query. As you can see, we used several tricks to constraint this 0 shot prompt to an expected format. The recall score is given as the proportion of "tests" that "pass".
Style (showing the information in a clear and concise manner)
We use a similar methodology to evaluate the style of the note, only this time, the same questions are asked for every note. Here are some examples of questions:
Question: Does the note contains several sentences with repeated structure?Expected answer: No
Question: Does the family history section contains information about the patient past medical history?Expected answer: No
We send a prompt to GPT-4 very similar to the one used for recall above, and the style score is given as the proportion of "tests" that "pass". The total number of style questions is 516.
Veracity (not making up facts)
Evaluating the veracity of a note is tricker. Basically the idea will be to evaluate the percentage of "atomic facts" from the note that appear in the transcript.
Extracting facts from the note
For each note, we first use GPT-4 to split it into atomic facts. This is a non-trivial task, and the use of a LLM is motivated by the fact that some sentences contain many atomic facts. We use the following prompt:
This is a section of a medical report."""### {section_title}{section_content}"""Re-express it into atomic facts. Atomic facts are very small sentences such as:Patient has been prescribed XXX.Patient has this symptom: XXX.Patient is taking XXX.Patient's weight is XXX.Doctor diagnoses XXX.Doctor planned a new appointment with XXX the XXX.
Each sentence starts either with "doctor" or "patient". All facts must be understood independently, so do NOT output facts like "No other symptoms reported".Do not mention facts about things "not discussed" or "not known".
Answer directly with one fact per line. If there is no facts, just answer None.
Here is an example output:
Patient reports experiencing chronic fatigue.Patient has difficulty focusing at work.Patient has been experiencing these symptoms for the past four months.Patient describes her energy levels as consistently low.Patient has a strong desire to rest after work.Patient has been experiencing headaches.Patient's headaches are localized on the right side of her head....
The total number of facts varies depending on the system that generates the note (cf. Results section).
Evaluating their presence in the transcript
We then use the following prompt to assert the presence of each of these atomic facts in the transcript. In this case though, instead of a binary answer, we provide 3 possibilities of response, as illustrated by this example:
You are a doctor who is reviewing its encounter notes, with access to the transcript of the encounter.
## Transcript{transcript}
## Questions
### Question 1Is the following fact present in the transcript ? {fact, e.g. "Patient reports experiencing chronic fatigue."}Think step by step and give a short explanation before giving an answer.Conclude your answer with one of the following possibilities, depending on what you find most appropriate: response A: Yes response B: It’s unclear. Some elements are in the transcript but there is room for interpretation response C: Definitely not. There is no mention at all in the transcript.”You must choose one answer and it must be written EXACTLY like above. Do not invent new responses. Do not repeat the content of the question. Start your response by "### Question 1".Your answer must finish by a new sentence being: "Response X: Y.", X being what you chose, Y being its value.
The reason for this ternary is the fact that in a lot of cases, the poor quality of the transcript makes it often ambiguous whether a fact is present or not. Distinguishing the 2 cases of an obvious or non-obvious hallucination is helpful because penalizing the model for the latter may be unfair.
Here are some examples of non-obvious cases, providing the full response by GPT-4:
Is the following fact present in the transcript ? "Doctor advised to rest.".
The doctor advises the patient to "take it easy" if they have a stomach flu, which could be interpreted as advising the patient to rest. However, the term "rest" is not specifically used. Therefore, there is room for interpretation.
Response B: It's unclear. Some elements are in the transcript but there is room for interpretation.
Is the following fact present in the transcript ? "Patient started experiencing insomnia in their early 20s.".
The patient mentions that their issue probably started in their early 20s. However, the specific issue is not explicitly stated in that part of the transcript. It is only later in the transcript that the patient discusses having trouble switching off and not being able to sleep, which suggests insomnia.
Response B: It's unclear. Some elements are in the transcript but there is room for interpretation.
The veracity score is computed as the sum of proportions of response A or B.
Use case: comparison of backbone models
Nabla note generations are based on multiple queries addressed to a backbone model. Evaluating which backbone model to use is an ideal use case of the evaluation metrics that we describe in this post. Here we compare 4 different models, Hugging Face's Zephyr 7B Beta, and OpenAI's GPT-3 Turbo (0613), GPT-4 (0613) and GPT-4 Turbo (1106-preview).
Results
Zephyr | GPT-3 Turbo | GPT-4 | GPT-4 Turbo | |
---|---|---|---|---|
Recall 1,088 questions | 46% | 58% | 70% | 78% |
Style 516 questions | 72% | 73% | 80% | 81% |
Veracity | 73% 3,126 facts | 94% 2,759 facts | 98% 2,950 facts | 97% 3,633 facts |
Discussion
- Regarding the recall scores, it must be noted that there is some subjectivity regarding which facts should or should not be in the note, which renders a 100% score virtually unattainable. Lifestyle for instance, ("Patient plays golf") may or may not be considered as mandatory. Further work will make the distinction between these types of facts.
- GPT-4 Turbo scores highest on the recall, which is aligned with its outputting more facts.
- GPT-4 0613 seems to be the most reliable in terms of veracity.
- For our specific task, we find that the open-source state-of-the-art among 7B parameters models (Zephyr) lags behind. We acknowledge however, that we are using prompts that have been iterated on over time to work well with GPT-4, and maybe other ways of formulating the input may work better with Zephyr. We also acknowledge that Zephyr is only 7B parameters, which makes this a very unfair comparison.
Conclusion
In this post we've shown how at Nabla we evaluate the quality of the notes that we generate, as well as a comparison of different backbone models. Although further work will follow, this is a critical step in the development of our medical assistant. We hope that this post will be useful to other companies and engineering team dedicated to improve healthcare.
If you liked this post, you might be interested in joining our team as a ML engineer!