Improving Automated Punctuation of Transcribed Medical Reports

What do the following sentence pairs have in common?

“Let’s eat Grandma!” vs “Let’s eat, Grandma!”

“A woman, without her man, is nothing.” vs “A woman: without her, man is nothing.”

“I love cooking my family and my pets.” vs “I love cooking, my family, and my pets.”

Indeed, a few changes in the sentences’ punctuation completely alters the meaning of the sentence. While holding different roles and taking on different forms in languages throughout the last millennia, it is commonly believed that punctuation finds its roots in spoken language as guidance for orators on how best to take pauses, break down messages and map written text to the aural form. Over time these marks found value not only in directing the appropriate reading of texts, but in adding precision and clarity to writing in order to convey a clear, more standardised meaning. It is estimated that punctuation marks make up about 10–15% of all characters in most latin-based language texts including Dutch and English. These marks, or their absence, have a big effect on the meaning of a text, leading to anything from minor confusion to a complete change in purpose.

Punctuation in Transcription

In automatic speech recognition (ASR), most of the attention is often given to the actual transcription of spoken words as dictated by the user, with most industry research focusing on improving the coverage of languages and accents/dialects from the perspective of word error rate (WER). Punctuation is frequently overlooked as an integral part of the transcription process, with its impact on final results often underrated.

Attendi aims to deliver a fast, effective and easy to use solution for the reporting of client visits by healthcare professionals through electronic health record (EHR) systems, streamlining the documentation of information while allowing them to focus on what matters most — caring for the client. Proper punctuation is an integral part of clear and concise documenting in this flow.

Users are able to add punctuation manually by literally saying “period”, “comma”, “colon”, etc. We refer to this as spoken punctuation. While this functionality gives the users full control, it breaks the natural flow of speech, and requires users to take more time to complete reports. For this reason, Attendi has tuned its own Auto Punctuation service for the automated interpunction of transcribed medical reports, reducing the burden on users to manually punctuate their reports.

Currently, Attendi’s ASR pipeline involves two primary stages of processing: (i) Automated transcription of audio to text via the Kaldi ASR pipeline, and (ii) transcript post-processing and formatting for the final output. The ASR pipeline has been trained on a combination of open source and in-domain audio and medical reports in Dutch, achieving, to the best of our knowledge, the best transcription WER for the Dutch healthcare domain. Meanwhile the post-processing module is responsible for transforming, cleaning and filtering the final transcript by applying a series of hand-crafted and general purpose rules.

Unlike more modern end-to-end models like Whisper, Kaldi is not inherently built to add punctuation to its transcriptions. For this reason, one of primary tasks of the post-processing module is to punctuate these texts such that the appropriate meaning is conveyed, grammatical rules are observed and the highest possible value is delivered to the healthcare professional.

In this article we will explore the underlying architecture of this punctuation system, the steps taken to adapt it to our unique use-case and the improvements this initiative has brought about.

Baseline Performance

Of the 15 or so most commonly used punctuation marks, periods (.), commas (,), question marks (?), dashes (-) and colons (:) account for the highest frequency of usage, being widely considered to convey the most influential information of all punctuations within a text. For this reason, Attendi leveraged the open source Full Stop Deep Punctuation Prediction pipeline for the inference of these 5 most relevant punctuations in Dutch.

Vandeghinste et al. [1] outlined their steps in producing this first publicly available Auto Punctuation system for spoken language in Dutch by leveraging the power of transformer-based encoders for token classification . The motivation for this approach is that punctuating text can be tackled as a classification problem, whereby every word is classified to be followed by one of the five considered punctuations, or no punctuation at all. By leveraging a large enough dataset of punctuated samples, the model can learn underlying patterns of sentence structure and word context to understand how and when to punctuate each word. This approach had been proven effective for other languages [2], [3] which inspired the authors to extend its application to Dutch.

Under the hood, the Auto Punctuator model has a RoBERTa Transformer encoder architecture. A sliding window is applied over the tokenized inputs, with the model predicting a punctuation label or 0 (no punctuation) for each word for each window, resulting in 200 punctuation predictions per word, each produced with slightly varying context. Following this prediction, the tally of punctuations predicted for each word are then thresholded to produce a final result at the word level.

This model was trained on three large scale punctuation-labelled Dutch datasets:

The SoNaR dataset containing over 500M words, covering a mixture of Dutch texts across various genres and domains.
Europarl v8 dataset, containing transcribed plenary sessions of the European Parliament totalling 2.3M sentences and about 53M words.
OpenSubtitles dataset, containing media subtitles in Dutch to capture spoken language.

The outcome of this initiative is a relatively light-weight and accurate automated punctuation model achieving decent performance in general purpose applications of Dutch, serving as a good starting point for Attendi’s Auto Punctuation needs.

The use of this model represented a definite improvement from the previously handcrafted punctuation rule (placing a period at a pause), but it also raised users’ expectations of how well punctuation should be placed. After some time, the amount of customer feedback coming in regarding the quality of punctuation made us realise that the off-the-shelf Auto Punctuation model was not good enough for our purposes. While it achieves decent performance on more general-purpose Dutch text, it is not well-tailored to the specific structure and style of medical reporting:

Medical reporting often involves complex terminologies and references, including medicines, diagnoses, anatomical nomenclature, measurement readings and more. These word and sentence compositions would be very rare or not present at all within the open source datasets, and so are considered to be partly out-of-distribution.
Our target domain is transcribed reports from spoken language, which is often different in length, formality and form of expression when compared to written text. These differences represent a shift in the underlying distribution, negatively impacting its performance.
Transcripts can contain transcription errors. A word that sounds similar can have a completely different meaning. These transcription errors were confusing the model a lot, leading to erratic placement of punctuation.

These problems motivated the work to fine-tune and better adapt the Auto Punctuator for application on transcribed Dutch medical reports.

Improvements

Datasets

The first requirement for improving performance is gathering and processing in-domain datasets containing accurate punctuation. For this we leveraged two main sources of data:

In-house annotated transcripts sourced from production data:
The annotated transcription dataset consists of around 15K production reports transcribed by Attendi before being annotated by our in-house team of medical labellers. These reports are a direct reflection of the target data that the Auto Punctuation model will process, and have been punctuated following standardised annotation rules. These punctuated reports are prepared for training by splitting sentences at the word level and pairing each word to the appropriate label in {. , : – ? 0}.
In-domain written report data volunteered by clients:
This dataset consists of ~7.7M sentences extracted from user reporting on an EHR system. This dataset is valuable in capturing the type of domain-specific terminology and structure at scale, but is not fully indicative of the final reports received by Attendi.

Mapping Annotation Punctuations to Transcripts

Attendi’s annotators are tasked with labelling and enriching production samples for maintaining and improving our services. This involves correcting transcribed outputs to their flawless form, fixing any wrong transcription or formatting to produce results representative of what is expected by the user. Meanwhile, the Auto Punctuation model sits in the middle of the post-processing module (see Figure 1 above), and therefore expects uncapitalised, partially formatted text with some potential transcription errors. For this reason, the annotated samples deviate slightly from the type of data the model will see in production.

In order to better frame the learning objective and train the model on representative data, we must leverage transcribed reports which have only been partially post-processed up until the position of the Auto Punctuator — which we refer to here as the hypotheses. In order to acquire labels for this data, we map the punctuation labels from the annotated samples to the appropriate locations within the hypotheses such that the original content of the hypotheses is maintained. For example:

Hypothesis (verbatim):
“…better then last time err sixty seven beats per minute i recommended he…”

Annotation (cleaned, processed and annotated text):
“… better than last time: 67BPM. I recommended he…”

Desired training sample:
“… better then last time: err sixty seven beats per minute. i recommended he…”

Notice how we are able to leave the content of the hypothesis untouched, while injecting the punctuation labels in the correct locations.

For a large part of the dataset, this task is trivial as the two pairs differ only in punctuation and capitalisation. However for the remaining subset of data, a series of hand-crafted rules are applied such that the correct punctuation labels are applied to the correct position. This step, albeit trivial, is instrumental in achieving the right data characterisation for training, so as to achieve top performance in production.

Training

Before fine-tuning the encoder on the punctuation task directly, we adapt the model to our target data distribution by running masked language model pre-training on the written medical report dataset mentioned above. Sunkara et. al [4] find this improves performance on a similar punctuation task.

We customise the data collator for the masked language modelling task to bias 50% of all random masking to punctuations within the text. The motivation behind this is that the model will inherently learn the presence of punctuation within the structured text, better preparing it for the downstream task of punctuation inference via token classification. We use the best performing model on the masked language modelling objective for fine-tuning on the Auto Punctuation task.

Evaluation

In order to validate the improvements in performance, we conduct two forms of benchmarking: (i) Model-only and (ii) End-to-end pipeline evaluation.

Model-only Evaluation

Here we simply run the validation procedure on the model with the labelled dataset, aggregating statistics at the word-level. The following is the breakdown of results showing improvements from the off-the-shelf to the fine-tuned model on the test set:

Table 1. Word-level performance improvements with fine-tuning

We observe a significant increase in performance across the board. Note that macro average (average over classes) is probably the more reasonable statistic to look at here, since the micro average (average over classes, weighted by number of samples per class) is dominated by the most common class, whitespace, which we are less interested in than the other punctuation marks.

End-to-end Evaluation

Here we compare performance of the pipeline as a whole, understanding the impact improved punctuation has on down-stream features and the final output, providing a measure of accuracy at the report level, where a report is one full recording which may consist of multiple sentences.

When considering the report as a whole, we see a jump in performance from 36.4% to 69.8% flawless reports where punctuation is exactly as expected.

A qualitative analysis of the remaining ~30% of reports containing some degree of mismatch with the annotation shows that, although some prediction errors do persist, the vast majority of mismatches are due to subjective differences in punctuations in the annotated report, meaning that the output produced by the Auto Punctuator is not necessarily wrong, but differs in a subjective nature. In fact, in almost 50% of cases manually validated, the Auto Punctuator was found to produce more favourable results in comparison to the annotated sample.

This points to an ongoing struggle when working with annotated data — while annotators can be instructed to follow rules and best practices, both human error and subjectivity in labelling lead to variations in ground-truth labels. This subjectivity is especially pronounced for the punctuation task, as indicated by the many discussions online arguing whether one should place a comma at some point in a sentence, or a period, or neither. Placement of commas is also generally quite dependent on one’s individual writing style.

This highlights the importance of continuously monitoring annotation consistency and quality. It is a strong indication that our labelling instructions should be more clearly defined; for example by giving explicit guidance on when to add a comma and when not to. It is also a reminder for us as machine learning practitioners that the quality of data is one of the most important aspects of training a good model.

In the case of this initiative, internal benchmarking together with manual analysis validated the clear improvements and green-lit the new model for deployment to production.

Production Monitoring

Following deployment to production, the performance of the model and the postprocessing pipeline as a whole was closely monitored to validate the expected improvements. One of the indirect consequences of improved automatic punctuation is the reduction of spoken punctuation in reports.

Figure 2. Relative frequency of spoken punctuation in transcribed reports

Figure 2 demonstrates the continuous reduction in use of spoken punctuation by our users. Improvements in automated punctuation build trust in our systems ability to capture the intentions of the user, reducing the need for their manual injection of punctuation marks. Furthermore, we see that the types of spoken punctuation used shift away heavily from the 5 covered by the Auto Punctuator, towards more specific punctuations such as “bullet”, “open/close brackets” and “new line”. These marks are more nuanced to predict and harder to infer automatically, but are the focus of future efforts in this space.

Final Thoughts

Post-release monitoring and feedback from customers validate the improvements brought about by fine-tuning on in-domain data, with transcribed results now being more legible, accurate, and in-line with the expectations of the user. This demonstrates the importance of leveraging in-domain data for fine-tuning applications to their target use-case. With no changes to the underlying architecture and using in-domain data a fraction of the scale of the base dataset, a reduction in errors greater than 50% end-to-end was still achieved.

An interesting point of comparison is the reported performance of the original model on predicting punctuation in Dutch text versus its actual performance on our in-domain dataset (without fine-tuning):

Table 2. Comparison of off-the-shelf model on general Dutch texts vs our in-domain datasets

The large discrepancy between the reported and actual scores highlights the fact that any reported metric should be interpreted very carefully, taking into account context including the difference between source and target data distributions.

As discussed earlier, error analysis shows that a lot of potential for future model improvement lies in improving the annotation label quality and consistency at scale. Future label quality can be improved by making labelling instructions clearer and using inter-annotator agreement to discover inconsistencies between annotators’ work.

Another area for model improvement is incorporating the auditory information into the prediction. The current model operates solely on text. This disregards auditory information like pauses or intonation that can be very useful for predicting punctuation marks.

Finally, we’re also in the process of adding “new paragraph” as spoken punctuation. This will be followed by the release of an automated paragraphing solution, capable of formatting blocks of text into clear, neatly grouped paragraphs reflective of standard reporting conventions, further reducing the workload for healthcare professionals. These, together with other ongoing developments, will ensure Attendi maintains state-of-the-art performance in automated transcription of Dutch medical reports, continuing our support of the Dutch medical community.

Hoe Aafje 36% adoptie bereikte in 6 weken met de Attendi App

“Mijn digitale secretaresse, blij mee”

Zo schaal je de Attendi App organisatiebreed succesvol op