Structuring conversations through automated speaker diarization models (part 2)

Current generation electronic health records, and more specific clinical documentation systems, suffer a number of problems that make them inefficient and associated with poor clinician satisfaction. As of today clinicians in the Netherlands spent nearly 40% of the total time available in such a system. This investment of time comes at the direct expense of the time and attention available for patient care.

Recent progress in the fields of Automatic Speech Recognition and Natural Language Processing have allowed for the development of digital scribes; these are intelligent systems that automatically transform the recording of a clinical conversation into documentation, taking the administrative burden away from the doctor and helping them focus more on the patient.

At Attendi, speech processing lies at the core of our digital scribe product. Apart from understanding what is being said, we also need to know ‘who spoke when’. Speaker diarization models address this task of assigning speaker labels to dialogue segments and significantly impact the subsequent downstream tasks, including entity extraction and transcript summarisation. Therefore, especially when it comes to medical conversations where every word counts, a well-performing speaker diarization model protects the system from propagating errors to the rest of the product pipeline.

Moreover, medical consultations typically include one doctor and one patient. However, this is not always the case, since more than two people could be present in spontaneous consultations. Such situations demand even more robust speaker diarization systems, since they should take into account each and every speaker’s voice characteristics so as to correctly assign labels to the speech segments.

Minimizing DER: Why is it important?

In order to measure how well a speaker diarization system performs, Diarization Error Rate (DER) is the main metric that is typically used. It measures the fraction of time that is incorrectly classified to a speaker or to ‘silence’ and equals 0 if all speech segments are classified correctly. DER is also a key metric for Attendi, as it reflects our systems’ reliability.

Our recent advancements in Speaker Diarization have significantly reduced the DER score. Contrary to previous solutions, our current model does not depend on a ‘signature’ file to compute the speaker embeddings. Instead, we have employed an x-vectors architecture, which is based on a deep neural network (DNN). By feeding the input speech segments into the network, we extract DNN vector representations (embeddings), which are subsequently clustered together and finally assigned a speaker role. Since the number of speakers is not predefined, this model accounts also for a multi-speaker setup. Moreover, apart from the system architecture, we are aware that the hardware used to record the conversation in the examination room has a considerable impact on speech-to-text performance. To this end, our plan for the upcoming period also includes examining the recording setup and the ways it can be leveraged so as to result in a further minimized DER score.

Finally, the deployment of our own ASR engine, along with our in-house expertise allow for refining individual parts of the product pipeline. To this end, we are capable of not only developing robust solutions, but also performing adaptations in order to keep the total inference time and the overall memory requirements at a constantly low level.

Danai Xezonaki studied Electrical and Computer Engineering and is currently pursuing an MSc in Artificial Intelligence, at the University of Amsterdam. Moreover, she has conducted research on the application of Natural Language Processing in healthcare. At Attendi, Danai is working on improving the speaker diarization models as a Machine Learning Intern.