Automatic speech recognition (ASR) is a technology that can be used to transcribe spoken words into written text.
Ubiqus uses one form of ASR, which is the Large Vocabulary Continuous Speech Recognition (LVCSR), based on the automatic identification of very short audio sequences. This technology makes it possible to produce a highly quality transcription, if provided with and subject to a high quality audio recording.
The state of the art of ASR has greatly evolved in recent years, and our R&D team is contributing to its permanent growth.
There are 4 Steps to the Process:
1 | Voice Activity Detection
Firstly, it is important to identify when talking /speech is present during the recording, in order to cut the soundtrack into segments. The machine will then work on each of these segments.
Next, it’s important to identify the different speakers in each recording, and to group them into segments according to their identity, solving the problem of ‘who spoke when?’. For this, the machine uses different models containing specific data (languages, voice). It is therefore able to differentiate the subtleties of a language (such as accents for example). Note that at this point, we are still in the “mathematical” treatment of the data.
This is when the actual transcription starts. A list of possible syllables (phonemes) is established for each audio segment. For now, no full sentences have been generated only one long list of possibilities, each with a score.
The computer chooses, amongst/between all the phonemes and words learned during the initial training, those that are most likely to form the most accurate sentence (a bit like how a GPS identifies the best route). It is this sentence that is transcribed into the document.
This process is applied to every segment of the recording to produce, in fine, the complete transcription.
At the end of this automated process, the document is re-read by our teams, like we do for any other Ubiqus document: On top of verifying the content as a whole, the proofreader will also ensure the speech has been correctly attributed.