Submitted:
06 November 2023
Posted:
07 November 2023
You are already at the latest version
Abstract
Keywords:
Introduction
Speech-2-Text Models
| Company Name | API Name | Release Year | Pricing | Supported Languages |
Accuracy/WER |
|---|---|---|---|---|---|
| Assembly AI | AssemblyAI S2T | 2020 | Paid | 9 | 16.8 (WER) |
| AWS | AWS Transcribe | 2017 | Paid | 75 | 18.42% (WER) |
| Wit.ai | Wit Speech | 2015 | Paid | 132 | — |
| Microsoft | Microsoft Azure Speech | 2018 | Paid | 147 | 9.2% (WER) |
| Vosk | VoskAPI | 2020 | Free | 20 | 63.4% Acc. |
| Whisper | OpenAI | 2022 | Free | 99 | 11.4 % / 4.2 % (best model-EN) |
| IBM | IBM Watson S2T | 2017 | Paid | 13 | 11.3%-36.4% |
| Google S2T | 2017 | Paid | 119 | 16% (WER) |
OpenAI Whisper
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
| base | 74 M | base.en | base | ~1 GB | ~16x |
| small | 244 M | small.en | small | ~2 GB | ~6x |
| medium | 769 M | medium.en | medium | ~5 GB | ~2x |
| large | 1550 M | N/A | large | ~10 GB | 1x |
System Overview
Implementation
Sound Conversion Algorithm:
- Utilize the pydub library's AudioSegment.from_wav() function. The initial audio input is converted into the WAV format using this function, which creates an audio_segment object. Notably, the io.BytesIO() function is employed to handle and process the audio data produced from the input object, which facilitates this transformation.
- Use the audio_segment object's native export() method. With this technique, the audio is preserved by having it stored as an MP3 file" The output format is wisely specified by setting the format argument to "mp3," which is a directive.
- Allow for the modification of the audio variable, requiring that it be updated to reflect the location of the just created MP3 audio file ("check.mp3").
- The method finally completes its work by supplying an enhanced audio variable that includes the path to the freshly converted MP3 audio file. The transformed audio content can then be accessed quickly thanks to this route.
Location Extraction Algorithm:
![]() | |
![]() | |
Accuracy
Algorithm for Word Error Rate (WER)
| Algorithm 1: Calculation of WER with Levenshtein distance function | |
| 1 | function WER(Reference r, Hypothesis h) |
| 2 | int [|r| + 1 |h| + 1] D *Initialization |
| 3 | for (i = 0; i <= |r|; i++) do |
| 4 | for (j = 0; j <= |h|; j++) do |
| 5 | if i == 0 then |
| 6 | D[0][j] ← j |
| 7 | else if j == 0 then |
| 8 | D[i][0] ← i |
| 9 | end if |
| 10 | end for |
| 11 | end for |
| 12 | for (i = 1; i <= |r|; i++) *Calculation Part |
| 13 | for (j = 1; j <= |h|; j++) do |
| 14 | if r[i - 1] == h[j - 1] then |
| 15 | D[i][j] ← D[i - 1][j - 1] |
| 16 | else |
| 17 | sub ← D[i - 1][j - 1] + 1 |
| 18 | ins ← D[i][j - 1] + 1 |
| 19 | del ← D[i - 1][j] + 1 |
| 20 | D[i][j] ← min(sub, ins, del) |
| 21 | end if |
| 22 | end for |
| 23 | end for |


Interface Illustration



Conclusion
Key Code Snippets





Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

