Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation

Version 1 : Received: 30 April 2022 / Approved: 6 May 2022 / Online: 6 May 2022 (09:09:31 CEST)

How to cite: Jansen van Vüren, J.M.; Niesler, T. Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation. Preprints 2022, 2022050066 (doi: 10.20944/preprints202205.0066.v1). Jansen van Vüren, J.M.; Niesler, T. Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation. Preprints 2022, 2022050066 (doi: 10.20944/preprints202205.0066.v1).

Abstract

We present improvements in n-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. In addition, we consider the application of large pretrained transformer-based architectures. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.

Keywords

code-switching; automatic speech recognition; low resource languages; language modelling

Subject

MATHEMATICS & COMPUTER SCIENCE, Artificial Intelligence & Robotics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.