What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets


INTRODUCTION
Since the advent of transformers in the computational linguistics field in 2017 [59], they have received great attention for a wide variety of tasks, ranging from constituency parsing [34] to coherence modelling [49] and sentiment analysis [57]. However, until recently the transformers have been limited to the discrete signal domain. Speech, being in the continuous domain, lags behind.
As one of the first models for transformer based speech representation, vq-wav2vec [3] proposed a two-stage pipeline. It discretizes an input speech to a -way quantized embedding space (similar to word tokens for NLP tasks). The embeddings are then extracted from a BERT-based transfomer model. Mockingjay [39] and Au-dioALBERT [10] are other such transformer models taking mel and fbank features as input, respectively. Mel-scale spectrogram as input are a more compendious acoustic feature compared to linearscale spectrogram and fbank features are Mel filter bank coefficients which give better resolution at low frequencies and less at high frequencies, much like the human ear. Wav2vec2.0 [4] is a recent transformer based speech representation model that converts an input audio to latent space embeddings via a contrastive task.
These audio transformers have been applied over many diverse downstream speech language processing tasks with state-of-the-art results, such as speech translation [61], speaker recognition [58], automatic scoring [26], and sentiment classification [57]. This also begs the question as to what these transformer models are able to learn during the pretraining phase that helps them for various evaluation tasks 1 . Besides, as more and more applications start relying on such models, it is important to explain what these embeddings capture to check for potential flaws and biases, which can affect a large number of applications.
To this end, different research studies started probing language model embeddings for particular linguistic properties of interest. In [6], Belinkov et al. probed for part-of-speech language understanding, [29] probed for syntax, [50] on morphology, [64] for scales and numbers, etc. However, progress in the audio domain has been very limited with only a few works [1,7,51,52]. Most of these works treat the audio encoders as automatic speech recognition (ASR) systems. Because of this restrictive treatment, they probe on a limited set of features important for ASR, such as phones, accent and style (spontaneous and non-spontaneous). However, the analysis does not explain the state-of-the-art performance that audio encoders achieve on a wide variety of tasks.
Our contributions are summarized as: (1) We introduce here (47) probing tasks to capture simple linguistic features of speech audios, and we use them to study embeddings generated by two different audio transformers on three types of speeches, uncovering intriguing properties of encoders.
(2) We propose a detailed analysis of what is learned by the recent transformer-based semisupervised audio encoder models, wav2vec2.0 and Mockingjay. We implement post hoc probing on the embeddings extracted from each intermediate unit of the two models. We probe these embeddings using an extensive diversity (4 high-level categories) and number of features (46 in total), each categorized by the linguistic property they probe. We extract the results on all the features relevant to speech covering both what was spoken and how it was spoken. These results help us lay out a map of what particular features are learned in each layer while also providing a metric of comparison between the two models. These features are crucial for downstream applications such as automatic scoring, readability evaluation, automatic speech generation quality, text to speech quality, accent detection, ASR models, etc. [5,33,37,54,63,66]. As a proof of concept, we also show the effect of our analysis on two such downstream applications (speaker identification and phone classification) ( §6).
(2) We test the models for their representative effectiveness on different types of speech settings: native-read, native-spontaneous, and non-native-read. We find that, for the most part, native-spontaneous and non-native speech settings follow the result patterns for nativeread dataset albeit with a worse performance. In general, type of speakers matter less than the type of speech.
(3) We identify the role of the feature extractor module in wav2vec2.0, which enables it to process raw input audio of 16 without any preprocessing. We find that the subsequent layers of the feature encoder can encode all features into increasingly dense and informative representation vectors without any "intelligent processing" on them.
(4) We compare the performance of the representations from audio models and BERT on text features. This is the first work to check the representative capacity of audio representations for the text captured by audio. We find that despite of having no textspecific error metrics, the audio models are able to encode text well and are comparable to BERT on several parameters. We find that the dataset used to pre-train audio models has a significant effect on downstream performance.
To the best of our knowledge, this is the first attempt towards interpreting audio transformer models 2 . The conclusion points out that the transformers are able to learn a holistic range of features, which enable them to perform with great accuracy on various downstream tasks even training solely on unlabeled speech.

BRIEF OVERVIEW OF THE PROBED MODELS
We probe three recent transformer based models: wav2vec2.0, Mockingjay and BERT. Below, we give a brief overview of the three models and their high-level architectures.

wav2vec2.0
wav2vec2.0 is a recent transformer based speech encoding model. It is composed of 3 major components -the feature encoder, the transformer, and the quantization module. The feature encoder consists of a multi-layer convolutional network which converts the raw input audio input to latent representation 1 , 2 , .., . These latent vectors are fed into the transformer to build the representations 1 , 2 , ... . The training is done by masking certain time-steps in the latent feature representation and learning a contrastive task over it. The contrastive task requires finding the correct quantized representation corresponding to the masked latent audio representation amongst a set of distractors. The contrastive task targets ( ) are built by passing the output of feature encoder to the quantizater at various time steps.
The model is pretrained on unlabeled Librispeech data [48] and then finetuned on TIMIT [22] dataset for phoneme recognition. It achieves a 1.8/3.3 WER on the clean/noisy test sets on experiments using all labeled data of Librispeech and 5.2/8.6 WER on the noisy/clean test sets of Librispeech using just ten minutes of labeled data. The authors claim that even while lowering the amount of labeled data to one hour, wav2vec2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data.
A point to note is that the output of each of the transformer block depends on the duration of the audio file. For a moderate size audio (∼5 seconds), the embedding obtained is huge in size. It is of the form 768 * where T is dependent on the duration of the audio. Hence, to probe the different features, we time-average the embeddings.

Mockingjay
Mockingjay is a bidirectional transformer model which allows representation learning by joint conditioning on past and future frames. For our experiments, we use the MelBase-libri model. The architecture comprises of 12 encoder layers and each unit has 0he same output dimension of 768 and comprises of sub-layers which include a feed-forward layer of size 3072 and 12 self-attention heads. We probe each of the 12 transformer blocks of both models and the feature encoder of wav2vec2.0 to check if they learn the features of audio, fluency, suprasegmental pronunciation and text.
Similar to wav2vec2.0, Mockingjay also has huge embeddings of size 768 * with T dependent on the size of audio.

BERT
BERT stands for Bidirectional Encoder Representations and proved to be a major breakthrough for NLP. The architecture basically comprises encoder layers stacked upon eachother. BERT-Base has 12 such layers while BERT-Large has 24. We have probed the uncased Base model. The input format to the transformer has 3 parts -a classification token(CLS), sequence of words and a separate sentence(SEP) token. The feed-froward network has 768 hidden units and 12 attention heads. BERT achieves effective performance on various NLP tasks. Similar to audio models, we probe BERT extracting embeddings from each of the 12 encoder blocks. Since, text has no time component, the embeddings are of size 768 * 1.

PROBING -PROBLEM DEFINITION AND SETUP
Here we specify the probing model and explain how we compare the audio and text transformer models. We also give an overview 3 For a primer on log-Mel and other audio feature extraction, refer to [11] of all the features and models we probe in the paper along with the datasets used.

Probing Model
We define the problem of probing a model for a feature as a regression task using a probing model . is a 3-layer feed forward neural network trained on 's emebddings to predict the feature . For instance, in text-transformers, a probing model ( ) might map BERT embeddings ( ) to syntactic features such as parts of speech ( ) [32]. Post model training, the representational capacity of embeddings is judged based on the ease with which the 3-layer feed-forward probe network is able to learn the said feature. Metrics like accuracy and MSE loss are used for measuring and comparing the representational capacities [1,6,7,32,51]. Our probe model consists of a 3-layer fully connected neural network with the hidden layer having a ReLU activation and dropout to avoid over-fitting 4 . We compare the representative capacity of different audio and text transformers on the basis of the loss values reported by the prober. Furthermore, we take a randomly initialized vector as a baseline to compare against all the 'intelligent' models. This approach is in line with some of the previous works in the model interpretability domain [1,6,7,32,51]. A diagram explaining the overall process is given in the

Feature Overview
We test the audio transformer models on the following speech features: audio features ( §4.1), fluency features ( §4.2), and pronunciation features ( §4.3). Since spoken language can be considered as a combination of words (what was spoken), and language delivery (how it was spoken), we probe audio transformer models for both speech and text knowledge. For comparing on textual representational capacity, we extract text features from the original transcripts of all the audio datasets considered ( §5). A detailed description of all features extracted and their methodology of extraction is given in Section 4 (audio features) and Section 5 (text features).

Types of Speech Explored
Unlike text, speech varies drastically across speech types. For instance, a model developed for American (native) English speakers produces unintelligible results for Chinese (non-native) English speakers [43]. Since transformer models tend to be used across multiple speech types [17,26], it is important to assess and compare their performance and bias across each of the speech types. Therefore, we test them on native read, native spontaneous, and non-native read speech corpora. For probing on native read speech, we use the LibriSpeech dataset [48]. We take the default 'train-clean-100' set from LibriSpeech for training the probing model and the 'test-clean' set for testing it. For native spontaneous English speech, we use the Mozilla Common Voice dataset [2]. We use a subset of 2000 random audios for training and 200 audios for testing. For interpreting audio transformers on non-native speech, we use L2-Arctic dataset [67]. We take 500 audios of 4 speakers each for training the prober and 50 audios each for testing. The 4 speakers are selected in such a way that there is 1 male and 1 female speaker each with Hindi and Spanish as their first languages.

Models Probed
We probe two recent audio transformers, wav2vec2.0 and Mockingjay for their speech and language representational capacities. For text-based linguistic features particularly, we also compare them with BERT embeddings [16]. See Section 2 for an overview of the three transformer models.
Self-attention is the powerhouse which drives these transformers [59]. It is the main reason behind their state-of-the-art performance on diverse tasks. While Mockingjay is exclusively built of selfattention and feed-forward layers, wav2vec2.0 also has several CNN layers. They are presented as "feature extractor" layers in the original paper ( Figure 1). Therefore, we also investigate the role of the feature extractor in wav2vec2.0. In particular, we investigate that whether similar to computer vision [21,24,35], do the CNN layers in speech transformers also learn low-level to high-level features in the subsequent layers. Very few studies in the speech domain have tried to answer this question [65].
We probe the representational capacity of embeddings from all layers of the three transformer models. This helps us understand the transformer models at four levels, i.e., across models, speech types, input representations (log Mel and raw audio), and layers. This analysis gives us results on a much finer level than just comparing the word error rates of the two models. It helps us to know the linguistic strengths and weaknesses of the models and how they are structuring and extracting information from audio. We also use our interpretability results to improve the performance on some downstream tasks ( §6).

WHAT DO AUDIO TRANSFORMERS HEAR?
In this section, we probe audio ( §4.1), fluency ( §4.2), and pronunciation ( §4.3) features. These features are extracted directly from the audio waveform. Amongst them, the audio features measure the knowledge of the core features of audio including energy, jitter, shimmer and duration. Fluency features measure the smoothness, rate, and effort required in speech production [14,63]. Pronunciation features measure the intelligibility, accentedness and stress features of the audio. Tasks such as automatic scoring, readability evaluation, automatic speech generation quality, text to speech quality, accent detection, ASR models, etc. are impacted by the fluency and pronunciation features [5,33,37,38,54,63,66].
A typical embedding of the transformers at any layer is of the size 768 * where T depends on the duration of the speech segment. We average it to get 768 * 1 dimension embedding which serves as the representation of the speech segment for which we have extracted the features. This is then fed as the input to our probing model. Figure 3 depicts the process.

Audio knowledge
We measure the following audio features: Total duration, zerocrossing rate, energy entropy, spectral centroid, mean pitch, local jitter, local shimmer, and voiced to unvoiced ratio. Total duration is a characteristic feature of the audio length that tells us about the . The graphs represent stacked area charts with the x-axis being the layers of the model and y-axis shows the relative performance of each layer with respect to the maximum loss for each feature ((loss -min_loss)*100%/min_loss). Hence, higher the value, higher the loss, lower the performance. The feature numbers according to category are given below: Audio features: 1. total duration, 2. stdev energy, 3. mean pitch, 4. voiced to unvoiced ratio, 5. zero crossing rate, 6. energy entropy, 7. spectral centroid, 8 temporal shape of the audio. The temporal feature zero crossing rate measures the rate at which a signal moves from positive to a negative value or vice-versa. It is widely used as a key feature in speech recognition and music information retrieval [47,56]. Energy features of audio are an important component that characterizes audio signals. We use energy entropy and the standard deviation of energy (std_dev energy) to evaluate the energy profile of audio. Spectral centroid is used to characterise the spectrum by its centre of mass. To estimate the quality of speech as perceived by the ear, we measure the mean pitch. We also probe for frequency instability (localJitter), amplitude instability (localShimmer), and voiced to unvoiced ratio.  5 shows the results obtained for audio features probed on wav2vec2.0 and Mockingjay on the Librispeech dataset. It can be seen that the lowest loss is obtained in the initial two layers for wav2vec2.0, whereas it is the final layer for Mockingjay. These results also indicate that unlike computer vision there is no uniform conception of "high-level" or "low-level" in audio transformers [21,24,35]. We can see a clear ascent in the losses as we traverse the graph for wav2vec2.0 from left to right, i.e., from lower layers to the higher layers. This suggests that as we go deeper into the 12 block transformer model the audio features are diluted by wav2vec2.0. Mockingjay, on the other hand, follows a negative slope for its losses from the first to the last layers. Hence, the audio features are best captured in the final layers of the Mockingjay model. 5 Refer Tables 7 and 11 of Appendix for loss values  Audio feature   Description Extracted Using Total duration Duration of audio Librosa [40] zero-crossing rate Rate of sign changes PyAudioAnalysis [23] energy entropy Entropy of sub-frame normalized energies PyAudioAnalysis [23] spectral centroid Center of gravity of spectrum PyAudioAnalysis [23] mean pitch Mean of the pitch of the audio Parselmouth [8,31] local jitter Avg. absolute difference between consecutive periods divided by the avg period Parselmouth [8,31] local shimmer Avg absolute derence been the amplitudes of consecutive periods, Parselmouth [8,31]   When comparing the minimum losses across both models, the average learning of these features for wav2vec2.0 is better than that of Mockingjay by 28.59%. Even with the final layer embedding, wav2vec2.0 performs better than Mockingjay by 24.53%. This is interesting given that the final layer of wav2vec2.0 contains the most diluted version of the learned features and Mockingjay has its best version (in the final layers). Therefore, wav2vec2.0 has richer audio representations compared to Mockingjay.
Native Spontaneous Speech: For native spontaneous speech, as shown in Figure 5 6 , wav2vec2.0 is observed to perform better than Mockingjay. Wav2vec2.0, on an average performs better by 41.69% when compared across the best performing layers and 51.12% when end layer losses are compared. The pattern of the best performing layer also remains the same as the case of native read speech for Mockingjay. For wav2vec2.0, native read speech was best captured in the initial 2 layers, but for spontaneous speech, the layers are a bit more spread out across the initial half of the transformer model. We also observe that the loss values on native spontaneous speech are higher than the ones for native read and non-native read corpora.
Non-native Speech: When tested on L2 speakers ( Figure 5 7 ), wav2vec2.0 outperforms Mockingjay by 9.53% and 12.51% on minimum and end layer losses, respectively. Additionally, similar to the case of native read speech, Mockingjay learns the audio features best in the final layers. As for wav2vec2.0, the layers learning the audio features are spread out with the initial half of the model learning them more accurately than the later half.

Fluency knowledge
To the best of our knowledge, we use the features that measure fluency for the first time in this paper. The key features of fluency are: rate of speech, pauses, and length of runs between pauses [63].
To measure the rate of speech, we measure the speech rate (number of words per second in the total response duration) (speaking_rate) 6 Refer Tables 23 and 25 of Appendix for loss values  7 Refer Tables 15, 19 of Appendix for loss values and articulation rate (number of words per second in the total articulation time, i.e., the resulting duration after subtracting the time of silences and filled pauses from the total response duration) (articu-lation_rate) [60]. Apart from these rates, pauses in speech are the second most observable feature to indicate disfluency [30]. Therefore, we measure the duration, location and frequency of pauses as prototypical features. For this, we measure the number of filled pauses per second -(filled_pause_rate), silence deviation (absolute difference from the mean of silence durations), which along with the total duration of the audio helps to indicate the length of runs between the pauses [41]. This also serves an important indicator for fluency. Other features include total number of silences (general silence), mean duration of silences (mean_silence), average silence per word (SilenceRate1), average silence per second (SilenceRate2) and number of long silence per word (longpfreq).
Furthermore, conversational fillers are a major source of disfluency. Sounds like uh, um, okay, you know, etc are used to bring naturalness and fluency to their speech. The extent of fillers is an important feature to check for speech fluency. We use the average number of syllables in a word (average_syllables_in_word), the number of words with syllables greater than 2 (wordsyll2) and the repetition frequency (repetition_freq), to measure this. Native Read Speech: For fluency based features on native read speech, similar to audio features, wav2vec2.0 performs better than Mockingjay (Figures 4 (a1) and (b1) 8 ). While the fluency features are not layer specific but are spread across the model for Mockingjay, they tend to show the best performance in the middle layers for wav2vec2.0. With the final layer embeddings of both models, wav2vec2.0 performs better than Mockingjay by 12.23%. The performance gap increases by four folds to 42.37% when compared on the minimum losses (among all observed for the intermediate layers) learnt by both models.
Non-native Speech: For the L2 Arctic dataset ( 9 ), the learning of fluency features is concentrated in the middle layers for wav2vec2.0. 8 Refer Tables 8 and 12 of Appendix for loss values  9 Refer Tables 16 and 20 of Appendix for loss values Figure 5: Performance of each audio feature (on the y-axis) relative to the performance of random embeddings on the three speech types (native read, native spontaneous, and non native speech). X-axis represent the MSE loss values relative to random embeddings loss (loss*100/l2_random_loss).

Fluency feature Description
Filled pause rate Number of filled pauses (uh, um) per second [8,31] General

Pronunciation Features
Similar to fluency features, we are the first to probe pronunciation features in speech. The intelligibility, perceived comprehensibility, and accentedness of speech are impacted by phonemic errors [15]. Segmental pronunciation is judged based on the amount of listener effort with lower being the better. Hence, we probe the models for   We also study the presence of stress with the characteristic features of stress syllables distance mean (stressDistanceMean), and stress distance mean (stressDistanceSyllMean). Native Read Speech: Figures 4(a2) and (b2) 10 show the results for probing pronunciation features on wav2vec2.0 and Mockingjay 10 Refer Tables 9 and 13   with the Librispeech data. These features are learnt best by the last layers in Mockingjay. Wav2vec2.0 learns these features the most in the 6 th to 8 th layers amongst its 12 layers. Mockingjay performs better for pronunciation-based features than wav2vec2.0 by 30.4% in the final layer embeddings. Comparing the minimum loss layers for both models, the difference is 16.19% in favor of Mockingjay.
Non-native Speech: Mockingjay follows the same pattern for L2 Arctic dataset as for the Librispeech dataset. It learns these features better in the last layers. However, for wav2vec2.0, the layers learning each of these pronunciation features are more spread out across the initial layers of the second half of the model. Wav2vec2.0 outperforms Mockingjay but the differences here are reduced to 8.9% in the end layer and 2.20% in the best performing layer. This pattern follows the non-native speech performance of wav2vec2.0 and Mockingjay seen with audio and fluency features. Here too, the performance difference between wav2vec2.0 and Mockingjay widens when compared to the native speech scenario.

Feature Extractor Module of wav2vec2.0
As shown in Figure 1, wav2vec2.0 has 7 convolutional layers before the transformer encoder block. The authors call it the "feature extractor" of wav2vec2.0. While in the computer vision community, it has been shown that subsequent layers of a CNN architecture look for higher level features, in the speech community this question has largely been left unaddressed [21,25]. We find that there is a uniform increase in performance of the subsequent CNN layers for all feature types (audio, fluency, and pronunciation) and there is no difference between any features with respect to "high-level" or "low-level". Figure 8 shows this behavior for audio features(which are supposed to be best learnt by feature extractor of audio transformer). The CNN layers faithfully extract all the features and show minimum loss at the seventh layer or the post-projection layers.

CAN AUDIO MODELS READ TOO?
Speech combines the text and the audio parts of the language. Conventionally, the audio community (which also deals with speech) has been more involved with signal sciences while the NLP community has dealt with the text part of speech while ignoring audio. This approach is suboptimal. However, due to the impressive performance of self-supervised transformers in every domain, there is a newfound interest in learning task-independent representations. Concurrently, there is also an interest in learning how these representations are working. Therefore, we probe to check whether the self-supervised audio transformers on account of their selfsupervision tasks have accumulated some knowledge present in the text as well. With this motivation, we probe the audio transformer representations for surface ( §5.1), syntax ( §5.3) and semantic ( §5.2) knowledge. For reference, we compare them with BERT based text-only embeddings. We use random embeddings as baseline. We do the experiments for four speech types (native read, native spontaneous, non-native, and artificial speech).
While the surface features measure the non-linguistic surface knowledge of the encoders, syntax features measure the syntax based linguistic properties. Conneau et al. [12] include features such as sentence length and word content in surface features and syntax tree depth in syntax feature. The other category of features we measure are semantics features in which we include number of objects and subjects [12].

Surface Level Features
Surface level features measure the surface properties of sentences. No linguistic knowledge is required for these features. They can be measured by just looking at the tokens [12]. We include the following featuresunique word count and the average word complexity (Word Complexity) since the lexical diversity of spoken speech is an important metric to evaluate its quality [55]. Native Read Speech: When compared on LibriSpeech, surfacebased features are learnt better by Mockingjay than wav2vec2.0   11 . These features are learnt best in the intermediate layers in wav2vec2.0 and initial layers in Mockingjay. From the results, we observe that the text understanding of both models becomes increasingly diffused as we go towards the later layers. However, wav2vec2.0 outperforms Mockingjay by 3.01% in the 11 Refer Tables ?? and 14 of Appendix for the loss values final layer. A contributing factor to these observations is the learning of surface features by Mockingjay in the initial layers while, wav2vec2.0 learns it best in the middle layers. Non-native Speech: For L2 arctic data, again wav2vec2.0 best learns the surface features in the middle layers but for mockingjay, no particular pattern is observed. The difference widens to 38.41% on the end layers and 18.96% on the minimum loss layer in favour of wav2vec2.0. Native Spontaneous Speech:Mockingjay learns best in the initial layers like in the case with native read speech meanwhile, wav2vec2.0 performs best in the lower middle(7-11) layers. The difference increases to 141.42% for native spontaneous speech on the final layer and 132.44% on the best performing layer.

Semantic Level Features
The relationship between the words spoken and our comprehension of that spoken content falls into the domain of semantics. To produce meaning in a sentence, it is almost necessary for it to have a subject and a direct object that the subject addresses. The number of subjects, number of direct objects and total nouns, pronouns, adverbs, adjectives, verbs, conjunction, and determiners are hence in our set of features to evaluate the spoken content. [12,32]. We also probe for the tense(past or present) and it is framed as a classification task unlike the rest which are regression tasks so the result for tense are separately mentioned. Native Read Speech: wav2vec2.0 performs better in this setting by 4.173% and 5.29% on the minimum loss layer. Like the surface features, the pattern followed by the layers in learning is same for semantic features. Mockingjay learns them best in initial layers while wav2vec2.0 in the intermediate layers. For tense too, wav2vec2.0 best performs with 75.04% accuracy in the seventh layer where Mockingjay performs with 56.99% in the last layer.
Non-native Speech:The same pattern as surface features in the non-native setting is followed by both the transformers. Mockingjay does not follow a clear pattern but wav2vec2.0 performs best in the middle layers. While wav2vec2.0 outperforms Mockingjay be 7.36% on minimum layer loss for L2 speech, the margin decreases to 3.26% on the end layer. Accuracy for tense is 57.95% for wav2vec2.0 and 52.27% for Mockingjay on 5th and 9th layer respectively.
Native Spontaneous Speech:Mockingjay does not concentrate its learning in any particular layer but wav2vec2.0 performs best in the second half of the transformer layers. wav2vec2.0 performs better by 9.83% for native spontaneous speech on the best performing layer and 8.06% on the final layer. Again for tense, the accuracy is 65.79% on wav2vec2.0 and 57.89% on Mockingjay.

Syntax Level Features
Syntax is the key component of the grammatical structure of a sentence, which in turn is a key component of the communicative competence [9]. We use the depth of the syntax tree constructed from the sentences spoken in each sound clip as a feature to evaluate the syntax content [12,32,36]. Native Read Speech:In this setting as well, Mockingjay performs better than wav2vec2.0 by 38.64% on the best performing layer and by 21.5% on the final layer. The final layer captures this feature best for wav2vec2.0 and the initial for Mockingjay, which explains the decrease in percentage difference for the final layer.
Non-native Speech: wav2vec2.0 performs better on minimum layer loss by 15.89% and 30.92% on the final layer. wav2vec2.0 learns best on eight layer and Mockingjay learns best on fourth layer.
Native Spontaneous Speech:

Feature Extractor Module of wav2vec2.0
The pattern observed in the feature extractor module for these surface level features is the same as that of audio features with minimum losses seen in the post projection layer. However, the value of the minimum loss in this layer is less than that of the transformer module in wav2vec2.0. This gives some intuition for the better performance of Mockingjay since the Transformer is unable to capture the features or unlearns the presented vocabulary features.

Comparison with BERT
When we compare the performance of audio-transformer models with BERT ( For the first part, we convert 2000 random sentences from Wikipedia articles to speech by using Google's text-to-speech API [19]. We made sure that the audios constructed had similar lengths as those of LibriSpeech. The audios obtained were then passed through both the speech Transformer models and the layers were then probed. On this synthetic dataset, for the semantic features, BERT outperforms both the models by more than 10% when compared on minimum loss across all the layers. However, by the end layers, both the models learn the features well and the performance difference between BERT and audio-transformer models reduces greatly (2.29% and 0.49% difference for semantic features, 3.83% and 7.68% for syntax and 7.58% and 21.90% for surface features). These results are motivating since this means that embeddings of audio Transformer captures not only audio, fluency and pronunciation features, but also textual features to a large extent.
Next, we use the CMU L2 Arctic dataset. Table 5 presents the results for all the experiments. Here the results are the most different from the previous ones. For the semantic, syntax and surface features, BERT outperforms both the models by more than 15%. This result when compared with Wikipedia TTS and native read speech implies that the audio models capture text features for native speakers in 'cleaner settings' but they are not able to work in not-so controlled environments. Therefore, in a general setting, BERT text embeddings combined with audio embeddings can capture all the speech features adequately.

EFFECT ON DOWNSTREAM TASKS
We wanted to evaluate our findings which show that different layers of the models capture different features and see its impact on downstream tasks. To this end, we perform two representative tasks: speaker recognition on Voxceleb [46] (which uses audio features primarily), and phone classification on LibriSpeech (which uses pronunciation features). For speaker recognition, we randomly pick 10 speakers with 50 audios each in the train-set and 10 in the test-set. For phone classification, we use the libri-clean-100 and libri-cleanTest splits. We build a 4-layer linear classifier with dimensions 756, 512, 256, 10 with Adam optimizer and a learning rate of 0.01. Hidden layers have ReLU activation function and the third layer also has dropout. We perform the tasks using the best performing, final, and weighted average of all layer embeddings of the transformer models as input.
Results for both the tasks are given in Table 6. The results are consistent with those found for audio ( §4.1) and pronunciation features ( §4.3).

OTHER RELATED WORK
We already covered closely related work on attribution in Sections 1 and 2. We mention other related work.
Audio Probing: In the domain of speech processing, probes have been carried out on feature vectors, neural networks like RNN or DNN, end-to-end ASR systems or Audio-visual models. In [52], probing on x-vectors which are trained solely to predict the speaker label revealed they also contain incidental information about the transcription, channel, or meta-information about the utterance. Probing the Music Information Retrieval(MIR) prediction through Local Interpretable Model-Agnostic Explanations (LIME) by using AudioLIME [28] helped interpret MIR for the first time.
[44] analyses a DNN for phoneme recognition, both at single node and poplation level. Further research on interpretation of the role of non-linear activation of the nodes of a sigmoid DNN built for phoneme recognition task is done in [45]. Research has also been done to address why LSTMs work well as a sequence model for statistical parametric speech synthesis [62]. Several other studies have been conducted to interpret the correlation between audio and image structures for audio-visual tasks [1,18,27]. Even for Deep ASR models, efforts have been made to comprehend the hidden and learned representations [7,20]. However, probing of representation learning audio transformers is yet unexplored.
Text Probing: The field of natural language processing has seen numerous efforts in understanding the inner working of large-scale transformers, especially BERT [13,32,53]. Jawahar et al. [32] probe each of the different layers of BERT to find which layers best learn the phrase-level information, linguistic information and the longdistance dependencies. The results showed what role each layer played and the study concluded that the middle layers learnt the syntactic features and the higher levels learnt the semantic features and that the deeper layers are needed for long-distance dependencies while the initial layers capture the phrase-level information.

CONCLUSION
Speech transformer models, while still being new, have shown state-of-the-art performance on various downstream tasks. We probe two such models, wav2vec2.0 and Mockingjay, to understand what they learn. We probe the models on a wide range of features including audio, fluency, suprasegmental pronunciation, and textbased characteristics. For each category of features, we identify a learning pattern over each model and its layers. We find that wav2vec2.0 outperforms Mockingjay on audio and fluency features but underperforms on pronunciation features. Furthermore, we compare BERT with the audio models with text features and find that the audio models surprisingly outperform BERT in cleaner, controlled settings of native speech, but are not able to perform in an uncontrolled environment such as of spontaneous speech and non-native speech.

ACKNOWLEDGMENTS
To Robert, for the bagels and explaining CMYK and color spaces.