Seeking Gender Difference in Code-Switching by Investigating Mandarin-English Child Bilingual in Singapore

As a behavior of bilingual individuals and an indispensable part of bilingual speech, code-switching has been investigated by many researchers. However, there are many variables influencing code-switching, and each variable has the potential to be a confounding variable. Among these variables is the gender; however, whether there are significant gender differences and what are the gender differences in code-switching remains unknown for Mandarin Mandarin-English child bilinguals, as previous literature diverse on the existence of gender differences. Therefore, this paper seeks potential code-switching and distribution of code-switching by quantitative analysis of speech data in Singapore Bilingual Corpus. The results indicate that gender differences are significant in the amount of intra code-switching. However, neither considerable gender difference is observed in the amount of inter nor the code-switching related environment.


Introduction
Code-switching is a behavior of bilingual individuals and a symbol contact of languages [1][2][3][4]. Serving as an indispensable linguistic feature of bilingual speech, code-switching has been investigated by many bilingual researchers. Thus, code-switching is one of the "mirrors" that reflects bilingualism, teeming with variables. Therefore, researches on code-switching face a large number of confounding variables. For example, age is reported to be a confounding variable in code-switching, where both patterns and motivation of code-switching are affected [5]. Both chronological and developmental age impose influence upon code-switching, which is typically calculated as MLU [6]. Meanwhile, dominant language seems to influence code-switching as well, where the dominant language is accessed by a comparison between MLUs for individual languages [7]. Besides, language input is highly probable to have a significant impact in code-switching rate and directionality, since code-switching constitutes part of the input to most children under bilingual language acquisition [7,8], while it is reported that the pattern of code-switching in child bilinguals in Hong Kong may result from socialization into that of the adults in speech community [9]. The rate of children's code-switching is related to that of their parents [10]. As an influencing factor of parental input, parent's educational background can serve as a confounding variable in child's code-switching. Moreover, there are other potential confounding variables like language policy [11,12] and birth order [13]. Meanwhile, the register of speech may also influence codeswitching [14].
Beyond these confounding variables, gender, a common confounding variable in linguistics, is reported to influence code-switching in adult bilingualism. In the study by Wong, the author interviews and keeps a language diary to elicit natural utterances of ten females and ten males in Hong Kong. The result is that females code-switches almost twice more than males during the interview with greater use of English [15]. Besides, the difference of code-switching frequency between females with two types of working environments (i.e., more competitive and less competitive) is 4.2%, more significant than males' 2.4%. The author claims that code-switching is a symbol of education, and females code-switched more to show their identity as 'new women,' which is different from traditional gender roles.
Meanwhile, in the Middle East, gender difference has also been studied in different linguistic features, including code-switching. Besides, the research on SMS messages by female and male bilingual university students in Pakistan indicates that females code-witch more than males because females are more self-conscious than males, and males code-switches more about social life while females code-switch more about personal matters because of a limited circle of life [16]. More recently, considering gender as a social variable, the context of utterances has been paid attention to when investigating gender differences of code-switching. Finnis explored the code-switching behavior of British and Greek bilingual females and males in meeting context and dinner context. It turned out that males speak more GCD (Greek-Cypriot dialect) and jokes than females in their speech in line with the gender pattern in monolingual utterances that women tend to use more standard and prestigious forms than men [17].
However, whether gender difference persists in child bilinguals' code-switching remains a question, in which a prolonged "cold war" occurs between two groups of people: those who stand for a significant gender difference, and those who think gender difference is minute or even does not exist. Terming this debate as "cold war" is quite sensible, for the stance in this question are not reflected explicitly by a claim, but implicitly by methods taken in their experiments, where they either choose to ignore gender as a confounding variable in code-switching researches or take gender difference as a confounding variable.
Generally speaking, many studies on code-switching pay little attention to gender as a confounding variable and ignore gender difference either directly or after a simple screening [18,19]. While these researchers denied(to be more accurate, ignore) the existence of significant gender differences, other researchers consider gender differences in code-switching vital. They hence take measures to ensure gender is balanced correctly as a confounding variable. For example, girls are prone to be grammatical in code-switching than boys [20]. Some researches even consider code-switching as an indicator of gender [16]. Although these researches are focused on adults [16], they imply that gender differences in adult code-switching are tremendous, which may be passed down to the child's code-switching. Therefore, it is necessary for future studies on code-switching to 'look locally into gender' [21].
In addition to implicit agreement or disagreement on the existence of gender difference in code-switching in child bilinguals, there are also implications from outside the scope of codeswitching, which, for the most part, favor that gender difference exists. Gender difference in linguistic features has been frequently visited since the 20th century [22][23][24][25][26][27], where lexicon (e.g., particles, reflexive and hedges) [23,27] syntax(e.g., tag-question, request, and orders) [23], phonetics (e.g., glottalization) [24] and language choice [22] are known linguistic aspects which are vulnerable to gender. These findings in gender differences in linguistic features suggest gender differences in code-switching, which is also a linguistic feature.
So far it seems that there are gender differences in code-switching of child bilinguals, however, what are the gender differences in code-switching remains a mystery, since no research is done on showing gender difference in code-switching of child bilinguals. However, this question demands an answer to construct a more representative corpus and design a more accurate experiment. Therefore, in this study, attempts are made in seeking these problems: whether there is a significant gender difference, and if there is a gender difference, how gender differences are reflected in various linguistics features such as the amount of code-switching, the proportion of code-switching as well as the environment for code-switching, such as part of speech of the surrounding words, as well as related action.

Data
In order to seek potential gender differences, a bilingual corpus is needed, which is preferably a code-switching corpus where the gender of participants is readily available, and speeches involving code-switching are tagged. Meanwhile, if possible, the corpus should be as large as possible so that influence from individual differences can be mitigated.
After screening over accessible corpora at CHILDES, Singapore Bilingual Corpus is found out to fulfills these demands and is thus adopted for this study. As a corpus designed for codeswitching studies by Yow Wei Quin at Singapore University of Technology and Design, the corpus incorporate the most significant number of participants among Mandarin-English corpora available at CHILDES: 55 participants among which 30 is male and 25 is female [28,29].Participants age 5 to 6 years old, whose MLU are between 3 and 6 (See Figure 1).The participants have a language input of English(55.30%) and Mandarin Chinese(41.80%) predominantly. In contrast, several Chinese dialects like Cantonese and other languages like Japanese constitute a tiny part of the children's language input. All of the data were collected in the setting of childcare centers in Singapore in 2013. As a bonus, confounding variables like age and family education background have been controlled when the corpus is constructed, which alleviates the burden of this study in coping with these confounding variables [28,29].  The data is collected from CHILDES and is carefully examined for data consistency. Two of the children, one male, and one female, are excluded because their speech data is not found. Therefore, the number of participants involved is 53, with 24 males and 29 females. After the data are examined, the data is parsed according to .cha format by a python script. Then utterances that are marked [+rou](routinized forms), [+ prop](proper noun only utterances), [+ prop-intra](proper name only in a different language) [+ imit](imitation) as well as [+ trans](translation) are excluded since these are not considered code-switching [28]. After that, code-switching utterances with the tag of [+intra][+inter][+inter-utter-switch] and[+ intraoth] are collected for further analysis, which represents four categories of code-switching in Singapore Bilingual Corpus, namely intra, inter, inter-utterance as well as intraoth, whose definitions will be discussed in 2.3.

Variables
In the introduction, three questions upon code-switching are put forward, with topics including the existence of gender difference, the features of gender differences on both the amount and distributional function of code-switching as well as code-switching related environment. In order to solve these questions with acceptable accuracy, independent variables, dependent variables, and confounding variables need to be clarified. As a study for gender differences, gender is the sole independent variable, while dependent variables are among the features of code-switching. Dependent variables are divided into two groups: core code-switching features and code-switching environment, which is about the context of code-switching tokens. These would be covered in 2.3 and 2.4. Meanwhile, confounding variables are not negligible. Despite the effort made to mitigate their effect, some confounding variables slip through, among which dominant language may impose a threat on the accuracy of results. This issue will be tacked in 2.5. Other potential confounding variables, such as birth order and language policy, will be covered in the discussion.

Core Code-Switching Features
Core code-switching features are features on the code-switching itself, which involves the amount of code-switching as a whole and the amount of code-switching in each code-switching category. However, before discussing these categories, it is necessary to clarify the definition once again. In some studies, code-switching is distinguished from code-mixing, where code-mixing refers to the use of two languages within one sentence, while code-switching represents the use of two languages beyond the boundary of one sentence [1][2][3][4]. Similar two-class divisions are common among lingual studies, but the "code-mixings" are labeled intra-sentential code-switching (intra), while the "code-switchings" are labeled "inter-sentential" code-switching (inter). In this study, the "intra-inter" system is adopted, and code-switching severs as an umbrella term for both intra codeswitching and inter code-switching.
However, there is not a unified method for dividing code-switching. Generally speaking Codeswitching is divided into intra-sentential code-switching(intra) and inter-sentential codeswitching(inter). Intra refers to code-switching within a sentence, while inter is code-switching involving two or more sentences. However, a different system of division is adopted in Singapore Bilingual Corpus, where code-switching is divided into four categories: intra, inter, inter-utterance, and intraoth. In this version, intra is the use of two languages within a sentence [30]. Inter involves two consecutive sentences where the first sentence is in a language while the second sentence is in the other language. Meanwhile, inter-utterance also involves two consecutive sentences, where one intra-marked sentence is close to another sentence in the other language rather than the predominant language in the first sentence. The last type of code-switching is intraoth, which refers to code-switching in a sentence involving language other than Mandarin and English [30].
By comparing these two versions of code-switching categories, it is clear that the traditional intra is congruent with intra in Singapore Bilingual Corpus, while transitional inter is a combination of inter and inter-utterance in Singapore Bilingual Corpus. As for Intraoth in Singapore Bilingual Corpus, since it refers to code-switching beyond the language pair of Mandarin and English, the category of intraoth is not considered in this study, which focuses on Mandarin-English codeswitching. Although "intra-sentential, inter-sentential and tag switching" can be another version of division [31], incorporating this version of division into this study is infeasible since the Singapore Bilingual Corpus does not mark "tag code-switching." Thus this version is excluded from this study.
Therefore, apart from intraoth, code-switching categories in the first two versions are taken in this study, since potential gender difference cannot be denied in each of the proposed categories. Thus in this study, four code-switching categories are involved, which are intra, inter as in Singapore Bilingual Corpus, inter-utterance, and cross-sentence, which is the equivalent of traditional inter. Furthermore, since intraoth is excluded, overall, which represents taking all codeswitching involved as a whole, would now refer to a combination of inter, intra, and inter-utterance in Singapore Bilingual Corpus. Hence, Intra, inter, inter-utterance, cross-sentence, together with overall constitute the five levels of code-switching, where the amount of code-switching shall be measured.
However, how the amount of code-switching shall be measured remains a question. The number of morphemes, percentage of code-switching related morphemes in all morphemes, together with the number of utterances, and percentage of code-switching utterances in all utterances can be the potential targets of measurement. In this study, all four measurements are made, each of which owns one of the four possible targets.
After the measurements are decided, the next step is to conduct 20 independent t-tests between male participants and female participants on each of the five levels with each of the four measurements. From this step, a 5*4 atlas containing p-values of these 20 independent t-tests can be constructed, which represents the global picture of significant differences in gender. In this study, 0.05 is taken as the upper threshold in claiming a considerable difference.

Code-Switching Environment
Besides core code-switching features, code-switching environments are also investigated in this study to determine whether there is a gender difference. Two major features are examined: the part of speech of words in context, as well as contextual act, on both of which quantitative analysis are performed, focusing on the number of related tokens as well as their proportion.
For the analysis on part of speech of contextual words, "context" and part of speech need to be specified. In this study, context refers to a proportion that is not marked L2 of the speech that owns one of the tags of code-switching, while parts of speech include the eight traditional parts of speech: noun, verb, adjective, adverb, pronoun, conjunction, preposition, as well as interjection [32]. Meanwhile, both the number of part of speech and the proportion it takes up in the context are measured. As for contextual act, the number and the proportion of contextual act in code-switching utterances are calculated.
After measurements are conducted, independent t-tests are performed between male and female participants, where 0.05 still serves as the upper threshold for claiming significance.

Dealing with Confounding Variables
Before the analysis of gender difference begins, confounding variables need to be controlled or at least explained. As mentioned, age, language input, as well as parent's education background, have already been controlled when the corpus is designed, and the register in each script shares considerable resemblance since all data are collected in the educational settings; however, dominant language slips through, a confounding variable that may put a significant influence on the outcome. Should the dominant language be significantly different between male participants and female participants, it would be unable to distinguish between gender differences from the dominant language. Therefore, it is obligatory to confirm both groups of participants, namely the group of male participants and the group of female participants, are not biased for the dominant language.
Yet, information on the dominant language for each child is not provided in Singapore bilingual corpus. However, the difference of Mandarin MLU and English MLU may serve as an indicator for a child's dominant language: if the difference is positive(Mandarin MLU>English MLU), Mandarin is considered the dominant language; if the difference is negative(English MLU>Mandarin MLU), then English is regarded as the dominant language; if the difference is zero(English MLU=Mandarin MLU), the child is considered to be balanced bilingual, but this is very unlikely [7].
On this indicator, an independent t-test is carried out between male participants and female participants. The p-value is 0.6594, meaning that there is no between-group difference in the dominant language, and it is now safe to analyze the gender difference in code-switching.

Core Code-switching Features
After the analysis into code-switching features, the p-value atlas described in 2.3 has been obtained, which is shown in Figure 2, while the averages for four measurements, namely number of utterance, percentage of utterance, number of morphemes as well as the percentage of morpheme have been calculated, which is presented in Figure 3, Figure 4, Figure 5 and Figure 6. In p-value atlas, color red is for significant gender difference(p<0.05), while color teal is for non-significance. Thus, it can be seen that there are significant gender differences for intra code-switching since the result is robust regardless of the type of measurements. In this code-switching, the male codeswitch consistently more than the female. Therefore, it can be claimed that gender differences exist in intra code-switching.   Meanwhile, gender difference exists in category inter-utterance, while gender difference is also significant if the code-switching is taken as a whole with no further division. However, the result for inter-utterance and overall is not as robust as that of intra, for the result varies according to the applied measurement. For example, the difference between male children and female children is significant in inter-utterance when code-switching is calculated based on the number of utterances, but this gender difference cannot be found if the code-switching is calculated based on the percentage of utterance, number of morphemes or percentage of morphemes. Thus it is difficult to decide whether there is a gender difference in the level of inter-utterance and overall. Moreover, no matter what measurement is taken, gender differences are always insignificant for inter and crosssentence, which is marked inter in traditional definitions. Hence, inter code-switching carries no gender difference, in the sense of both traditional code-switching categorization and that of Singapore Bilingual Corpus.

Code-Switch Environment
As mentioned, two facets of code-switching environments are investigated: part of speech of contextual words and contextual act, which is discussed herewith.
For part of speech of contextual words, a p-value atlas on the gender difference in codeswitching environmental part of speech distribution is shown in Figure 7. From these two charts, no gender difference is seen in any of those part of speech. Therefore, there is no gender difference in each of the eight fundamental parts of speech. However, it is undeniable that there may be gender difference if the part of speech of contextual words is taken as a whole, which indeed demands further discussion.

Figure 7. P-value Atlas on Gender Difference on Part of Speech
Besides, the number of contextual acts of male participants does not seem to be different from that of female participants. This result is echoed with the independent t-test performed to see whether there is a significant gender difference, whose result is a p-value of 0.07891 that is greater than the upper threshold for claiming substantial gender difference. Therefore, there is no significant gender difference regarding the number of the contextual act.

Potential Preconditions: Re-investigating Confounding Variables
In the previous session, robust gender differences in code-switching are found on the amount of intra, which proves gender difference. However, gender differences are not found in the other facets, such as the amount of intra and the environments of code-switching, where the non-presence of gender difference is robust across different types of measurements. However, the result contradicts those previous reports claiming that there is no significant gender difference in codeswitching. Albeit the cause of contradiction can be an individual difference or other confounding variables that are deemed minute but are not, the current explanation on gender difference on codeswitching is still unsatisfactory since the previous literature for no gender difference becomes the exception if gender difference persists as in this study. Given that earlier reports are correct, there may be a collection of conditions that govern whether the gender difference is significant or not, which will be discussed in the following part. Meanwhile, it is acknowledged that the number of confounding variables in code-switching studies is substantial. Whether this is the reason behind the theoretical incompatibility will also be discussed in the following.
In order to seek potential preconditions, previously acknowledged confounding variables are reviewed once again. Despite the effort made in balancing them and the claim from Yow, none of them are presented in raw data for each individual involved since these data are classified due to privacy protection, except for dominant language, which is figured out via comparison between Mandarin MLU and English MLU, both of which can be calculated from the speech in the corpus. Therefore, the dominant language is put under examination as a potential precondition.
Hence, participants described in methodology are re-divided into two groups according to their dominant language, among whom 37 participants use Mandarin as the dominant language, while 16 participants' dominant language is English. Among those whose dominant language is Mandarin, there are 21 males and 16 females, while among those whose dominant language is English, there are eight males and eight females. The same method for seeking core code-switching is applied to each of the groups as proposed in methodology. And the t-tests for gender difference are performed for both groups of Mandarin dominant participants and groups of English dominant participants.
After these, p-value atlases are created from the p-values in the t-tests, which can be found in Figure 8 for the Mandarin dominant group, and Figure 9 for the English dominant group. As is depicted in p-value atlases, it is surprising to find that all known significant gender difference is gone in the English dominant group, while the gender difference in Mandarin dominant children is similar to what is described in result and analysis, where the significant gender difference is robust in intra, while the non-existence of considerable gender difference is consistent for both versions of inter.  Based on the data herewith, it is sensible to hypothesize that the gender difference in codeswitching is not always apparent. There are preconditions for a discernible gender difference, among which is the dominant language. Meanwhile, previously acknowledged confounding variables can come up as a precondition. Therefore, whether these confounding variables are among the preconditions for gender difference may serve as an interesting topic for research in the future.

Potential Gender Difference in Part of Speech
In the hunt for the gender difference in the code-switching environment, it remains a question whether there are gender differences in the code-switching environment if the part of speech of contextual words is taken as a whole. Since t-tests can be applied for between-group difference analysis only, a new method needs to be introduced to investigate this potential gender difference, where cosine similarity analysis is a possible option. In this analysis, vectors are constructed used to represent a set of linguistic features in given texts, and the cosine similarity is used to measure the similarity between the two sets of linguistic features [33].
Therefore, in this study, by aligning the percentage of the eight fundamental part of speech as a vector for each participant, for example, [noun, verb, adjective, adverb, pronoun, conjunction, preposition, interjection](this is the exact alignment in this study), the cosine similarity of two parts of speech vector can be calculated, which indicates the similarity between the two participants, ranging from -1, if the two vectors are very different(in fact, in this circumstance given the first vector A, then the second vector is -A) to 1 if two vectors are identical.
Hence, cosine similarity is calculated for all possible pairs of participants of different genders, and the result is depicted in Figure 10. In the chart, the cosine similarity is always greater than 0, while many pairs of participants own a cosine similarity greater than 0.8. Therefore, it is reasonable to claim that there is no gender difference either if the part of speech of contextual words is taken as a whole. Adding this with the previous result, it is now safe to claim that there is no significant gender difference in part of speech of code-switching contextual words.

Conclusion
So far, the results support the existence of gender differences in the code-switching of Mandarin-English child bilinguals. The gender difference is significant in the amount of intra codeswitching, where male participants code-switch more than female participants. Meanwhile, neither significant gender difference is observed in the amount of inter, nor in the code-switching related environment, like part of speech distribution and action in context. Since the current finding seems to be incompatible with previous literature that claims no significant gender difference above all, it is hypothesized that there are potential preconditions for explicit significant gender differences, and hence further investigation in the amount of code-switching under the five levels is made to seek potential precondition for explicit gender difference. As a result, the dominant language, previously recognized as a confounding variable, turns out to be one of the potential precondition for explicit gender difference. In conclusion, it has been confirmed that there are gender differences in code-switching, and the difference is predominantly in type intra. Thus it is recommended to incorporate gender as a confounding variable in code-switching related corpus construction and experiment design. Meanwhile, the mystery of gender difference on code-switching remains: for this time, the interesting question is: apart from dominant language, what are the other preconditions for explicit gender differences in code-switching.
Author Contributions: W.H.: conceptualization, methodology, software, data analysis, and writing; D.L.: data collection and writing; J.L.: data collection and writing. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding