Natural language processing of helpline chat data before and during the pandemic revealed significant decrease in self-image appreciation and changes in other traits

During the last two years the COVID-19 pandemic has affected the world population in several ways. An important increase in mental health problems is a consequence of this pandemic that is ubiquitous worldwide. In this work we study the effect of the pandemic on the mental health of a population of teenagers and youth based on the analysis of natural language processing, machine learning algorithms and expert knowledge. The data analysed was obtained from a chat helpline called Safe time from the It Get’s Better Foundation in Chile. The data consists of 10 , 986 conversations gathered from 2018 till 2020 between volunteers from the foundation and users of the platform. We compared the conversations before and during the pandemic in terms of their thematic content. Our analysis found: a significant decrease in self-image appreciation during the pandemic; a significant decrease in the quality of personal relationships during the pandemic, and a significant increase of performance appreciation.


Introduction
Mental health may become the next pandemic [1]. Recent studies show that the global prevalence of depression has gone up from 9,6 percent to 28 percent and anxiety from 12,9 percent to about 26 percent [2,3]. During the COVID-19 crisis, about 16,4 percent of the global population shows a prevalence of suicidal thoughts, and over 50 percent of the population shows symptoms of loneliness, stress and low levels of wellbeing [2]. These studies also point to the fact that structural inequality and poverty are highly related to the prevalence of mental health and psycho-social problems, as well as countries' abilities to respond and assess them [3,2]. In a post-pandemic world, governments will have to deal with the mental health consequences in a context of continuing distress produced by the likely economic recession [4]. Moreover, this scenario has shown the need to rethink and drastically improve public health services for the future [5,6].
Assessing the mental health impact of the COVID-19 pandemic is a challenging endeavour as it requires information and data gathered before and during the pandemic [7]. Most mental health population studies depend on large-scale self-reports [8,9,10,11]. Conducting such large-scale population studies can be costly [12,13]. This has led to a paradigm shift in many fields of research. For instance, human-mobility studies used to rely on active solicitation of data through travel surveys and self-reports but has since embraced inferences based on computational analysis of passive data generated by cell phones users [14]. Analyzing direct behavior can also lead to more precise interpretations. For instance, researchers have found that liberals tend to self-report less happiness than conservatives but display more in their actual behavior [15].
Lately, computer-based tools, such as Natural Language Processing (NLP) and Machine Learning (ML), have increasingly been adopted to study mental health [16]. Using large amounts of text from either patient records, emergency room data or even social media, researchers have been able to extract symptoms, classify the severity and identify psycho-pathological clues [16]. NLP has even been used to design chat-bots for complementary mental health treatment [17]. Combining linguistics and computer science, researchers have tested automated markers for mental health, such as excessive self-focus shown by first-person pronouns and negative emotions using word dictionaries [18]. Recent studies have sought to use these new computational approaches for population studies in mental health using non-clinical data [19]. They often rely on social media data and the Linguistic Inquiry and Word Count (LIWC) dictionary [20,21]. Other studies also use machine learning/deep learning approaches to inductively assess mental health symptoms in social media forums and communities [12,22].
Dictionary approaches for characterizing mental health problems involve generating groups of words that are hypothesized to relate to specific psychological constructs and then scanning texts for the frequency of those words [23]. In this sense, dictionary approaches presuppose the existence of a data ontology or taxonomy that connects terms in conceptually meaningful ways [23]. For instance, the LIWC proposes a series of words that are hypothesized to relate to particular emotions, cognitive processes and social relations and does not need a model to do inference. On the other hand, deep learning approaches involve supervised training of algorithms using neural networks to estimate the model for classification. Nonetheless, there are some serious limitations of the current uses of both of these approaches that ought to be considered.
'Off-the-Shelf' dictionaries [24] such as the LIWC provide stable and rich markers for psychological constructs and have now been translated into multiple languages. However, these sorts of dictionaries are context-blind [25] in the sense that they do not account for changes of meaning in words depending of the whole context of the phrase and its use in different 'language games' [26]. General use dictionaries, such as the LIWC, have top-down bias, as they operate with pre-defined ontologies that are assumed to be stable across domains and discourses, which can lead to significant inaccuracies [24]. Most of the population-level studies using dictionaries utilize rough sentiment analysis to measure the mood valence and emotion shifts over time [19]. This is because current markers of the LIWC can only serve as features and complementary data in more specific mental health studies. Despite the value of its emotion, cognitive processes and social relation markers, they lack more specific mental health constructs, such as symptoms or psychopathology markers. For this reason, computational studies tend to still rely on the application of self-reports to assess mental health symptoms and other domain-specific constructs [21].
Machine learning and, particularly, deep learning studies have shown great accuracy in predicting the mental health status of people using their social media data [27,28,29]. However, Deep Learning models are perceived as "black boxes" in which inputs are computed and conclusions are reached without too much explanation of its inner working [27]. This lack of transparency is critical when trying to convince mental health experts to embrace the possibilities and conclusion of machine learning models [27]. Moreover, understanding why an algorithm is making certain 'decisions' is important for the goal of learning about that phenomenon. There have been discussions about incorporating Explainable Artificial Intelligence (XAI) techniques for making sense of algorithmic decision-making in health science, but this is still a pending challenge [30]. Some studies include mental health experts, but mostly for labeling data and not for coconstructing the conceptual underpinnings used for interpretations of the data [31].
Social media has proven to be an effective source of big data for mental health analysis [31]. However, the informality of social media data and its public availability raises questions about its quality and its ability to protect the privacy and anonymity of participants [16,12]. This makes it a sub-optimal data source alternative in comparison to clinical interviews and notes or other forms of clinical data in which practitioners are able to exercise content regulation [31]. Although presumably of higher quality, these records would likely be difficult to acquire in the necessary volume because of institutional restrictions in the public sector or the lack of a centralized source of data in the private sector.
As an alternative, mental health helplines are a growing and global phenomenon. Just in the UK there are over 2,500 helplines in operation [32]. In Europe, the International Federation of Telephone Emergency Services (IFOTES) estimates that four million telephones conversations are held every year [33]. In these helplines (both chat-based and telephone-based and mixed), conversations are held between participants and paid workers or volunteers. Because of this, conversations are better guided, in the sense that they are deliberately covering important information pieces somewhat consistently across users, which is necessary for better data interpretations [31]. In sum, these data sources seem to provide higher quality data than social media, with rich free-text and higher accessibility than medical files and interviews.
In this work we study helpline chat data from the Safe time program that belongs to the It Get's Better Foundation in Chile in order to assess the effect of the pandemic on the mental health of teenagers. We first selected seven volunteers from the Foundation who were interviewed. From these interviews, several patterns were identified in terms of the strategy implemented by the volunteers in the conversation and the conversation thematic contents. Once we had identified the categories in which each conversation can be classified we selected six expert volunteers from the foundation plus two professional psycologists from our team to manually label a set of a 1000 conversations to train our models. We assessed whether the thematic contents of the conversations changed during the pandemic, whether the strategy implemented by the volunteers changed also during the pandemic. Further, we identified features associated with depressive, anxious or suicidal symptomatologies.

Data ontology construction using qualitative research
We explore the linguistic patterns, beliefs and perceptions of expert volunteers of a mental health assistance NGO. This, in order to construct a contextually-driven data ontology that would serve as the basis for identifying lexical markers associated with mental health problems, as well as to inductively find research questions to drive quantitative testing.
By interviewing seven expert volunteers we were able to identify key aspects on which each conversation could be classified. The resulting thematic model (data ontology) contains four dimensions (See Figure 1). First, the theme of "Gravity" describes perceptions about the seriousness of each case. It is perceived as the presence of suicidal behaviors and as the lack of personal resources (which can either be social networks, hobbies or interests and/or access to professional care).
Second, "Thematic Family" describes overarching topics (semantic context) of each conversation. We identified six main thematic families: Third, "Introduction patterns" describes linguistic patterns related to how users start conversations. We identified four main patterns. Users may start by timidly greeting volunteers and waiting to be acknowledged before starting to present the reason to access the channel (acknowledgment solicitation). Users may try to fully expose their mental health issues as the first message or at the earliest (problem presentation). Users may also start a conversation by conveying their imminent desire to self-harm or commit suicide (imminent risk). Finally, users may start a conversation by greeting but then leave or refuse to answer volunteers (conversation declined).
Fourth, the theme of "Intervention" describes the pragmatic responses that mental health volunteers use to address different user scenarios. We identified seven main intervention options frequently used by volunteers: Several research questions were identified, drawing from the participants' expert knowledge and the researcher's analytical memos. These questions are the following: 1. How does the relative frequency of thematic families change during the pandemic? 2. How do thematic families relate to mental health symptomatologies and whether these changed during the pandemic?
3. How do different intervention get reflected on the mental health symptomatologies and whether these changed during the pandemic?
To be able to answer the questions above, a set of a 1000 conversations was selected randomly and classified by an expert volunteer in the four dimensions mentioned in the previous section: Gravity, thematic families, introduction patterns and interventions. These set was then used as a training set to estimate all the classifiers developed in the study. In what follows, the analysis to answer these questions are described and the findings are presented. Table 1 shows the list of terms associated with each of the thematic families. The accuracy and precision metrics were estimated by comparing the dictionary-based classification with the expert's tagging on the 1000 conversation database. Dictionary-based classification performance varies among different thematic families. The performance increases when the semantic families are less prone to contextual changes, i.e., if the addition of new concepts contributes to saturate the family.

Effect of the pandemic on the thematic content of the conversations
Using the dictionary-based classification on the full dataset, we estimated the prevalence of each thematic family from 2018 to 2020 at the monthly level. The differences in prevalence before and during the pandemic were tested with a Welch Two Sample t-test between both periods. A significant difference at the 0.01 level was found in Self-image (t = -6.1341, p-value = 2.272e-06), Relational (t = -6.3912, p-value = 5.924e-07) and Performance (t = 3.6293, p-value = 0.001828). A positive/negative value of t means a decrease/increase of theme prevalence during the pandemic. Figure 2 shows the evolution of these thematic families from 2018 to 2020. This period covers ten months of the pandemic in Chile, from March 2020 to December 2020.  Self-image: The pre-post-pandemic change may respond to the greater exposure to body and appearance issues. Looking at their own faces in virtual platforms all the time resembles a "mirror" on screen, allowing them to inspect their appearance simultaneously [34]. The zoom effect caused by videoconferencing systems raises body image concerns were associated with self-focused attention and with increased concern about appearance and how to change it due to time spent on video calls [35]. Looking at oneself during video chatting is associated with self-objectification and appearance comparison on face satisfaction and body satisfaction [34]. Also, exposure to weight-stigmatizing content on social media increased during the pandemic among adolescents [36]. Using hide self-view on videoconferencing systems and "touch up" features can increase bodily discomfort by having an ideal image on the screen [35,34]. Furthermore, daily routine disruptions, increased snacking and the lack of outdoor activities may raise weight and shape concerns [37]. In addition, the pandemic and social distance may have diminished social support and adaptative coping strategies, which may increase discomfort with the body [37].
Relational: family and social relations have suffered due to confinement, which could explain the change in this topic. It has been reported an increase in tensions between LGBTI and adolescents with their families, and also among young adults with friends and partners, as a result of the pandemic and confinement [38].
Performance: the decrease in academic performance conversations may reflect the inconsistent empirical research on the impact on academic demands [39,40,41,42]. On the other hand, domestic monetary concerns may have been relaxed due to state money transfers and various withdrawals from pension funds.
Violence: As some studies reported an increase in domestic violence due to the confinement [43,44,45], we would expect a higher presence of this thematic family in the pandemic. However, our results show no difference pre-post pandemic. This suggests that the violence exerted on this segment of the population finds causes in contexts other than confinement. Also, mentions of violent situations are generally not complaints but rather arise as part of the narrative of other problems.

Association of the type of intervention/thematic families with symptomatologies
Logistic regression models were fitted to the three symptomatologies: Suicidal, anxious and depressive.The different intervention strategies corresponds to dichotomous predictive variables for these models. Table 2 shows the results of the logistic regression on each symptomatology (suicidal, depressive, and anxious), using the thematic families and interventions as predictors. These models were trained with the 1000 conversation tagged dataset. For suicidal behavior, the main positively associated themes are Emotion management and Self-image. While the former describes the crisis itself, the latter can be seen as a crisis motive. In other words, a selfimage problem may generate -along with other factors -emotional management issues. The prevalence of both topics could respond to a "cause-effect" or "description-explanation" scheme. Regarding the strategies, Emotional containment shows the highest positive association with suicidal behavior. This strategy corresponds to the immediate action in such a crisis since the main objective is to stop the suicidal act. The Identification of personal resources also shows a positive and significant effect, with a slightly lower magnitude. This strategy may point to the crisis management itself -as in finding someone who can help in the crisis, for example, driving to the hospital -but to the non-immediate causes as well. In this sense, personal resources can also correspond to activities and personal relations that contribute to personal welfare in daily life.
In the case of depressive symptoms, several themes have a positive and significant effect. In order of importance, they are Emotion Management, Performance, Self-image, Violence. This could indicate that depressive tendencies are aggravated by multiple causes of a personal or relational nature. Again, Emotion Management seems to describe the symptom itself. Regarding the strategies, the Identification of personal resources appears as the most important predictor, followed by Validation of personal experience. In this case, the identification of personal resources probably refers to the search of activities, interests and personal relations that help the user to better face the depressive episode. Validation of personal experience also plays a relevant role because greater validation is associated with decreased negative affect [46].  that show a positive and significant association. In this case, the ϕ correlation coefficient equals 0. 29. This value can be tested for statistical significance with a χ 2 test, and the resulting p-value is lower than 0.001. However, even when suicidal and depressive symptomatologies show a significant association, the strategy scheme in each case is slightly different. The element that separates them is the preponderance of containment in suicidal behavior and validation as a relevant element in depressive behavior.
Finally, there is only one theme with a significant and positive effect on anxious symptomatology and its Emotional management. This suggests that people with this symptomatology usually use the platform in a crisis context, probably as a last resource. Regarding strategies, Validation of personal experience is the most important predictor, followed by Identification of personal resources. Psycho-education also appears, but with low significance, although its magnitude is similar to Identification of personal resources. Since Psychoeducation and Inducing reflection strategies have an ϕ correlation coefficient of 0.39 (the highest association among symptomatologies, themes and strategies), we ran the model without the Inducing reflection strategy, in which case Psycho-education becomes significant at the 5 percent level, and takes second place after Validation of personal experience.

Discussion
This article set out to show the efficacy of alternative means to conduct population studies on mental health. Considering the high cost of survey-based approaches, the black-boxing produced by unsupervised machine learning, and the de-contextualization produced by off-the-shelve dictionaries, our manuscript argued for the use of interdisciplinary approaches driven by qualitative understanding.
We employed qualitative research methods to produce a context-driven data ontology to identify analytical categories. These categories allowed us to organize the corpus, identify markers and concepts related to those categories, and research questions to use those categories in meaningful ways. For instance, we were able to map the topics of conversation (thematic families) that operate as the source of distress beyond the type of distress (depressive, anxious or suicidal symptoms). These included relational, self-image, sexual diversity, emotional management, and violence themes. We also identified the strategies used by volunteers to address the concerns of users. These ranged from helping users explore their problems to providing psychological education. Future studies may use these sorts of categories in combination with other frequently assessed variables in text-based psychological assessment, such as personality traits [47,48]. Incorporating more variables with strong empirical support is especially critical if an automatic analysis is used at the individual level and not only for large cohorts.
We operationalized our qualitative findings using both expert human taggers and automatic computational methods. This enabled us to describe how topics of conversations changed during the COVID-19 crisis, how mental health symptomatologies relate to conversation themes and the lexical markers associated with each major symptomatology.
We observed how the thematic composition of conversations changed before and during the COVID-19 crisis. Relational issues and self-image issues were more dominant during COVID-19. This first insight possibly reflects the increase of home interpersonal relationships and the decrease of in-person friend interactions. The interpretation of the increase of self-image issues is more complex but perhaps pointing to the manifestation of personal life through social media, which raises self-worth questions, particularly in adolescents. The decrease of performance issues relates to the inconsistent findings of the effects of COVID-19 in educational settings and the provision of economic aid by the local government. Overall, these findings show that although the dictionaries are contextually bounded, they enable discussions with previous evidence on a global scale.
Gathering mental health conversations from chats means that the nature of the data is dialogical instead of monological. That means that the text's characteristics are modified both by users and volunteers. Controlling by the volunteer's interventions, we associated how thematic families link to mental health symptoms. In all three depressive, anxious, and suicidal cases, emotional management was a major topic of conversation. For depressive and suicidal symptoms, self-image themes were also relevant. Performance and violence themes were also significantly related to depressive symptoms. These findings point out the need to contextualize population mental health issues within their semantic content. In other words, understanding what people are hurting about, rather than only quantifying their illness. For instance, depressive symptoms are associated with more themes, possibly reflecting that the person is more likely affected by different sources of life's challenges. On the other hand, anxiety symptoms are related primarily to emotional management, indicating that the person is focusing on the immediate discomfort and consequences of the mental health crisis.
Identifying features associated with mental health symptoms allowed us to understand what factors were significantly associated with the classification of mental health symptomatologies. For Anxious and Depressive symptoms, we observed that participants' self-report was highly consistent with our classifiers. Beyond explicit self-reporting, we observe that in Anxious and Suicidal symptomatologies, crisis-related terms were among the most relevant predictors. In this sense, we conclude that Anxious and Suicidal symptoms operate with greater emphasis immediate moment (the "right now"). In the case of Anxious symptoms, this is consistent with its only significant correlation with emotional management themes.

Data Source
This study uses the data of a Chilean-based NGO helpline that provides free-access mental health support through chat. Overall, we sought to test our interdisciplinary method with the 45944 available conversations of the NGO and through a qualitative inquiry with its volunteers. Using this method, mental health research questions were identified and tested using NLP. Overall, our sample consisted of 45944 text entries, of which 10986 were included in this article after filtering for conversations with less than 10 messages. Of these, 2335 were produced in 2018, 4974 in 2019, and 3701 in 2020. Participants can freely decide to share personal information when reaching out to the helpline. During these three years, 4643 participants revealed their age. Out of these participants, the average age was 18.89 years old (5.17 years Standard Deviation). The average amount of words per conversation in our final database was 742. The total amount of different volunteers registered in the NGO was 210 for all years. (Ethical review committee: Universidad Adolfo Ibañez; Approval number: N°02b/2021).

Interviews
In order to identify the key dimensions that structure mental health conversations within our sample, we conducted a qualitative inquiry. For this qualitative study, we conducted semi-structured in-depth interviews [49] with expert volunteers of the NGO. In this sense, we utilized purposive sampling through a critical case approach [50]. Supervisor recommendations alongside user evaluations were used to determine the expertise of volunteers. The inclusion criteria were set as belonging to the top quartile in the user assessment and being recommended by a supervisor.
The topics covered by the in-depth interview were the following: • Representations of "a typical case" • Perceived typology of users and cases Seven in-depth interviews were conducted, lasting approximately 1 1/2 hours. The resulting audio files were professionally transcribed and analyzed using thematic analysis [51]. An investigator triangulation approach was used [52] in which a subset of researchers held critical discussions about coding and the thematic structure of the text. After the fifth interview, qualitative saturation [53] was achieved using the criteria of code saturation [54] as no new themes emerged from the data. This reflects on the highly consistent set of practices that conform to the expert knowledge of these volunteers. The last interviews were used to add nuances to existing schemes and to confirm saturation. Analytic memos [55] were iteratively written throughout the qualitative research process in order to raise research questions about mental health associated behaviors based on participants' expert knowledge. Finally, a thematic model was developed that served as the data ontology of these conversations (See Figure 1). The final thematic model was triangulated [52] with direct analysis of a random sample of 20 conversations. The selection criteria for this random sample were having more than 10 messages and balancing conversations that the NGO tagged as containing depressive, anxious and/or suicidal symptomatology.

Labeling of conversations
An expert tagging process was employed in order to construct and validate data analysis tools, such as dictionaries and automatic classifiers. Two mental health experts of the research team independently tagged a random sample of 200 conversations (with more than 10 messages) based on the data ontology produced in Study 1. This process was used to validate and improve the new categories. Additionally, they tagged the conversations for the three main symptomatologies used in the NGO, namely, depressive, anxious, and suicidal. These symptoms were used as the basis for comparison as they implicate the largest interpretative load. On the 200 conversations, they achieved agreement in 85% in depressive, 88% in anxious and %89 in suicidal symptoms.
NGO volunteers were invited to tag a greater volume of conversations. They were selected based on a test in which they tagged 10 random cases already tagged in agreement by our two experts. Accepted volunteers were required to correctly tagg at least 80% of conversations according to the displayed symptomatology. Overall, 12 volunteers took the test and 6 passed. The conversations tagged by the 6 volunteers plus additional conversations tagged by our experts added up to 1000 assessed conversations.

Change of thematic families during COVID pandemic.
To assess how the relative frequency of thematic families changes during the COVID-19 pandemic, we build a dictionary of thematic families. In this context, a dictionary is a structure of words and categories, where each word is associated with one or more categories. In our dictionary, each category corresponds to one thematic family. The set of words belonging to each category was derived as follows: • Using the sample of 1000 conversations tagged by the experts, we trained a Random Forest classifier for each thematic family. We then selected the 40 most predictive features in each case and generated a preliminary list of words.
• Then, if any word in a conversation is contained by the dictionary, the conversation is labeled according to the corresponding category.
• We evaluate the dictionary's performance by comparing its classification with that of the experts. In this phase, some words were added or removed in order to improve the dictionary's performance.
• The lists of words were refined by our psychology experts.
Once the dictionary is ready, we count how many conversations are labeled in each thematic family by month and year and normalize by the total number of conversations. Then, we compare the relative presence of the thematic family before and after the pandemic arrives in Chile (March 2020), by comparing the prevalence between both periods with a Welch Two Sample t-test.

Forms of interventions, thematic families and symptomatologies
To assess how the thematic families and interventions are related to different mental health issues, we set up a logistic regression model for each symptomatology: suicidal, depressive, and anxious. We used the 1000 conversation dataset tagged by the experts to train these models. The independent variables are the presence/absence of the 6 thematic families and 7 forms of interventions identified in each conversation by the experts. Thus, all regressors are binary variables, and the reference category is 0 (absence). Several classifiers were trained and compared, using SkLearn library in Python: Logistic Regression, Random Forest, Support Vector Machine, and Gaussian Naive Bayes. For each model and classifier, the main hyper-parameters were tuned.