Social Media Platform: Measuring Readability and Socio- Economic Status

Social media has brought forth an unprecedented dimension to analyze various areas of today’s society. The socioeconomic status (SES) is considered to be a leading indicator in many fields of research especially in the field of medical sciences. Forum uses, hashtags and so on are common tools of conversations grouping. On the other hand, crowdsourcing is a concept that involves gathering intelligence to group online user community based on common interest. This paper provides a mechanism to look at writings on social media and group them based on their academic background. We build upon earlier work where we analyzed online forum posts from various geographical regions in the USA and Canada and characterized the readability scores of such users. Specifically, we collected 1000 tweets from the members of the US Senate and computed the Flesch-Kincaid readability score for the Senators. Comparing the Senators’ tweets to the ones from average citizens, we note the following. 1) US Senators’ readability based on their tweets rate is much higher affirming the gap between the academic performance of US Senators and their average citizen, and 2) the immense difference among average citizen’s score compared to those of US Senators is attributed to the wide spectrum of academic attainment.


Motivation and Background
Social computing garnered significant attention after the advent of Web 2.0. The extensive use of blogs, Myspace communities, and various online forums affected the way people conducted social interactions (Parameswaran & Whinston, 2017).
Social media platforms offer a unique chance to perform social science and online research. It offered users a forum to voice their views unequivocally since they don't need to reveal their true identity. Despite the fact that social media platforms made great strides to ---* Corresponding author. e-mail: aiilahi@effatuniversity.edu.sa. ascertain the real identity of the users, the internet platform still grants a great deal of anonymity. In addition, federal regulations bind social media companies to not reveal the true identity of users without their consent. The 2004 US presidential campaign, for example, popularized the idea of online advertising and encouraged many scholars to research its influence (Weinberg & William, 2006).

The launch of Amazon Mechanical Turk in 2005 brought a new dimension to the area of Artificial
Intelligence (Irani, 2017). The crowdsourcing platform allowed the users to outsource tasks to humans, which would be difficult for a computer to perform. The crowdsourcing platform allows advertisement of a task for a group of users who will perform it for an incentive (money, contribution to literature, etc.). The social media platform has the concept of crowdsourcing embedded in it, as pointed out by (Paniagua & Korzynski, 2017). As an example, Twitter was used successfully in various domains such as emergencies; disaster relief, etc. in the context of crowdsourcing (Jordan et al., 2018) -discussed more in the next section. In these scenarios, the experts depended on the feedback from volunteers in the affected region, based on which agencies could come up with an appropriate real-time response. Such scenarios come under the umbrella of active crowdsourcing. Passive crowdsourcing, on the other hand, involves soliciting user action without the users consciously realizing that they are contributing. The concept of hashtag on twitter where various users would contribute to a particular topic is one example of passive crowdsourcing. In this scenario, people interested in soliciting feedback can start a hashtag that can help gather valuable information.
Social and medical sciences researchers have begun to focus on the vast number of available data. Although social network data are not the means by which a particular individual's problems are identified or treated by themselves, the data can be used to identify different symptoms as measures for certain problems of certain issues in mental health (Rajput & Ahmed, 2018a). The field of Natural Language Processing (NLP) is a cross disciplinary field between Computer Science and Linguistics that aims to help in segregating relevant data using various segmenting techniques. The choice of the corpus is one of the main requirements to these steps. We use the definition of the corpus as "a collection of naturally occurring text, chosen to characterize a state or variety of a language" (Schvaneveldt et. al., 1976). In general, constructing a corpus includes considering a specific text to the problem and deriving keywords, bigrams and sometimes trigrams (two or three-word sentences) that are used excessively in a given area. As an example, (Rajput & Ahmed, 2018b) argue that a corpus should be developed to assist mental health professionals in detecting depression among users provided some group of people. The researchers base their observations on the twitter hashtag # depression. The study gathered overwhelmingly evident terms and found that these words are part of the language of depression patients. Once such a corpus is established, researchers would look at a random text and predict with a certain assurance whether the words used by the individual are the same frequency as those in the corpus.

Problem Description
Experts and researchers in various fields especially those belonging to medical science and sociology consider the socio-economic status (SES) of society members as a strong heuristic in anticipating potential pitfalls that can cause issues in peoples' lives. (Collins, 2016) discusses, for example, SES comorbidity and alcoholism. The level of education is o ne of the main determinants of a person's SES. The education level, in turn, is correlated with the ability to write (Kellogg & Raulerson, 2007). (Geiser & Studley, 2002) in their findings argue that a student's ability to compose extended text is the single best predictor of success in finishing freshmen level coursework. (Flesch et al., 1975) defined the Flesch-Kincaid test by providing a formula to combine the proportions of a text with the grade level of the text, as follows. 1. The ratio of total words to total sentences in a given text 2. The ratio of total syllables to total words in a given text. The grade level inferred from the above equation is directly linked to the school grade level of the given text. A score of 12, for instance, shows that a grade 12 student can understand the given text while a score of 14 means that the text is written at a level of a second year University student.
Given the above, our current work aims at establishing a measure that will serve as a proxy for SES from publicly available data. Deriving our motivation from the concept of crowdsourcing in social media, we explore the possibility of looking at the grade level of a writer to predict the education level of a person that in turn would indicate one aspect of his/her SES. The study was based on a group of people, all of whom were in possession of a college degree as a minimum requirement. A Pew Center research report Pew (2019) mentioned that every member of the US Senate in 2019 has attained at least a college degree. This is compared to only 34% of American adults aged 25 or older accomplishing the same feat Thus, our study comprised of the following: 1) analyzing US Senators' impromptu writing and 2) comparing random online forum posts that reflect writing of an average member of American society to the above.
One of the biggest challenges we had to face was: How do we segregate posts based on geographical reasons. Based on our work in (Rajput et. al., 2019), we looked at discussion forums of vehicles by GM, where the users were asked to post in their particular region in US and Canada. Using the NLP techniques, we scavenged the forum for texts posted by users with the goal of analyzing how users differ in various English speaking regions in terms of readability scores.
The work contribution is summarized as follows: 1. Establish a baseline for US Senators' tweets (Assumption: they are representative of something being jotted down without much or very little preparation and are a reflection of writers' immediate thoughts) 2. Compare the Flesch-Kincaid grade level score obtained from US Senators to grade score level of user community from different regions of the US and Canada.

Relationship between reading and Socioeconomic Status (SES)
In the earlier work described in (Rajput et. al., 2019), the authors established from literature the three factors contributing to the SES status of a person namely education, income and occupation. The authors provided examples from literature discussing the effect of all the three variables on various facets of individuals' lives. In this paper, we focus on the academic aspect and specifically the work done in the realm of readability.
Earlier work in this area focused on establishing metrics that would measure reading and writing abilities of people in various income brackets. Income on the other hand, was easier to categorize based on multitude of factors such as self-reported data, housing prices, etc. (Chall & Jacobs, 1983) selected childrenbelonging to economically challenged families -from different grades and chose various metrics to measure progress in terms of reading and writing. Specifically, they looked at the skill of evolving writing ability from merely listing contents to story telling. The authors showed that children from grades five going on to grades seven started lagging behind in this area. The authors continued their work in this area and published their findings in their seminal work (Chall and Jacobs, 2003). The children's SES status was established by their eligibility for free-lunch program. The authors established five stages for reading from stage 0 to 5 where stage 0 referred to pre-reading and stage 5 referred to most mature reading stage. The authors argues that students transition from stage 0 to 5 by going through "learning to read" stage -characterized by stages 1 and 2. These stages are typically acquired in grades 1-3 while the next phase extends for a longer period of time where the students "read to learn"stages 3 and 4. The authors study focused on students in grades 4 when they transitioned between the phases of "learning to read" to "reading to learn". The authors noted that students coming from economically challenged background not only struggled from transitioning to fourth grade (reflective of their lagging behind in "learning to read" stage), same students continued to struggle in grades 7, 9 and 11. (Bowey, 1995) presented similar results where the author looked at five-year olds' preschool phonological development and the first grade reading skills' development based on their paternal occupation. The author established that even after accounting for IQ status, the children belonging to lower SES (based on paternal occupation struggled in reading).
As opposed to the aforementioned work focused on English language (US and Australia), authors in (Heppt, et. al., 2015) looked at a subset of German students belonging to low SES native speakers and nonnative speakers. The results showed that the students struggled in acquiring basic reading skills necessary to learn and communicate their achievements.
Lastly, the work done by (D'angiulli et. al., 2005) focused on fifth graders in British Columbia, Canada and showed that the fifth-grade children belonging to lower income levels needed remedial work during school years to come up to par to their counterparts in terms of reading skills.
Having established the importance of reading, researchers showed the correlation between poor reading and writing skills as discussed in the next subsection. Our work in this paper builds upon the findings that lower reading and writing skills usually correspond to poor performance in school and in turn reflects in the occupation/income status of the individual. We focus on gleaning the reading/writing skills of an individual from social media platform and public repositories. We argue that social media platform provides us ample evidence that can act as a proxy of an individual's SES status by looking at the readability scores of their writings.

Measuring Readability
Having established the correlation of reading skills to SES, we will explore in this section the correlation of poor writing to both reading and in turn SES. Nevertheless, writing can occur under different conditions. It can be written in the form of a manuscript, impromptu or written extemporaneous (Blankenship, 1974). The manuscript form of writing follows a thorough process of thoughts, reflection, writing, and revision. The Impromptu writing process does not assume any prior deliberation while the extemporaneous form assumes a short time of reflection before writing the thoughts on paper (Cronn-Mills & Croucher, 2001).
Writing has received much attention as many scholars consider it an afterthought of the thinking process (Applebee, 1984;Emig, 1977;and Odell, 1980). Work done by (Howard, 1988) considers the act of writing as the father of thought. Across the literature, scholars agree that the language used in writing is superior to the oral form (Devito, 1965). (Chafe and Tannen, 1987) presented a detailed review of the work done on written and oral language. (Nippold et. al., 2014) divided the communication into three forms namely social, academic and practical and explored the use of complex syntax in conversational and narrative speaking. The results showed that people use complex syntax and sophisticated language when narrating a particular event as opposed to engaging in conversational mode.
The work done in (Blankenship, 1974) performed a detailed study on various forms of written and oral mode of communication and she divided the communication into a conversation, oral impromptu, written impromptu, oral extemporaneous, written extemporaneous and manuscript. The author employed various metrics such as sentence and word length, cloze score, adjective/verb quotient, and preposition/token quotient among others. The work is important to our study as we look at both twitter messages and floor speeches delivered by members of the US senate. The underlying assumption for our work is that Twitter provides a mode for written impromptu/written extemporaneous, while the speeches delivered on the Senate floor reflect the manuscript mode.  introduced the concept of measuring readability in the US military. The test is now also part of many word processing softwares such as Microsoft Word -the Flesch-Kincaid Readability Tests (Stockmeyer, 2009). The method rates a particur collection of text on an increasing 100-point scale where a higher score indicates better readability (Flesch, 1948). Based on the original work, the Flesch-Kincaid Grade Level Test ranks the readability of a collection of text against the grades of U.S. graduate school levels . The score ranges from grades 1 -12, college, college graduate and professional. Recently, readability tests have been incorporated in big data research (Flaounas et al., 2013). Some variants of readability tests have been shown to be superior than Flesch-Kincaid as well (Si & Callan, 2001). Given important progress, however, there remain some key problems in terms of consistency -outlined in (Mailloux et al., 1995); (Wang et al., 2013). Because of the common use, we opted for the Flesch-Kincaid test as the base of our research. Although various methods exit in the literature, the Flesch-Kincaid test uses the length of the sentence and the words in the sentences, to measure the individual's level of education.
As discussed later in the paper, we establish a base case by gathering both the tweets by the members of the US Senate and their floor speeches. We chose the tweets randomly without paying attention to the context, content, and length of the speeches as it is beyond the scope of this work.

Crowdsourcing
Meriam-Webster 1 defines crowdsourcing as "the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people and especially from the online community rather than from traditional employees or suppliers". (Doan et. al., 2011) first discussed the use of worldwide-web as a medium for crowdsourcing and mentioned four significant challenges namely: Way of recruiting contributors, gauging their abilities, combining the work performed and avoiding abuse of the system. The paper also discussed challenges in maintaining the quality of the work performed. Crowdsourcing has become increasingly popular in scavenging and culling together huge amount of data from a diverse group of people. The work done by (Wazny, 2017) looks at various topologies of crowdsourcing and interesting underlying issues from different perspectives such as legal, ethical etc. Specifically, the work points out that such techniques and methodologies can contribute significantly to various fields including mecical sciences. The concept of hashtags in Twitter has proven to be an effective way to gather information on a given topic. It is worth noting that the crowdsourcing concept predated social media as many researchers worked on amalgamating data from heterogeneous sources for many years. One such approach is defined in (Adali et. al., 1995). This approach also became popular in the P2P paradigm (Rajput & Rotenstreich, 2004).
The wok in this paper segments users based on their geographical context and create a corpus for these users. We start by crowdsourcing a group of users that have achieved a high academic accomplishment and ---1 https://www.merriam-webster.com/dictionary/crowdsourcing study the level of their writing in the case of impromptu writing. Furthermore, we use the terms impromptu writing/extemporaneous writing as synonyms despite the work by (Blankenship, 1974) as discussed above. The reason for this is that we have no mechanism that can help us decide whether the writing on twitter is an example of impromptu or extemporaneous writing Note that extemporaneous writing assumes a short amount of deliberation while impromptu writing is a direct reflection of instantaneous thoughts of the writer as pointed out by (Blankenship, 1974). Once we have established a baseline, we will look at discussion forums for a particular commercial vehicle. We started exploring this in our earlier work (Rajput et al., 2019).

SES Application in Medicine and Psychiatry
The work of (Kawachi, 1999) built on the concept of social capital and reported that individual level factors -such as low income, low education, etc. -are strongly correlated with self-reported poor health. Social Capital is defined as "an individual's personal network and elite institutional affiliations" (Belliveau et. al., 1996). (Veenstra, 2000) built upon the above where the author looked at the three elements of social capital namely trust, commitment, and identity and showed that both income and education were related to self-reported health data. The commencement of the century saw researchers study the comorbidity of low SES status with different diseases across the vast realms of medicine. Given the fact that our research can have a strong application in the mental health realm, we will briefly discuss various efforts in this area only. Full discussion of various areas that can be affected by SES is beyond the scope of this paper. (Baker & Wagner, 1966) made the case that researchers and practitioners are ignoring the social aspect of patients when defining treatments and that the seeking of psychotherapy for children is inversely proportional to the social class of the patient and the caretakers. (Dohrenwend, 1990) observed that the 1980s established the relationship between SES status and various psychiatric disorders such as schizophrenia. In other words, high poverty levels were shown to be related to high levels of psychiatric disorders. However, the author argues that it had been difficult to unlock the riddle that would establish low SES as a cause or a consequence of psychiatric disorders. (Vitaroand and Tremblay, 1999) showed that impulsivity in gambling had a high prevalence of low SES adolescent males. (Piko & Fitzpatrick, 2001) found a correlation between SES status and psychosocial health among Hungarian adults. (Mayes & Calhoun, 2011) studied the effect of various variables including SES on autistic symptoms and established that they had a higher rate of presence in lower SES. (Goldberg et. al., 2011) also established a relationship between lower SES and both schizophrenia and cognitive ability. The relationship between risk of hospitalization for schizophrenia, SES, and cognitive functioning was established in (Goldberg et. al., 2011). (Hanscombe et. al., 2012) studied the Gene-Environment interaction among a group of kids in the UK. (Bates et. al., 2016) further looked in this area and found no evidence that the SES status could alter the intelligence of an individual. Rather, the effect is confined to the development of abilities of the individual in various disciplines. In their paper, (Fernald et. al., 2013) discussed in their paper that SES differences might strongly result in clear differences in language processing and vocabulary development starting at 18 months. This is also the basis of our work, which is founded on the premise that the difference in SES will directly reflect on the way a person writes. Looking at a person's writing across various media will provide a preponderance of evidence on their SES status. Such information can prove invaluable in studying the behavior of people across different SES and in turn help in detecting certain mental illnesses such as depression, as shown in (Rajput and Ahmed, 2018b).

Data Sources and Data Gathering
One of the biggest challenges when gathering data is ensuring the legality of using the data -discussed in (Youyou et. al., 2015), (Ahmed & Rajput, 2020) and ---2 https://www.financialsamurai.com/the-110th-rule-for-car-buyingeveryone-must-follow/ 3 https://www.mongodb.com/ (Ahmed, 2019). All the data that we gathered is available from public sources and we use the NLP techniques and tools described in (Rajput, 2020). Specifically, we gathered the following set of data: 1. Tweets from the official twitter account of US Senators 2. Vehicle forum for a specific manufacturer divided by various regions. Such forums help the manufacturer learn from customers various issues and problems they might face. We chose the specific vehicle because it is in the choice of an average American. Specifically, we tied the average American income to the class of cars 2 and looked for the cars in a particular category. 3. We gathered the posts in the aforementioned forum as a sample of impromptu writing Furthermore, to ensure the anonymity of the users, we masked their online identity by assigning each user fictitious pseudonym before storing the information in a database. We used the following APIs for our work. 1. Mongo Database 3 2. Twitter API 4 3. textstat package 5 4. beautifulsoup package case letters 6 .

Preprocessing and Processing Data
We followed the following process for preprocessing and processing of data:  Culled together the US Senators' account on Twitter and grabbed the last 1000 tweets  Each set of 1000 tweets was treated as a corpus and calculated the Flesch-Kincaid score.  Calculate the average score for each corpus and then calculate the mean of the means  Utilize the 'urllib' python library to help capture meta data from the urls  Build upon the beautifulsoup package to scvange data from the forums  Segregate the forums based on regions across the United States and Canada  As was the case with the tweets from the US Senators, we treated the posts by each user as a corpus  Compute the average of each corpus under each region  Compute the mean of the means for each region  Compare the results of regions to those from US Senators Figure 1 below presents the average Flesch-Kincaid score for the tweets (impromptu writing) of the US Senators. Table 1 summarizes the average tweets scores and the standard deviation. A cursory look at the data above shows that the level of impromptu writings reflects the high academic attainment of the US Senators. Compare this to the data obtained from various forums that represent the impromptu writing of an average American across the continental United States presented in Rajput et. al.

Discussion
The aforementioned results provide us a basis to compare and confirm various theories presented in the literature before the advent of Web 2.0 and the spread of social media.
To begin with, we confirm the findings of statistics reported by (Pew, 2019) which reported that only 34% of an average American adult aged 25 or older has a Bachelor's degree. Impromptu writing by US Senators had an average score of 12 compared to no more than eight of an average American citizen. Also, note that the data for average American was gleaned by looking at the average income based on each of the regions above and tying it to the types of vehicles they drive (Rajput et. al., 2019). (Note that a difference of one on the Flesch Kincaid score is equivalent to a grade level. Secondly, comparing the standard deviation of the Senators' tweets to an average American impromptu writing, the variation is much bigger for the general population. This can possibly be explained by many factors such as 1) Income disparity is much less among US Senators as opposed to an average American citizen and 2) one-third of average Americans have attained the same academic level as those of Senators. This group will score much higher and hence the high variation.
Thirdly, note that the corpus chosen for the average American is very specific and not from Twitter as we wanted to focus on responses for a very specific group -namely an average American with average income and a high probability that the person is a native English speaker.
Lastly, one of the underlying assumptions of academic attainment is that the age of the person writing is 25 or older. While we can confirm this for the members of US Senate, we cannot state with absolute certainty that this the case for the corpus we harvested. However, given that the forum is for owners, we have high confidence that the average age of writers on the forum is 25 or older as reported by CarMax 8 .

Conclusion
In this paper, we proposed a groundwork for predicting a proxy of SES of different communities. First, we established an analysis of the US senators' tweets and determined their Flesch-Kincaid average scores. We considered this as our baseline and we compared the results to different groups of online communities in US based on their geographic regions and analyzing their tweets for commercial vehicle selections. The results helped to note the difference in the choice of language and in turn act as an indicator of academic attainment -used as a proxy for SES status of the writer. Our work is comparing the impromptu writings for native speakers. In the future, we would like to focus on segregating native speakers' ---8 https://www.carmax.com/articles/which-car-brands-have-oldestyoungest-buyers writings from non-native speakers. Furthermore, we currently assume Twitter to reflect a person's impromptu form of writing. We would focus on digging in the literature to devise algorithms that can help us accomplish this. Lastly. Our work focuses solely on writings in English language. We would like to replicate the study for another language and compare the results.