Machine Learning Algorithm’s Measurement and Analytical Visualization of User’s Reviews for Google Play Store

The fact is quite transparent that almost everybody around the world is using android apps. Half of the population of this planet is associated with messaging, social media, gaming, and browsers. This online marketplace provides free and paid access to users. On the Google Play store, users are encouraged to download countless of applications belonging to predefined categories. In this research paper, we have scrapped thousands of users reviews and app ratings. We have scrapped 148 apps’ reviews from 14 categories. We have collected 506259 reviews from Google play store and subsequently checked the semantics of reviews about some applications form users to determine whether reviews are positive, negative, or neutral. We have evaluated the results by using different machine learning algorithms like Naïve Bayes, Random Forest, and Logistic Regression algorithm. we have calculated Term Frequency (TF) and Inverse Document Frequency (IDF) with different parameters like accuracy, precision, recall, and F1 and compared the statistical result of these algorithms. We have visualized these statistical results in the form of a bar chart. In this paper, the analysis of each algorithm is performed one by one, and the results have been compared. Eventually, We've discovered that Logistic Regression is the best algorithm for a review-analysis of all Google play store. We have proved that Logistic Regression gets the speed of precision, accuracy, recall, and F1 in both after preprocessing and data collection of this dataset.


Introduction
The essential task of natural language processing is the classification of text strings or documents into different categories that are the part of this process, which depends upon the content of the string. Text classification has a variety of application s, including detection of user sentiments on comments or tweets, classification of an email as spam. Presently, text classification has gained vital importance in organizing online information [1]. User reviews and the mobile program ecosystem have an abundance of information regarding expectations and user experience. Programmers and app store regulators can leverage the data to better understand their audience. App stores enable users to search for, buy and install programs that are mobile and give comments in the form of evaluations and reviews. The rapid increase in the specific to linguistic and cultural contexts, to the extent that such a project is possible [10]. The elements of idiom and figurative speech, being cultural, are often also converte d into relatively invariant meanings in semantic analysis. Semantics, although related to pragmatics, is distinct in that the former deals with word or sentence choice in any given context, while pragmatics consider the unique meaning derived from context or tone. In different terms repetition, semantics is about universally coded meaning, and pragmatics, the meaning encoded in words that are then interpreted by an audience [11].
In information retrieval, TF/IDF, short for Term Frequency-Inverse Document Frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF/IDF value increases proportionally to the number of times a word appears in the document and the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequ ently in general [12]. TF/IDF is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use TF/IDF. Using of common words in TF/IDF e.g., articles receive a significant weight even if they contribute no real information about common words. In TF/IDF, the more familiar a word is in the corpus, the smaller weight it receives. Thus, common words like articles receive small weights but rare words, that it is assumed to carry more information, receive larger weights [13].
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with the favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree [14]. RE-module is the core of text processing. The RE-module provides sophisticated ways to create and use regular expressions [15]. A regular expression is a kind of formula that specifies patterns in text strings. The name "regular expression" comes from the earli er mathematical treatment of "regular sets. We are still stuck with the phrase. Regular expressions give us a simple way to specify a set of related strings by describing the pattern they have in common. We write a pattern to summarize some set of matching strings. This pattern string can be compiled into an object that efficiently determines if and where a given string matches the pattern [16,17].

Literature Review
In this paper, authors propose a framework that allows developers to filter, summarize, and analyze user reviews written about applications. Author extract automatically relevant features from reviews of apps (e.g., information about functionalities, bugs, and requirements) and analyze the sentiment associated with each of them. In this research three main building blocks (i) topic modeling, (ii) sentiment analysis and (iii) summarization interface, are discussed. The topic modeling block aims to find semantic topics from textual comments, extracting the features based on the most relevant words of each topic. The sentiment analysis block detects the sentiment associated with each discovered feature [18]. The summarization interface provides developers an intuitive visualization of the features (i.e., topics) along with their associated sentiment, providing more valuable information than a 'star rating'. Our evaluation shows that the topic modeling block can organize the information provided by users into subcategories that facilitate the understanding of features that may positive, negative, and neutral impact on the overall evaluation of the application.
Regarding user satisfaction, authors can observe that, despite the star rating being a good measure of evaluation, the Sentiment Analysis technique is more precise in capturing the sentiment transmitted by the user using comment [19].
Authors discussed Sentiment Analysis of App Reviews. In this study, result shows sentiment analysis of App Reviews approaches that is helpful for the developers. With these approaches, the developer can accumulate, filter, and examine user reviews. In this research, use of simple language techniques to recognize fine-grained app features in the reviews. In this extract, the user gives an opinion about the recognized features by giving a typical score to all reviews [20]. By using topic modeling techniques, authors can group fine-grained features into more comfortable and meaningful features. Authors compared the result and analyzed the 7 applications taken from the Apple app store and Google play store. In this way, the app developer can systematically examine the user opinion about a single feature and filter view. Authors got the accuracy of up to 91% and recall up to 73% [21].
Text mining techniques have been recently employed to classify and summarize user reviews on mobile application stores. However, due to the inherently diverse and unstructured nature of user-generated online textual data, text-based review mining techniques often produce excessively complicated models that are prone to overfitting. In this paper, authors proposed approach, based on frame semantics, for review mining [22]. Semantic frames help to generalize from the raw text (individual words) to more abstract scenarios (contexts). This representation of text is expected to enhance the predictive capabilities of review mining techniques and reduce the chances of overfitting [23]. First, authors investigate the performance of semantic frames in classifying informative user reviews into various categories of software requests about maintenance. Second, the authors propose and evaluate the performance of multiple summarization algorithms in generating concise and representative summaries of informative reviews. Three different datasets of app store reviews, sampled from a broad range of application domains, have been used to conduct experimental analysis [24]. The results have shown that semantic frames can enable an efficiently quick and precise review classification process. However, in reviewing summarization tasks, our deductions claim that text-based summarization generates more comprehensive summaries than frame-based summarization. In closing, authors have introduced MARC 2.0, a review classification and summarization suite that implements the algorithms investigated in the analysis [25].
Nowadays, the use of apps has increased with the use of advance mobile technology. User prefer to use mobile phone for mobile application as compare to other gadgets. Users already downloaded different mobile applications in their mobile phones and they uses these application and left reviews about it [26]. In the mobile app market, fallacious ranking points may lead to pushing up mobile apps in the popularity list. Indeed, it turns more periodic for app developers to use fake mechanism. The paper has, at this moment, proposed semantic analysis of app review for fraud detection in mobile apps. Firstly, authors have proposed to detect the misrepresentation by excavating the active periods correctly, also called as leading sessions, of the mobile apps [27].
Authors have an intention to inspect two types of evidence: ranking-based review and -based and use natural language processing (NLP) to get action words. Next, authors have agreed to convert review to ratings and finally perform pattern analysis on the session with app data gathered from the app store. So, the paper has proposed an approach to validate its effectiveness and show the scalability of the detection algorithm [28]. The question arises: How do authors automatically summarize millions of user reviews and make sense out of them? Unfortunately, beyond simple summaries such as histograms of user ratings, few analytic tools can provide insights into user reviews [29]. In this paper, authors proposed a system "Wiscom" that can analyze millions of user ratings and comments in mobile app markets at three different levels of detail. This system is able to (a) discover inconsistencies in reviews; (b) identify reasons why users like, or dislike a given app, and provide an interactive, zoomable view of how users' reviews evolve over time; and (c) provide valuable insights into the entire app market, identifying users' significant concerns and preferences of different types of apps. Results using a techniques and are reported on a 32GB dataset consisting of over 13 million user reviews of 171,493 Android apps in the Google Play Store [30]. Three operator such as Google as well as individual app developers and end-users [31]. The products on "Amazon.com", the mobile apps are continuously evolving, with newer versions rapidly superseding the older ones. Many app stores still use an Amazon-style rating system, which aggregates every rating ever assigned to an app into one store rating [32]. To examine whether the store rating captures the fickle user-satisfaction levels regarding new app versions, researchers mined the store ratings of more than 10,000 mobile apps in Google Play, every day for a year. Even though many apps' version ratings rose or fell, their store rating was resilient to fluctuations once they had gathered a substantial number of raters. The conclusion is that current store ratings are not dynamic enough to capture changing user satisfaction levels. This resilience is a significant problem that can discourage developers from improving app quality [33].

The methodology of Analytical Measurement and Visualization of Users' Reviews
In this methodology for classification is started with the scraping of reviews on applications. On Google Play store using the AppID request for scrape the reviews of that specific application scrape several pages with reviews and rating of the applications. We have scarped this dataset for classifying the user reviews that is a positive, negative, or neutral review. After scraping the bulk raw reviews, the next step of preprocessing of those reviews. In preprocessing different steps, we normalize our reviews after preprocessing. These steps are involves removing a special character, remove a single character, remove a single character from the start, subtracting multiple spaces with single spaces and remove prefixed, then converting data into lowercase and at the end of this stop words and stemming is performed. These are some significant steps for refining our reviews. After refining reviews, the bag of words approach is performed. In next step apply TF (Term Frequency) on reviews by using a python language after that we apply TF/IDF (Term Frequency-Inverse Document Frequency), its often used in information retrieval and text mining. After applying TF/IDF, feature extraction performs on each application. By using python, we use different algorithm for classification Naïve Bayes, Random Forest, and Logistic Regression and check the different parameters like accuracy, precision, recall, and F1-score and find the statistical information of these parameters. After analyzing and testing from statistical information we get the result about which algorithm has a maximum accuracy, precision, recall, and F1-score information, and we can access which algorithm is best for analyzing of reviews for classification as shown in Figure. 1.

The Methodology of Data Collection Process
The advancement in technology, Mobile applications can become a part of our daily life.
Half-million applications were introduced in 2011, and in October 2012, 0.675 million applications were accessible on the Google Play store. Now a day's Android app is being used by lots of peoples; people use different Android apps, like messengers, social media, games and browsers. This online marketplace provides free and paid access for mobile users to over a million mobile applications also refers as "mobile apps". On the Google Play store website, users can choose from over a million mobile apps for various datasets with predefined categories. Data collection always an important task in every research, and the validity and accuracy of the dataset is also a significant part of any dataset collection process. In this research, scrape the thousands of users review and rating of different applications based on the different categories, as shown in Figure. 2. We select 14 categories of Google play store, different scrape application of each category, as shown in Table. 1. These categories of applications are Action, Arcade, Card, Communication, Finance, Health and Fitness, Photography, Shopping, Sports, Video Player Editor, Weather, Casual Medical, and Racing. We scrape the thousands of reviews and ratings of application, and convert these data into a .CSV file format. After this we applying preprocessing for removing special characters, remove a single character, remove a single character from the start, subtracting multiple spaces with single spaces, remove prefixed, converting data into lowercase, stop words and then stemming technique on data in .CSV file. Then evaluating results by using different machine learning algorithm and find the best algorithm for classification. We have download 148 apps that appeared in 14 categories from Google play store fetch several reviews and enter the required pages according to the reviews. We Collect total of 506259 reviews in the Google play store website, as shown in Figure. 2., to fetch the data, in first step we use the requests library We use Python's Scikit-Learn Library (Pedregosa , Varoquaux , Gramfort , Michel , Thirion , Grisel & Vanderplas, 2011) for machine learning because this library provides machine learning algorithms like classification, regression, clustering, model validation etc. [34]. The requests library allows the user to send HTTP/1.1 requests using Python to add content like headers. This library allows users to process response data in python. Then use the Re-library for text processing. A regular expression is a unique sequence of characters that help the user match or find other strings or sets of strings, using a specific syntax held in a pattern. After using the Re-library, use the Beautiful Soup library. Beautiful Soup library is used to extract data from the HTML and XML files. This library works quickly and saves the programmer's time, as shown in Table. 1.

Results and Experiments
The results have been evaluated by fetching the reviews of different categories. Perform a series of steps which predict the sentiment reviews of different categories. The usage of Python's Scikit-Learn Library because this library provides different features like classification, Regression, clustering, and model validation. Different methods and features that are perform are as follows: Import the data, Feature Extraction, Convert the text into Numbers, Training and testing sets, Training text classification model and predicting sentiments and Evaluating model.

Analytical Measurement and Visualization After Preprocessing
These are the statistical information of different algorithm on the base of the different parameters after preprocessing; compare and find the best algorithm that uses for the analysis and classification of reviews.

Naï ve Bayes Multinomial
Naïve Bayes is used for classification. It assumes that the occurrence of a specific feature is independent of the occurrence of other features. It is fast to make models and make predictions. We have scraped 148 apps reviews form 14 categories from Google play store. There are 40 reviews on one page, we have collected a total of 506259 reviews from Google play store applications. Apply the Naïve Bayes algorithm for classification on that dataset of reviews and find different information on different parameters concerning TF and TF/IDF. Find the accuracy of classification of each category application and in statistical information find precision, recall, and F1 score these all parameters use to measure the accuracy of the dataset is shown in Table. 2. Also, bar chart visualization of Naïve Bayes algorithm in which series1 shows the accuracy of Naïve Bayes algorithm, series2 shows the precision, series3 shows the recall and series4 shows the F1 score measurement as shown in Figure. 3.   information on different parameters concerning TF and TF/IDF. Find the accuracy of classification of each category application and in statistical information find precision, recall, and F1 score these all parameters use to measure the accuracy of the dataset is shown in Table. 3. Also, bar chart visualization of Random Forest algorithm in which series1 shows the accuracy of Random Forest algorithm, series2 shows the precision, series3 shows the recall and series4 shows the F1 score measurement as shown in Figure. 4.      Google Play Store is an online market place that provided free and paid access to users. Google Play store, users can choose from over a million apps from various predefined categories. In this research, scrapped thousands of users' review and app ratings. We evaluated the results by using different machine learning algorithms like Naïve Bayes, Random Forest, and Logistic Regression algorithm that can check the semantics of reviews about some applications form users that their reviews are good, bad, normal and so on. Calculated Term Frequency (TF) and Inverse Document Frequency (IDF) with different parameters like accuracy, precision, recall, and F1 score after the preprocessing of the Raw reviews in the concluded results compared the statistical result of these algorithms. Visualized these statistical results in the form of a bar chart, as shown in Figure. 6. After comparison, analyzed that the Logistic Regression algorithm is the best algorithm for checking the semantic analysis of any Google application users' reviews on both TF and TF/IDF bases. As in sports category in TF base, showed that Logistic Regression algorithm has 0.622% accuracy, 0.414% precision, 0.343% recall and 0.343% F1 score and the statistical information with another category of application as shown in Table. 10. Also, in TF/IDF base showed that Logistic Regression algorithm has 0.621% accuracy, 0.404% precision, 0.319% recall and 0.315% F1 score and the statistical information with another category of application is shown in Table. 5.

Analytical Measurement and Visualization Without Preprocessing of Dataset
These are the statistical information of different algorithm on the base of the different parameters after data collection; compare and find the best algorithm that uses for the analysis and classification of reviews.

Naï ve Bayes Multinomial
Naïve Bayes is commonly used classifcation algorithm. Naïve Bayes assumes that the occurrence of a specific feature is independent of the occurrence of other features. It is fast to make models and make predictions. Apply the Naïve Bayes algorithm for classification on that dataset of reviews and find different information on different parameters concerning TF and TF/IDF. Find the accuracy of classification of each category application and in statistical information find precision, recall, and F1 score these all parameters use to measure the accuracy of the dataset is shown in Table   7. Also, bar chart visualization of Naïve Bayes algorithm in which series1 shows the accuracy of Naïve Bayes algorithm, series2 shows the precision, series3 shows the recall and series4 shows the F1 score measurement as shown in Figure. 8.

Random Forest Algorithm
Random Forests Classifier is the class of all methods that are designed explicitly for decision tree. It develops a lot of decision tree based on a random selection of data and a random selection of variables. Apply the Random Forest algorithm for classification on that dataset of reviews and find different information on different parameters concerning TF and TF/IDF. Find the accuracy of classification of each category application and in statistical information find precision, recall, and F1 score these all parameters use to measure the accuracy of the dataset is shown in Table. 8. Also, bar chart visualization of Random Forest algorithm in which series1 shows the accuracy of Random Forest algorithm, series2 shows the precision, series3 shows the recall and series4 shows the F1 score measurement as shown in Figure. 9.

Logistic Regression Algorithm
In statistics, the logistic product can be a trusted statistical version which, in its essential type that runs on the logistic functionality to simulate a binary determining factor; lots complex each category application and in statistical information find precision, recall, and F1 score these all parameters use to measure the accuracy of the dataset is shown in Table. 9. Also, bar chart visualization of Logistic Regression algorithm in which series1 shows the accuracy of Logistic Regression algorithm, series2 shows the precision, series3 shows the recall and series4 shows the F1 score measurement as shown in Figure. 10.   Figure. 11. After comparison, analyze that the Logistic Regression algorithm is the best algorithm to check the semantic analysis of any Google application users' reviews on both TF and TF/IDF bases. As in sports category in TF base, show that Logistic Regression algorithm has 0.623% accuracy, 0.416% precision, 0.35% recall, and 0.353% F1 score and the statistical information with another category of application is shown in Table. 15. Also, in TF/IDF base show that Logistic Regression algorithm has 0.629% accuracy, 0.416% precision, 0.331% recall, and 0.328% F1 score and the statistical information with another category of application is shown in Table. 10.  Figure 11. Bar chat visualization of different machine learning algorithm comparison on TF based without preprocessing of data

Algorithm
After checking the different paraments, analyze that Logistic Regression algorithm is the best algorithm having the highest accuracy. In this section, performed analysis and classify all reviews in different classes positive, negative, and neutral. Set target value if the value of the comment is positive, it is equal to 1 if the review is negative, and it is equal to 0. Also, analyze the neutral class with the confidence rate if the confidence rate is between the 0 and 1 then classify this to neutral class.
Different parameters in our dataset like the category of application, Application Name, Application ID, Reviews, and rating, as shown in Figure. 13. However, for checking the semantics of each review, these parameters are more enough. That is why we select only reviews of all application.

HTML Decoding
To convert HTML encoding into text, and in the start or ending up in the text field as '&amp,' '\amp'

Find Null Entries from Reviews
It seems there are about 700-800 null entries in the reviews column of the dataset. Which might happen during the cleaning process to remove the null entries with using commands as shown below.

Negative and Positive Words Dictionary
By using word cloud corpus, made negative and positive words dictionary based on the occurrence of words in a sentence to get the idea of what kind of words are frequent in the corpus as shown in Figure. 16.

Conclusion and Future Work
On the Google Play store, users may download more than one million applications from different categorized groups. In this research, we have built hundreds and thousands of user's application reviews. We have 14 categories and download 148 app reviews. Accumulate 506259 reviews out of Google play store. Assessed the outcome using unique machine learning algorithms such as Naïve Bayes, Random Forest, and Logistic Regression algorithm, which will assess the semantics of application users' reviews are equally positive, negative, and neutral. Calculate Time-Frequency (TF) and Inverse Document Frequency (IDF) using various parameters such as precision, accuracy, recall, and F1 score and about the statistical effect of those calculations. Using only TF, it's not an issue if a word is common or not. Thus, common words like, e.g., articles receive a large weight even if they contribute no real information. In TF/IDF, the more common a word is in the corpus, the smaller weight it receives. Thus, common words like articles receive small weights but rare words, that are assumed to carry more information, receive larger weights. Visualize these statistical results in the form of a bar chart. The study of each algorithm has been conducted one by one and the results can be compare. The evaluated results are shows that Logistic Regression could be an optimal algorithm to get a review of this Google Play store tool. Logistic Regression got the optimal speed of precision, accuracy, recall, and F1 score in equally earlier and right after preprocessing of their dataset. Some statistical results as in sports category in TF base after preprocessing the Logistic Regression algorithm has 0.622% accuracy, 0.414% precision, 0.343% recall and 0.343% F1 score and in TF/IDF based Logistic Regression algorithm has 0.621% accuracy, 0.404% precision, 0.319% recall and 0.315% F1 score. Also, the sports category in TF base after data collection the Logistic Regression algorithm has 0.623% accuracy, 0.416% precision, 0.35% recall and 0.353% F1 score and in TF/IDF based Logistic Regression algorithm has 0.629% accuracy, 0.416% precision, 0.331% recall and 0.328% F1 score and the statistical information with another category of applications analyze in concluded table below that shows the authenticity of this analysis. The Set target value if the value of the comment is positive, it is equal to 1 if the review is negative, and it is equal to 0. Analyze the neutral class with the confidence rate if the confidence rate is between the 0 and 1 then classify this to neutral class.
In the future, increase the category of applications and increase the number of reviews.
Compare the Logistic Regression algorithm accuracy results with different algorithms. Generate clusters and check the relationship between reviews and ratings of the application that can analyze each application more precisely.