A Security-Related Reputation Scheme of Android Apps based on NLP Analysis of Comments

Comments are exploited by product vendors to measure satisfaction of consumers. With the advent of Natural Language Processing (NLP), comments on Google Play can be processed to extract knowledge on applications such as their reputation. Proposals in that direction are either informal or interested merely on functionality. Unlike, this work aims to determine reputation of Android applications in terms of confidentiality, integrity, availability and authentication (CIAA). This work proposes a model of assessing app reputation relying on sentiment analysis and text analysis of comments. While assuming that comments are reliable, we collect Google Play applications subject to comments which include security keywords. An in-depth analysis of keywords based on Naive Bayes classification is made to provide polarity of any comment. Based on comment polarity, reputation is evaluated for the whole application. Experiments made on real applications including dozens to billions of comments, reveal that developers lack to make efforts to guarantee CIAA services. A fine-grained analysis shows that not security reputed applications can be reputed in specific CIAA services. Results also show that applications with negative security polarities display in general positive functional polarities. This result suggests that security checking should include careful comment analysis to improve security of applications.


Introduction
Due to its open-source nature, Android is the most popular mobile operating system [1]. It is expected to remain popular until 2023 [2]. Android markets, such as Google Play 1 and other third-party markets (AppChina 2 , Anzhi 3 ), play an important role in the popularity of Android devices. The number of applications has significantly increased from one million in 2013 to about 2.900.000 applications [3] making Google Play a rich place where people can satisfied their needs. Despites Google Play Protect which is exploited for threat protection, there are still malicious applications carefully designed by bad people to impact on user security and privacy [4,5]. Therefore, an urgent need is to develop effective solutions to identify malware. Google provides to consumers, possibility to evaluate submitted applications after deploying for a while. They can give a positive and negative evaluation based on their experience using an application by providing suggestions, opinions and reviews on applications subjects to any complaints, appreciations or suggestions about the functioning of the application. Such opinion comments are useful and relevant to developers and application store owners to better understand their customers and to recommend improvements [6]. However, they are provided subjectively and descriptively, therefore harder exploitable for decision making whether or not to install the application. As for any social ecosystem, an application (seen as an agent) should be associated with a reputation score to predict bad behaviors [7]. Tesfay et al. [8] evaluate security-related reputation based on user feedbacks, number of users who installed the studied application and the reputation of vendors. This proposal is subjective, does not exploit sentiment analysis on feedbacks and is subject to false reputation in case there are few feedbacks. Some authors rely on review analyses to reveal aspects to improve during application updates [9], [10,11]. Their interest focuses on functionality. Other reputation proposals rely on the risk induced based on permissions requests [12,13], Application Programming Interface (API) calls [9,14] and information flow analysis [15]. However, they require the applications to be installed. Even if they are successful, bad applications would have induced damages. Unlike, the aim behind this research is to prevent bad installations by investigating how to exploit consumer's comments to build the security related reputation of Android applications. By assuming that comments are not fake, this research proposes a reputation model relying on sentiment analysis of opinions and reviews related to security aspects such as confidentiality, integrity, availability and authentication. Experiments on real applications from Google Play reveal that they have bad security related reputations but acceptable functionality related reputations. This document is organized as follows. Section 2 provides a broad overview of authors who conducted similar works. Section 3 presents the research methodology including all aspects for the reputation model. Section 4 presents and discusses results obtained with real applications. The document ends with a conclusion and perspectives.

Related Works
According to Genc-Nayebi and Abran [6], the app source environment and user reviews contain relevant information about user experience and expectations. They recommend that developers and application store owners could leverage the information to better understand their customers. Their findings motivate this work which evaluates applications based on analyzed comments to recommend developers and owners about security-related improvements. Any agent in social network ecosystem should be estimated a reputation score which is a predictor of future behavior based on previous interactions [7]. If they act positively in the past, they will be likely trust in the future because they are expected to act likewise. The reputation is a characteristic helpful to minimize dishonest agents. In this regards, every app in Android ecosystem is an agent which should be associated with a reputation score. This work predicts its actions based on analysis of experiences from users who installed. Based on comment analysis, this work derives probabilistic model to assess reputation of any application and therefore estimates whether it is honest or dishonest seen as malicious or benign in terms of security aspects. Tesfay et al. [8] put in place a private cloud to control user installations of Android applications and to keep track of feedbacks from users. They evaluate reputation of applications based on feedbacks, vendor's reputation and number of users who run the same application. Their proposal is subjective as well as analysis of feedbacks. Our work is more formal with a sentiment analysis coupled to NLP scheme. Reputation can also be evaluated through the risk induced based on permissions requests [12,13,16], API calls [9,14] and information flow analysis [15]. The main flaw related to such reputation schemes is that they require the application to be installed. Even if they are successful, the application would have already caused damages. The purpose of our proposal is to prevent dangerous installations by investigating applications in the store. Several prior works analyze reviews to guide vendors and developers on what actions to perform on the applications. Nguyen et al. [9] empirically evaluate incidence of user reviews on updates of Android application security and privacy features such as permissions. They demonstrate that reviews could have been an indicator to code changes to another version. Our proposal can couple the reputation scheme to indicate which security service the developer should concentrate on. Noei et al. [10] investigated significant relationship between reviews and increase of star-ratings. They looked for key topics of user-reviews per category that developers should improve to achieve higher star-ratings. Our work is complement to theirs in that we provide topics specifically on security services, but per application, to be improved by developers. Noei and Lyons [11] further investigate approaches that can help developers and researchers to analyze user-reviews to improve app development process. However, we notice that the studied works consider more functionality aspects in the improvements. In our work, we suggest that security aspects are fundamental, that is why, the reputation based on review analysis is made only on security comments.

Reputation Model
The methodology can be summarized in seven essential steps, depicted in the diagram shown in Figure 1. The first step concerns collection of applications based on security-related comments. The second step refers to identification of keywords indicating confidentiality, authentication, integrity and availability flaws. The third step refers to determining polarity of comments based on sentiment analysis. Within the fourth step, comment polarities are classified in security service. The fifth step refers to determine the reputation of an application based on comment polarities from its security-related keywords. Details in each step are provided in the following.

Collecting applications
In this phase, we gather some applications from Google Play based on comments which point out security concerns and other criteria which can influence the reputation of the application. These criteria include user ratings, number of users who downloaded the application and the number of comments made by the customers. We adopt to look for French comments within the scope of this work. Table 1 depicts 10 comments concerning security flaws made on the Instagram application. This example is exploited further on to illustrate the methodology steps.

Comment 9
Depuis la dernière mise à jour, les conversations privées sont impraticables. Les messages prennent beaucoup plus de place, les images et les gifs prennent absolument toute la page, les déformant, et rendant la conversation illisible et surtout, il est devenu impossible de cliquer dessus pour mieux les voir. Si vous pouviez régler rapidement ce problème car ce n'est pas la première fois qu'il y a des bugs après de MAJ et ça arrive de plus en plus souvent, ça en dit long sur l'intérêt que vous portez à votre appli...

Extraction of keywords
It involves analyzing the comments identified by highlighting their keywords and in particular those related to security. We adopt to mix French and English keywords within the scope of this work. For this, we use keyword extraction tools such as the Search Engine Optimization (SEO) analysis tool 4 , Application TextStat 5 and Appbot 6 . The relevance of these keywords is studied to bring out those related to the security problem. These keyword extraction tools are based on a global analysis of positive and negative words highlighted in a comment. We proceed to the classification of these keywords by security service as presented in Table 2. We justify the selection of these keywords based on their definitions obtained from the Lexicon of computer security and also on the Wikipedia's definitions of security services [17].
Confidentiality: Only authorized persons can have access to the information intended for them (notions of rights or permissions). All unwanted access should be prevented.
Authentication: users must prove their identity by using an access code. Identification and authentication have different terminology; with identification, the user is recognized only by his identifier, while with authentication the user must provide a password. This makes it possible to manage the rights of access to the resources concerned and to maintain confidence in the exchange relationships.
Integrity: the data must be that which is expected and must not be altered accidentally, unlawfully or maliciously. Clearly, the elements considered must be exact and complete.
Availability: access to information system resources must be permanent and flawless during the planned periods of use. Services and resources are accessible quickly and regularly.

Determination of comment polarity
This step evaluates feelings that each user experiences when he makes a review. It is represented here as positive (+), negative (-) or neutral. This evaluation relies on Naïve Bayes. By definition, a Naïve Bayes classifier is a classifier based on Bayes' theorem with the naive assumption that the entities are independent of each other. According to Bayes' theorem [18], for a characteristic vector X = (X1, X2, ..., Xn) and a class variable Ck, Bayes' theorem states that: where, P(Ck|X) is the posterior probability, P(X|Ck) is the (previous) probability, P(Ck) is the prior class probability (likelihood), P(X) the previous probability of predictor (evidence). Using the chain rule, the probability P(Ck|X) can be broken down as in Equation (2).
We use two environments of analysis for obtaining polarity: the sentiment analysis environment Natural Language Toolkit (NLTK) 7 and the Spider environment 8 . These environments allow sentiment analysis to be and text classification. They evaluate whether texts express a positive, negative or neutral feeling. Using hierarchical classification, neutrality is determined then the polarity of feelings follows, but only if the text is not neutral. This classification is called naive Bayes classification which is nothing more than the application of Bayes rules for the formation of classification probabilities. This environment provides an overall feeling that a user has for a given application. For our work, it is a question of obtaining the polarity of feeling from a comment by relying on the security aspect. Appendix A shows the Python script used to exploit the keywords related to security services. Figure 2 depicts the polarity of each comment from the script. In this specific case, the comment has a negative polarity of 77.9% and a positive polarity of 22.1%. This comment is therefore classified as negative.

Polarity classification by security service
This step determines feelings behind each comment based on polarity and classifying them by security service. Table 3 presents comment polarities for the Instagram application. The comment C2 for example, has a negative feeling with a probability of 0.84. Now we look for classifying polarities by security service. Table 4 shows the breakdown of keywords by security service for comments from a given application.

C3
Negative Table 4 shows results about matching comment feelings to security services. Based on Table 4's outputs, • C1 has a negative polarity in terms of confidentiality and availability; • C2 has a negative polarity in terms of integrity and availability; • C3 has a negative polarity in terms of integrity and availability; • C4 has a negative polarity in terms of availability; • C5 has a positive polarity in terms of authentication and availability.

Calculation of reputation by security service
At this level, it is a matter of determining the reputation of an application based on the number of comments recorded in Google Play, as well as their respective polarities while taking into account the security service that each comment releases. We obtain the probability that an application has a good or a bad reputation, by subsequently relying on the formula of the probabilistic model [7].
• α is the total number of positive polarities (Npos) such that = ∑ + + + • β is the total number of negative polarities (Nneg) such that = ∑ + + + With • αC (resp. βC) is the positive (resp. negative) polarity value for the service confidentiality; • αI (resp. βI) is the positive (resp. negative) polarity value for the service integrity; • αA (resp. βA) is the positive (resp. negative) polarity value for the service authentication; • αD (resp. βD) is the positive (resp. negative) polarity value for the service availability. Such that + = 1 (4) Table 5 summarizes different polarity values for each comment in the Instagram application. These values are obtained as follows: The value obtained after implementing the Python script on a given comment, the polarities obtained, namely positive and negative (Table 3)  above (Table 4). Thus to obtain the values αi and βi for a given comment, we recover the number of words of the comment distributed by security service, then we bring out the exact value of polarity obtained in Table 3 for each security service having a positive number ( i.e., αi, βi > 0). The final polarity value is therefore obtained as a function of the total number of words in the comment highlighting the security aspect and the number of words distributed by security service (Table 4). We can therefore determine the reputation likelihood by security service by exploiting the formula of the probabilistic model E(p|α,β) for each service. We have After the results obtained above, the reputation E that we obtained must be included in the interval [0, 1]. Thus, for an application building a good reputation, the value of E must be greater than the Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 23 March 2020 doi:10.20944/preprints202003.0332.v1 value 0.5; bad for a value less than 0.5 and neutral for a value equal to 0.5. For instance, the Instagram application does not have a good reputation regarding the security aspect. In a more specific context, we can infer the reputation of the application in terms of security services, which suggests to developers to improve security aspects in the next version.

Results and Discussions
This section presents the main results of our findings. It is subdivided into two points. The first point presents the reputation of Android applications on the security aspect and the second point deals with the functionality aspect.

Dataset
We gathered 13 applications from Google Play oriented to live messaging, social media, mobile banking and antiviruses. Based on the facts that an application can be subject to hundreds to millions of comments, applications are filtered based on comments with security terms and that their collection is manual, it has been harder to get a lot of applications.

Reputation based on security
Based on the reputation model, we have obtained the results on the dataset. These results are detailed in Table 6.  Table 6 presents the polarity of the applications on a defined number of comments related to security services. Then, it presents the security related reputation of these applications in general and by security service. Some of them (Lecteur de code QR, Messenger, Whatapp) have almost average reputation. It appears that none of those applications has a very good reputation. In particular, it reveals for example that "Security Master" has a bad reputation (42%) in terms of security design.
Users feel not comfortable while using this application. Developers have missed to put effort on the security aspect in general. However, authentication measures are desirable within the application (67%). The other services are not respected. Despites the fact that Whatsapp has an average reputation, its developers put efforts on integrity (58%), authentication (68%) and availability (67%) to guarantee non alteration of exchanges, to guarantee the service and to verify identity of communicators. "Express Union" application aiming to provide banking operations has a bad reputation (41%) meaning that users feel not safe when using this application to perform banking transactions. Table 7 shows the reputation of the same applications in terms of functionality. This is realized based on Appbot from which an evaluation of functionality on Android applications is provided depending on the point of view of ratings and comment polarity. It reveals for example that "Security Master" has a very good reputation (86%) in terms of what it has been designed for. Users feel very comfortable while using this application. Unlike, "Express Union" application has a bad reputation (15%) meaning that users are satisfied when using this application to perform banking transactions.  in terms of functionality than in security. This fact can be explained by two situations. The first situation is that people feel more comfortable in terms of service than in terms of security. The second situation is that developers use vulnerable Application Programming Interface (APIs) which generates security flaws. Another remark is that developers of messaging applications tend to put more accents in security service than the others. This is the case of Messenger and Whatsapp. The correlation coefficient between reputation based security and reputation based functionality is 0.44

Reputation based on functionality
indicating that there is no relation between reputation based security and reputation based functionality. This result confirms that it is not possible to estimate security enforcement based on functionalities implementation. In other words, consumer comments reveal that developers really lack to look into security aspects.

Advantages
The proposed model contributes in some points.
• The model reveals that comments can be effective to indicate security flaws to overcome across updated versions. • The model is able to prioritize security services in terms of reinforcement in the next version. • The model is able to provide fine grained security reputations. The model finds through opinions various CIIA security pitfalls to overcome before one installs inside the smartphone. It can be exploited to recommend developers and vendors. • The reputation based on functionality is in major cases much higher than the reputation based on security except certain messaging applications

Limitations
Results show that the proposed model is effective in determining the security related reputation of an Android application. However, this model has some limitations: • Retrieving comments on the Google Play is done manually and takes a lot of time; • This model does not distinguish false comments from real ones. We assumed all comments as true comments; • The determination of security services for a given comment is not done automatically, but rather manually; • The consideration of the date of publication of comments is not taken into account. Over time, applications can have improvements thanks to updates and customers change their feelings to positive. • We have only considered one application source: Google Play Store. People can make comments on the same application through different application stores.

Conclusion and Perspectives
We could almost call them "shortcuts", present most of the time on our mobile phones. While some can no longer do without them, others still find it difficult to appropriate them and "stay connected" day and night. Mobile applications are very recent and yet their appearance quickly overwhelmed and completely changed the daily life of the population. Despite its popularity, its momentum, its open-source software and its behavior of programmable structure make it vulnerable to attacks. It was, therefore, a question for us to assess reputation of Android applications based on the opinions and reviews made by users. For this, we have proposed a model based on sentiment analysis and NLP as a solution for determining the reputation of a given application and therefore its potential to be risky or not. Results revealed that the reputation of an application from a functional point of view is not as reliable as that based on the security aspect. This is because an application that meets the security criteria is easier and gives more confidence to its use. For future works, the model will consider the fact that comments can be done on the same application across different stores and that within the same store, comments can be updated dependently on the developer updates.

Conflicts of Interest:
The authors declare no conflict of interest.