Preprint
Article

This version is not peer-reviewed.

CrossCFD: Leveraging Cross Market Inconsistencies for Fake Review Detection

Submitted:

30 June 2026

Posted:

01 July 2026

You are already at the latest version

Abstract
User reviews serve as a primary information source for users to assess app quality, gauge trustworthiness, and make installation decisions in mobile app stores. However, malicious developers frequently manipulate app reputation through fake reviews and inflated ratings, undermining the credibility of these platforms. Existing detection methods are mostly confined to a single app store, overlooking the cross market inconsistencies and coordinated manipulation campaigns that affect the same app across multiple platforms, leading to poor generalization and high false negative rates. To address this gap, we propose CrossCFD, a novel framework that leverages cross market inconsistencies for fake review detection. Our approach reuses four single market features, two enhanced single market features, and introduces nine cross market features to characterize coordinated promotional behavior. These features are then fused and fed into a gradient boosting classifier, which is trained on a labeled benchmark to distinguish fake reviews from genuine ones. Evaluation on a benchmark dataset shows that CrossCFD achieves 93.5% precision and 93.6% recall, demonstrating stronger detection effectiveness than baselines based on single market evidence and text only features. Applied to approximately 1.5 million reviews from 175 rapidly growing apps across eight major stores, CrossCFD identifies 21.37% of reviews as potentially fake. Our findings highlight the value of cross market evidence for understanding and detecting fake review manipulation.
Keywords: 
;  ;  ;  

1. Introduction

User reviews have become a central component of mobile app markets. Before installing an app, users often rely on reviews and ratings to assess its quality, reliability, and popularity. App markets also use single market evidence to support search, ranking, recommendation, and quality control. As a result, reviews directly influence user acquisition, app visibility, and developer commercial success. These economic incentives have encouraged fake review manipulation, where developers or third party promotion providers post misleading reviews, inflate ratings, or insert promotional keywords. Previous work has shown that fake and incentivized reviews are prevalent in app markets and that black hat App Store Optimization(ASO) services use fake reviews and sockpuppet accounts to manipulate app visibility and reputation [1,2,3].
A large body of research has studied fake review detection using textual, behavioral, temporal, and rating evidence. Typical approaches extract features from review content, rating distributions, reviewer activity, sentiment inconsistency, or abnormal review bursts, and then train classifiers to distinguish fake reviews from genuine ones [4,5]. These studies have significantly advanced the understanding of review manipulation. However, most existing methods are designed for a single market, such as Google Play or the Apple App Store. They implicitly assume that the evidence needed to identify fake reviews is fully observable within one market. This assumption becomes increasingly restrictive in fragmented mobile app ecosystems, where the same app may be distributed simultaneously through multiple app markets.
In practice, promotional fake review campaigns may operate across multiple app stores rather than within a single store alone. Promotion providers can reuse similar review content, adjust ratings, and schedule reviews across markets to improve the perceived credibility and visibility of the same app. Figure 1 presents a motivating example. Figure 1(a) shows that the same app, identified by package name com.yiban1314.yiban, appears in multiple app stores, and duplicated positive reviews are observed across these markets at different posting times. Figure 1(b) further shows discrepancies in app score and review count for the same app across markets. These observations suggest that promotional manipulation may leave evidence in different stores, including duplicated review content, inconsistent reputation statistics, and abnormal temporal patterns.
Such cross market manipulation creates new challenges for fake review detection. On the one hand, a detector that analyzes each market independently may miss coordinated manipulation whose evidence is weak within a single store but clearer after the same app is aligned across markets. On the other hand, it may produce false alarms by treating market specific variations as suspicious without considering whether similar patterns also appear in other stores. Therefore, fake review detection in mobile app ecosystems requires cross market analysis that jointly models review content, ratings, timestamps, rankings, and app metadata across stores.
To address this problem, we propose CrossCFD, a cross market fake review detection framework that leverages inconsistencies in the review patterns of the same app across multiple app markets. Our key observation is that legitimate apps tend to exhibit relatively consistent review characteristics across markets, whereas coordinated promotional campaigns often disrupt such consistency by reusing similar review templates, posting duplicated reviews within short time windows, or selectively inflating ratings in markets with weaker moderation. CrossCFD first collects review related records of the same app from multiple markets, such as review text, ratings, timestamps, rankings, and app metadata. It then cleans and aligns these records, extracts review features and cross market inconsistency features, and uses the fused features for fake review detection. By jointly modeling evidence within each market and discrepancies across markets, CrossCFD can identify promotional fake reviews that are difficult to capture from evidence in a single market alone.
We evaluated CrossCFD on a benchmark dataset of 9,000 reviews and further applied it to a large scale dataset containing approximately 1.5 million reviews from 175 apps across eight major app markets. Our evaluation shows that cross market information provides substantial benefits for detecting fake reviews. On the benchmark, CrossCFD achieves 93.5% precision and 93.6% recall, outperforming baselines that use only single market evidence and review text. The large scale measurement further reveals that potential fake reviews remain widespread across mobile app markets, with clear differences across markets and app categories. These findings suggest that cross market analysis is not only useful for improving detection accuracy, but also necessary for understanding how review manipulation campaigns operate across fragmented app ecosystems. The main contributions of this paper are as follows:
(1)
We formalize a 15 dimensional feature set for cross market fake review detection. The feature set retains four single market features, enhances two semantic features, and designs nine cross market features to capture inconsistency and coordination patterns across app markets.
(2)
We design CrossCFD, a fake review detector based on feature fusion. CrossCFD extends conventional fake review classifiers from single market detection to cross market detection by integrating local review evidence, semantic evidence, and cross market inconsistency evidence into a unified supervised model.
(3)
We conduct a large scale measurement study of fake reviews across app markets. Applying CrossCFD to 1,566,296 reviews from 175 apps across eight markets, we identify 334,778 potentially fake reviews and analyze their distribution across markets, categories, and apps.

3. Approach

3.1. High Level Overview

CrossCFD targets fake review detection for apps distributed across multiple markets. Its core insight is that coordinated promotion may be less visible within a single market but becomes clearer when the same app is examined across markets. Such campaigns often leave traces across markets, such as reused review templates, synchronized posting, inconsistent ratings, and market specific manipulation patterns. As shown in Figure 2, the framework contains four stages. The first stage collects review data of the same app from multiple markets. The second stage cleans and normalizes heterogeneous records, then constructs labels through LLM assisted human verification. The third stage extracts review level features and cross market inconsistency features. The final stage fuses these features into a supervised classifier for fake review detection.

3.2. Multi Market Crawling

A target app is included only when it can be reliably matched across at least two app markets. For each candidate app, crawlers query app names and package identifiers to retrieve potential listings from each market. Because app markets provide different interfaces, we use structured API crawling and UI rendering when needed. Only apps with consistent evidence across markets are retained. For each validated app and market pair, the crawler collects review text, rating, timestamp, ranking, and app metadata, while preserving the source market identifier for cross market alignment and feature computation.

3.3. Preprocessing and LLM Analysis

The collected review records are preprocessed by removing records with missing text, invalid ratings, invalid timestamps, very short content, market specific artifacts, duplicated interface text, and developer replies. Ratings are normalized to a unified scale, and timestamps are converted into a consistent format for cross market comparison.
After preprocessing, we construct a labeled benchmark through LLM assisted annotation and human verification. Specifically, 11,832 candidate reviews are sampled from nine apps across three representative categories, and each review is annotated by Qwen 2.5 Max [41] using the app description, feature list, category, and market metadata as context. The LLM provides preliminary labels and rationales based on app content consistency, review specificity, sentiment rating consistency, and generic promotional expressions.Final labels are assigned by three human annotators after reviewing the LLM outputs and app context. A review is labeled as suspicious if it conflicts with the app functionality, contains unsupported promotional claims, shows sentiment rating inconsistency, or exhibits duplication or temporal coordination with other reviews. After removing ambiguous and inconsistent cases, 9,000 high confidence reviews are retained for model training and evaluation.

3.4. Feature Extraction

Each review is represented by a 15 dimensional feature vector. The feature construction starts from 11 conventional fake review features. Among them, four features are retained as traditional single market features, denoted as t f 1 to t f 4 , because they directly describe basic review properties such as rating, length, sentiment consistency, and review burst behavior. Two features are further enhanced as e f 1 and e f 2 , because simple lexicon or surface based measurements are insufficient to capture semantic sentiment and language complexity in app reviews. Several conventional features are then extended to the cross market setting, forming c f 3 to c f 8 , to capture repeated content, rating discrepancies, ranking discrepancies, and review growth differences across markets. Finally, three new cross market features, c f 1 , c f 2 , and c f 9 , are introduced to describe sentiment inconsistency and temporal coordination patterns that can only be observed after aligning the same app across multiple markets.
For the i th review c i of an app, the feature vector is denoted as F ( c i ) = { t f 1 i , , t f 4 i , e f 1 i , e f 2 i , c f 1 i , , c f 9 i } , where t f i , e f i , and c f i denote traditional single market features, enhanced single market features, and cross market features, respectively. Figure 3 summarizes the feature construction process.
Single Market Features. Single Market features provide basic evidence from individual reviews and local review activity within a single market. For review c i , A i denotes its app, r i denotes the original rating, and r ˜ i = ( r i 1 ) / 4 denotes the normalized rating.
t f 1 i : App rating score. The app rating score is defined as t f 1 i = r ˜ i .
t f 2 i : Sentiment rating inconsistency. Fake reviews may exhibit a mismatch between the sentiment expressed in the review text and the assigned rating. Such inconsistency can indicate abnormal review behavior or mechanically generated promotional content. Therefore, the sentiment rating inconsistency feature is defined as t f 2 i = | e i r ˜ i | , where e i [ 0 , 1 ] denotes the sentiment score of review c i . A larger value of t f 2 i indicates a stronger mismatch between textual sentiment and numerical rating.
t f 3 i : App review length. The app review length feature is defined as t f 3 i = L i , where L i denotes the character length of review c i .
t f 4 i : App review count anomaly. Fake reviews often appear in short bursts, leading to abnormal daily review-count changes. The feature is calculated using the isolation forest and is defined as t f 4 i = a i = 2 E ( h ( x ) ) / c ( n ) , where x is the daily review count, E ( h ( x ) ) denotes the expected path length of x through isolation trees, and c ( n ) is the average path length for a sample size of n. A larger anomaly score indicates a stronger deviation from normal review volume patterns.
Enhanced Single Market Features. These features characterize the semantic and linguistic properties of individual reviews.
e f 1 i : Sentiment score. Promotional fake reviews often contain strong emotional expressions to influence user installation decisions. The sentiment score is obtained using a Chinese sentiment classifier based on StructBERT-base-chinese [42], trained on four review datasets with 115K samples. The feature is defined as e f 1 i = e i , where e i [ 0 , 1 ] denotes the predicted sentiment score of review c i . Larger values indicate more positive sentiment. Table 1 gives representative examples.
e f 2 i Language complexity. Fake reviews may contain rigid templates, repeated promotional wording, or unnatural expressions that differ from ordinary user feedback. We measure language complexity using a RoBERTa language assessment model [43]. For review c i , the feature is defined as e f 2 i = p i = 2 i , where i = 1 N k = 1 N log 2 P ( w k | w 1 , , w k 1 ) denotes the normalized log probability of the review, w k denotes the token at position k, N is the number of tokens, and p i is the resulting perplexity value. In implementation, p i is clipped to [ 18000 , 26000 ] for numerical stability, with larger values indicating lower language model likelihood and higher linguistic complexity. We keep the clipped perplexity value without additional normalization because its absolute scale reflects the output range of the language assessment model and preserves magnitude differences among reviews. Table 1 gives representative examples.
Cross Market Features. These features capture discrepancy and synchronization patterns of the same app across multiple markets. For review c i , all cross market features are computed over its associated app A i and then assigned to c i .
c f 1 i : Cross market sentiment discrepancy. This feature measures the variation of average sentiment across markets for app A i . It is defined as c f 1 i = σ A i = 1 m A i 1 j = 1 m A i ( μ A i j μ A i ) 2 , where μ A i j = 1 n A i j k = 1 n A i j e A i j k is the average sentiment score of app A i in market j, e A i j k is the sentiment score of the k-th review of app A i in market j, n A i j is the number of reviews of app A i in market j, μ A i = 1 m A i j = 1 m A i μ A i j is the cross market mean sentiment, and m A i is the number of markets where app A i is listed. A larger value indicates stronger sentiment discrepancy across markets.
c f 2 i : Cross market temporal variance of duplicate app reviews. This feature measures whether duplicated reviews are posted synchronously across markets. It is defined as c f 2 i = d i = 1 | T d u p ( i ) | t T d u p ( i ) ( t t ¯ d u p ( i ) ) 2 , where T d u p ( i ) is the set of timestamps of reviews whose content is identical to review c i , and t ¯ d u p ( i ) = 1 | T d u p ( i ) | t T d u p ( i ) t is their average timestamp. If no duplicated review is observed, c f 2 i is set to 0.
c f 3 i : Cross market count of duplicate app reviews. This feature captures exact content reuse across markets. It is defined as c f 3 i = n i , where n i denotes the number of reviews whose content is identical to review c i across all markets where app A i is listed.
c f 4 i : Cross market count of similar app reviews. This feature captures cross market review reuse with shared templates or minor textual variations. It is defined as c f 4 i = q i = c C A i { c i } I ( sim ( c i , c ) > θ ) , where C A i denotes the set of reviews of app A i across all markets, sim ( · ) is the Hamming-distance-based similarity score, θ = 0.75 is the similarity threshold, and I ( · ) is the indicator function.
c f 5 i : Cross market app ranking discrepancy. This feature measures whether the same app exhibits inconsistent ranking dynamics across markets. It is defined as c f 5 i = r d A i = SB A i SW A i , where SB A i denotes the mean square between markets and SW A i denotes the mean square within markets, both computed from ANOVA over the daily ranking changes of app A i . The daily ranking change is Δ R A i j t = R A i j t R A i j ( t 1 ) , where R A i j t denotes the ranking of app A i in market j at time t. A larger value indicates stronger cross market divergence in ranking dynamics.
c f 6 i : Cross market app review length discrepancy. This feature measures whether review verbosity differs across markets. It is defined as c f 6 i = l d A i = Var ( L ¯ A i 1 , , L ¯ A i m A i ) , where L ¯ A i j is the average review length of app A i in market j, and m A i is the number of markets where app A i is listed.
c f 7 i : Cross market app burst discrepancy. This feature captures abnormal short term growth in review volume across markets. For app A i in market j, the burst ratio is defined as B A i j = max t ( N A i j , t ) N ¯ A i j , where N A i j , t is the number of reviews of app A i in market j on day t, and N ¯ A i j is the average daily review count. The feature is defined as c f 7 i = Var ( B A i 1 , B A i 2 , , B A i m A i ) . A higher value indicates stronger inconsistency in burst review behaviors across markets.
c f 8 i : Cross market app rating discrepancy. This feature measures the variation of app ratings across markets. It is defined as c f 8 i = s d A i , where s d A i is the standard deviation of the average ratings of app A i across all markets. A larger value indicates stronger rating inconsistency across markets.
c f 9 i : Cross market temporal variance of similar app reviews. This feature extends c f 2 i from exact duplicates to semantically or lexically similar reviews. Review similarity is computed using a Hamming distance based similarity score with a threshold of θ = 0.75 . The feature is defined as c f 9 i = v i = 1 | T s i m ( i ) | t T s i m ( i ) ( t t ¯ s i m ( i ) ) 2 , where T s i m ( i ) is the timestamp set of reviews similar to c i , and t ¯ s i m ( i ) = 1 | T s i m ( i ) | t T s i m ( i ) t is their average timestamp. If no similar review is observed, c f 9 i is set to 0.

3.5. Fake Review Detection

Fake review detection is formulated as a supervised binary classification task. For each review c i , CrossCFD takes the feature vector F ( c i ) as input and predicts y ^ i = f ( F ( c i ) ) , where y ^ i = 1 denotes a fake review and y ^ i = 0 denotes a genuine review. The feature vector consists of single market, enhanced single market, and cross market features.The classifier f ( · ) is selected from five supervised models.

4. Evaluation

4.1. Evaluation Setup

We evaluate CrossCFD on the labeled benchmark described in Section 3.3. The dataset contains 9,000 labeled reviews, with 6,750 reviews used for training and 2,250 reviews reserved for testing. Features are extracted for each review following Section 3.5. Model selection and hyperparameter tuning are conducted on the training set using five fold cross validation, while the test set is used only for final evaluation.

4.2. Detection Evaluation

The detection experiment evaluates whether cross market inconsistency features improve fake review classification beyond baselines that use only review text or evidence from a single market. We compare three feature settings, text only features(word2vec), review features combined with word2vec representations, and the full CrossCFD feature set. Table 2 reports the detection results. Across all classifiers, the full CrossCFD feature set achieves better performance than the baselines that use only review text or review evidence. Gradient Boosting obtains the best overall result among the evaluated classifiers, reaching 93.5% precision, 93.6% recall, and 93.5% F1 score.
Figure 4 reports the ROC curve of the selected Gradient Boosting detector. CrossCFD achieves an AUC of 0.976, showing clear separation between fake and genuine reviews. When the true positive rate reaches 90%, the false positive rate is 8.6%, indicating that the detector can identify most fake reviews while keeping false alarms relatively low.

4.3. Generalization Evaluation

To evaluate robustness under distribution shifts, CrossCFD is tested in two settings, Only app and Only market. In the Only app setting, reviews from one app are used only for testing, while reviews from the remaining apps are used for training. In the Only market setting, reviews from one app store are used only for testing, while reviews from the remaining markets are used for training. These settings are stricter than the standard split because they avoid overlap of the same app or the same market between training and testing.
CrossCFD is compared with a baseline that uses only evidence from a single market. Precision, recall, F1 score, and false negative rate are reported. Recall and FNR are emphasized because missed detections correspond to fake reviews that remain unfiltered in deployment. In both settings, cross market features are computed from available unlabeled observations across markets, while labels from the target app or target market are used only for evaluation.
Table 3 reports the generalization results. Compared with the single market baseline, CrossCFD consistently improves recall and reduces false negative rates. In the app level setting, CrossCFD improves F1 score from 78.4% to 88.0% and reduces FNR from 23.0% to 13.8%. In the market level setting, CrossCFD improves F1 score from 74.5% to 82.7% and reduces FNR from 31.0% to 17.3%. The recall improvement is especially clear under the market level setting, suggesting that cross market inconsistency features help recover fake reviews that are missed when only evidence from a single market is used.

4.4. Feature Analysis

This experiment analyzes whether the proposed features show clear differences between fake and genuine reviews. Rather than focusing only on overall classification performance, this analysis examines the feature space itself, covering enhanced review features, repeated or similar review content across markets, and representative cross market inconsistency features.
Enhanced single market features. The sentiment score e f 1 captures the emotional intensity of a review. Promotional fake reviews often use exaggerated emotional expressions to influence user decisions, making accurate sentiment modeling important. Table 4 compares the StructBERT based sentiment model used in CrossCFD with a dictionary based method. The StructBERT based model improves accuracy from 81.7% to 90.6%, precision from 84.4% to 93.5%, and F1 score from 80.9% to 90.3%, indicating that contextual sentiment modeling provides a stronger signal than lexicon matching.
The language complexity feature e f 2 captures stylistic regularity. Figure 5 shows the KDE distributions of language complexity for fake and genuine reviews. Genuine reviews span a wider range, reflecting more diverse user expressions, whereas fake reviews are concentrated in a narrower region, suggesting more homogeneous and templated writing.
Cross market propagation of duplicated and similar reviews. To examine whether review reuse provides a cross market signal, we analyze an example app, com.zzjr.niubanjin across three markets, namely 360 Mobile Assistant, Baidu App Store, and Yingyongbao. After removing null records and reviews shorter than six characters, each review is represented as a node, and an edge is added when two reviews are identical or similar. As shown in Figure 6, duplicated and similar reviews propagate across markets rather than remaining isolated within a single market.
The association between repeated or similar review content across markets and fake reviews is substantial. Among 137 identical reviews appearing in two markets, 110 are fake reviews, accounting for 80.3%. Among 26 identical reviews appearing in three markets, 22 are fake reviews, accounting for 84.6%. For similar reviews, 198 of 285 reviews appearing across two markets are fake reviews, accounting for 69.5%. In addition, 55 of 67 similar reviews appearing across three markets are fake reviews, accounting for 82.1%. Moreover, more than 65% of identical or similar reviews are posted across markets within 12 days. These results support the use of duplicate count, similar review count, and temporal coordination as cross market features.
Discriminativeness of cross market features. We further inspect three representative cross market features, namely cross market count of similar app reviews c f 4 , cross market app rating score discrepancy c f 8 , and cross market temporal variance of similar app reviews c f 9 . As shown in Figure 7(a), fake reviews tend to have higher counts of similar reviews, indicating repeated or template based review propagation across markets. Figure 7(b) shows that fake reviews are associated with larger rating score discrepancies across markets, while real reviews are more concentrated at lower discrepancy values. Figure 7(c) shows that fake reviews exhibit larger temporal variance among similar reviews, suggesting that suspicious review content may be distributed across different markets and time windows.

5. Measuring Fake Reviews in Multiple Mobile App Stores

To better understand fake review activity across real app markets, we conduct a large scale measurement study. The study is guided by the following research questions.
RQ1. How prevalent are fake reviews in app markets and which markets show stronger manipulation patterns?
We measure the overall scale of fake reviews, compare fake review proportions across markets, and analyze synchronization patterns across markets.
RQ2. Which app categories are more frequently targeted by fake review campaigns across markets?
We examine category differences in fake review prevalence and identify whether commercially sensitive categories, such as dating, gaming, and finance, are more affected.
RQ3. How is fake review prevalence related to app popularity and market distribution?
We analyze the relationship between fake review ratios and app popularity, and further characterize how apps with high fake review ratios distribute suspicious reviews across multiple markets.

5.1. Measurement Dataset

The measurement study is based on three datasets:
  • Hu et al. dataset. [45] This historical dataset contains 26,735,573 reviews from five Android app markets, including 360 Mobile Assistant, Baidu App Store, Wandoujia, Xiaomi, and Yingyongbao. It is used to characterize cross market duplicate and similar review propagation.
  • CHAMP dataset. [44] This dataset contains more than 730K reviews from major domestic app markets, including Huawei, Meizu, OPPO, VIVO, Xiaomi, and Yingyongbao. It complements the Hu et al. dataset by expanding market coverage for category analysis.
  • Recent review dataset. This dataset is collected from eight app markets1 between January 1, 2024 and January 1, 2025. Guided by the historical analysis, we focus on dating, gaming, and finance apps selected from rank-gain and rank-drop lists. Candidate apps are matched across markets using app names and package identifiers, and are further verified with available app metadata. Apps appearing in fewer than two markets are discarded. The final dataset contains 175 apps and 1,566,296 reviews. Each record includes review text, rating score, published time, ranking score, market source, and available app metadata.
Overall, these three datasets provide complementary coverage for the measurement study, where the Hu et al. dataset supports historical cross market propagation analysis, the CHAMP dataset expands category and market coverage, and the recent review dataset enables focused analysis of recent dating, gaming, and finance apps across eight markets.

5.2. RQ1: Prevalence and Market Manipulation Patterns

We first investigate the distribution of potential fake reviews across app markets and identify markets with stronger manipulation and synchronization patterns. To answer RQ1, we apply the trained CrossCFD detector to the recent review dataset, compare proportions of suspicious reviews across markets, and use the CMS score to examine whether suspicious reviews in one market tend to co-occur with suspicious activity in other markets.
Overall prevalence of fake reviews. Table 5 reports the estimated fake review proportions across app stores in the recent review dataset. CrossCFD identifies 334,778 fake reviews in total, indicating that reviews classified as fake remain prevalent in the current mobile app ecosystem. The distribution is highly uneven across markets. OPPO has the lowest fake review proportion, with 16.36% of its reviews classified as fake. VIVO, Xiaomi, Huawei, and App Store show moderate fake review proportions, ranging from 22.90% to 29.48%. In contrast, Yingyongbao exhibits a substantially higher fake review proportion of 60.15%, while 360 Mobile Assistant and Meizu reach 94.69% and 97.39%, respectively. These results indicate that fake review manipulation varies substantially across app stores. Large markets still contain many suspected fake reviews, while smaller markets may show stronger manipulation patterns. The high ratios in 360 Mobile Assistant and Meizu should be interpreted cautiously due to their limited review volumes, but they still suggest that fake review activity can be highly concentrated in specific markets.
Market manipulation evidence. In addition to potential fake review proportions, we examine cross market synchronization to identify stores that are more closely associated with coordinated promotion. Table 6 reports the distribution of suspicious reviews across markets under different cross market. Specifically, the columns indicate suspicious reviews associated with apps whose suspicious reviews appear in one market, two markets, three markets, or more than three markets. A higher CMS score means that suspicious reviews in a given market are more likely to appear with suspicious activity in other markets. Moreover, Yingyongbao obtains the highest CMS score of 0.81, followed by 360 Mobile Assistant with 0.68. Yingyongbao also accounts for 72.8% of apps whose suspicious reviews appear in more than three markets, suggesting a strong connection with cross market propagation. In comparison, OPPO, VIVO, Xiaomi, Huawei, and Meizu show lower CMS scores, ranging from 0.30 to 0.39. Together, fake review proportion and CMS capture different aspects of manipulation risk. A high fake review proportion indicates concentrated suspicious activity within a market, while a high CMS score indicates stronger involvement in suspicious activity across markets. This distinction shows that a market with a moderate fake review proportion may still be important in coordinated promotion if its suspicious reviews are widely synchronized with other stores.
Answer to RQ1. Potential fake reviews are not uniformly distributed across app markets, but vary substantially in both estimated prevalence and cross market synchronization. Markets with higher suspicious review proportions do not always provide equally reliable evidence, since small valid review volumes may amplify estimated ratios in some stores. These findings suggest that fake review risk should be assessed by jointly considering within market prevalence, valid review volume, and cross market synchronization.

5.3. RQ2: Category Targeting Patterns

We next examine whether fake review campaigns are concentrated in specific app categories. To answer RQ2, we analyze category differences using both historical datasets and therecent review dataset, since apps with stronger user acquisition pressure, higher monetization potential, or stronger trust requirements may provide greater incentives for promotional manipulation.
Category prevalence. To obtain broad category coverage, we first analyze the Hu et al. dataset and the CHAMP dataset. Since these datasets do not provide complete category labels for all package names, we collect app metadata from third party app intelligence platforms, including Kuchuan and Qimai. We then group apps into ten major categories and assign apps with ambiguous labels to the Others category.
Figure 8 reports the estimated fake review proportions across categories and markets. The heatmap shows clear category differences. Social Communication has the highest proportions in most markets, reaching 0.563 in Yingyongbao, 0.458 in Wandoujia, and 0.390 in 360. Finance also shows consistently high values, such as 0.352 in Yingyongbao, 0.339 in Wandoujia, and 0.279 in 360. Gaming presents strong manipulation patterns as well, with proportions of 0.350 in OPPO, 0.345 in Xiaomi, and 0.309 in Meizu. In contrast, categories such as Education, News, Reading, and Others generally show lower proportions, often below 0.10 in several markets.
Focused measurement on selected categories. Guided by the historical category analysis, the recent review dataset focuses on dating, gaming, and finance apps. Figure 9 reports the estimated fake review proportions for these three categories. Dating apps show the highest proportion at 25.25%, followed by gaming apps at 22.84%. Finance apps show a lower proportion of 15.29%, but still contain a non negligible proportion of suspicious reviews. Meizu and 360 Mobile Assistant are not shown in this figure because their valid samples in the selected categories are too sparse after applying the app matching and review filtering rules.
These differences suggest that potential fake review manipulation is related to category specific incentives. Dating apps often depend on perceived popularity, trust, and successful social interaction to attract users and encourage paid engagement. Gaming apps operate in a competitive market where ratings and reviews can affect visibility, downloads, and in app spending. Finance apps are more function oriented and trust sensitive. Although their estimated fake review proportion is lower than that of dating and gaming apps, suspicious reviews in this category may still be important because they can influence user decisions in financial scenarios.
Promotional language in selected categories. To further characterize fake review content in the selected categories, we perform word cloud analysis on representative apps from dating, finance, and gaming. Figure 10 shows that fake reviews in these categories are dominated by positive and persuasive words. For the dating app, frequent words are related to recommendation, liking, and social connection. For the finance app, the words emphasize convenience, speed, and loan services. For the gaming app, the words highlight entertainment quality, classic gameplay, and game equipment related terms.
Although the keywords differ across categories, they serve a similar promotional purpose. potential fake reviews tend to emphasize the most commercially attractive aspects of each category, such as social connection in dating apps, fast loan services in finance apps, and entertainment experience in gaming apps. Instead of providing detailed user experiences, these reviews often use generic positive language to improve the perceived credibility of the app.
Answer to RQ2. Potential fake review activity is more visible in commercially sensitive categories such as social communication, dating, gaming, and finance, while categories such as education, news, and reading generally show lower estimated proportions. This pattern suggests that suspicious review manipulation is more likely to emerge in categories where user trust, reputation signals, and acquisition incentives are closely connected to monetization. However, these results should be interpreted as category associated risk patterns rather than a complete ranking of all app categories.

5.4. RQ3: App Popularity and Cross Market Distribution

To answer RQ3, we examine whether fake review prevalence is associated with app popularity and how apps with high fake review proportions distribute suspicious reviews across markets. We first rank apps by download counts and plot each app according to its download count and fake review ratio. We then divide apps into three download groups to compare fake review ratios across different popularity ranges to observe their distribution across app stores.
Fake review prevalence versus app popularity. Figure 11 shows the relationship between app downloads and potential fake review ratios, where each point represents one app. We further divide apps into three groups based on download counts to compare potential fake review ratios across different popularity ranges. The average potential fake review ratio is 35.85% in the top download group and 56.60% in the low download group. To quantify the relationship within each market, we compute Pearson correlation coefficients between app download counts and fake review ratios. The correlations are close to zero, with 0.03 for Huawei, 0.02 for OPPO, 0.00 for VIVO, and 0.04 for Xiaomi. Other markets are omitted from this analysis due to missing or inconsistent download statistics and sparse valid samples after app matching and filtering. As supplementary examples, Table 7 shows the top 10 apps ranked by fake review proportion and their fake review distributions across markets. These apps show different distribution patterns. For example, com.bingo.yeliao has many fake reviews in OPPO, VIVO, Yingyongbao, and 360 Mobile Assistant, while com.leniu.shdl appears across App Store, Huawei, OPPO, VIVO, Xiaomi, and Yingyongbao. Other apps show fake reviews mainly in one or two markets, while some markets contain few reviews, no detected fake reviews, or no available listing.
Answer to RQ3. Potential fake review risk is not explained by app popularity alone, as the observed correlation between download counts and estimated fake review proportions is weak. Some low popularity apps still show high suspicious review proportions, and high risk apps differ in how suspicious reviews are distributed across markets. This indicates that fake review analysis should jointly consider popularity signals, market coverage, store specific variation, and cross market propagation rather than relying on download counts alone.

6. Discussion

6.1. Implications

For app market maintainers, fake review moderation should not rely only on evidence observed within a single store. Markets can compare the same app across stores and use repeated review content, abnormal rating gaps, synchronized posting, and burst patterns as early warning evidence for coordinated promotion. Such evidence can help prioritize manual review for apps and categories with stronger manipulation risk. For app developers, the results suggest that reputation building should rely on transparent user acquisition and genuine feedback rather than artificial promotion, since suspicious review patterns may become visible when the app is examined across markets. Developers can also monitor their own review distributions across stores to identify abnormal third party promotion or reputation attacks. For mobile users, reviews and ratings should be interpreted with caution, especially when an app shows many generic positive reviews, sudden review growth, or inconsistent reputation across stores. Comparing reviews across markets and paying attention to detailed user experiences can help users make more reliable installation and payment decisions.

6.2. Limitations

Our study carries several limitations. First, the large scale measurement relies on CrossCFD predictions rather than fully manual labels, which may introduce false positives and false negatives. To alleviate this issue, we evaluate CrossCFD on a high confidence labeled benchmark before applying it to large scale data, and we report the measurement results as estimated manipulation evidence rather than exact ground truth prevalence. Second, cross market analysis depends on accurate app alignment, while app names, package identifiers, developer information, and descriptions may be incomplete or inconsistent across stores. To reduce this risk, we match apps using multiple metadata fields and exclude apps that cannot be confidently matched in at least two markets. Finally, CrossCFD captures observable evidence from reviews, ratings, timestamps, rankings, and market distributions, but cannot directly observe hidden promotion infrastructures such as paid review networks, install farms, or account collusion. Therefore, we avoid attributing suspicious reviews to specific attackers or organizations, and instead focus on measurable manipulation patterns across markets.

7. Conclusion

This paper presents CrossCFD, a fake review detection framework that uses evidence across app markets. CrossCFD aligns the same app across different stores and combines review evidence with cross market inconsistency evidence, covering rating discrepancy, sentiment deviation, duplicated or similar review content, temporal coordination, ranking variation, and burst behavior. The evaluation shows that cross market evidence improves detection effectiveness and helps reduce missed detections under stricter generalization settings. The measurement study further shows that potential fake review activity appears across multiple app markets, but its estimated proportion and synchronization degree vary substantially by store. Markets with higher CMS scores show stronger co occurrence of suspicious reviews with other stores, indicating closer involvement in cross market promotional activity. The category analysis further shows that suspicious reviews are prominent in several commercially sensitive categories, and the focused measurement on dating, gaming, and finance apps also observes substantial fake review activity.In addition, app popularity alone cannot explain fake review prevalence, since apps with high fake review proportions show different download levels and different distributions across markets. These findings highlight the need to assess fake review risk from both within market prevalence and cross market coordination perspectives.

References

  1. Martens, D.; Maalej, W. Towards Understanding and Detecting Fake Reviews in App Stores. Empir. Softw. Eng. 2019, 24, 3316–3355. [Google Scholar] [CrossRef]
  2. Rahman, M.; Hernandez, N.; Recabarren, R.; Ahmed, S.I.; Carbunar, B. The Art and Craft of Fraudulent App Promotion in Google Play. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 2437–2454. [Google Scholar]
  3. Hernandez, N.; Recabarren, R.; Carbunar, B.; Ahmed, S.I. RacketStore: Measurements of ASO Deception in Google Play via Mobile and App Usage. In Proceedings of the 21st ACM Internet Measurement Conference, Virtual Event, 2–4 November 2021; pp. 639–657. [Google Scholar]
  4. Shan, G.; Zhou, L.; Zhang, D. Examining Review Inconsistency for Fake Review Detection. Decis. Support Syst. 2021, 144, 113513. [Google Scholar] [CrossRef]
  5. Gupta, R.; Jindal, V.; Kashyap, I. Recent State-of-the-Art of Fake Review Detection: A Comprehensive Review. Knowl. Eng. Rev. 2024, 39, e67. [Google Scholar]
  6. Jindal, N.; Liu, B. Opinion Spam and Analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 219–230. [Google Scholar]
  7. Lim, E.-P.; Nguyen, V.-A.; Jindal, N.; Liu, B.; Lauw, H.W. Detecting Product Review Spammers Using Rating Behaviors. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 939–948. [Google Scholar]
  8. Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. Spotting Opinion Spammers Using Behavioral Footprints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 632–640. [Google Scholar]
  9. Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a General Rule for Identifying Deceptive Opinion Spam. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; pp. 1566–1576. [Google Scholar]
  10. Rayana, S.; Akoglu, L. Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 985–994. [Google Scholar]
  11. Zhao, J.; Shao, M.; Tang, H.; Liu, J.; Du, L.; Wang, H. RHGNN: Fake Reviewer Detection Based on Reinforced Heterogeneous Graph Neural Networks. Knowl.-Based Syst. 2023, 280, 111029. [Google Scholar] [CrossRef]
  12. Cheng, L.-C.; Wu, Y.-T.; Chao, C.-T.; Wang, J.-H. Detecting Fake Reviewers from the Social Context with a Graph Neural Network Method. Decis. Support Syst. 2024, 179, 114150. [Google Scholar]
  13. Yao, J.; Jiang, L.; Shi, C.; Yan, S. Fake Review Detection with Label-Consistent and Hierarchical-Relation-Aware Graph Contrastive Learning. Knowl.-Based Syst. 2024, 303, 112385. [Google Scholar]
  14. Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Manjotho, A.A. A Deep Feature Interaction and Fusion Model for Fake Review Detection: Advocating Heterogeneous Graph Convolutional Network. Neurocomputing 2024, 598, 128097. [Google Scholar] [CrossRef]
  15. Sun, P.; Bi, W.; Zhang, Y.; Wang, Q.; Kou, F.; Lu, T.; Chen, J. Fake Review Detection Model Based on Comment Content and Review Behavior. Electronics 2024, 13, 4322. [Google Scholar] [CrossRef]
  16. Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Manjotho, A.A. An Analysis of Graph Neural Networks for Fake Review Detection: A Systematic Literature Review. Neurocomputing 2025, 623, 129341. [Google Scholar] [CrossRef]
  17. Mohawesh, R.; Salameh, H.B.; Jararweh, Y.; Alkhalaileh, M.; Maqsood, S. Fake Review Detection Using Transformer-Based Enhanced LSTM and RoBERTa. Intell. Syst. Appl. 2024, 23, 200406. [Google Scholar]
  18. Liu, J.; Quan, P.; Zhang, W. A Study on Fake Review Detection Based on RoBERTa and Behavioral Features. Procedia Comput. Sci. 2024, 242, 1323–1330. [Google Scholar] [CrossRef]
  19. Geetha, S.; Elakiya, E.; Kanmani, R.S.; Das, M.K. High Performance Fake Review Detection Using Pretrained DeBERTa Optimized with Monarch Butterfly Paradigm. Sci. Rep. 2025, 15, 7445. [Google Scholar] [CrossRef] [PubMed]
  20. Liu, M.; Poesio, M. Data Augmentation for Fake Reviews Detection in Multiple Languages and Multiple Domains. arXiv 2025, arXiv:2504.06917. [Google Scholar]
  21. Liu, X.; Xu, R.; Jia, X.; Liao, J.; Sun, J.; Huang, L.; Xu, W. Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network. arXiv 2025, arXiv:2510.01801. [Google Scholar]
  22. Meng, W.; Harvey, J.; Goulding, J.; Carter, C.J.; Lukinova, E.; Smith, A.; Frobisher, P.; Forrest, M.; Nica-Avram, G. Large Language Models as Hidden Persuaders: Fake Product Reviews Are Indistinguishable to Humans and Machines. arXiv 2025, arXiv:2506.13313. [Google Scholar]
  23. Chen, N.; Lin, J.; Hoi, S.C.H.; Xiao, X.; Zhang, B. AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace. In Proceedings of the 36th International Conference on Software Engineering, Hyderabad, India, 31 May–7 June 2014; pp. 767–778. [Google Scholar]
  24. Wang, L.; Wang, H.; Luo, X.; et al. Demystifying “removed reviews” in ios app store. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022; pp. 1489–1499. [Google Scholar]
  25. Maalej, W.; Kurtanović, Z.; Nabil, H.; Stanik, C. On the Automatic Classification of App Reviews. Requir. Eng. 2016, 21, 311–331. [Google Scholar] [CrossRef]
  26. Gao, C.; Zeng, J.; Lyu, M.R.; King, I. Emerging app issue identification from user feedback: Experience on WeChat. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP); IEEE, 2019; pp. 279–288. [Google Scholar]
  27. Nilizadeh, S.; Groggel, A.; Lista, P.; Das, S.; Ahn, G.-J.; Kapadia, A. Think outside the dataset: Finding fraudulent reviews using cross-dataset analysis. Proceedings of The Web Conference, San Francisco, CA, USA, 2019; pp. 3108–3115. [Google Scholar]
  28. Mohawesh, R.; Xu, S.; Tran, S.N.; Ollington, R.; Springer, M.; Jararweh, Y.; Maqsood, S. Fake Reviews Detection: A Survey. IEEE Access 2021, 9, 65771–65802. [Google Scholar] [CrossRef]
  29. Kumar, P.N.V.S.P.; Kasiviswanath, N.; Babu, A.S. Detecting Mobile App Fraud Review and Fake Ranking. In Proceedings of the International Conference on Advanced Materials, Manufacturing and Sustainable Development, 2024; Advances in Engineering Research; Babu, B.S., Ed.; Atlantis Press: Dordrecht, The Netherlands, 2025; pp. 314–319. [Google Scholar]
  30. Chen, K.; Li, S.; Wang, W. You can promote, but you can’t hide: Large-scale abused app detection in mobile app stores. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 2016; pp. 374–385. [Google Scholar]
  31. Fan, M.; et al. LLM App Store Analysis: A Vision and Roadmap. arXiv 2024, arXiv:2404.12737. [Google Scholar]
  32. Adelani, D.I.; Mai, H.; Fang, F.; Nguyen, H.H.; Yamagishi, J.; Echizen, I. Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-Based Detection. arXiv 2019, arXiv:1907.09177. [Google Scholar]
  33. Farooqi, S.; Feal, Á.; Lauinger, T.; McCoy, D.; Shafiq, Z.; Vallina-Rodriguez, N. Understanding incentivized mobile app installs on Google Play Store. In Proceedings of the ACM internet measurement conference, 2020; pp. 696–709. [Google Scholar]
  34. Apple. The App Store Prevented More than $9 Billion in Fraudulent Transactions over the Last Five Years. Available online: https://www.apple.com/newsroom/2025/05/the-app-store-prevented-more-than-9-billion-usd-in-fraudulent-transactions/ (accessed on 26 June 2026).
  35. Gyöngyi, Z.; Garcia-Molina, H. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005; 5, pp. 39–47. [Google Scholar]
  36. Ntoulas, A.; Najork, M.; Manasse, M.; Fetterly, D. Detecting Spam Web Pages through Content Analysis. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, UK, 23–26 May 2006; pp. 83–92. [Google Scholar]
  37. Castillo, C.; Donato, D.; Gionis, A.; Murdock, V.; Silvestri, F. Know Your Neighbors: Web Spam Detection Using the Web Topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 423–430. [Google Scholar]
  38. Spirin, N.; Han, J. Survey on Web Spam Detection: Principles and Algorithms. ACM SIGKDD Explor. Newsl. 2012, 13, 50–64. [Google Scholar]
  39. Bevendorff, J.; Wiegmann, M.; Potthast, M.; Stein, B. Is Google getting worse? A longitudinal investigation of SEO spam in search engines. In Proceedings of the 46th European Conference on Information Retrieval; Springer: Cham, Switzerland, 2024; pp. 56–71. [Google Scholar]
  40. Aggarwal, P.; Singh, V.M.; Zhang, T.; Mandal, S.; Bansal, M.; McAuley, J.; Jha, S. GEO: Generative engine optimization. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 2024; pp. 5–16. [Google Scholar]
  41. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  42. Wang, W.; Bi, B.; Yan, M.; Wu, C.; Bao, Z.; Xia, J.; Peng, L.; Si, L. StructBERT: Incorporating Language Structures into Pre-Training for Deep Language Understanding. arXiv 2019, arXiv:1908.04577. [Google Scholar]
  43. Bao, J. nlp-fluency; GitHub Repository, 2021. Available online: https://github.com/baojunshan/nlp-fluency (accessed on 2 June 2026).
  44. Hu, Y.; Wang, H.; Ji, T.; et al. CHAMP: Characterizing Undesired App Behaviors from User Comments Based on Market Policies. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering, Madrid, Spain, 25–28 May 2021; pp. 933–945. [Google Scholar]
  45. Hu, Y.; Wang, H.; Zhou, Y.; et al. Dating with Scambots: Understanding the Ecosystem of Fraudulent Dating Applications. IEEE Trans. Dependable Secur. Comput. 2019, 18, 1033–1050. [Google Scholar]
1
App Store, https://apps.apple.com. Huawei AppGallery, https://appgallery.huawei.com. OPPO App Market, https://store.oppomobile.com. VIVO App Store, https://dev.vivo.com.cn. Xiaomi GetApps, https://global.app.mi.com. Yingyongbao, https://sj.qq.com. Meizu App Store, https://app.meizu.com. 360 Mobile Assistant, https://zhushou.360.cn.
Figure 1. Motivating example of cross market promotional review manipulation. (a) The same app, identified by package name com.yiban1314.yiban, is distributed across multiple app stores, and duplicated positive reviews appear across these markets at different posting times. (b) The same app exhibits discrepancies in app score and review count. The full app score is 10 in 360 Mobile Assistant and 5 in the other app stores.
Figure 1. Motivating example of cross market promotional review manipulation. (a) The same app, identified by package name com.yiban1314.yiban, is distributed across multiple app stores, and duplicated positive reviews appear across these markets at different posting times. (b) The same app exhibits discrepancies in app score and review count. The full app score is 10 in 360 Mobile Assistant and 5 in the other app stores.
Preprints 220924 g001
Figure 2. Overview of our approach.
Figure 2. Overview of our approach.
Preprints 220924 g002
Figure 3. Feature design of CrossCFD. White boxes represent Traditional Features; yellow, pink, and green boxes represent reused Single market features, enhanced Single market features, and proposed cross market features, respectively. Arrows indicate the mapping from prior signals to CrossCFD feature construction.
Figure 3. Feature design of CrossCFD. White boxes represent Traditional Features; yellow, pink, and green boxes represent reused Single market features, enhanced Single market features, and proposed cross market features, respectively. Arrows indicate the mapping from prior signals to CrossCFD feature construction.
Preprints 220924 g003
Figure 4. ROC curve of the selected CrossCFD detector.
Figure 4. ROC curve of the selected CrossCFD detector.
Preprints 220924 g004
Figure 5. Distribution of language complexity for fake and genuine reviews.
Figure 5. Distribution of language complexity for fake and genuine reviews.
Preprints 220924 g005
Figure 6. Identical and similar review connections for an example app across three markets.
Figure 6. Identical and similar review connections for an example app across three markets.
Preprints 220924 g006
Figure 7. CDF distributions of representative cross market features. (a) Cross market count of similar app reviews c f 4 . (b) Cross market app rating score discrepancy c f 8 . (c) Cross market temporal variance of similar app reviews c f 9 .
Figure 7. CDF distributions of representative cross market features. (a) Cross market count of similar app reviews c f 4 . (b) Cross market app rating score discrepancy c f 8 . (c) Cross market temporal variance of similar app reviews c f 9 .
Preprints 220924 g007
Figure 8. Heatmap of estimated fake review proportions across app categories and markets. Each cell shows the proportion of reviews classified as fake for a given app category and market pair. Darker colors represent higher fake review proportions.
Figure 8. Heatmap of estimated fake review proportions across app categories and markets. Each cell shows the proportion of reviews classified as fake for a given app category and market pair. Darker colors represent higher fake review proportions.
Preprints 220924 g008
Figure 9. Estimated fake review proportions of dating, gaming, and finance apps in the recent review dataset.
Figure 9. Estimated fake review proportions of dating, gaming, and finance apps in the recent review dataset.
Preprints 220924 g009
Figure 10. Word clouds of the top 30 words in three apps with high fake review proportions. (a) Dating app. (b) Finance app. (c) Game app.
Figure 10. Word clouds of the top 30 words in three apps with high fake review proportions. (a) Dating app. (b) Finance app. (c) Game app.
Preprints 220924 g010
Figure 11. Relationship between app downloads and fake review ratios.
Figure 11. Relationship between app downloads and fake review ratios.
Preprints 220924 g011
Table 1. Examples of sentiment score and language complexity.
Table 1. Examples of sentiment score and language complexity.
Review Sentiment score Language complexity
I’m so grateful, you’ve really helped me a lot. 0.9465 21998.372
It keeps crashing, what’s wrong with it? 0.0861 20384.556
I can’t open the 2024 version. 0.3535 23231.899
A very convenient investment and financial management tool that can solve my spare cash problem. 0.9485 21643.835
Sing in the daylight with wine, as youth accompanies you back home. 0.8366 22684.603
Table 2. Detection performance of different models and feature settings.
Table 2. Detection performance of different models and feature settings.
Feature set Logistic SVM Random Forest Gradient Boosting MLP
P R F1 P R F1 P R F1 P R F1 P R F1
word2vec 0.711 0.708 0.708 0.747 0.740 0.737 0.780 0.778 0.776 0.783 0.782 0.782 0.750 0.746 0.744
tf+word2vec 0.766 0.731 0.748 0.813 0.816 0.815 0.859 0.853 0.856 0.853 0.850 0.851 0.818 0.818 0.818
CrossCFD 0.872 0.872 0.872 0.877 0.877 0.877 0.907 0.909 0.908 0.935 0.936 0.935 0.884 0.882 0.883
Table 3. Generalization performance under only app and only market settings.
Table 3. Generalization performance under only app and only market settings.
Setting Model Precision Recall FNR F1
Only app Baseline 0.799 0.770 0.230 0.784
CrossCFD 0.898 0.862 0.138 0.880
Only market Baseline 0.809 0.690 0.310 0.745
CrossCFD 0.827 0.827 0.173 0.827
Table 4. Effectiveness of the sentiment feature.
Table 4. Effectiveness of the sentiment feature.
Method Metrics
Model Type ACC P F1
Dictionary based Lexicon 81.7% 84.4% 80.9%
StructBERT based Neural 90.6% 93.5% 90.3%
Table 5. Overview of estimated fake review proportions across app markets.
Table 5. Overview of estimated fake review proportions across app markets.
Market Number of reviews Fake prop.
App Store 123,238 29.02%
Huawei 148,720 29.48%
OPPO 790,261 16.36%
VIVO 386,535 22.90%
Xiaomi 105,677 27.44%
Yingyongbao 8,396 60.15%
Meizu 1,304 97.39%
360 Mobile Assistant 2,165 94.69%
Total 1,566,296 21.37%
Table 6. Fake review distribution across markets under different cross market coverage levels in the recent review dataset.
Table 6. Fake review distribution across markets under different cross market coverage levels in the recent review dataset.
Market #1 Market #2 Markets #3 Markets #>3 Markets CMS
Huawei 8,200 (10.6%) 4,500 (13.1%) 2,100 (12.5%) 1,300 (14.6%) 0.39
Xiaomi 7,800 (10.1%) 4,200 (12.2%) 1,900 (11.3%) 980 (11.0%) 0.35
OPPO 6,500 (8.4%) 3,800 (11.0%) 1,600 (9.5%) 750 (8.4%) 0.33
VIVO 5,200 (6.7%) 2,900 (8.4%) 1,300 (7.7%) 620 (6.9%) 0.30
App Store 19,100 (24.6%) 5,300 (15.4%) 2,200 (13.1%) 1,100 (12.3%) 0.42
Yingyongbao 18,400 (23.7%) 14,200 (41.2%) 9,800 (58.3%) 6,500 (72.8%) 0.81
360 12,500 (16.1%) 9,600 (27.9%) 5,300 (31.5%) 3,200 (35.8%) 0.68
Meizu 4,880 (6.3%) 2,670 (7.8%) 1,240 (7.4%) 590 (6.6%) 0.31
Table 7. Top 10 apps with the highest fake review proportions and their cross market fake review distributions.
Table 7. Top 10 apps with the highest fake review proportions and their cross market fake review distributions.
Package name Fake prop. # Fake reviews / # Reviews
Appstore Huawei OPPO VIVO Xiaomi Yingyongbao 360
com.xiao.chengshi 0.971 553/556 2/2 1536/1605 407/410 0/1 NAN NAN
com.st.QSB 0.962 16412/16959 1/19 290/307 246/328 0/6 0/1 NAN
com.gxyt.truthlove 0.959 NAN 6/6 59/60 52/54 0/2 NAN NAN
com.bingo.yeliao 0.937 NAN 0/1 396/399 228/317 NAN 2008/2110 1435/1514
com.ts.facai 0.930 NAN 17/19 1143/1181 1537/1686 0/8 49/60 NAN
com.kangluoer.tomato 0.928 NAN 1/1 366/370 417/507 NAN 436/449 615/651
com.huihe.tmydl 0.904 0/47 0/1 3196/3225 2913/3487 NAN NAN NAN
com.wiwj.xiangyucustomer 0.791 2035/2504 7/27 4/25 2/14 0/20 1/2 265/310
com.leniu.shdl 0.648 504/2143 3917/5671 4952/7224 6045/8309 3122/4804 247/332 NAN
com.yitantech.gaigai 0.621 26/409 213/366 13654/17160 3917/5671 4/62 1/15 0/39
NAN indicates that the app is unavailable in the corresponding store or that no valid reviews were observed during data collection.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings