CrossCFD: Leveraging Cross Market Inconsistencies for Fake Review Detection

Pengfei Wang; Lu Wang; Chuanhui Ma; Yangyu Hu

doi:10.20944/preprints202607.0059.v1

Submitted:

30 June 2026

Posted:

01 July 2026

You are already at the latest version

Abstract

User reviews serve as a primary information source for users to assess app quality, gauge trustworthiness, and make installation decisions in mobile app stores. However, malicious developers frequently manipulate app reputation through fake reviews and inflated ratings, undermining the credibility of these platforms. Existing detection methods are mostly confined to a single app store, overlooking the cross market inconsistencies and coordinated manipulation campaigns that affect the same app across multiple platforms, leading to poor generalization and high false negative rates. To address this gap, we propose CrossCFD, a novel framework that leverages cross market inconsistencies for fake review detection. Our approach reuses four single market features, two enhanced single market features, and introduces nine cross market features to characterize coordinated promotional behavior. These features are then fused and fed into a gradient boosting classifier, which is trained on a labeled benchmark to distinguish fake reviews from genuine ones. Evaluation on a benchmark dataset shows that CrossCFD achieves 93.5% precision and 93.6% recall, demonstrating stronger detection effectiveness than baselines based on single market evidence and text only features. Applied to approximately 1.5 million reviews from 175 rapidly growing apps across eight major stores, CrossCFD identifies 21.37% of reviews as potentially fake. Our findings highlight the value of cross market evidence for understanding and detecting fake review manipulation.

Keywords:

fake review detection

;

cross market inconsistencies

;

ranking fraud

;

mobile security

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

User reviews have become a central component of mobile app markets. Before installing an app, users often rely on reviews and ratings to assess its quality, reliability, and popularity. App markets also use single market evidence to support search, ranking, recommendation, and quality control. As a result, reviews directly influence user acquisition, app visibility, and developer commercial success. These economic incentives have encouraged fake review manipulation, where developers or third party promotion providers post misleading reviews, inflate ratings, or insert promotional keywords. Previous work has shown that fake and incentivized reviews are prevalent in app markets and that black hat App Store Optimization(ASO) services use fake reviews and sockpuppet accounts to manipulate app visibility and reputation [1,2,3].

A large body of research has studied fake review detection using textual, behavioral, temporal, and rating evidence. Typical approaches extract features from review content, rating distributions, reviewer activity, sentiment inconsistency, or abnormal review bursts, and then train classifiers to distinguish fake reviews from genuine ones [4,5]. These studies have significantly advanced the understanding of review manipulation. However, most existing methods are designed for a single market, such as Google Play or the Apple App Store. They implicitly assume that the evidence needed to identify fake reviews is fully observable within one market. This assumption becomes increasingly restrictive in fragmented mobile app ecosystems, where the same app may be distributed simultaneously through multiple app markets.

In practice, promotional fake review campaigns may operate across multiple app stores rather than within a single store alone. Promotion providers can reuse similar review content, adjust ratings, and schedule reviews across markets to improve the perceived credibility and visibility of the same app. Figure 1 presents a motivating example. Figure 1(a) shows that the same app, identified by package name com.yiban1314.yiban, appears in multiple app stores, and duplicated positive reviews are observed across these markets at different posting times. Figure 1(b) further shows discrepancies in app score and review count for the same app across markets. These observations suggest that promotional manipulation may leave evidence in different stores, including duplicated review content, inconsistent reputation statistics, and abnormal temporal patterns.

Such cross market manipulation creates new challenges for fake review detection. On the one hand, a detector that analyzes each market independently may miss coordinated manipulation whose evidence is weak within a single store but clearer after the same app is aligned across markets. On the other hand, it may produce false alarms by treating market specific variations as suspicious without considering whether similar patterns also appear in other stores. Therefore, fake review detection in mobile app ecosystems requires cross market analysis that jointly models review content, ratings, timestamps, rankings, and app metadata across stores.

To address this problem, we propose CrossCFD, a cross market fake review detection framework that leverages inconsistencies in the review patterns of the same app across multiple app markets. Our key observation is that legitimate apps tend to exhibit relatively consistent review characteristics across markets, whereas coordinated promotional campaigns often disrupt such consistency by reusing similar review templates, posting duplicated reviews within short time windows, or selectively inflating ratings in markets with weaker moderation. CrossCFD first collects review related records of the same app from multiple markets, such as review text, ratings, timestamps, rankings, and app metadata. It then cleans and aligns these records, extracts review features and cross market inconsistency features, and uses the fused features for fake review detection. By jointly modeling evidence within each market and discrepancies across markets, CrossCFD can identify promotional fake reviews that are difficult to capture from evidence in a single market alone.

We evaluated CrossCFD on a benchmark dataset of 9,000 reviews and further applied it to a large scale dataset containing approximately 1.5 million reviews from 175 apps across eight major app markets. Our evaluation shows that cross market information provides substantial benefits for detecting fake reviews. On the benchmark, CrossCFD achieves 93.5% precision and 93.6% recall, outperforming baselines that use only single market evidence and review text. The large scale measurement further reveals that potential fake reviews remain widespread across mobile app markets, with clear differences across markets and app categories. These findings suggest that cross market analysis is not only useful for improving detection accuracy, but also necessary for understanding how review manipulation campaigns operate across fragmented app ecosystems. The main contributions of this paper are as follows:

(1): We formalize a 15 dimensional feature set for cross market fake review detection. The feature set retains four single market features, enhances two semantic features, and designs nine cross market features to capture inconsistency and coordination patterns across app markets.
(2): We design CrossCFD, a fake review detector based on feature fusion. CrossCFD extends conventional fake review classifiers from single market detection to cross market detection by integrating local review evidence, semantic evidence, and cross market inconsistency evidence into a unified supervised model.
(3): We conduct a large scale measurement study of fake reviews across app markets. Applying CrossCFD to 1,566,296 reviews from 175 apps across eight markets, we identify 334,778 potentially fake reviews and analyze their distribution across markets, categories, and apps.

2. Related Work

2.1. Fake Review Detection

Fake review detection has been studied through linguistic, behavioral, relational, and neural signals. Early studies formulated deceptive reviews as opinion spam and exploited duplicated or near duplicated content, rating behaviors, and reviewer activities as manipulation signals [6,7]. Subsequent work further integrated textual, rating, temporal, and behavioral evidence to detect deceptive reviews and suspicious reviewers [8,9,10], and showed that inconsistency among ratings, sentiment, and review content is informative for fake review detection [4]. Recent studies have moved beyond handcrafted features and increasingly adopt graph models, neural models, and feature fusion methods. Graph based methods construct reviewer, review, and product graphs, reviewer and product interaction graphs, or social context graphs to capture collusion, camouflage, and propagation patterns [11,12,13,16]. Feature fusion and neural methods integrate review text, emotions, ratings, reviewer behavior, and semantic representations to improve robustness over classifiers that rely only on review text [14,15]. Transformer based detectors using BERT or RoBERTa encoders further improve semantic modeling of deceptive content [17,18,19]. Recent studies on LLMs(Large Language Models) generated reviews show that fluent machine generated reviews can weaken detectors that rely mainly on surface level linguistic cues [21,22,32]. Surveys also conclude that modern fake review detection increasingly depends on multiple sources of evidence, graph structures, and hybrid neural architectures rather than review text alone [5,16]. In mobile app markets, fake review detection has received particular attention because reviews and ratings directly affect app visibility, ranking, recommendation, and installation decisions. Martens [1] analyzed fake reviews in app stores through review content, ratings, and review provider behaviors, while Wang [24] studied removed reviews in the iOS App Store as moderation evidence to understand review manipulation. Related studies on app reviews analyze review bursts, rating anomalies, reviewer behaviors, and review content to support app quality analysis and fraud detection [2,3,23,25,26]. Evidence across platforms has also been explored in adjacent settings. OneReview correlates reviews of the same merchant across multiple crowd sourced review websites to detect suspicious reputation shifts across platforms [27], while studies on domain and language transfer examine model transfer across domains, platforms, and languages [20,28].

2.2. Ranking Fraud and App Promotion Abuse

Ranking fraud and app promotion abuse study how adversaries manipulate platform visibility across ranking, recommendation, and search systems. Kumar [29] investigated fake reviews and fake ranking in mobile app markets, highlighting that fraudulent developers may manipulate app visibility through deceptive reviews, ranking distortion, and abnormal promotional behaviors. Chen [30] further showed that malicious app promotion can be coordinated through fake reviews, ratings, installs, and abnormal popularity evidence. Black hat App Store Optimization studies provide a more operational view of this ecosystem. Farooqi [33] studied incentivized mobile app installs as a mechanism for distorting app popularity. Hernandez [3] measured ASO deception services that use bulk installs and fake reviews to manipulate app visibility. More recently, Fan [31] showed that similar app store ranking fraud risks may also arise in LLM app stores through fake ratings, downloads, reviews, and keyword stuffing. These studies show that app promotion abuse is not limited to isolated fake reviews, but combines review manipulation, rating manipulation, installation inflation, retention manipulation, and keyword optimization. Recent platform evidence also suggests that fraudulent ratings, reviews, and chart or search manipulation remain persistent operational threats in app markets [34]. Similar visibility manipulation has also been studied beyond app stores. Web spam and black hat SEO manipulate ranking evidence through keyword stuffing, cloaking, link schemes, affiliate spam, and low quality optimized content [35,36,37,38]. Bevendorff [39] conducted a longitudinal study of SEO spam and found that highly optimized affiliate content remains a persistent challenge for search quality. With the rise of generative search, Aggarwal [40] introduced Generative Engine Optimization, where content is optimized for visibility in LLM generated answers rather than traditional ranked lists.

3. Approach

3.1. High Level Overview

CrossCFD targets fake review detection for apps distributed across multiple markets. Its core insight is that coordinated promotion may be less visible within a single market but becomes clearer when the same app is examined across markets. Such campaigns often leave traces across markets, such as reused review templates, synchronized posting, inconsistent ratings, and market specific manipulation patterns. As shown in Figure 2, the framework contains four stages. The first stage collects review data of the same app from multiple markets. The second stage cleans and normalizes heterogeneous records, then constructs labels through LLM assisted human verification. The third stage extracts review level features and cross market inconsistency features. The final stage fuses these features into a supervised classifier for fake review detection.

3.2. Multi Market Crawling

A target app is included only when it can be reliably matched across at least two app markets. For each candidate app, crawlers query app names and package identifiers to retrieve potential listings from each market. Because app markets provide different interfaces, we use structured API crawling and UI rendering when needed. Only apps with consistent evidence across markets are retained. For each validated app and market pair, the crawler collects review text, rating, timestamp, ranking, and app metadata, while preserving the source market identifier for cross market alignment and feature computation.

3.3. Preprocessing and LLM Analysis

The collected review records are preprocessed by removing records with missing text, invalid ratings, invalid timestamps, very short content, market specific artifacts, duplicated interface text, and developer replies. Ratings are normalized to a unified scale, and timestamps are converted into a consistent format for cross market comparison.

After preprocessing, we construct a labeled benchmark through LLM assisted annotation and human verification. Specifically, 11,832 candidate reviews are sampled from nine apps across three representative categories, and each review is annotated by Qwen 2.5 Max [41] using the app description, feature list, category, and market metadata as context. The LLM provides preliminary labels and rationales based on app content consistency, review specificity, sentiment rating consistency, and generic promotional expressions.Final labels are assigned by three human annotators after reviewing the LLM outputs and app context. A review is labeled as suspicious if it conflicts with the app functionality, contains unsupported promotional claims, shows sentiment rating inconsistency, or exhibits duplication or temporal coordination with other reviews. After removing ambiguous and inconsistent cases, 9,000 high confidence reviews are retained for model training and evaluation.

3.4. Feature Extraction

Each review is represented by a 15 dimensional feature vector. The feature construction starts from 11 conventional fake review features. Among them, four features are retained as traditional single market features, denoted as

t f_{1}

to

t f_{4}

, because they directly describe basic review properties such as rating, length, sentiment consistency, and review burst behavior. Two features are further enhanced as

e f_{1}

and

e f_{2}

, because simple lexicon or surface based measurements are insufficient to capture semantic sentiment and language complexity in app reviews. Several conventional features are then extended to the cross market setting, forming

c f_{3}

to

c f_{8}

, to capture repeated content, rating discrepancies, ranking discrepancies, and review growth differences across markets. Finally, three new cross market features,

c f_{1}

,

c f_{2}

, and

c f_{9}

, are introduced to describe sentiment inconsistency and temporal coordination patterns that can only be observed after aligning the same app across multiple markets.

For the i th review

c_{i}

of an app, the feature vector is denoted as

F (c_{i}) = {t f_{1}^{i}, \dots, t f_{4}^{i}, e f_{1}^{i}, e f_{2}^{i}, c f_{1}^{i}, \dots, c f_{9}^{i}}

, where

t f^{i}

,

e f^{i}

, and

c f^{i}

denote traditional single market features, enhanced single market features, and cross market features, respectively. Figure 3 summarizes the feature construction process.

Single Market Features. Single Market features provide basic evidence from individual reviews and local review activity within a single market. For review

c_{i}

,

A_{i}

denotes its app,

r_{i}

denotes the original rating, and

{\tilde{r}}_{i} = (r_{i} - 1) / 4

denotes the normalized rating.

$t f_{1}^{i}$ : App rating score. The app rating score is defined as

t f_{1}^{i} = {\tilde{r}}_{i}

.

$t f_{2}^{i}$ : Sentiment rating inconsistency. Fake reviews may exhibit a mismatch between the sentiment expressed in the review text and the assigned rating. Such inconsistency can indicate abnormal review behavior or mechanically generated promotional content. Therefore, the sentiment rating inconsistency feature is defined as

t f_{2}^{i} = | e_{i} - {\tilde{r}}_{i} |

, where

e_{i} \in [0, 1]

denotes the sentiment score of review

c_{i}

. A larger value of

t f_{2}^{i}

indicates a stronger mismatch between textual sentiment and numerical rating.

$t f_{3}^{i}$ : App review length. The app review length feature is defined as

t f_{3}^{i} = L_{i}

, where

L_{i}

denotes the character length of review

c_{i}

.

$t f_{4}^{i}$ : App review count anomaly. Fake reviews often appear in short bursts, leading to abnormal daily review-count changes. The feature is calculated using the isolation forest and is defined as

t f_{4}^{i} = a_{i} = 2^{- E (h (x)) / c (n)}

, where x is the daily review count,

E (h (x))

denotes the expected path length of x through isolation trees, and

c (n)

is the average path length for a sample size of n. A larger anomaly score indicates a stronger deviation from normal review volume patterns.

Enhanced Single Market Features. These features characterize the semantic and linguistic properties of individual reviews.

$e f_{1}^{i}$ : Sentiment score. Promotional fake reviews often contain strong emotional expressions to influence user installation decisions. The sentiment score is obtained using a Chinese sentiment classifier based on StructBERT-base-chinese [42], trained on four review datasets with 115K samples. The feature is defined as

e f_{1}^{i} = e_{i}

, where

e_{i} \in [0, 1]

denotes the predicted sentiment score of review

c_{i}

. Larger values indicate more positive sentiment. Table 1 gives representative examples.

$e f_{2}^{i}$ Language complexity. Fake reviews may contain rigid templates, repeated promotional wording, or unnatural expressions that differ from ordinary user feedback. We measure language complexity using a RoBERTa language assessment model [43]. For review

c_{i}

, the feature is defined as

e f_{2}^{i} = p_{i} = 2^{- ℓ_{i}}

, where

ℓ_{i} = \frac{1}{N} \sum_{k = 1}^{N} {log}_{2} P (w_{k} | w_{1}, \dots, w_{k - 1})

denotes the normalized log probability of the review,

w_{k}

denotes the token at position k, N is the number of tokens, and

p_{i}

is the resulting perplexity value. In implementation,

p_{i}

is clipped to

[18000, 26000]

for numerical stability, with larger values indicating lower language model likelihood and higher linguistic complexity. We keep the clipped perplexity value without additional normalization because its absolute scale reflects the output range of the language assessment model and preserves magnitude differences among reviews. Table 1 gives representative examples.

Cross Market Features. These features capture discrepancy and synchronization patterns of the same app across multiple markets. For review

c_{i}

, all cross market features are computed over its associated app

A_{i}

and then assigned to

c_{i}

.

$c f_{1}^{i}$ : Cross market sentiment discrepancy. This feature measures the variation of average sentiment across markets for app

A_{i}

. It is defined as

c f_{1}^{i} = σ_{A_{i}} = \sqrt{\frac{1}{m_{A_{i}} - 1} \sum_{j = 1}^{m_{A_{i}}} {(μ_{A_{i} j} - μ_{A_{i}})}^{2}}

, where

μ_{A_{i} j} = \frac{1}{n_{A_{i} j}} \sum_{k = 1}^{n_{A_{i} j}} e_{A_{i} j k}

is the average sentiment score of app

A_{i}

in market j,

e_{A_{i} j k}

is the sentiment score of the k-th review of app

A_{i}

in market j,

n_{A_{i} j}

is the number of reviews of app

A_{i}

in market j,

μ_{A_{i}} = \frac{1}{m_{A_{i}}} \sum_{j = 1}^{m_{A_{i}}} μ_{A_{i} j}

is the cross market mean sentiment, and

m_{A_{i}}

is the number of markets where app

A_{i}

is listed. A larger value indicates stronger sentiment discrepancy across markets.

$c f_{2}^{i}$ : Cross market temporal variance of duplicate app reviews. This feature measures whether duplicated reviews are posted synchronously across markets. It is defined as

c f_{2}^{i} = d_{i} = \frac{1}{| T_{d u p} (i) |} \sum_{t \in T_{d u p} (i)} {(t - {\bar{t}}_{d u p} (i))}^{2}

, where

T_{d u p} (i)

is the set of timestamps of reviews whose content is identical to review

c_{i}

, and

{\bar{t}}_{d u p} (i) = \frac{1}{| T_{d u p} (i) |} \sum_{t \in T_{d u p} (i)} t

is their average timestamp. If no duplicated review is observed,

c f_{2}^{i}

is set to 0.

$c f_{3}^{i}$ : Cross market count of duplicate app reviews. This feature captures exact content reuse across markets. It is defined as

c f_{3}^{i} = n_{i}

, where

n_{i}

denotes the number of reviews whose content is identical to review

c_{i}

across all markets where app

A_{i}

is listed.

$c f_{4}^{i}$ : Cross market count of similar app reviews. This feature captures cross market review reuse with shared templates or minor textual variations. It is defined as

c f_{4}^{i} = q_{i} = \sum_{c^{'} \in C_{A_{i}} ∖ {c_{i}}} I (sim (c_{i}, c^{'}) > θ)

, where

C_{A_{i}}

denotes the set of reviews of app

A_{i}

across all markets,

sim (\cdot)

is the Hamming-distance-based similarity score,

θ = 0.75

is the similarity threshold, and

I (\cdot)

is the indicator function.

$c f_{5}^{i}$ : Cross market app ranking discrepancy. This feature measures whether the same app exhibits inconsistent ranking dynamics across markets. It is defined as

c f_{5}^{i} = r d_{A_{i}} = \frac{{SB}_{A_{i}}}{{SW}_{A_{i}}}

, where

{SB}_{A_{i}}

denotes the mean square between markets and

{SW}_{A_{i}}

denotes the mean square within markets, both computed from ANOVA over the daily ranking changes of app

A_{i}

. The daily ranking change is

Δ R_{A_{i} j t} = R_{A_{i} j t} - R_{A_{i} j (t - 1)}

, where

R_{A_{i} j t}

denotes the ranking of app

A_{i}

in market j at time t. A larger value indicates stronger cross market divergence in ranking dynamics.

$c f_{6}^{i}$ : Cross market app review length discrepancy. This feature measures whether review verbosity differs across markets. It is defined as

c f_{6}^{i} = l d_{A_{i}} = Var ({\bar{L}}_{A_{i} 1}, \dots, {\bar{L}}_{A_{i} m_{A_{i}}})

, where

{\bar{L}}_{A_{i} j}

is the average review length of app

A_{i}

in market j, and

m_{A_{i}}

is the number of markets where app

A_{i}

is listed.

$c f_{7}^{i}$ : Cross market app burst discrepancy. This feature captures abnormal short term growth in review volume across markets. For app

A_{i}

in market j, the burst ratio is defined as

B_{A_{i} j} = \frac{{max}_{t} (N_{A_{i} j, t})}{{\bar{N}}_{A_{i} j}}

, where

N_{A_{i} j, t}

is the number of reviews of app

A_{i}

in market j on day t, and

{\bar{N}}_{A_{i} j}

is the average daily review count. The feature is defined as

c f_{7}^{i} = Var (B_{A_{i} 1}, B_{A_{i} 2}, \dots, B_{A_{i} m_{A_{i}}})

. A higher value indicates stronger inconsistency in burst review behaviors across markets.

$c f_{8}^{i}$ : Cross market app rating discrepancy. This feature measures the variation of app ratings across markets. It is defined as

c f_{8}^{i} = s d_{A_{i}}

, where

s d_{A_{i}}

is the standard deviation of the average ratings of app

A_{i}

across all markets. A larger value indicates stronger rating inconsistency across markets.

$c f_{9}^{i}$ : Cross market temporal variance of similar app reviews. This feature extends

c f_{2}^{i}

from exact duplicates to semantically or lexically similar reviews. Review similarity is computed using a Hamming distance based similarity score with a threshold of

θ = 0.75

. The feature is defined as

c f_{9}^{i} = v_{i} = \frac{1}{| T_{s i m} (i) |} \sum_{t \in T_{s i m} (i)} {(t - {\bar{t}}_{s i m} (i))}^{2}

, where

T_{s i m} (i)

is the timestamp set of reviews similar to

c_{i}

, and

{\bar{t}}_{s i m} (i) = \frac{1}{| T_{s i m} (i) |} \sum_{t \in T_{s i m} (i)} t

is their average timestamp. If no similar review is observed,

c f_{9}^{i}

is set to 0.

3.5. Fake Review Detection

Fake review detection is formulated as a supervised binary classification task. For each review

c_{i}

, CrossCFD takes the feature vector

F (c_{i})

as input and predicts

{\hat{y}}_{i} = f (F (c_{i}))

, where

{\hat{y}}_{i} = 1

denotes a fake review and

{\hat{y}}_{i} = 0

denotes a genuine review. The feature vector consists of single market, enhanced single market, and cross market features.The classifier

f (\cdot)

is selected from five supervised models.

4. Evaluation

4.1. Evaluation Setup

We evaluate CrossCFD on the labeled benchmark described in Section 3.3. The dataset contains 9,000 labeled reviews, with 6,750 reviews used for training and 2,250 reviews reserved for testing. Features are extracted for each review following Section 3.5. Model selection and hyperparameter tuning are conducted on the training set using five fold cross validation, while the test set is used only for final evaluation.

4.2. Detection Evaluation

The detection experiment evaluates whether cross market inconsistency features improve fake review classification beyond baselines that use only review text or evidence from a single market. We compare three feature settings, text only features(word2vec), review features combined with word2vec representations, and the full CrossCFD feature set. Table 2 reports the detection results. Across all classifiers, the full CrossCFD feature set achieves better performance than the baselines that use only review text or review evidence. Gradient Boosting obtains the best overall result among the evaluated classifiers, reaching 93.5% precision, 93.6% recall, and 93.5% F1 score.

Figure 4 reports the ROC curve of the selected Gradient Boosting detector. CrossCFD achieves an AUC of 0.976, showing clear separation between fake and genuine reviews. When the true positive rate reaches 90%, the false positive rate is 8.6%, indicating that the detector can identify most fake reviews while keeping false alarms relatively low.

4.3. Generalization Evaluation

To evaluate robustness under distribution shifts, CrossCFD is tested in two settings, Only app and Only market. In the Only app setting, reviews from one app are used only for testing, while reviews from the remaining apps are used for training. In the Only market setting, reviews from one app store are used only for testing, while reviews from the remaining markets are used for training. These settings are stricter than the standard split because they avoid overlap of the same app or the same market between training and testing.

CrossCFD is compared with a baseline that uses only evidence from a single market. Precision, recall, F1 score, and false negative rate are reported. Recall and FNR are emphasized because missed detections correspond to fake reviews that remain unfiltered in deployment. In both settings, cross market features are computed from available unlabeled observations across markets, while labels from the target app or target market are used only for evaluation.

Table 3 reports the generalization results. Compared with the single market baseline, CrossCFD consistently improves recall and reduces false negative rates. In the app level setting, CrossCFD improves F1 score from 78.4% to 88.0% and reduces FNR from 23.0% to 13.8%. In the market level setting, CrossCFD improves F1 score from 74.5% to 82.7% and reduces FNR from 31.0% to 17.3%. The recall improvement is especially clear under the market level setting, suggesting that cross market inconsistency features help recover fake reviews that are missed when only evidence from a single market is used.

4.4. Feature Analysis

This experiment analyzes whether the proposed features show clear differences between fake and genuine reviews. Rather than focusing only on overall classification performance, this analysis examines the feature space itself, covering enhanced review features, repeated or similar review content across markets, and representative cross market inconsistency features.

Enhanced single market features. The sentiment score

e f_{1}

captures the emotional intensity of a review. Promotional fake reviews often use exaggerated emotional expressions to influence user decisions, making accurate sentiment modeling important. Table 4 compares the StructBERT based sentiment model used in CrossCFD with a dictionary based method. The StructBERT based model improves accuracy from 81.7% to 90.6%, precision from 84.4% to 93.5%, and F1 score from 80.9% to 90.3%, indicating that contextual sentiment modeling provides a stronger signal than lexicon matching.

The language complexity feature

e f_{2}

captures stylistic regularity. Figure 5 shows the KDE distributions of language complexity for fake and genuine reviews. Genuine reviews span a wider range, reflecting more diverse user expressions, whereas fake reviews are concentrated in a narrower region, suggesting more homogeneous and templated writing.

Cross market propagation of duplicated and similar reviews. To examine whether review reuse provides a cross market signal, we analyze an example app, com.zzjr.niubanjin across three markets, namely 360 Mobile Assistant, Baidu App Store, and Yingyongbao. After removing null records and reviews shorter than six characters, each review is represented as a node, and an edge is added when two reviews are identical or similar. As shown in Figure 6, duplicated and similar reviews propagate across markets rather than remaining isolated within a single market.

The association between repeated or similar review content across markets and fake reviews is substantial. Among 137 identical reviews appearing in two markets, 110 are fake reviews, accounting for 80.3%. Among 26 identical reviews appearing in three markets, 22 are fake reviews, accounting for 84.6%. For similar reviews, 198 of 285 reviews appearing across two markets are fake reviews, accounting for 69.5%. In addition, 55 of 67 similar reviews appearing across three markets are fake reviews, accounting for 82.1%. Moreover, more than 65% of identical or similar reviews are posted across markets within 12 days. These results support the use of duplicate count, similar review count, and temporal coordination as cross market features.

Discriminativeness of cross market features. We further inspect three representative cross market features, namely cross market count of similar app reviews

c f_{4}

, cross market app rating score discrepancy

c f_{8}

, and cross market temporal variance of similar app reviews

c f_{9}

. As shown in Figure 7(a), fake reviews tend to have higher counts of similar reviews, indicating repeated or template based review propagation across markets. Figure 7(b) shows that fake reviews are associated with larger rating score discrepancies across markets, while real reviews are more concentrated at lower discrepancy values. Figure 7(c) shows that fake reviews exhibit larger temporal variance among similar reviews, suggesting that suspicious review content may be distributed across different markets and time windows.

5. Measuring Fake Reviews in Multiple Mobile App Stores

To better understand fake review activity across real app markets, we conduct a large scale measurement study. The study is guided by the following research questions.

RQ1. How prevalent are fake reviews in app markets and which markets show stronger manipulation patterns?

We measure the overall scale of fake reviews, compare fake review proportions across markets, and analyze synchronization patterns across markets.

RQ2. Which app categories are more frequently targeted by fake review campaigns across markets?

We examine category differences in fake review prevalence and identify whether commercially sensitive categories, such as dating, gaming, and finance, are more affected.

RQ3. How is fake review prevalence related to app popularity and market distribution?

We analyze the relationship between fake review ratios and app popularity, and further characterize how apps with high fake review ratios distribute suspicious reviews across multiple markets.

5.1. Measurement Dataset

The measurement study is based on three datasets:

Hu et al. dataset. [45] This historical dataset contains 26,735,573 reviews from five Android app markets, including 360 Mobile Assistant, Baidu App Store, Wandoujia, Xiaomi, and Yingyongbao. It is used to characterize cross market duplicate and similar review propagation.
CHAMP dataset. [44] This dataset contains more than 730K reviews from major domestic app markets, including Huawei, Meizu, OPPO, VIVO, Xiaomi, and Yingyongbao. It complements the Hu et al. dataset by expanding market coverage for category analysis.
Recent review dataset. This dataset is collected from eight app markets1 between January 1, 2024 and January 1, 2025. Guided by the historical analysis, we focus on dating, gaming, and finance apps selected from rank-gain and rank-drop lists. Candidate apps are matched across markets using app names and package identifiers, and are further verified with available app metadata. Apps appearing in fewer than two markets are discarded. The final dataset contains 175 apps and 1,566,296 reviews. Each record includes review text, rating score, published time, ranking score, market source, and available app metadata.

Overall, these three datasets provide complementary coverage for the measurement study, where the Hu et al. dataset supports historical cross market propagation analysis, the CHAMP dataset expands category and market coverage, and the recent review dataset enables focused analysis of recent dating, gaming, and finance apps across eight markets.

5.2. RQ1: Prevalence and Market Manipulation Patterns

We first investigate the distribution of potential fake reviews across app markets and identify markets with stronger manipulation and synchronization patterns. To answer RQ1, we apply the trained CrossCFD detector to the recent review dataset, compare proportions of suspicious reviews across markets, and use the CMS score to examine whether suspicious reviews in one market tend to co-occur with suspicious activity in other markets.

Overall prevalence of fake reviews. Table 5 reports the estimated fake review proportions across app stores in the recent review dataset. CrossCFD identifies 334,778 fake reviews in total, indicating that reviews classified as fake remain prevalent in the current mobile app ecosystem. The distribution is highly uneven across markets. OPPO has the lowest fake review proportion, with 16.36% of its reviews classified as fake. VIVO, Xiaomi, Huawei, and App Store show moderate fake review proportions, ranging from 22.90% to 29.48%. In contrast, Yingyongbao exhibits a substantially higher fake review proportion of 60.15%, while 360 Mobile Assistant and Meizu reach 94.69% and 97.39%, respectively. These results indicate that fake review manipulation varies substantially across app stores. Large markets still contain many suspected fake reviews, while smaller markets may show stronger manipulation patterns. The high ratios in 360 Mobile Assistant and Meizu should be interpreted cautiously due to their limited review volumes, but they still suggest that fake review activity can be highly concentrated in specific markets.

Market manipulation evidence. In addition to potential fake review proportions, we examine cross market synchronization to identify stores that are more closely associated with coordinated promotion. Table 6 reports the distribution of suspicious reviews across markets under different cross market. Specifically, the columns indicate suspicious reviews associated with apps whose suspicious reviews appear in one market, two markets, three markets, or more than three markets. A higher CMS score means that suspicious reviews in a given market are more likely to appear with suspicious activity in other markets. Moreover, Yingyongbao obtains the highest CMS score of 0.81, followed by 360 Mobile Assistant with 0.68. Yingyongbao also accounts for 72.8% of apps whose suspicious reviews appear in more than three markets, suggesting a strong connection with cross market propagation. In comparison, OPPO, VIVO, Xiaomi, Huawei, and Meizu show lower CMS scores, ranging from 0.30 to 0.39. Together, fake review proportion and CMS capture different aspects of manipulation risk. A high fake review proportion indicates concentrated suspicious activity within a market, while a high CMS score indicates stronger involvement in suspicious activity across markets. This distinction shows that a market with a moderate fake review proportion may still be important in coordinated promotion if its suspicious reviews are widely synchronized with other stores.

Answer to RQ1. Potential fake reviews are not uniformly distributed across app markets, but vary substantially in both estimated prevalence and cross market synchronization. Markets with higher suspicious review proportions do not always provide equally reliable evidence, since small valid review volumes may amplify estimated ratios in some stores. These findings suggest that fake review risk should be assessed by jointly considering within market prevalence, valid review volume, and cross market synchronization.

5.3. RQ2: Category Targeting Patterns

We next examine whether fake review campaigns are concentrated in specific app categories. To answer RQ2, we analyze category differences using both historical datasets and therecent review dataset, since apps with stronger user acquisition pressure, higher monetization potential, or stronger trust requirements may provide greater incentives for promotional manipulation.

Category prevalence. To obtain broad category coverage, we first analyze the Hu et al. dataset and the CHAMP dataset. Since these datasets do not provide complete category labels for all package names, we collect app metadata from third party app intelligence platforms, including Kuchuan and Qimai. We then group apps into ten major categories and assign apps with ambiguous labels to the Others category.

Figure 8 reports the estimated fake review proportions across categories and markets. The heatmap shows clear category differences. Social Communication has the highest proportions in most markets, reaching 0.563 in Yingyongbao, 0.458 in Wandoujia, and 0.390 in 360. Finance also shows consistently high values, such as 0.352 in Yingyongbao, 0.339 in Wandoujia, and 0.279 in 360. Gaming presents strong manipulation patterns as well, with proportions of 0.350 in OPPO, 0.345 in Xiaomi, and 0.309 in Meizu. In contrast, categories such as Education, News, Reading, and Others generally show lower proportions, often below 0.10 in several markets.

Focused measurement on selected categories. Guided by the historical category analysis, the recent review dataset focuses on dating, gaming, and finance apps. Figure 9 reports the estimated fake review proportions for these three categories. Dating apps show the highest proportion at 25.25%, followed by gaming apps at 22.84%. Finance apps show a lower proportion of 15.29%, but still contain a non negligible proportion of suspicious reviews. Meizu and 360 Mobile Assistant are not shown in this figure because their valid samples in the selected categories are too sparse after applying the app matching and review filtering rules.

These differences suggest that potential fake review manipulation is related to category specific incentives. Dating apps often depend on perceived popularity, trust, and successful social interaction to attract users and encourage paid engagement. Gaming apps operate in a competitive market where ratings and reviews can affect visibility, downloads, and in app spending. Finance apps are more function oriented and trust sensitive. Although their estimated fake review proportion is lower than that of dating and gaming apps, suspicious reviews in this category may still be important because they can influence user decisions in financial scenarios.

Promotional language in selected categories. To further characterize fake review content in the selected categories, we perform word cloud analysis on representative apps from dating, finance, and gaming. Figure 10 shows that fake reviews in these categories are dominated by positive and persuasive words. For the dating app, frequent words are related to recommendation, liking, and social connection. For the finance app, the words emphasize convenience, speed, and loan services. For the gaming app, the words highlight entertainment quality, classic gameplay, and game equipment related terms.

Although the keywords differ across categories, they serve a similar promotional purpose. potential fake reviews tend to emphasize the most commercially attractive aspects of each category, such as social connection in dating apps, fast loan services in finance apps, and entertainment experience in gaming apps. Instead of providing detailed user experiences, these reviews often use generic positive language to improve the perceived credibility of the app.

Answer to RQ2. Potential fake review activity is more visible in commercially sensitive categories such as social communication, dating, gaming, and finance, while categories such as education, news, and reading generally show lower estimated proportions. This pattern suggests that suspicious review manipulation is more likely to emerge in categories where user trust, reputation signals, and acquisition incentives are closely connected to monetization. However, these results should be interpreted as category associated risk patterns rather than a complete ranking of all app categories.

5.4. RQ3: App Popularity and Cross Market Distribution

To answer RQ3, we examine whether fake review prevalence is associated with app popularity and how apps with high fake review proportions distribute suspicious reviews across markets. We first rank apps by download counts and plot each app according to its download count and fake review ratio. We then divide apps into three download groups to compare fake review ratios across different popularity ranges to observe their distribution across app stores.

Fake review prevalence versus app popularity. Figure 11 shows the relationship between app downloads and potential fake review ratios, where each point represents one app. We further divide apps into three groups based on download counts to compare potential fake review ratios across different popularity ranges. The average potential fake review ratio is 35.85% in the top download group and 56.60% in the low download group. To quantify the relationship within each market, we compute Pearson correlation coefficients between app download counts and fake review ratios. The correlations are close to zero, with 0.03 for Huawei, 0.02 for OPPO, 0.00 for VIVO, and 0.04 for Xiaomi. Other markets are omitted from this analysis due to missing or inconsistent download statistics and sparse valid samples after app matching and filtering. As supplementary examples, Table 7 shows the top 10 apps ranked by fake review proportion and their fake review distributions across markets. These apps show different distribution patterns. For example, com.bingo.yeliao has many fake reviews in OPPO, VIVO, Yingyongbao, and 360 Mobile Assistant, while com.leniu.shdl appears across App Store, Huawei, OPPO, VIVO, Xiaomi, and Yingyongbao. Other apps show fake reviews mainly in one or two markets, while some markets contain few reviews, no detected fake reviews, or no available listing.

Answer to RQ3. Potential fake review risk is not explained by app popularity alone, as the observed correlation between download counts and estimated fake review proportions is weak. Some low popularity apps still show high suspicious review proportions, and high risk apps differ in how suspicious reviews are distributed across markets. This indicates that fake review analysis should jointly consider popularity signals, market coverage, store specific variation, and cross market propagation rather than relying on download counts alone.

6. Discussion

6.1. Implications

For app market maintainers, fake review moderation should not rely only on evidence observed within a single store. Markets can compare the same app across stores and use repeated review content, abnormal rating gaps, synchronized posting, and burst patterns as early warning evidence for coordinated promotion. Such evidence can help prioritize manual review for apps and categories with stronger manipulation risk. For app developers, the results suggest that reputation building should rely on transparent user acquisition and genuine feedback rather than artificial promotion, since suspicious review patterns may become visible when the app is examined across markets. Developers can also monitor their own review distributions across stores to identify abnormal third party promotion or reputation attacks. For mobile users, reviews and ratings should be interpreted with caution, especially when an app shows many generic positive reviews, sudden review growth, or inconsistent reputation across stores. Comparing reviews across markets and paying attention to detailed user experiences can help users make more reliable installation and payment decisions.

6.2. Limitations

Our study carries several limitations. First, the large scale measurement relies on CrossCFD predictions rather than fully manual labels, which may introduce false positives and false negatives. To alleviate this issue, we evaluate CrossCFD on a high confidence labeled benchmark before applying it to large scale data, and we report the measurement results as estimated manipulation evidence rather than exact ground truth prevalence. Second, cross market analysis depends on accurate app alignment, while app names, package identifiers, developer information, and descriptions may be incomplete or inconsistent across stores. To reduce this risk, we match apps using multiple metadata fields and exclude apps that cannot be confidently matched in at least two markets. Finally, CrossCFD captures observable evidence from reviews, ratings, timestamps, rankings, and market distributions, but cannot directly observe hidden promotion infrastructures such as paid review networks, install farms, or account collusion. Therefore, we avoid attributing suspicious reviews to specific attackers or organizations, and instead focus on measurable manipulation patterns across markets.

7. Conclusion

This paper presents CrossCFD, a fake review detection framework that uses evidence across app markets. CrossCFD aligns the same app across different stores and combines review evidence with cross market inconsistency evidence, covering rating discrepancy, sentiment deviation, duplicated or similar review content, temporal coordination, ranking variation, and burst behavior. The evaluation shows that cross market evidence improves detection effectiveness and helps reduce missed detections under stricter generalization settings. The measurement study further shows that potential fake review activity appears across multiple app markets, but its estimated proportion and synchronization degree vary substantially by store. Markets with higher CMS scores show stronger co occurrence of suspicious reviews with other stores, indicating closer involvement in cross market promotional activity. The category analysis further shows that suspicious reviews are prominent in several commercially sensitive categories, and the focused measurement on dating, gaming, and finance apps also observes substantial fake review activity.In addition, app popularity alone cannot explain fake review prevalence, since apps with high fake review proportions show different download levels and different distributions across markets. These findings highlight the need to assess fake review risk from both within market prevalence and cross market coordination perspectives.

References

Martens, D.; Maalej, W. Towards Understanding and Detecting Fake Reviews in App Stores. Empir. Softw. Eng. 2019, 24, 3316–3355. [Google Scholar] [CrossRef]
Rahman, M.; Hernandez, N.; Recabarren, R.; Ahmed, S.I.; Carbunar, B. The Art and Craft of Fraudulent App Promotion in Google Play. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 2437–2454. [Google Scholar]
Hernandez, N.; Recabarren, R.; Carbunar, B.; Ahmed, S.I. RacketStore: Measurements of ASO Deception in Google Play via Mobile and App Usage. In Proceedings of the 21st ACM Internet Measurement Conference, Virtual Event, 2–4 November 2021; pp. 639–657. [Google Scholar]
Shan, G.; Zhou, L.; Zhang, D. Examining Review Inconsistency for Fake Review Detection. Decis. Support Syst. 2021, 144, 113513. [Google Scholar] [CrossRef]
Gupta, R.; Jindal, V.; Kashyap, I. Recent State-of-the-Art of Fake Review Detection: A Comprehensive Review. Knowl. Eng. Rev. 2024, 39, e67. [Google Scholar]
Jindal, N.; Liu, B. Opinion Spam and Analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 219–230. [Google Scholar]
Lim, E.-P.; Nguyen, V.-A.; Jindal, N.; Liu, B.; Lauw, H.W. Detecting Product Review Spammers Using Rating Behaviors. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 939–948. [Google Scholar]
Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. Spotting Opinion Spammers Using Behavioral Footprints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 632–640. [Google Scholar]
Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a General Rule for Identifying Deceptive Opinion Spam. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; pp. 1566–1576. [Google Scholar]
Rayana, S.; Akoglu, L. Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 985–994. [Google Scholar]
Zhao, J.; Shao, M.; Tang, H.; Liu, J.; Du, L.; Wang, H. RHGNN: Fake Reviewer Detection Based on Reinforced Heterogeneous Graph Neural Networks. Knowl.-Based Syst. 2023, 280, 111029. [Google Scholar] [CrossRef]
Cheng, L.-C.; Wu, Y.-T.; Chao, C.-T.; Wang, J.-H. Detecting Fake Reviewers from the Social Context with a Graph Neural Network Method. Decis. Support Syst. 2024, 179, 114150. [Google Scholar]
Yao, J.; Jiang, L.; Shi, C.; Yan, S. Fake Review Detection with Label-Consistent and Hierarchical-Relation-Aware Graph Contrastive Learning. Knowl.-Based Syst. 2024, 303, 112385. [Google Scholar]
Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Manjotho, A.A. A Deep Feature Interaction and Fusion Model for Fake Review Detection: Advocating Heterogeneous Graph Convolutional Network. Neurocomputing 2024, 598, 128097. [Google Scholar] [CrossRef]
Sun, P.; Bi, W.; Zhang, Y.; Wang, Q.; Kou, F.; Lu, T.; Chen, J. Fake Review Detection Model Based on Comment Content and Review Behavior. Electronics 2024, 13, 4322. [Google Scholar] [CrossRef]
Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Manjotho, A.A. An Analysis of Graph Neural Networks for Fake Review Detection: A Systematic Literature Review. Neurocomputing 2025, 623, 129341. [Google Scholar] [CrossRef]
Mohawesh, R.; Salameh, H.B.; Jararweh, Y.; Alkhalaileh, M.; Maqsood, S. Fake Review Detection Using Transformer-Based Enhanced LSTM and RoBERTa. Intell. Syst. Appl. 2024, 23, 200406. [Google Scholar]
Liu, J.; Quan, P.; Zhang, W. A Study on Fake Review Detection Based on RoBERTa and Behavioral Features. Procedia Comput. Sci. 2024, 242, 1323–1330. [Google Scholar] [CrossRef]
Geetha, S.; Elakiya, E.; Kanmani, R.S.; Das, M.K. High Performance Fake Review Detection Using Pretrained DeBERTa Optimized with Monarch Butterfly Paradigm. Sci. Rep. 2025, 15, 7445. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Poesio, M. Data Augmentation for Fake Reviews Detection in Multiple Languages and Multiple Domains. arXiv 2025, arXiv:2504.06917. [Google Scholar]
Liu, X.; Xu, R.; Jia, X.; Liao, J.; Sun, J.; Huang, L.; Xu, W. Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network. arXiv 2025, arXiv:2510.01801. [Google Scholar]
Meng, W.; Harvey, J.; Goulding, J.; Carter, C.J.; Lukinova, E.; Smith, A.; Frobisher, P.; Forrest, M.; Nica-Avram, G. Large Language Models as Hidden Persuaders: Fake Product Reviews Are Indistinguishable to Humans and Machines. arXiv 2025, arXiv:2506.13313. [Google Scholar]
Chen, N.; Lin, J.; Hoi, S.C.H.; Xiao, X.; Zhang, B. AR-Miner: Mining Informative Reviews for Developers from Mobile App Marketplace. In Proceedings of the 36th International Conference on Software Engineering, Hyderabad, India, 31 May–7 June 2014; pp. 767–778. [Google Scholar]
Wang, L.; Wang, H.; Luo, X.; et al. Demystifying “removed reviews” in ios app store. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022; pp. 1489–1499. [Google Scholar]
Maalej, W.; Kurtanović, Z.; Nabil, H.; Stanik, C. On the Automatic Classification of App Reviews. Requir. Eng. 2016, 21, 311–331. [Google Scholar] [CrossRef]
Gao, C.; Zeng, J.; Lyu, M.R.; King, I. Emerging app issue identification from user feedback: Experience on WeChat. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP); IEEE, 2019; pp. 279–288. [Google Scholar]
Nilizadeh, S.; Groggel, A.; Lista, P.; Das, S.; Ahn, G.-J.; Kapadia, A. Think outside the dataset: Finding fraudulent reviews using cross-dataset analysis. Proceedings of The Web Conference, San Francisco, CA, USA, 2019; pp. 3108–3115. [Google Scholar]
Mohawesh, R.; Xu, S.; Tran, S.N.; Ollington, R.; Springer, M.; Jararweh, Y.; Maqsood, S. Fake Reviews Detection: A Survey. IEEE Access 2021, 9, 65771–65802. [Google Scholar] [CrossRef]
Kumar, P.N.V.S.P.; Kasiviswanath, N.; Babu, A.S. Detecting Mobile App Fraud Review and Fake Ranking. In Proceedings of the International Conference on Advanced Materials, Manufacturing and Sustainable Development, 2024; Advances in Engineering Research; Babu, B.S., Ed.; Atlantis Press: Dordrecht, The Netherlands, 2025; pp. 314–319. [Google Scholar]
Chen, K.; Li, S.; Wang, W. You can promote, but you can’t hide: Large-scale abused app detection in mobile app stores. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 2016; pp. 374–385. [Google Scholar]
Fan, M.; et al. LLM App Store Analysis: A Vision and Roadmap. arXiv 2024, arXiv:2404.12737. [Google Scholar]
Adelani, D.I.; Mai, H.; Fang, F.; Nguyen, H.H.; Yamagishi, J.; Echizen, I. Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-Based Detection. arXiv 2019, arXiv:1907.09177. [Google Scholar]
Farooqi, S.; Feal, Á.; Lauinger, T.; McCoy, D.; Shafiq, Z.; Vallina-Rodriguez, N. Understanding incentivized mobile app installs on Google Play Store. In Proceedings of the ACM internet measurement conference, 2020; pp. 696–709. [Google Scholar]
Apple. The App Store Prevented More than $9 Billion in Fraudulent Transactions over the Last Five Years. Available online: https://www.apple.com/newsroom/2025/05/the-app-store-prevented-more-than-9-billion-usd-in-fraudulent-transactions/ (accessed on 26 June 2026).
Gyöngyi, Z.; Garcia-Molina, H. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005; 5, pp. 39–47. [Google Scholar]
Ntoulas, A.; Najork, M.; Manasse, M.; Fetterly, D. Detecting Spam Web Pages through Content Analysis. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, UK, 23–26 May 2006; pp. 83–92. [Google Scholar]
Castillo, C.; Donato, D.; Gionis, A.; Murdock, V.; Silvestri, F. Know Your Neighbors: Web Spam Detection Using the Web Topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 423–430. [Google Scholar]
Spirin, N.; Han, J. Survey on Web Spam Detection: Principles and Algorithms. ACM SIGKDD Explor. Newsl. 2012, 13, 50–64. [Google Scholar]
Bevendorff, J.; Wiegmann, M.; Potthast, M.; Stein, B. Is Google getting worse? A longitudinal investigation of SEO spam in search engines. In Proceedings of the 46th European Conference on Information Retrieval; Springer: Cham, Switzerland, 2024; pp. 56–71. [Google Scholar]
Aggarwal, P.; Singh, V.M.; Zhang, T.; Mandal, S.; Bansal, M.; McAuley, J.; Jha, S. GEO: Generative engine optimization. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 2024; pp. 5–16. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
Wang, W.; Bi, B.; Yan, M.; Wu, C.; Bao, Z.; Xia, J.; Peng, L.; Si, L. StructBERT: Incorporating Language Structures into Pre-Training for Deep Language Understanding. arXiv 2019, arXiv:1908.04577. [Google Scholar]
Bao, J. nlp-fluency; GitHub Repository, 2021. Available online: https://github.com/baojunshan/nlp-fluency (accessed on 2 June 2026).
Hu, Y.; Wang, H.; Ji, T.; et al. CHAMP: Characterizing Undesired App Behaviors from User Comments Based on Market Policies. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering, Madrid, Spain, 25–28 May 2021; pp. 933–945. [Google Scholar]
Hu, Y.; Wang, H.; Zhou, Y.; et al. Dating with Scambots: Understanding the Ecosystem of Fraudulent Dating Applications. IEEE Trans. Dependable Secur. Comput. 2019, 18, 1033–1050. [Google Scholar]

1

App Store, https://apps.apple.com. Huawei AppGallery, https://appgallery.huawei.com. OPPO App Market, https://store.oppomobile.com. VIVO App Store, https://dev.vivo.com.cn. Xiaomi GetApps, https://global.app.mi.com. Yingyongbao, https://sj.qq.com. Meizu App Store, https://app.meizu.com. 360 Mobile Assistant, https://zhushou.360.cn.

Figure 1. Motivating example of cross market promotional review manipulation. (a) The same app, identified by package name com.yiban1314.yiban, is distributed across multiple app stores, and duplicated positive reviews appear across these markets at different posting times. (b) The same app exhibits discrepancies in app score and review count. The full app score is 10 in 360 Mobile Assistant and 5 in the other app stores.

Figure 2. Overview of our approach.

Figure 3. Feature design of CrossCFD. White boxes represent Traditional Features; yellow, pink, and green boxes represent reused Single market features, enhanced Single market features, and proposed cross market features, respectively. Arrows indicate the mapping from prior signals to CrossCFD feature construction.

Figure 4. ROC curve of the selected CrossCFD detector.

Figure 5. Distribution of language complexity for fake and genuine reviews.

Figure 6. Identical and similar review connections for an example app across three markets.

Figure 7. CDF distributions of representative cross market features. (a) Cross market count of similar app reviews

c f_{4}

. (b) Cross market app rating score discrepancy

c f_{8}

. (c) Cross market temporal variance of similar app reviews

c f_{9}

.

Figure 7. CDF distributions of representative cross market features. (a) Cross market count of similar app reviews

c f_{4}

. (b) Cross market app rating score discrepancy

c f_{8}

. (c) Cross market temporal variance of similar app reviews

c f_{9}

.

Figure 8. Heatmap of estimated fake review proportions across app categories and markets. Each cell shows the proportion of reviews classified as fake for a given app category and market pair. Darker colors represent higher fake review proportions.

Figure 9. Estimated fake review proportions of dating, gaming, and finance apps in the recent review dataset.

Figure 10. Word clouds of the top 30 words in three apps with high fake review proportions. (a) Dating app. (b) Finance app. (c) Game app.

Figure 11. Relationship between app downloads and fake review ratios.

Table 1. Examples of sentiment score and language complexity.

Review	Sentiment score	Language complexity
I’m so grateful, you’ve really helped me a lot.	0.9465	21998.372
It keeps crashing, what’s wrong with it?	0.0861	20384.556
I can’t open the 2024 version.	0.3535	23231.899
A very convenient investment and financial management tool that can solve my spare cash problem.	0.9485	21643.835
Sing in the daylight with wine, as youth accompanies you back home.	0.8366	22684.603

Table 2. Detection performance of different models and feature settings.

Feature set	Logistic			SVM			Random Forest			Gradient Boosting			MLP
Feature set	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
word2vec	0.711	0.708	0.708	0.747	0.740	0.737	0.780	0.778	0.776	0.783	0.782	0.782	0.750	0.746	0.744
tf+word2vec	0.766	0.731	0.748	0.813	0.816	0.815	0.859	0.853	0.856	0.853	0.850	0.851	0.818	0.818	0.818
CrossCFD	0.872	0.872	0.872	0.877	0.877	0.877	0.907	0.909	0.908	0.935	0.936	0.935	0.884	0.882	0.883

Table 3. Generalization performance under only app and only market settings.

Setting	Model	Precision	Recall	FNR	F1
Only app	Baseline	0.799	0.770	0.230	0.784
	CrossCFD	0.898	0.862	0.138	0.880
Only market	Baseline	0.809	0.690	0.310	0.745
	CrossCFD	0.827	0.827	0.173	0.827

Table 4. Effectiveness of the sentiment feature.

Method		Metrics
Model	Type	ACC	P	F1
Dictionary based	Lexicon	81.7%	84.4%	80.9%
StructBERT based	Neural	90.6%	93.5%	90.3%

Table 5. Overview of estimated fake review proportions across app markets.

Market	Number of reviews	Fake prop.
App Store	123,238	29.02%
Huawei	148,720	29.48%
OPPO	790,261	16.36%
VIVO	386,535	22.90%
Xiaomi	105,677	27.44%
Yingyongbao	8,396	60.15%
Meizu	1,304	97.39%
360 Mobile Assistant	2,165	94.69%
Total	1,566,296	21.37%

Table 6. Fake review distribution across markets under different cross market coverage levels in the recent review dataset.

Market	#1 Market	#2 Markets	#3 Markets	#>3 Markets	CMS
Huawei	8,200 (10.6%)	4,500 (13.1%)	2,100 (12.5%)	1,300 (14.6%)	0.39
Xiaomi	7,800 (10.1%)	4,200 (12.2%)	1,900 (11.3%)	980 (11.0%)	0.35
OPPO	6,500 (8.4%)	3,800 (11.0%)	1,600 (9.5%)	750 (8.4%)	0.33
VIVO	5,200 (6.7%)	2,900 (8.4%)	1,300 (7.7%)	620 (6.9%)	0.30
App Store	19,100 (24.6%)	5,300 (15.4%)	2,200 (13.1%)	1,100 (12.3%)	0.42
Yingyongbao	18,400 (23.7%)	14,200 (41.2%)	9,800 (58.3%)	6,500 (72.8%)	0.81
360	12,500 (16.1%)	9,600 (27.9%)	5,300 (31.5%)	3,200 (35.8%)	0.68
Meizu	4,880 (6.3%)	2,670 (7.8%)	1,240 (7.4%)	590 (6.6%)	0.31

Table 7. Top 10 apps with the highest fake review proportions and their cross market fake review distributions.

Package name	Fake prop.	# Fake reviews / # Reviews
Package name	Fake prop.	Appstore	Huawei	OPPO	VIVO	Xiaomi	Yingyongbao	360
`com.xiao.chengshi`	0.971	553/556	2/2	1536/1605	407/410	0/1	NAN	NAN
`com.st.QSB`	0.962	16412/16959	1/19	290/307	246/328	0/6	0/1	NAN
`com.gxyt.truthlove`	0.959	NAN	6/6	59/60	52/54	0/2	NAN	NAN
`com.bingo.yeliao`	0.937	NAN	0/1	396/399	228/317	NAN	2008/2110	1435/1514
`com.ts.facai`	0.930	NAN	17/19	1143/1181	1537/1686	0/8	49/60	NAN
`com.kangluoer.tomato`	0.928	NAN	1/1	366/370	417/507	NAN	436/449	615/651
`com.huihe.tmydl`	0.904	0/47	0/1	3196/3225	2913/3487	NAN	NAN	NAN
`com.wiwj.xiangyucustomer`	0.791	2035/2504	7/27	4/25	2/14	0/20	1/2	265/310
`com.leniu.shdl`	0.648	504/2143	3917/5671	4952/7224	6045/8309	3122/4804	247/332	NAN
`com.yitantech.gaigai`	0.621	26/409	213/366	13654/17160	3917/5671	4/62	1/15	0/39

NAN indicates that the app is unavailable in the corresponding store or that no valid reviews were observed during data collection.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

CrossCFD: Leveraging Cross Market Inconsistencies for Fake Review Detection

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Fake Review Detection

2.2. Ranking Fraud and App Promotion Abuse

3. Approach

3.1. High Level Overview

3.2. Multi Market Crawling

3.3. Preprocessing and LLM Analysis

3.4. Feature Extraction

3.5. Fake Review Detection

4. Evaluation

4.1. Evaluation Setup

4.2. Detection Evaluation

4.3. Generalization Evaluation

4.4. Feature Analysis

5. Measuring Fake Reviews in Multiple Mobile App Stores

5.1. Measurement Dataset

5.2. RQ1: Prevalence and Market Manipulation Patterns

5.3. RQ2: Category Targeting Patterns

5.4. RQ3: App Popularity and Cross Market Distribution

6. Discussion

6.1. Implications

6.2. Limitations

7. Conclusion

References

MDPI Initiatives

Important Links

Subscribe