During the data acquisition phase, all platform usage policies and relevant legal regulations were strictly adhered to. As a globally recognized hub for data science, Kaggle aggregates an extensive array of high-quality open-source datasets, offering a rich repository for interdisciplinary academic research.
This chapter centers on the “Amazon Fine Food Reviews” dataset, curated by user
mdraselsarker and published on the Kaggle platform. It specifically captures user-generated reviews of gourmet products sold on Amazon, spanning the period from October 1999 to October 2012. The dataset encompasses 568,454 reviews from 256,059 users and 74,258 distinct products. Among these users, 260 submitted more than 50 reviews [
6]. Upon downloading and extracting the dataset from Kaggle, two essential files are obtained:
Reviews.csv and
database.sqlite. The
Reviews.csv file is directly derived from the “Reviews” table within the
database.sqlite database and is stored in CSV format, facilitating subsequent data analysis tasks.
Covering over a decade of user activity, the dataset comprises ten key fields: unique review ID (Id), product ID (ProductId), user ID (UserId), user nickname (ProfileName), timestamp (Time), star rating from 1 to 5 (Score), number of helpful votes (HelpfulnessNumerator), total number of votes (HelpfulnessDenominator), review summary (Summary), and full review text (Text). These structured attributes offer comprehensive and valuable support for the in-depth exploration of product review patterns, sentiment trends, and user behavioral drivers in the e-commerce domain.
4.3. Data Visualization and Analysis
Key influencing factors in this analysis include total vote count, helpful vote count, sentiment polarity, rating, review length, and high-frequency words. Taking individual reviews as the unit of analysis, the study examines intrinsic relationships, distributional characteristics, and latent patterns across multiple dimensions. By leveraging data visualization and comprehensive analytical techniques, user behavioral tendencies, sentiment orientations, and core product concerns are systematically uncovered. These insights provide robust data support and valuable references for optimizing product strategies, enhancing user experience, and fostering deeper user engagement.
4.3.1. Correlation Between Helpful Votes and Total Votes
Figure 7 shows the relationship between the number of helpful votes and the total number of votes.
To quantify the degree of correlation between the two, Pearson correlation coefficient is used. The formula is:
where
r is the correlation coefficient,
and
are the
i-th observations of variables
X and
Y, respectively,
and
are the sample means of variables
X and
Y, and
n is the sample size, i.e., the number of observations. To ensure the scientific validity and reproducibility of the analysis, the calculation of the correlation coefficient is implemented using the
scipy.stats.pearsonr function in Python [
9], which returns the correlation coefficient
r and the corresponding
p-value. The
p-value represents the probability of obtaining a result as extreme as the observed one under the assumption that the null hypothesis (i.e., no linear correlation between the two variables) is true. When the
p-value is less than a pre-specified significance level (commonly 0.05), it is considered that a true linear correlation exists between the variables. The calculation results are shown in
Figure 8, indicating that the correlation coefficient
r between the two variables in the scatter plot is 0.98, and the
p-value is approximately 0.00.
Results indicate a strong and statistically significant positive correlation between the number of helpful votes and the total number of votes. As shown in the scatter plot, most data points are densely clustered along a trend line extending from the bottom left to the upper right, visually illustrating that an increase in total votes is typically accompanied by a corresponding rise in helpful votes. Some points, however, deviate from this pattern. Reviews with high total vote counts but low helpfulness may reflect superficial content, lack substantive value, or even contain misleading information, thus failing to earn broad user recognition. In contrast, reviews with fewer total votes yet high helpfulness may precisely address the needs of specific user groups.
4.3.2. Sentiment Distribution Pie Chart
Figure 9 shows the distribution of sentiment polarity in the comments, covering three categories: Positive, Neutral, and Negative.
During data preprocessing, the original dataset did not contain a direct “Sentiment” column. Sentiment polarity labels were inferred for each comment using natural language processing (NLP) techniques. The specific implementation code is shown in
Figure 10.
Sentiment analysis determines public opinion and expression toward products based on online reviews. Its core task is to classify text as positive, negative, or neutral sentiment. This relies on techniques such as text analytics, natural language processing (NLP), and computational linguistics to extract key information from user reviews, helping marketers assess market demand, understand public emotions and customer attitudes, and support customer base expansion. Technically, sentiment classification is divided into rule-based methods and machine learning-based automatic methods. The latter is more widely used due to its deeper understanding of opinionated text. Classifiers such as Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) can enhance classification accuracy. TextBlob is a Python library that offers a simple API, functioning similarly to Python strings, and supports NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and translation [
10].
In the function plot_sentiment_distribution, the get_sentiment function is defined to invoke TextBlob’s sentiment analysis capability. This function takes the text of each comment as input and uses the sentiment.polarity attribute of TextBlob to calculate the sentiment polarity score. A score greater than 0 is classified as positive sentiment; a score less than 0 as negative sentiment; and a score equal to 0 as neutral sentiment. The operation df[’Sentiment’] = df[’Text’].apply(get_sentiment) applies sentiment polarity labeling to all comments in the DataFrame df, providing the basis for subsequent analysis.
Compared to the random assignment method using np.random.choice, TextBlob’s sentiment tagging is based on textual semantics and syntactic information. It follows predefined dictionary rules and weight computation logic, resulting in outcomes that are stable and interpretable, effectively avoiding fluctuations that might arise from random assignment.
The final sentiment distribution results show: positive sentiment comments account for 88.3%, indicating that most users hold a favorable attitude toward the product or service; neutral sentiment comments account for only 1.5%, suggesting that a small number of users have ambiguous attitudes due to average experiences; and negative sentiment comments make up 10.2%, indicating dissatisfaction among some users.
To quantitatively understand the characteristics and uncertainty of this sentiment distribution, the concept of information entropy
H is introduced. Information entropy is a key metric for measuring the uncertainty of random variables. In sentiment polarity analysis, it can help assess the dispersion of sentiment distribution. Its calculation formula is:
where is the probability of the i-th sentiment category, and n is the total number of sentiment categories (here , corresponding to Positive, Neutral, and Negative). A higher entropy value indicates more dispersed sentiment distribution and greater uncertainty, reflecting diverse and inconsistent user attitudes; a lower value indicates more concentrated distribution and higher consistency in user attitudes. Substituting the category proportions into the formula yields an entropy value of approximately . This result shows that the sentiment distribution in the current dataset is relatively concentrated, and user attitudes are highly consistent, with most users inclined to give positive evaluations.
4.3.3. Box Plot of Score vs. Review Length
Figure 11 shows the distribution of comment text lengths (measured by character count) under different rating scores (from 1 to 5 stars), visualized using a box plot.
Box plots are composed of three main elements: the box itself, the median line within the box, and the whiskers. Representing the central 50% of the data, the box spans from the first quartile (), marking the 25th percentile, to the third quartile (), at the 75th percentile. The interquartile range (IQR), calculated as , defines the width of the box. A larger IQR indicates greater variability among the central data, while a smaller IQR suggests a more concentrated distribution. The line inside the box denotes the median, a measure resilient to outliers and indicative of the data’s central tendency. Whiskers extend to and , excluding data points outside this range, which are treated as outliers. These outliers are plotted individually and reflect the extent of data dispersion. Longer whiskers indicate wider value ranges and greater spread, whereas shorter whiskers correspond to more tightly clustered data.
According to existing studies, comment length has been found to correlate with fake review labels, where fake reviews tend to show more concentrated length distributions. This provides empirical support for the observed clustering of lengths in extreme ratings: short 1-star reviews may involve emotional or manipulative expressions, while brief 5-star reviews may reflect templated praise. Both exhibit low dispersion due to the unnatural nature of the content. Furthermore, analysis of review polarity shows that the sources of fake reviews differ by sentiment polarity—positive fake reviews often stem from seller inducement, while negative ones may involve competitive manipulation. These source differences partially explain the length distribution characteristics observed under extreme ratings. In contrast, 3-star reviews exhibit higher dispersion, aligning more closely with natural, authentic review behavior [
11].
In this study, the character length of each review was calculated using the line df[’TextLength’] = df[’Text’].apply(lambda x: len(str(x))) and stored in a new column named TextLength. This length was considered an effective quantitative indicator of the internal structure of the review. Based on this, the sns.boxplot function from the seaborn library was used to create the box plot. The “Score” served as the horizontal axis grouping variable, while “TextLength” was used on the vertical axis to show the distribution of comment lengths. The parameter showfliers=False was set to exclude outliers, in line with the standard practice of focusing box plot visuals on the core data distribution, thereby better reflecting comment distributions perceptible to actual users.
Significant heterogeneity is observed in the distribution of comment lengths across different rating levels. In the case of 1-star reviews, a notably low median length, together with narrow box widths and short whiskers, indicates a limited interquartile range (IQR) and low dispersion. This pattern suggests that users are inclined to convey intense negative emotions through succinct expressions, with the observed variability partially driven by emotional intensity and the presence of manipulative content. For 2-star ratings, the median comment length increases slightly, and the expansion of both the box and whiskers reflects a broader IQR and greater dispersion, indicating a shift from purely emotional outbursts to more varied expressions informed by actual user experience. In contrast, 3-star reviews exhibit the highest median value, the widest box, and the longest whiskers, resulting in the largest IQR and the greatest overall dispersion. Such characteristics point to more nuanced and multifaceted user feedback, encompassing ambiguous sentiments and a coexistence of brief and extensive reviews—features aligned with patterns of spontaneous and authentic expression.
In contrast, 4-star ratings have a median close to that of 3-star ratings, but show different dispersion characteristics. This suggests that users recognize product strengths while also focusing on details. Their comments tend to combine positive sentiment with rational judgments, resulting in a level of dispersion between that of extreme and moderate ratings. Lastly, 5-star reviews show a lower median and small IQR, as users often express satisfaction concisely.
The differences in comment length distributions across ratings not only reveal expression patterns under various emotional inclinations but also reflect intrinsic links to the authenticity of reviews. These findings further validate the systematic correlation between rating scores and comment length distributions.
4.3.4. Word Cloud of Frequent Words
Figure 12 presents the word cloud of frequent terms extracted from the review dataset. As a commonly used tool for visualizing textual data, word clouds provide an intuitive representation of word frequency through visual weight, enabling rapid identification of central themes and key information in a text corpus. Traditional word clouds, due to spatial constraints, randomly arrange words on a canvas. Typically, they compute word frequency from the corpus, sort the terms accordingly, and adjust font size proportionally to frequency [
12].
In this figure, the font size reflects the frequency of word occurrences. Notably, terms such as "brand", "product", "taste", "love", and "good" appear in large and prominent fonts. These high-frequency terms encapsulate rich user feedback and reflect the core concerns expressed in user reviews.
Due to the inherent randomness in word cloud layout algorithms, different visual representations may emerge from the same text data, with variations in word placement and spatial distribution. However, despite these visual differences, the identification of core frequent terms remains stable. Therefore, the analysis and interpretation of users’ key concerns based on word frequency are unaffected by such randomness.
From the specific visualization of the current word cloud, it can be observed that in the dimensions of product and brand, terms such as "brand" and "product" occupy prominent positions. This indicates that users frequently mention the products and their associated brands, demonstrating a high level of concern for attributes like quality and functionality. In terms of taste and experience, the frequent appearance of the word "taste"—especially for food-related products—suggests that taste serves as a critical evaluation criterion. Moreover, positive emotional expressions such as "love" and "good" appear frequently, reflecting a generally satisfied user experience across the reviewed products. Additionally, terms like "order" and "use" appear with high frequency, signifying user attention toward the purchasing and usage processes, including factors such as convenience of purchase and frequency of use.
4.3.5. Summary Analysis of Visualizations
Through multi-dimensional data visualization and analysis, this study reveals the internal associations among various features within user review data. The correlation analysis shows a strong positive correlation between helpful votes and total votes. However, certain outlier data points deviating from the regression trend line suggest inconsistencies in comment quality. This implies that some low-quality reviews may diminish the reference value for other users. Enterprises should thus pay closer attention to the quality of review content to avoid ineffective or misleading information from influencing consumer decisions.
Sentiment polarity is analyzed using the TextBlob library, which generates sentiment labels based on the semantic and syntactic characteristics of the text. This method allows for a more objective representation of users’ emotional inclinations. The final sentiment distribution chart demonstrates that positive reviews overwhelmingly dominate the dataset, indicating that most users maintain a favorable attitude toward the products or services. Although neutral and negative sentiments constitute a smaller proportion, they should not be neglected. Enterprises can further investigate the underlying reasons behind these less favorable sentiments and respond by optimizing products and services accordingly to enhance satisfaction among these users.
Box plots comparing ratings and comment lengths reveal distinct distributional patterns across rating levels. Visual interpretations of these plots yield valuable insights into user behavior and inform actionable strategies. For example, organizations can focus on negative feedback contained in low-rated reviews to pinpoint and address critical issues, while simultaneously leveraging positive narratives from high-rated comments to enhance marketing efforts.
Word clouds vividly capture users’ core concerns by emphasizing frequent terms associated with product features such as brand identity, taste, and aspects of the purchase or usage process. These visual cues imply that enterprises ought to reinforce brand positioning, elevate product quality, and refine the overall user experience. In particular, sensory attributes like taste—which play a crucial role in food-related products—should receive focused optimization. In addition, the prominence of terms like “order” and “use” underscores user appreciation for convenience and efficiency in both purchasing and consumption. Consequently, companies should continuously adapt and improve processes to foster higher user satisfaction and loyalty.