In this section, we present the empirical findings of our study. Our primary aim is to present the research outcomes, with each result directly linked to the core objectives of our study. First, we delve into the performance and comparison of the ML algorithms and then we move on engagement metrics task.
4.1. Comparison of Sentiment Analysis Tools: TextBlob and VADER
In the realm of sentiment analysis, accurate data labeling is crucial. This study thoroughly compares two widely used libraries, TextBlob and VADER, to identify the most suitable sentiment analysis tool [
61,
62,
63]. VADER, known for its expertise in social media sentiment analysis [
64], evaluates individual words and sentences, providing sentiment scores within the context of social media [
65]. It articulates expressed sentiments, as discussed Hutto and Gilbert [
66]. TextBlob, widely used in sentiment analysis tasks [
67,
68], is also selected. The labeling process categorizes data into positive, negative, and neutral classes, consistently applied to both datasets. Sentiment labels for each tool are illustrated in
Table 5 and
Table 6.
After labelling the comments, we compared the tools. Results are depicted in
Figure 1 and
Figure 2, shedding light on the performance of each tool in sentiment classification.
In both datasets, TextBlob and VADER provided similar results. Positive comments predominated, with negative ones being a minority. However, a notable difference emerged in the number of comments labeled as neutral. TextBlob assigned more comments as neutral than VADER, especially in the hedonic dataset, where neutral comments closely rivaled positive ones. The matching label percentage between TextBlob and VADER in the plant-based dataset is 72.48%, and in the hedonic dataset, it is 71.05%.
To determine the tool more aligned with human understanding, we manually compared a sample of 300 differently labeled comments in each dataset.
Table 7 and
Table 8 showcase parts of our datasets, highlighting in bold instances where the tools seemed closer to human understanding.
When it comes to comments that were differently classified by the two tools, VADER seems to be closer to human understanding. So, VADER is the sentiment analysis tool that we considered most appropriate for our study. According to VADER in the plant-based dataset, 58.9% of the comments were labelled positive, 32.6% neutral, and 8.5% as negative, while in the hedonic dataset, 52.5% of the comments were labelled positive, 34.6% neutral, and 12.9% as negative.
4.2. Performance and Comparison of the ML Algorithms
In this section, we discuss procedures applied to both plant-based and hedonic datasets. Choosing between micro, macro, and weighted averages as a performance metric is a common challenge in ML. Our study compared all three for model analysis.
Figure 3 illustrates the performance of these metrics for the Random Forest Model. Macro and weighted average values present a similar picture, but with a slight divergence. Macro-average values are slightly lower, suggesting potential class imbalance. The weighted average, accounting for class distribution, tends to be slightly higher. If we take the weighted average, the F1 score is a good score. However, it doesn’t classify negative comments with great confidence, which is why we believe the macro-average with 0.81 would be a better measure. Similar results were obtained from other models.
So, in our datasets, we decided to calculate the evaluation metrics using the macro-average to assign equal significance to each class and to handle the dataset imbalance. According to Hamid et al.[
69], a slightly imbalanced dataset is defined as having a distribution like 60:40, 55:45, or 70:30 (majority: minority). Furthermore, Opitz [
70] notes a growing adoption of ’macro’ metrics in recent years, and Guo et al. [
71] highlight an increasing trend in utilizing macro-average indicators for sentiment analysis evaluation.
4.2.1. Performance and Comparison of the ML Algorithms in Plant-Based Dataset
First, let’s examine the performance metrics for the plant-based dataset. In
Table 9, we present the performance values obtained from all five ML algorithms.
In our study, we evaluated the performance with respect to accuracy and F1 score. A higher accuracy indicates a better model performance. Our analysis revealed that the Support Vector Machine and Logistic Regression models achieved the highest accuracy among the considered models, both scoring an accuracy of 0.93. Additionally, the F1 score, which combines precision and recall, provides a balanced evaluation of a model’s performance. A higher F1 score implies a better trade-off between precision and recall. F1 remains a popular metric among researchers, and in multiclass cases, the F1 micro/macro averaging procedure offers flexibility, enabling customization for ad-hoc optimization to meet specific goals in diverse contexts [
72]. When considering the F1 score, the Support Vector Machine model outperformed the other models with a score of 0.89. This indicates that the Support Vector Machine model was able to maintain a good balance between precision and recall in its predictions. Overall, our results suggest that the Support Vector Machine model excels in both accuracy and F1 score, making it a suitable choice for sentiment analysis in the context of YouTube comments on plant-based products.
4.2.2. Performance and Comparison of the ML Algorithms in Hedonic Dataset
Just like we did for the previous dataset,
Table 10 demonstrates the performance metrics for all five ML algorithms for the hedonic dataset.
Our evaluation has unveiled that the Support Vector Machine and Logistic Regression models stand out as top performers, both achieving an impressive accuracy of 96%. Additionally, when considering the F1 score, both the Support Vector Machine and Logistic Regression models have proven to be highly effective with an F1 score of 94%. This score indicates that they maintain a strong equilibrium between precision and recall in their predictions. To determine which one of the two performed better in our dataset, we used their confusion matrices (
Figure 4 and
Figure 5), to compare the true positive and true negative values of each model at
Table 11 and
Table 12.
The performance of the two models is quite similar, with subtle differences. If we were to choose one model, the Support Vector Machine emerges as the preferable choice due to its slightly higher combined count of True Positive and True Negative values, indicating a marginally stronger overall performance.
4.3. Engagement Metrics
The Mann-Whitney U tests show significant differences in user engagement metrics, rejecting the null hypothesis (
Figure 6). Notably, Views, Comments, Likes, and Engagement Rate exhibit consistent disparities, indicating non-random variations. These findings enhance quantitative insights into user behavior, emphasizing the importance of exploring underlying factors. In Table 7, descriptive statistics are calculated for each dataset for a deeper understanding.
The hedonic dataset boasts a mean view count of 10 million, indicating broad reach and potential virality, with views ranging from 1.68 million to 68.75 million. Active audience participation is shown through a mean of 6579 comments and a substantial like count of 170,862, reflecting a positive response. The engagement rate suggests moderate interaction, ranging from 0.70% to 4.06%. In contrast, the plant-based dataset has a lower mean view count of 432,891, indicating less diverse viewership. Although comments are fewer, with a mean of 510, the likes count is positive, ranging from 1.109 to 113,000, indicating varying popularity. The engagement rate is notable at 3.90%, showcasing a more actively engaged audience compared to the hedonic dataset, with a range from 1.31% to 9.08%. After examining both datasets, we calculate the overall engagement rate by considering total views, comments, and likes. The engagement rate, computed using the formula in
Figure 7, is shown in
Table 13. It’s essential to consider dataset sizes—59 videos for plant-based and 24 for hedonic—before interpreting engagement metrics.
Table 13.
Descriptive Statistics
Table 13.
Descriptive Statistics
| Plant Based: |
Views |
Comments |
Likes |
Engagement Rate |
| mean |
|
510.12 |
|
3.90 |
| median |
|
307.00 |
|
3.42 |
| std |
|
510.50 |
|
1.64 |
| min |
|
104.00 |
|
1.31 |
| max |
|
2174.00 |
|
9.08 |
| var |
|
260611.14 |
|
2.69 |
| calculate_range |
|
2070.00 |
|
7.77 |
|
Hedonic: |
Views |
Comments |
Likes |
Engagement Rate |
| mean |
|
6579.75 |
|
2.29 |
| median |
|
5544.00 |
|
2.10 |
| std |
|
4978.95 |
|
0.77 |
| min |
|
1798.00 |
|
0.70 |
| max |
|
25168.00 |
|
4.06 |
| var |
|
24789956.63 |
|
0.59 |
| calculate_range |
|
23370.00 |
|
3.36 |
The plant-based dataset achieved 25.5 million views, 30,097 comments, and 812,411 likes, resulting in a commendable 3.30% engagement rate. In contrast, the hedonic dataset garnered 241.4 million views, 157,914 comments, and 4.1 million likes, with a relatively lower 1.77% engagement rate. Despite higher absolute engagement in hedonic videos, the plant-based dataset maintained a superior rate, suggesting a more consistent and impactful connection per video. Considering the varying video counts, the hedonic dataset achieved impressive views but across a smaller set of 24 videos. In contrast, the plant-based dataset, with 59 videos, achieved a noteworthy 3.30% engagement rate despite 25.5 million views. This emphasizes that plant-based content not only attracted attention but also fostered a more engaged audience per video. Beyond quantitative metrics, we analyze the effect of comment length, referring to the number of characters in each comment, on viewer engagement. In both datasets (
Figure 8), the correlation coefficients between comment length and comment likes, as well as reply count, are low, suggesting a weak linear connection. In the plant-based dataset, the correlation coefficients are 0.03 for comment likes and 0.10 for reply count. In the hedonic dataset, these coefficients are even smaller, at 0.02 and 0.06, respectively.
Table 14.
Overall Engagement Rate of the two datasets.
Table 14.
Overall Engagement Rate of the two datasets.
| Dataset |
Views |
Comments |
Likes |
Engagement Rate |
| Plant-Based |
25.540.596 |
30.097 |
812.411 |
3.30% |
| Hedonic |
241.400.820 |
157.914 |
4.100.700 |
1.77% |
Moving from comment length correlations, we shift to another engagement measure—the total comments over the years. Examining trends in both datasets,
Table 15 offers insights into user interaction dynamics. The plant-based dataset showed an upward trend from 2018 to 2021, followed by a decline in 2022 and 2023. Similarly, the hedonic dataset peaked in 2020, followed by a decline in subsequent years.
Table 15.
Total number of comments.
Table 15.
Total number of comments.
Lastly, beyond numerical counts, we delve into user activity patterns within each dataset (
Figure 9). In the plant-based dataset, activity rises in late morning and peaks between 14:00 and 18:00, with consistently higher activity on Sundays. In contrast, the hedonic dataset sees a peak in activity during the afternoon and evening, especially on Sundays.