I. Introduction
Big data is playing an increasingly transformative role in modern analytics, revolutionizing industries by enabling deeper insights and more informed decision-making [
1]. One of the most impactful areas of big data analytics is in entertainment and recommendation systems, particularly in generating localized recommendations, such as identifying popular movies in specific regions (e.g., Minnesota). To effectively process and analyze large-scale datasets, frameworks like Apache Spark have become essential due to their scalability, performance, and flexibility. Apache Spark is an open-source distributed computing framework designed for big data processing and analytics. It provides superior speed via in-memory processing and accommodates various workloads, such as batch processing, interactive queries, and machine learning.
However, working with large datasets presents several significant challenges, primarily related to data quality, scalability, and the development of efficient algorithms. Data quality issues often stem from semantic heterogeneity and the prevalence of unstructured data, which complicate integration and analysis efforts [
2,
3]. For instance, inconsistent formatting or missing information can reduce the reliability of results. Scalability also remains a critical concern, as traditional data processing tools struggle to manage the vast volumes and velocities of big data, necessitating advanced frameworks like Hadoop and Spark. These frameworks enable distributed computing but require substantial infrastructure and expertise to implement effectively [
2,
4]. Furthermore, extracting meaningful insights from complex datasets necessitates the development of efficient algorithms, which often rely on high-quality training data to achieve optimal results [
2,
5]. While big data analytics holds tremendous potential, addressing these challenges is essential to fully unlock its benefits [
4,
6].
Apache Spark offers significant advantages in analyzing and processing large datasets. Its architecture, based on the resilient distributed dataset (RDD) programming model, supports efficient data partitioning and locality-aware task placement, optimizing resource utilization and minimizing execution times [
7]. Spark’s ability to process structured, semi-structured, and unstructured data through DataFrames enhances its performance compared to traditional RDDs, making it a preferred framework for diverse applications, including recommendation systems [
8]. Additionally, Spark's integration with artificial intelligence techniques enables robust batch processing, supporting the analysis of massive datasets in fields such as medical and entertainment analytics [
9,
10]. Despite its advantages, performance tuning and optimization remain critical for fully leveraging Spark's capabilities in the rapidly evolving landscape of big data [
11].
This paper addresses a pressing challenge in modern data science: analyzing and extracting actionable insights from large- scale datasets. It contributes to the fields of big data analytics, recommendation systems, and regional data analysis by demonstrating how scalable frameworks like Apache Spark can effectively handle real-world challenges. Specifically, the paper bridges the gap between theoretical big data concepts and practical applications. Using the MovieLens 20M dataset, the research demonstrates an efficient, novel methodology for identifying the top-ranked movies. The results highlight the effectiveness of the proposed approach in addressing key challenges such as recommendation systems, data preprocessing, large-scale analysis, and user preference modeling. This work provides a replicable blueprint for building recommendation systems, making it valuable for both researchers in big data and practitioners seeking scalable solutions for regional data analysis and personalized recommendations.
The remainder of this paper is organized as follows:
Section 2 details the proposed methodology, including the data preprocessing steps and model design.
Section 3 presents the experimental results. Finally,
Section 4 offers a summary of conclusions and potential avenues for future research.
II. Methodology
A. Dataset
The dataset used in this project was sourced from the GroupLens website, a research initiative from the Department of Computer Science and Engineering at the University of Minnesota. GroupLens provides a variety of datasets, including MovieLens, HetRec2011, WikiLens, Book-Crossing, Jester, and EachMovie. For this research, the MovieLens 20M dataset was selected because it is recommended for new research applications.
The MovieLens 20M dataset contains six CSV files and one text file, summarizing 27,278 movies and 138,493 randomly selected users. In terms of the dataset structure, the first file, genome-scores.csv, consists of three columns: movieId, tagId, and relevance. The relevance column represents how strongly a specific tag matches a particular movie, with values ranging from 0 to 1. The second file, genome-tags.csv, contains two columns: tagId and tag. This file provides user-generated labels for movies, which are often descriptive words or phrases. The third file, links.csv, contains movieId, imdbId, and tmdbId, which allow cross-referencing of movies with external databases such as IMDb and TMDb.
The fourth file, movies.csv, contains three columns: movieId, title, and genres. The title includes the movie name along with its release year in parentheses, and the genres column lists the categories associated with each movie in a pipe-separated format. The fifth file, ratings.csv, has four columns: userId, movieId, rating, and timestamp. The rating ranges from 0.5 to 5.0, and each user has rated at least 20 movies. Finally, the sixth file, tags.csv, records user-generated tags along with timestamps. A text file named README provides a detailed description of the dataset and its structure.
B. Data Preprocessing
Data preprocessing was a crucial step in this research to ensure the integrity and usability of the dataset [
12]. One significant issue encountered was with movie titles containing commas, which disrupted the parsing of the movies.csv file. For example, the title “Godfather, The (1972)” caused the genre column to be misread as part of the title. To address this, the preprocessing included validating data formats and correcting misaligned fields. Specifically, the input file was filtered to only include rows where all columns satisfied specific data type requirements. For instance, the project ensured that the movie ID consisted of numeric characters and that ratings adhered to a valid floating-point format.
A test case was designed to identify such issues by filtering records for specific movie IDs, such as movieId 858, and checking whether the genre column displayed the correct value. In the case of movieId 858, the expected genre, “Crime|Drama,” was initially replaced by part of the title. By addressing this misalignment, the preprocessing step ensured that subsequent analyses could be conducted without errors. Other than this issue, the dataset was found to be clean and ready for further processing.
C. Spark Algorithm
The analysis relied on Apache Spark for its distributed computing capabilities [
13], which facilitated efficient processing of the large dataset. First, the ratings.csv, movies.csv, and genome-scores.csv files were read into Spark as RDDs. Each file was parsed by splitting rows into arrays based on commas. A case class was defined for each dataset to provide structure, allowing elements to be assigned appropriate names and data types such as strings or doubles. The RDDs were then converted into DataFrames for ease of manipulation, and their schemas were printed to verify the data structure.
Second, headers were removed from each dataset by applying filters to exclude rows that did not conform to expected patterns. This step ensured that all subsequent computations were based on valid data. For instance, the ratings dataset was filtered to retain only rows where the movieId and rating columns matched numeric patterns.
Third, key transformations were applied to derive meaningful insights. The ratings dataset was used to calculate statistics such as the count, minimum, maximum, mean, and standard deviation of ratings. Additionally, genres and tags were joined with ratings data to provide a comprehensive view of each movie's characteristics. For example, the project performed an intersection of movies with five-star ratings and movies with high tag relevance to identify the top-ranked films.
Finally, the top 10 movies were extracted by ranking movies based on their ratings and relevance. These results were saved to an HDFS output directory for further verification. To optimize performance and scalability, Spark's configuration settings were adjusted, including increasing the number of executors and memory allocation. This ensured that the system could handle larger datasets with minimal degradation in performance.
III. Experimental Results
A. Statistics of the Dataset
After applying data preprocessing techniques, the dataset was analyzed to extract key characteristics, uncover trends, and gain insights into its structure. This section highlights the distribution of genres, user rating patterns, and tag relevance, offering a comprehensive view of the dataset.
Tag correlation analysis provides crucial insights into user preferences and the latent attributes of movies. The most relevant tags, including "original," "mentor," and "great ending," indicate strong user engagement and specific expectations for movies. By leveraging these insights, recommendation algorithms can refine movie suggestions by prioritizing films with strong correlations to high-rated tags. Incorporating tag correlation into collaborative filtering models enhances recommendation accuracy by considering the qualitative aspects of user feedback rather than just numerical ratings. As illustrated in
Figure 1, the distribution of genres across movies is highly diverse, but certain genres dominate the dataset. Drama is the most prevalent genre, reflecting its universal appeal and storytelling depth. This is followed by Comedy, which caters to a wide audience seeking lighthearted entertainment, and Thriller, known for its suspenseful and engaging narratives. The prominence of these genres suggests they are central to the dataset and may align with popular user preferences. Other genres such as Action, Romance, and Horror also hold notable shares, underscoring the variety in movie offerings and user interests.
The user rating patterns, visualized in
Figure 2, show a clear positive bias. Most ratings are concentrated in the range of 3.5 to 4.5, with a peak around 4.0, indicating that users tend to rate movies favorably. This skew toward higher ratings could be attributed to the dataset's nature, where movies that receive poor ratings may be less frequently watched or rated. The relatively small proportion of ratings below 2.0 suggests that either the movies in the dataset are generally well-regarded or that users are reluctant to provide harsh ratings. The histogram in
Figure 2 provides a detailed representation of this trend, helping to understand the overall sentiment in user feedback.
An analysis of tag relevance, presented in
Figure 3, reveals the most dominant tags associated with movies. These tags provide an additional layer of context to the dataset, highlighting specific qualities and attributes that resonate with users. The tags “original,” “mentor,” and “great ending” emerged as the most relevant, indicating that users often value originality, the presence of a guiding or inspiring character, and a satisfying resolution. Other notable tags include “emotional,” “unexpected twist,” and “visually stunning,” which reflect diverse user preferences and experiences. This analysis underscores the depth of user engagement and the role of specific attributes in shaping movie perceptions.
In summary, the dataset demonstrates a rich diversity in genres, a tendency toward positive ratings, and meaningful engagement with specific movie attributes through tagging. These insights lay the foundation for further analyses, such as recommendation systems or predictive modeling, by emphasizing the factors that most significantly influence user preferences.
IV. Conclusion
This study presents a scalable and efficient approach to analyzing the MovieLens 20M dataset using Apache Spark, with a focus on identifying the top-ranked movies in Minnesota. The analysis of genres, user ratings, and tag correlations provides key insights into regional user preferences. Furthermore, integrating tag correlation analysis enhances the accuracy of recommendation systems by considering qualitative user preferences beyond numerical ratings. Apache Spark's role in this study demonstrates its effectiveness in large-scale data processing, significantly improving computational efficiency and scalability. The findings emphasize the importance of distributed computing frameworks in modern recommendation systems and highlight potential areas for further enhancement. Despite its strengths, this methodology has certain limitations. The reliance on collaborative filtering approaches may struggle to capture evolving user preferences or address the cold-start problem, where limited data for new users or items hinders recommendations. Additionally, the analysis does not incorporate external contextual factors, such as social media sentiment or real-time user interactions, which could further enhance predictive accuracy. Future research will explore advanced machine learning techniques, such as neural collaborative filtering, to improve personalization and adapt to non-linear user-item relationships. Moreover, integrating external data sources, including social media sentiment analysis and real-time user interactions, will provide a more comprehensive understanding of user preferences and trends. By addressing these aspects, future studies can further optimize recommendation systems and refine big data analytics techniques, ensuring more accurate and personalized content recommendations for users.
References
- Janssen, M., Van Der Voort, H., & Wahyudi, A. (2017). Factors influencing big data decision-making quality. Journal of business research, 70, 338-345. [CrossRef]
- Farhana, Zaman, Rozony., Mst, Nahida, Aktar, Aktar., Md, Ashrafuzzaman., Ashraful, Islam. (2024). 1. A systematic review of big data integration challenges and solutions for heterogeneous data sources. [CrossRef]
- Omaiyma, Abbas., Romaytha, Salih., Samah, Abdalla., Aisha, Elhassan., Al-Alas, Mohammed., Shima, Suliman. (2023). 5. Big data issues and challenges. International Research Journal of Modernization in Engineering Technology and Science. [CrossRef]
- Nand, Kumar, Et, al.. (2023). 2. Harnessing the Power of Big Data: Challenges and Opportunities in Analytics. [CrossRef]
- Salil, Bharany., Nasser, Taleb., Muhammad, Tariq, Sadiq., Nayab, Kanwal., Taher, M., Ghazal., Manas, Pradhan., Ateeq, Ur, Rehman. (2023). 4. A Comprehensive Review on Big Data Challenges. [CrossRef]
- Saeed, I., & KUMAR, R. (2023). Challenges and Emerging Patterns in Big Data Analytics. Authorea Preprints. [CrossRef]
- Vishnu, Prasad, Verma., T., P., Sinha., Santosh, Kumar., Nenavath, Srinivas, Naik. (2024). 2. Performance Analysis of Apache Spark Job Schedulers for Big Data Processing. 2017 IEEE Region 10 Symposium (TENSYMP). [CrossRef]
- Ashima, Sahni. (2024). 3. A Comparative Analysis of Apache Spark Dataframes over Resilient Distributed Datasets (RDDs). Indian Scientific Journal Of Research In Engineering And Management. [CrossRef]
- Himanshu, Gupta. (2024). 1. Big Data Analytics using Artificial Intelligence: Apache Spark for Scalable Batch Processing. International journal of innovative science and research technology. [CrossRef]
- Dragan, Stojanović., Dušan, Jovanović., Natalija, Stojanović. (2024). 4. Big Medical Data Analytics Using Apache Spark Framework. [CrossRef]
- Chaganti, Sri, Karthikeya, Sahith., Satish, Muppidi., Suneetha, Merugula. (2023). 5. Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization. [CrossRef]
- García, S., Luengo, J., & Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems, 98, 1-29. [CrossRef]
- Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1, 145-164. [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).