I. Introduction
The reusability of software packages (i.e., artifacts) has significantly enhanced developer productivity and improved the quality of software systems by enabling developers to build upon pre-existing packages. However, the extensive availability of open-source software packages in ecosystems, such as Maven
1, often makes it challenging for developers to select, contribute to, or reuse the most suitable packages. Popularity metrics, which quantify aspects of a package’s usages, community engagement, and technical activity, play a crucial role in guiding these decisions.
This study investigates the role of popularity metrics in evaluating open-source software packages in Maven and addresses the following research questions: (1) How popular are the packages in the Maven ecosystem? (2) How are different popularity metrics of Maven packages correlated with each other? (3) Which of the studied metrics are the most important in determining highly popular packages?
To answer these questions, we analyzed 103,315 Maven packages, all of which are at least two years old. We collected metrics related to popularity from two key sources: the Maven Central Neo4j dataset [
1] and GitHub
2 repositories. Metrics such as release frequency, dependencies, and usages reflect the technical and maintenance activity of packages, while GitHub metrics like star count, fork count, pull requests, contributors, and the presence of README files provide insights into community engagement and developer support [
2]. Logistic regression analysis was employed to determine the impact of these metrics on the popularity of the Maven packages.
Our findings reveal that metrics such as license status, the number of commits, the presence of README files, and usages are the most significant predictors of package popularity, with strong statistical support. Notably, while vulnerabilities were intuitively expected to be impactful, they did not reach statistical significance in this analysis. This study provides actionable insights for developers and researchers, highlighting the most critical factors for evaluating and selecting open-source software packages effectively. For more details, please refer to the dataset and findings at
https://zenodo.org/records/14788435.
The paper is organized as follows: Section II outlines data collection, including sources, metrics, and filtering. Section III presents the analysis on popularity, feature importance, and correlations. Section IV reviews related studies, and Section Vconcludes the paper.
II. Data Collection
The data collection process involves retrieving package IDs, release versions, the number of dependencies, usage counts, etc., from the Maven Central Neo4j dataset [
1]; collecting GitHub source code URLs; gathering repository metrics from GitHub; and applying filtering techniques to ensure data quality and relevance.
From the Maven Central Neo4j dataset (Version 2024-08-30), we collect artifact IDs and the latest release versions. Using the information, we construct Maven URLs to retrieve pom.xml files for each package. If a GitHub repository URL is specified in the pom.xml, we collect the URL for further analysis.
We utilize the Goblin Weaver tool [
3] to extract metrics from the Maven Central Neo4j dataset, encompassing release count, release frequency (i.e., how often new versions of a package are published within a given timeframe), dependencies (number of required packages), usages (number of packages that depend on it), popularity over the past year (activity level of the package), and vulnerabilities (count of vulnerable releases). Additionally, GitHub repository metrics—such as star count, fork count, pull requests, subscriber count, presence of a license (binary), tag count, open and closed issue counts, contributor count (number of individuals who contributed), commit count, creation and update timestamps, and the presence of README files (binary) and repository descriptions—are retrieved via the GitHub API. These metrics are then processed and structured for further analysis.
We collect 658,078 package IDs from the Maven Central Neo4j dataset. We remove 228,934 packages whose
pom.xml files lack source code repository information. The remaining packages are distributed across 8,571 platforms, with the majority hosted on
https://github.com (141,382 packages). For this study, we only consider those packages that keep their source code in GitHub repositories. We further remove 38,067 packages because their GitHub repositories are no longer available, leaving a final dataset of 103,315 packages for which we collect GitHub and Maven Central metrics.
III. Analysis and Result
This section addresses the research questions of our study.
RQ1: How Popular are the Packages in the Maven Ecosystem?
To evaluate the popularity of packages in the Maven ecosystem, we analyzed the distribution of packages based on their GitHub star count. GitHub stars serve as a widely accepted proxy for popularity, as they reflect user interest, adoption, and community endorsement. While not a perfect measure, stars indicate the visibility and perceived value of a package within the developer community [
4]. Our analysis reveals a heavily skewed distribution, where most packages have low star counts, and only a small fraction achieve high levels of recognition.
Figure 1 illustrates the distribution of packages across different star ranges. The majority (i.e., 40%) of packages have star counts of 10 or less, indicating minimal popularity for a large portion of the dataset. As the star count increases, the number of packages decreases significantly, with only a small subset achieving widespread recognition.
Packages in the 1–10 stars range dominate the dataset, with 44,388 packages representing repositories with minimal popularity. Moving to the 11–100 stars range, 27,817 packages demonstrate moderate popularity, though still significantly below higher ranges. In the 101–1000 stars range, 21,989 packages reflect a notable increase in recognition compared to the lower ranges. The 1001–10000 stars range includes 10,282 highly popular repositories, while only 4,128 packages surpass the 10000+ stars mark, making up the most widely recognized and utilized repositories.
These findings highlight the uneven distribution of package popularity in Maven, with a few gaining significant recognition while most receive little attention.
RQ2: How are Different Popularity Metrics of Maven Packages Correlated With Each Other?
To investigate the relationships between different popularity metrics (see
Table 1) of Maven packages, we conducted a correlation analysis using both Pearson and Spearman methods. Pearson correlation measures the linear relationship between metrics, while Spearman correlation assesses their monotonic association, capturing both linear and non-linear trends. The results are presented in
Figure 2, which provides a detailed view of the strength and direction of correlations among the metrics.
The analysis reveals that community-driven metrics, such as stars and subscribers, exhibit strong positive correlations (e.g., 0.82 between stars and subscribers). A similarly strong relationship is observed between stars and forks (0.63), indicating that popular repositories tend to attract more engagement in the form of forks and subscribers. Furthermore, pull requests show a strong correlation with closed issues (0.88), reflecting the active development and responsiveness of popular repositories.
Moderate correlations are observed between contributors count and closed issues count (0.56), as well as between open issues count and tags count (0.59). These findings indicate that repositories with more contributors and frequent updates handle issues more effectively, enhancing their popularity.
In contrast, technical metrics such as dependencies (both upstream and downstream), release frequency, and vulnerabilities exhibit weak or very weak correlations (less than 0.2) with community-driven metrics like stars and forks. For instance, the correlation between dependencies and other metrics is negligible, indicating that while dependencies may impact usability, they do not strongly correlate with engagement or popularity. Similarly, vulnerabilities show almost no relationship with metrics like stars or forks, implying that security concerns are not reflected in GitHub-based popularity indicators.
Overall, these findings highlight that community engagement metrics, such as stars, subscribers, and forks, are interrelated and serve as strong indicators of packages popularity. On the other hand, Maven-specific technical metrics, while critical for functionality, have limited influence on defining packages popularity and engagement.
RQ3: Which of the studied metrics are the most important in determining highly popular packages?
Table 1 provides a comparative statistical summary of the metrics for the top 20% and bottom 20% of packages based on popularity. Metrics such as
Star Count (Median: 415.0 for top 20% vs. 3.0 for bottom 20%) and
Fork Count (Median: 53.0 for top 20% vs. 1.0 for bottom 20%) show substantial differences, with high statistical significance (
) and medium to large effect sizes, underscoring their strong influence on package popularity. Similarly, metrics like
Closed Issues Count and
Contributors Count demonstrate large effect sizes, indicating their critical role in reflecting the activity and collaboration around the package.
Conversely, certain metrics, such as Vulnerabilities, Usages, and Release Frequency, showed negligible effect sizes, suggesting limited direct impact on package popularity. While these metrics provide valuable context about the package, their statistical insignificance highlights that developers may prioritize other factors, such as community engagement and active development, when assessing package adoption.
To identify the most significant metrics influencing the popularity of Maven packages, we performed a detailed quantitative analysis using logistic regression and hierarchical clustering. This analysis aimed to reduce multicollinearity among metrics and provide actionable insights into the factors most correlated with package popularity.
A. Placeholder Subsection
The analysis considered packages that were at least two years old to ensure maturity and sufficient historical data. Metrics were standardized using standard scaling to normalize feature values across different ranges. The metric About Info was preprocessed by removing stop words and counting unique words to quantify the richness of the package descriptions.
Hierarchical clustering addressed multicollinearity, grouping metrics with a cutoff. A representative metric was selected for each cluster based on interpretability and relevance.
The selected metrics include: License, Commits Count, Readme Exists, About Info, Dependencies, Usages, Closed Issues Percentage, Release Frequency, and Vulnerabilities.
Packages were categorized into “highly popular" (top 20% by star count) and “less popular" (bottom 20%) based on the star count of the packages. The middle 60% of packages were excluded to create a clear separation. A logistic regression model was trained using these labels as the dependent variable and the selected metrics as independent variables. The model’s performance was evaluated using the area under the ROC curve (AUC), with an AUC of 0.85 indicating strong discriminative ability.
The logistic regression model results are summarized in
Table 2, detailing the Wald
-statistics,
p-values, and significance levels for each metric.
The analysis reveals critical insights into the factors influencing package popularity in the Maven ecosystem. The most important factors include License, Commits Count, README Exists (i.e., the presence of README files), About Info, and Usages. Among these, License stands out as the most influential metric, with the highest Wald value (1326.15). This result highlights the role of permissive licensing in encouraging broader usages and adoption of packages. Similarly, Commits Count is another vital metric, as frequent commits indicate active development and maintenance, which fosters user confidence in the package’s reliability. The presence of a README file is also significant, as it improves package accessibility by providing essential documentation, thereby making it easier for developers to adopt and integrate the package. About Info contributes to popularity by offering clear and detailed descriptions of the package’s functionality, enhancing its visibility and appeal. Finally, the Usages metric reflects the package’s real-world application and community adoption, serving as a direct indicator of its popularity.
Metrics such as Closed Issues Percentage and Release Frequency showed moderate importance. A higher percentage of closed issues signifies the maintainers’ ability to efficiently address problems, which enhances user satisfaction and confidence. Similarly, frequent releases suggest active maintenance and responsiveness to evolving user needs, contributing to the package’s reputation and reliability. These factors, while not as influential as the most critical metrics, still play an essential role in shaping user perception and adoption.
Interestingly, some metrics that are intuitively expected to be significant, such as Vulnerabilities, were found to have a limited statistical impact. Although a low number of vulnerabilities could enhance trust in a package, this metric did not achieve statistical significance (). This could be attributed to limited variation in vulnerabilities across the dataset or its indirect relationship with perceived popularity.
The analysis highlights that packages with permissive licenses, active development (high commits count), and robust documentation (README exists) are more likely to be highly popular. These insights can guide developers and ecosystem maintainers in prioritizing package features to enhance their utility and adoption.