Submitted:
21 August 2025
Posted:
22 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Existing Literature Review: Big Data Definition
2.1. Foundational Frameworks: The "3 Vs" Model
2.2. Expanded Definitions: Moving Beyond the "3 Vs"
2.3. Industry-Specific Definitions: Real-Time Analytics and Decision-Making
2.4. Critical Challenges and the Need for Standardization
2.5. Toward a Quantitative and Contextual Understanding
3. Data bigness: A statistical variability-based framework
3.1. Multidimensional Quantitative Definition of Big Data
3.1.1. Statistical Complexity ()
- : The norm of the estimated mean vector. Significant magnitude or temporal drift in can complicate modeling, especially in non-stationary contexts [6].
- : Absolute multivariate kurtosis, defined by K.V. Mardia in [7], assessing the "tailedness" and peakedness relative to a multivariate normal distribution. High kurtosis indicates heavy tails (leptokurtosis), suggesting a higher propensity for outliers or extreme events that can destabilize standard estimators and necessitate robust statistical approaches [23].
- Tr(): The trace of the covariance matrix, equivalent to the sum of the variances of individual variables . It provides a simple scalar summary of the total variance or dispersion within the dataset [28]. High total variance can be indicative of complexity, particularly in high-dimensional settings.
3.1.2. Computational Complexity ()
- : A numerical representation of the algorithm’s asymptotic time complexity (e.g., mapping , , , to a monotonically increasing numerical scale). High time complexity signifies that execution time grows rapidly with data size n, potentially exceeding acceptable latency or processing windows [17,23,24].
3.1.3. Algorithmic (NP-Hard) Complexity ()
- : An indicator reflecting the complexity class of the problem solved by A. For instance, could be assigned a high value (e.g., 1) if the problem is NP-hard and typically requires non-exact methods for large instances encountered in Big Data, and a low value (e.g., 0) if the problem admits efficient polynomial-time algorithms.
3.2. Establishing Domain-Specific Thresholds () and Implication
- Finance: Statistical thresholds () for high kurtosis might be benchmarked against historical crisis data [15,21]. Computational thresholds () could be dictated by latency requirements (e.g., algorithms slower than deemed too slow). NP-hard complexity () could be triggered by tasks like complex portfolio optimization [22,27,28].
- Retail: Statistical thresholds () might relate to identifying significant deviations in customer behavior [29]. Computational limits () could be set by the feasibility of daily analyses on massive transaction volumes (e.g., infeasible). NP-hard challenges arise in vehicle routing or large-scale clustering [30].
- High Algorithmic Complexity (): Requires shifting from exact methods to well-justified heuristics, approximation algorithms, randomized algorithms, or specialized solvers [32].
3.3. Contextualizing Statistical Complexity: Positioning within Existing Frameworks
- Elevated variance (trace or norm of the covariance matrix)
- Non-Gaussian behavior (skewness and kurtosis)
- Strong inter-variable correlation (covariance structure)
- Shifting distributional centers (mean norm)
- Objective: Classical complexity measures aim to understand system behavior; aims to support model selection and window adaptation.
- Granularity: Other measures operate at the level of entire systems or sequences; is local and segment-level.
- Actionability: Its values directly inform whether a dataset slice warrants specialized modeling strategies or resource allocation.
3.4. Advantages of Our Proposed Multi-Dimensional Framework
- Theoretical Rigor and Comprehensiveness: Provides a more complete, multifaceted, and conceptually sound basis for characterizing the challenges posed by modern datasets, grounded in statistics and computer science principles.
- Enhanced Interpretability: It clearly separates distinct sources of difficulty (statistical properties, resource demands, intrinsic problem hardness), allowing a more precise diagnosis of analytical bottlenecks.
- Actionable Analytical Guidance: It directly informs strategic decisions on selecting appropriate statistical methods, computational infrastructure, and algorithmic techniques tailored to the specific complexities encountered.
4. Review of Big Data Analytics Challenges
4.1. Challenges Stemming from Statistical Complexity () and Data Veracity
4.2. Challenges Stemming from Computational Complexity (), Volume, and Velocity
4.3. Challenges Stemming from Algorithmic Complexity ()
4.4. Interrelated Challenges: Variety, Integration, Security, Value Extraction, and Skills
- Variety and Integration: Big Data encompasses diverse types (structured, unstructured, semi-structured) from multiple sources. Integrating these heterogeneous sources for analysis is challenging, sometimes requiring specialized tools (e.g., NoSQL databases [49]) and potentially leading to data silos that hinder comprehensive analysis [42,44,45,47].
- Security and Privacy: Protecting vast amounts of potentially sensitive data is paramount [40,41]. Concerns include data breaches, compliance with regulations (e.g., GDPR, CCPA), unauthorized access, and ensuring privacy throughout the data lifecycle [42,44,45,50]. Big Data environments, including IoT initiatives, increase the potential attack surface by introducing more endpoints [50]. Robust security measures like comprehensive data protection strategies, encryption, authentication, authorization, and strict, granular access control are essential but challenging to implement at scale [41,42,45,50,51].
- Value Extraction and Skills Gap: Extracting meaningful, actionable insights and generating tangible value from big data is the ultimate goal, yet it remains a significant challenge [41,51]. Furthermore, surveys indicate a lack of skilled personnel (e.g., data scientists) with the expertise to manage the infrastructure, apply advanced analytical techniques, and correctly interpret results is a major barrier to adoption [45,50]. Selecting the right tools and platforms is also crucial but complex, as no single solution fits all Big Data needs [40,44].
5. Proposed Methodology: Adaptive High-Fluctuation Recursive Segmentation
5.1. Introduction and Context
5.2. Review of Baseline Segmentation Approaches and Their Limitations
5.2.1. Fixed-Size Sliding Windows
5.2.2. ADWIN (Adaptive Windowing)
- ADWIN is optimized for detecting recent changes and retaining data from the current distribution. In contrast, our approach identifies historically significant segments with pronounced statistical fluctuation and integrates them with recent observations to form a forecasting-optimized dataset.
- Upon detecting change, ADWIN discards the older portion of its window to maintain adaptability. In contrast, our method explicitly retrieves and reuses historical segments, selecting those with statistically significant fluctuation for inclusion in the forecasting dataset. Thus, where ADWIN forgets, our approach remembers selectively.
- ADWIN primarily bases its adaptation on changes in means or simple distributional statistics. Our method incorporates a richer multivariate statistical characterization, leveraging higher-order moments such as skewness and kurtosis. This design allows our method to adapt to complex features like volatility, asymmetry, and tailedness, which are important in real-world, high-dimensional forecasting across diverse industries [7,21,53].
5.3. Foundational Recursive Segmentation (Likelihood Ratio)
5.4. Adaptive Window Size Determination
- Preliminary Similarity () which captures differences in central tendency and overall variance:where and represent the norm of the mean vector and the trace of the covariance matrix for data segment respectively. and represent the same for data segment .
- Detailed Similarity () which measures difference in higher-order moments using the composite variability metric defined in [18], which includes skewness and kurtosis and ensures that the two segments are also aligned in their distributional characteristics, such as shape, asymmetry, and tail behavior [7,21,54]:where and represent the weighted aggregation composite metrics (defined in [18]) for data segment and respectively.
5.5. High-Fluctuation Segment Selection and Optimal Dataset Construction
- Target Set for Segmentation (): Constructed by combining the full past segment of size with the earliest portion of the recent segment of length :
- Recursive Segmentation: Apply the likelihood-ratio segmentation algorithm (Section 5.3) on , yielding a set of statistically significant boundaries:
- Top-k Segment Selection: Rank boundaries by descending and select the top k. For each , extract a segment around it. The value of k and chunk sizes are chosen such that:
- Optimal Dataset Construction:
5.6. Advantages and Scalability
6. Evaluation Across Univariate and Multivariate Forecasting
6.1. Univariate Forecasting: Bitcoin Case Study
6.1.1. Dataset and Forecast Objective
6.1.2. Experimental and Environment Setup
- Mean norm becomes scalar mean .
- Covariance matrix norm is reduced to sample variance .
- Skewness and kurtosis are, respectively, reduced to their standard univariate forms as and .
- Trace of covariance matrix is equivalent to the sample variance .
- Trend: multiplicative
- Seasonal: multiplicative
- Seasonal period: 365 (to capture daily seasonal cycles)
6.1.3. Forecasting Results
- A.
-
Dynamic Window ComputationTo comply with the system-imposed memory constraint of time steps, the AHFRS framework constructs a training dataset by combining two complementary segments:
- : the most recent 14,288 observations, representing (1 – 24.62%) of the memory budget , and
- : a set of non-contiguous high-fluctuation historical segments totaling 4,668 observations, accounting for the remaining 24.62%.
These high-fluctuation segments are identified using the likelihood ratio–based recursive segmentation method developed in our earlier work. This approach partitions the time series into statistically homogeneous intervals by computing likelihood ratios between adjacent windows and selecting breakpoints where a significant statistical shift is detected. The segments are then ranked by their fluctuation intensity, and the top-K segments are chosen based on their relative contribution to the total variability, all while respecting the global memory constraint .Figure 5 illustrates this segmentation layout for the univariate Bitcoin dataset. At the end of the series, the contiguous recent segment provides short-term contextual information. Interleaved across the earlier timeline are the selected high-variability segments , which capture historically significant behavioral shifts. The combination ensures that the training dataset contains both up-to-date signals and long-range fluctuation patterns that might otherwise be excluded under recency-based schemes. This segmentation strategy distinguishes AHFRS from conventional sliding window methods. Instead of discarding older data outright, AHFRS selectively incorporates historically significant segments based on structural changes in the series. This dynamic windowing capability allows the framework to construct a statistically optimized and computationally feasible training dataset that retains both short-term trends and long-term variability patterns critical to accurate forecasting. - B.
-
Forecasting Generation and Evaluation MetricsThe forecasting process involves training the Holt-Winters exponential smoothing model (configured as detailed in Section 6.1.2) on the respective training datasets: either the ’No Segmentation’ baseline (the most recent observations) or the ’AHFRS-enhanced’ dataset (). Once trained, the model generates multi-step-ahead predictions for the defined forecast horizon, which is the period 2021-03-01 to 2021-03-30 for the Bitcoin dataset. These predicted values, alongside the actual test data, are visually presented in Figure 6.Figure 6 illustrates the actual Bitcoin Weighted Price (TEST Data), the predictions from the ’No Segmentation’ baseline, and the predictions from the AHFRS-enhanced approach for the evaluation period. It highlights how the AHFRS framework leads to predictions that more closely track the actual price movements compared to the baseline. To quantitatively assess and compare the accuracy of these generated forecasts, we utilize three well-established metrics:
- Root Mean Squared Error (RMSE):
- Mean Absolute Error (MAE):
- Mean Absolute Percentage Error (MAPE):
Where m is the number of predictions, is the actual value for the i-th prediction, and is the predicted value for the i-th prediction. - C.
-
Performance SummaryThe univariate forecasting results provide strong empirical evidence of the AHFRS framework’s effectiveness in memory-constrained environments. By combining recent data with historically significant, high-fluctuation segments, the training dataset constructed by AHFRS achieves superior statistical diversity and predictive quality compared to a purely recent-window baseline.As shown in Table 2, the AHFRS-enhanced configuration yields consistent improvements across all three evaluation metrics:
- RMSE is reduced from 5743.12 to 4517.31, reflecting a 21.34% improvement,
- MAE drops from 4916.20 to 3553.85, a 27.71% improvement,
- MAPE decreases from 8.69% to 6.33%, representing a 27.15% relative reduction.
These gains highlight the forecasting benefits of retaining select, high-fluctuation historical segments—rather than relying solely on recency—especially in volatile time series contexts like cryptocurrency pricing. The forecasting model used in this univariate scenario is the Holt-Winters exponential smoothing method, configured with multiplicative trend and seasonality. Notably, despite the model’s relative simplicity, the predictive performance is significantly enhanced when paired with AHFRS, highlighting the importance of adaptive, informative input segmentation over model complexity alone.In summary, AHFRS significantly boosts univariate prediction accuracy under strict computational constraints. These findings affirm the framework’s relevance and lay the empirical foundation for its subsequent application to multivariate forecasting.
6.2. Multivariate Forecasting
6.2.1. System Constraint and Forecasting Objective
6.2.2. Domain-Specific Synthetic Datasets Definition
-
Finance Dataset: Features include age, income, credit_score, loan_amount,loan_duration_months, interest_rate, default_risk_index.It simulates financial volatility, abrupt credit score shifts, and latent risk cycles.
-
Retail Dataset: Features include age, spending_score, number_of_purchases,average_purchase_value, churn_likelihood.It embeds patterns of consumer engagement, reactive purchase bursts, and promotional seasonality.
- Healthcare Dataset: Features include age, bmi, blood_pressure, cholesterol_level, exercise_hours_per_week, disease_risk_score. It models longitudinal health trajectories with gradual physiological changes and sporadic clinical risk spikes.
6.2.3. Experimental and Environment Setup
- A.
- Comparative Method: Latest- Baseline The Latest- strategy represents a common industry practice where the model is trained on the most recent observations. This method assumes that recent data contains the most relevant patterns, but it discards older segments that may contain valuable structural information.
- B.
-
Evaluation Metrics and Model Selection RationaleForecasting performance is evaluated using RMSE, MAE, and MAPE. These metrics are first computed for each individual customer i and then averaged across all customers to derive consolidated performance measures (Mean_RMSE, Mean_MAE, and Mean_MAPE).Where is the number of predictions for customer i, is the actual value, is the predicted value, and N is the total number of customers.We conducted experiments using two tree-based ensemble learning methods: Random Forest Regressor (RF) and Gradient Boosting Regressor (GB). These non-parametric models are robust to non-linearity and heterogeneity, common in real-world data [57,58]. Their effectiveness is well-documented in finance [59,60], healthcare [61,62], and retail [63,64]. They are also computationally tractable and compatible with distributed frameworks like Spark and Hadoop [57,65], making them suitable for -constrained pipelines.
6.2.4. Results and Discussion
- A.
-
Industry-Specific Dynamic window computationFigure 7 illustrates the average proportion of the dynamic window selected by AHFRS across the three industries. The variation (Finance: 18.06%, Retail: 25.07%, Healthcare: 24.1%) highlights AHFRS’s dynamic adjustment based on statistical variability. This aligns with the data bigness model, where statistical complexity () interacts with resource constraints (). In industries like Finance, with abrupt shifts, the model selects concise, fluctuation-rich segments. Conversely, domains like Retail and Healthcare, with more gradual shifts, warrant longer segments.
- B.
-
Comparative Forecasting PerformanceTable 3 presents the comparative forecasting performance of RF and GB models under baseline and AHFRS-enhanced regimes. Key observations:
- Substantial Accuracy Gains with AHFRS: Across all industries and metrics, models trained using AHFRS-selected windows consistently outperform their baseline counterparts. For example, in the Finance industry, the RF model’s Mean_RMSE decreases from 0.72 (baseline) to 0.27 with AHFRS—a relative reduction of over 62.5%.
- Retail Domain Sensitivity: Despite already low error values in the retail baseline, AHFRS delivers notable improvements, emphasizing its efficacy even in domains with high-frequency and potentially noisy data.
- Robustness in Healthcare: The Healthcare dataset benefits markedly from AHFRS segmentation. RF’s Mean_MAPE improves from 15.83% to 5.96%, enhancing reliability in critical health forecasting.
- Model-Agnostic Benefits: Both RF and GB models benefit from AHFRS, suggesting the strategy enhances predictive capacity through upstream data selection, independent of the downstream model architecture.
6.2.5. Summary of Evaluation
- Dynamic Adaptability: Selection of optimal historical windows varies by industry, highlighting that effective forecasting under constrained resources requires context-sensitive segmentation.
- Consistent Predictive Improvements: Across all industries, AHFRS-enhanced training sets yield lower forecasting errors.
- Model-Independent Gains: The segmentation benefits are robust across both RF and GB models, affirming the general applicability of AHFRS.
7. Conclusion
- Dynamic adaptability in historical window sizing, guided by statistical variability, significantly improves forecast accuracy over fixed-length or naively adaptive baselines.
- AHFRS consistently reduces forecasting error across all three industries, with reductions in Mean_RMSE reaching up to 62.5% in finance and Mean_MAPE improvements exceeding 10 percentage points in healthcare.
- The framework’s effectiveness is model-agnostic, delivering performance gains across both Random Forest and Gradient Boosting regressors without relying on architecture-specific optimizations.
- Most importantly, AHFRS adheres to the imposed processing budget (), thus validating its scalability and practicality in real-world resource-constrained analytical environments.
References
- De Mauro, A.; Greco, M.; Grimaldi, M. A Formal Definition of Big Data Based on its Essential Features. Libr. Rev. 2016, 65, 122–135. [CrossRef]
- Ajah, I.A.; Nweke, H.F. Big Data and Business Analytics: Trends, Platforms, Success Factors and Applications. Big Data Cogn. Comput. 2019, 3, 32. [CrossRef]
- Lee, I. Big Data: Dimensions, evolution, impacts, and challenges. Bus. Horiz. 2017, 60, 293–303. [CrossRef]
- Laney, D. 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group, Inc., 2001.
- Laney, D. 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group Research Note, Feb. 6, 2001.
- Wang, X.; Liu, J.; Zhu, Y.; Li, J.; He, X. Mean Vector and Covariance Matrix Estimation for Big Data. IEEE Trans. Big Data 2017, 3, 75–86. [CrossRef]
- Mardia, K.V. Measures of Multivariate Skewness and Kurtosis with Applications. Biometrika 1970, 57, 519–530. [CrossRef]
- Fomo, D.; Sato, A. High Fluctuation Based Recursive Segmentation for Big Data. In Proceedings of the 2024 9th International Conference on Big Data Analytics (ICBDA), Tokyo, Japan, 8–10 March 2024; pp. 358–363.
- De Mauro, A.; Greco, M.; Grimaldi, M. What is Big Data? A Consensual Definition and a Review of Key Research Topics. In Proceedings of the 4th International Conference on Integrated Information, Madrid, Spain, 1–4 September 2014.
- Kitchin, R.; McArdle, G. What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data Soc. 2016, 3, 2053951716631120. [CrossRef]
- Gandomi, A.; Haider, M. Beyond the hype: Big Data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [CrossRef]
- Schüssler-Fiorenza Rose, S.M.; Contrepois, K.; Moneghetti, K.J.; et al. A Longitudinal Big Data Approach for Precision Health. Nat. Med. 2019, 25, 792–804. [CrossRef]
- Seyedan, M.; Mafakheri, F. Predictive Big Data analytics for supply chain demand forecasting: Methods, applications, and research opportunities. J. Big Data 2020, 7, 53. [CrossRef]
- Torrence, C.; Compo, G.P. A Practical Guide to Wavelet Analysis. Bull. Am. Meteorol. Soc. 1998, 79, 61–78.
- Bollerslev, T. Generalized Autoregressive Conditional Heteroskedasticity. J. Econom. 1986, 31, 307–327. [CrossRef]
- Bhandari, A.; Rahman, S. Big Data in Financial Markets: Algorithms, Analytics, and Applications; Springer Nature: Cham, Switzerland, 2021.
- Bhosale, H.S.; Gadekar, D.P. A Review Paper on Big Data and Hadoop. Int. J. Sci. Res. Publ. 2014, 4, 1–8.
- Fomo, D.; Sato, A.-H. Enhancing Big Data Analysis: A Recursive Window Segmentation Strategy for Multivariate Longitudinal Data. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 July 2024; pp. 870–879.
- Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: New York, NY, USA, 2002.
- Brys, G.; Hubert, M.; Struyf, A. A robust measure of skewness. J. Comput. Graph. Stat. 2004, 13, 996–1017. [CrossRef]
- Kim, T.H.; White, H. On more robust estimation of skewness and kurtosis: Simulation and application to the S&P 500 index. Finance Res. Lett. 2004, 1, 56–73. [CrossRef]
- Markowitz, H. Portfolio selection. J. Finance 1952, 7, 77–91. [CrossRef]
- Hashem, I.A.T.; Yaqoob, I.; Anuar, N.B.; Mokhtar, S.; Gani, A.; Khan, S.U. The rise of Big Data on cloud computing: Review and open research issues. Inf. Syst. 2015, 47, 98–115. [CrossRef]
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 3rd ed.; MIT Press: Cambridge, MA, USA, 2009.
- Garey, M.R.; Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-Completeness; W. H. Freeman: San Francisco, CA, USA, 1979.
- Sipser, M. Introduction to the Theory of Computation, 3rd ed.; Cengage Learning: Boston, MA, USA, 2012.
- Bienstock, D. Computational complexity of analyzing credit risk. J. Bank. Finance 1996, 20, 1233–1249. [CrossRef]
- Hirsa, A. Computational Methods in Finance, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2016.
- Sabbirul, H. Retail Demand Forecasting: A Comparative Study for Multivariate Time Series. arXiv 2023, arXiv:2308.11939.
- Hillier, F.S.; Lieberman, G.J. Introduction to Operations Research, 10th ed.; McGraw-Hill: New York, NY, USA, 2014.
- Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; Cambridge Univ. Press: Cambridge, U.K., 1997.
- Vazirani, V.V. Approximation Algorithms; Springer: New York, NY, USA, 2001.
- López-Ruiz, R.; Mancini, H.L.; Calbet, X. A statistical measure of complexity. Phys. Lett. A 1995, 209, 321–326.
- Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252.
- Crutchfield, J.P.; Young, K. Inferring statistical complexity. Phys. Rev. Lett. 1989, 63, 105–108.
- Kolmogorov, A.N. Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1965, 1, 1–7. [CrossRef]
- Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81.
- Foster, D.J.; Kakade, S.M.; Qian, R.; Rakhlin, A. The Statistical Complexity of Interactive Decision Making. J. Mach. Learn. Res. 2023, 24, 1–78.
- Tononi, G.; Sporns, O.; Edelman, G.M. A measure for brain complexity: relating functional segregation and integration in the nervous system. PNAS 1994, 91, 5033–5037. [CrossRef]
- Tableau. Big Data Analytics: What It Is, How It Works, Benefits, And Challenges. Available online: https://www.tableau.com/learn/articles/big-data-analytics.
- Simplilearn. Challenges of Big Data: Basic Concepts, Case Study, and More. Available online: https://www.simplilearn.com/challenges-of-big-data-article.
- GeeksforGeeks. Big Challenges with Big Data. Available online: https://www.geeksforgeeks.org/big-challenges-with-big-data/.
- Al-Turjman, F.; Hasan, M.Z.; Al-Oqaily, M. Exploring the Intersection of Machine Learning and Big Data: A Survey. Sensors 2024, 7, 13.
- ADA Asia. Big Data Analytics: Challenges and Opportunities. Available online: https://ada-asia.com/big-data-analytics-challenges-and-opportunities/.
- Datamation. Top 7 Challenges of Big Data and Solutions. Available online: https://www.datamation.com/big-data/big-data-challenges/.
- Yusuf, I.; Adams, C.; Abdullah, N.A. Current Challenges of Big Data Quality Management in Big Data Governance: A Literature Review. In Proceedings of the Future Technologies Conference (FTC) 2024; Springer: Cham, Switzerland, 2024; Vol. 2.
- Kumar, A.; Singh, S.; Singh, P. Big Data Analytics: Challenges, Tools. Int. J. Innov. Res. Comput. Sci. Technol. 2015, 3, 1–5.
- Rathore, M.M.; Paul, A.; Ahmad, A.; Chen, B.; Huang, B.; Ji, W. A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 421–437.
- Cattell, R. Operational NoSQL Systems: What’s New and What’s Next? Computer 2016, 49, 23–30. [CrossRef]
- 3Pillar Global. Current Issues and Challenges in Big Data Analytics. Available online: https://www.3pillarglobal.com/insights/current-issues-and-challenges-in-big-data-analytics/.
- Sharma, S.; Gupta, R.; Dwivedi, A. A Challenging Tool for Research Questions in Big Data Analytics. Int. J. Res. Publ. Seminar 2022, 3, 1–7.
- Bifet, A.; Gavaldà, R. Learning from Time-Changing Data with Adaptive Windowing. In Proceedings of the SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007.
- Bai, J.; Ng, S. Tests for Skewness, Kurtosis, and Normality for Time Series Data. J. Bus. Econ. Stat. 2005, 23, 49–60. [CrossRef]
- Sato, A. Segmentation analysis on a multivariate time series of the foreign exchange rates. Physica A 2012, 388, 1972–1980.
- JMP Statistical Discovery LLC. Statistical Details for Change Point Detection. Available online: https://www.jmp.com/support/help/en/17.2/index.shtml#page/jmp/change-point-detection.shtml.
- Aminikhanghahi, M.; Cook, D.J. A Survey of Methods for Time Series Change Point Detection. Knowl. Inf. Syst. 2017, 51, 339–367. [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.
- Sirignano, R.; Cont, A. Universal features of price formation in financial markets. Quant. Finance 2019, 19, 1449–1459.
- Heaton, J.B.; Polson, N.G.; Witte, J.H. Deep learning in finance. Appl. Stoch. Models Bus. Ind. 2017, 33, 3–12.
- Alaa, A.; van der Schaar, M. Forecasting individualized disease trajectories. Nat. Commun. 2018, 9, 276.
- Rajkomar, A.; Oren, E.; Chen, K.; et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018, 1, 18. [CrossRef]
- Zheng, Y.; Liu, Q.; Chen, E.; Ge, Y.; Zhao, J.L. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Proceedings of the International Conference on Web-Age Information Management, Macau, China, 16–18 June 2014. [CrossRef]
- Chu, W.; Park, S. Personalized recommendation on dynamic content. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 20–24 April 2009; pp. 691–700.
- Zaharia, M.; Xin, R.S.; Wendell, P.; et al. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 2016, 59, 56–65.







| Measure | Purpose | Domain | Typical Use Case |
|---|---|---|---|
| López-Ruiz-Mancini-Calbet (CLMC) [33] | Balance between entropy and disequilibrium | Statistical physics | Identify intermediate complexity states |
| Excess Entropy [34] | Quantify mutual information across time | Dynamical systems | Memory and structure estimation |
| Statistical Complexity () [35] | Minimal memory for optimal prediction | Computational mechanics | Structural modeling |
| Kolmogorov Complexity (KC) [36,37] | Shortest description length / compressibility | Universal modeling | Algorithmic regularity, anomaly detection |
| Decision-Estimation Coefficient (DEC) [38] | Sample complexity of decision tasks | Interactive learning | Regret-bound estimation |
| Neural/Integrated Complexity () [39] | Causal integration and differentiation | Neuroscience, systems biology | Consciousness quantification |
| Dataset | Model | Scenario | RMSE | MAE | MAPE(%) |
|---|---|---|---|---|---|
| Hourly Bitcoin Weighted Price (USD) | Holt-Winters | Baseline | 5743.12 | 4916.20 | 8.69 |
| Proposal | 4517.31 | 3553.85 | 6.33 |
| Industry | Model | Scenario | Mean_RMSE | Mean_MAE | Mean_MAPE(%) |
|---|---|---|---|---|---|
| Finance | RF | Baseline | 0.72 | 0.58 | 21.79 |
| Proposal | 0.27 | 0.21 | 8.05 | ||
| GB | Baseline | 0.70 | 0.56 | 21.20 | |
| Proposal | 0.55 | 0.44 | 16.80 | ||
| Retail | RF | Baseline | 0.03 | 0.03 | 18.34 |
| Proposal | 0.01 | 0.01 | 6.84 | ||
| GB | Baseline | 0.03 | 0.03 | 17.94 | |
| Proposal | 0.02 | 0.02 | 14.47 | ||
| Healthcare | RF | Baseline | 4.48 | 3.59 | 15.83 |
| Proposal | 1.70 | 1.35 | 5.96 | ||
| GB | Baseline | 4.25 | 3.42 | 15.16 | |
| Proposal | 3.41 | 2.73 | 12.05 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).