Preprint
Article

This version is not peer-reviewed.

Optimizing Multi-time Series Forecasting for Enhanced Cloud Resource Utilization Based on Machine Learning

A peer-reviewed article of this preprint also exists.

Submitted:

08 September 2024

Posted:

09 September 2024

You are already at the latest version

Abstract
Due to its flexibility, cloud computing has become essential in modern operational schemes. However, the effective management of cloud resources to ensure cost-effectiveness and maintain high performance presents significant challenges. The pay-as-you-go pricing model, while convenient, can lead to escalated expenses and hinder long-term planning. Consequently, FinOps advocates proactive management strategies, with resource usage prediction emerging as a crucial optimization category. In this research, we introduce the multi-time series forecasting system (MSFS), a novel approach for data-driven resource optimization alongside the hybrid ensemble anomaly detection algorithm (HEADA). Our method prioritizes the concept-centric approach, focusing on factors such as prediction uncertainty, interpretability and domain-specific measures. Furthermore, we introduce the similarity-based time-series grouping (STG) method as a core component of MSFS for optimizing multi-time series forecasting, ensuring its scalability with the rapid growth of the cloud environment. The experiments performed demonstrate that our group-specific forecasting model (GSFM) approach enabled MSFS to achieve a significant cost reduction of up to 44%.
Keywords: 
;  ;  ;  ;  

1. Introduction

The increasing importance of cloud computing under- scores its undeniable flexibility, which renders it indispensable in modern organizations. However, the operational model inherent in cloud environments carries significant risks that can impact both cost-effectiveness and energy utilization. While the pay-as-you-go pricing scheme provides a convenient interface for accessing cloud resources, expenses can escalate uncontrollably [40].
Ensuring top-notch QoS (Quality of Service) and steering clear of SLA (Service Level Agreement) violations often involves provisioning more resources than necessary, leaving a considerable margin of unused cloud capacity. While overprovisioning leads to resource wastage and higher costs, under provisioning risks service downtime. Thus, FinOps (financial operations) emphasize finding a middle ground to encourage proactive resource management strategies [49].
Among the leading FinOps principles are continuous monitoring, embracing serverless architectures, utilizing preemptive computing instances, and employing resource reservations [49]. Since on-demand provisioning carries the risk of resource shortage, organizations can assess their workload patterns in a data-driven manner without compromising on performance and scalability. Moreover, long-term reservation plans can serve as the foundation for committed use discounts, further enhancing the substantial value generated by cloud environments [37].
As cloud computing continues to grow in scale and complexity, optimizing resource utilization is crucial for achieving environmental sustainability. Effective optimization can result in significant cost savings by reducing unnecessary computational expenses for both users and cloud service providers. Moreover, improving resource management aligns with the principles of GCC (green cloud computing) and sustainable computing by lowering energy consumption and reducing the carbon footprint of data centers, thereby facilitating economically viable cloud computing practices [44]. Given the rapid evolution of cloud environments, it is also essential that optimization solutions are scalable and do not introduce excessive operational overhead.
Therefore, resource usage prediction emerges as a significant optimization category. However, workload patterns in cloud environments are influenced by various factors, and as a result time series data represent historical resource usage characterized by complex, multi-seasonal dependencies and considerable variability [27].
Consequently, in this research, we introduce a novel system for data-driven cloud resource usage optimization called the multi-time series forecasting system (MSFS). While many studies tend to prioritize a model-centric approach [41], we propose a concept-centric attitude, emphasizing factors such as prediction uncertainty, interpretability and efficient data utilization in a knowledge-based manner. Recognizing the importance of scalability, especially given the fast-paced evolution of cloud environments, we focus on optimizing multi-time series forecasting through the introduction of similarity-based time-series grouping (STG) method and another related concept of group-specific forecasting model (GSFM).
Therefore, the major contributions of this paper can be summarized as follows:
  • A novel cost-effective multi-time series forecasting system (MSFS) for dynamic cloud resource reservation planning.
  • A context-aware multi-time series optimization method called similarity-based time-series grouping (STG) for scalable, automated, and feature-based time series segmentation.
  • A novel hybrid ensemble anomaly detection algorithm (HEADA).
  • A multifaceted evaluation of multi-time series forecasting (MSFS) system using a real-life dataset from a production cloud environment.
  • FinOps-driven qualitative and quantitative assessment of dynamic resource reservation plans.
  • The experiments conducted demonstrated that the group-specific forecasting model (GSFM) approach presented in this study outperformed both the global forecasting model (GFM) and local forecasting model (LFM) concepts, resulting in an average cost reduction of up to 44.71% in cloud environments.
The rest of this paper is structured as follows: Section 2 contains a description of related work, Section 3 focuses on multi-time series forecasting system (MSFS) architecture, Section 4 concentrates on the experiments performed, and Section 5 presents conclusions and further work.

3. Multi-Time Series Forecasting System

In this section, we introduce a novel concept called the multi-time series forecasting system (MSFS) for FinOps- aware resource optimization (Section 3.1). Additionally, we present a formal notation to further explain key design decisions and showcase the challenges of time series forecasting in cloud computing that are targeted by MSFS (Section 3.2).

3.1. System Overview

The primary objective of enhancing cloud resource usage is to strike a balance between reducing costs, minimizing resource wastage and improving long-term operational decisions. Consequently, MSFS utilizes time series forecasting to enable demand-based resource scaling and reservations. Given that data from cloud environments encompass multiple multivariate time series of varying lengths and dimensions, our focus is on optimizing the time series forecasting itself to ensure overall MSFS scalability.
In general, MSFS involves adopting a hybrid approach between the global forecasting model (GFM) and local forecasting model (LFM) concepts. While the GFM proposes a shared forecasting model for all virtual machines, the LFM approach necessitates a separate entity for each virtual machine. Therefore, the hybrid solution – group-specific forecasting model (GSFM) – seeks to find a compromise between these extremes, offering scalability and improved accuracy in dynamic resource reservation. GSFM is thus a concept involving separate model forecasting for each group of similar time series, which is identified through similarity- based time-series grouping (STG) – a multi-time series optimization method. Consequently, MSFS is a self-adapting system that enables data-driven decisions, influencing the state of the cloud environment by creating dynamic resource reservation plans and delegating scaling recommendations to the resource manager.
Therefore, Figure 1 illustrates the concept of integrating MSFS within a cloud environment. The system relies on diverse metrics related to virtual machine resource usage, collected from cloud monitoring. This monitoring system is a core built-in component of cloud environments, which is responsible for aggregating data on the operational status of various infrastructure resources. Given the potential for numerous virtual machines within a cloud infrastructure (as illustrated in Figure 1, where there are M virtual machines), each with its own set of metrics (Figure 1 also shows the scenario where each of the M machines has K registered metrics), the system processes a dataset consisting of multiple time series. Within this framework, the MSFS system functions as a data consumer, utilizing historical data to forecast future resource usage. Simultaneously, MSFS acts as a data producer, generating resource reservation plans – essentially resource allocation recommendations for each virtual machine – based on the forecasts made. Subsequently, reservation plans serve as the basis for scaling decisions, which are delegated to a resource manager, thereby directly influencing the cloud infrastructure, for instance recommending adjustments to the computational capacity of individual virtual machines.
Considering that as a solution that enhances resource utilization, MSFS should not incur higher costs or lead to increased management overhead, system architecture (Figure 2) has been designed as a set of eight stages, each with its individual execution schedule. Consequently, each stage can be executed as a serverless pipeline, and the execution of individual stages can be supervised by any workflow scheduler. In Figure 2, each stage contains a single step (highlighted block) indicating the starting point for on- schedule execution. If off-schedule execution occurs, the interactions depicted with dashed lines are followed.
The initial stage of MSFS – Data Processing – interfaces with cloud monitoring to retrieve resource usage metrics. This stage encompasses two modules: the Ingestion Module and the Anomaly Module. The former is responsible for applying lightweight data transformations, primarily to address erroneous readings and missing observations, while also performing streamlined data aggregations to unify dimensionality across time series. Moreover, the Anomaly Module is used for the detection and handling of outlying samples and anomalous sequences, leveraging the novel hybrid ensemble anomaly detection algorithm (HEADA), which is discussed in Section 4.1. Given that the Anomaly Module accesses historical data already devoid of erroneous readings, its primary objective is to detect time points where virtual machines encountered anomalous operation conditions, thereby ensuring the stability of classifications. Consequently, anomaly detection is integrated into the MSFS system through a dedicated Anomaly Module. In our proposed operational model, as outlined in the MSFS architecture, the HEADA algorithm plays a central role in the anomaly detection process, serving as an integral component of the system.
Throughout various stages, the steps of data loading and extraction into either TSDB (time series database) or VDB (vector database) occur. The use of MSFS in production environments is particularly recommended in order to ensure persistence, idempotency and facilitate backward filling. Additionally, it is worth noting that updating the computing capabilities of virtual machines often requires turning them off. To counteract this, downtime prevention strategies are possible, such as provisioning a new machine in advance. In such cases, it is necessary to associate MSFS-specific identifiers with cloud-specific ones to unambiguously distinguish virtual machines. Moreover, we introduce a concept of ID-SET as a collection of unique TSIDs (time series identifiers) for which the execution of specific stages takes place. This abstract collection is exchanged between stages during off-schedule execution and by default includes all the series identifiers in on-schedule runs. For instance, in the event that new time series are identified during the final step of the Anomaly Module (series meeting the minimum length criterion – determining the necessary data threshold for series inclusion in MSFS), these can be seamlessly added to the ID-SET collection, prompting the immediate initiation of the subsequent stage.
The second stage of MSFS – Embedding – is responsible for training the embedding model (e.g., autoencoder) within the Training Module, which is incrementally fine-tuned in subsequent stage executions. Such model is expected to efficiently capture nonlinearity and temporal dynamics in order to generate semantically accurate latent representations of sequences intended for feature-based similarity determination. Consequently, the design decision to implement a global model involves incorporating data from all available time series (post-AD – after anomaly detection). This design decision prioritizes overall effectiveness in sequence reconstruction over exceptional performance in corner cases. Furthermore, the Embedding Module is responsible for creating vectorized representations of sequences belonging to individual time series considered at this stage.
Leveraging the embedding model to generate semantically accurate predictors reduces the need for handcrafting features. As the global embedding model undergoes periodic fine-tuning, the maintenance of feature importance ranking becomes obsolete.
In the subsequent stage of MSFS – Grouping – embedded representations of sequences are subjected to K- means clustering. With semantically accurate embeddings, this algorithm clusters together time series exhibiting similar resource usage patterns – this forms the foundation of the STG method. Additionally, if the dataset contains metrics that deviate significantly from the norm (for instance representing a virtual machine that has been accidentally left without decommissioning and thus with near-zero resource usage), STG can isolate such series, effectively acting as a filter. This prevents such series from adversely affecting other series, thereby enhancing further modeling outcomes and facilitating the identification of unused resources. Section 4.2 covers the selection methodology for the embedding model and focuses on K-means clustering, including the evaluation and determination of the appropriate number of clusters with the use of novel domain-specific metrics.
According to execution schedules, the subsequent stage of MSFS is Training, which directly leverages the associations of time series with groups. Following the introduced optimization strategy – STG – at this stage the Training Module aims to prepare group-specific models, thus implementing the GSFM concept. Subsequently, within the Prediction stage – and specifically as part of the Prediction Module – forecasting is performed exclusively for CPU and RAM (pertaining solely to active virtual machines), as these two resources determine the configuration parameters of virtual machines in a cloud environment. Consequently, predictions are made using group-specific models. Section 4.3 focuses on the evaluation of weekly forecasts for multiple virtual machines using GSFM, compared with GFM and LFM- based approaches, alongside a comprehensive assessment against domain-specific metrics.
In the subsequent stage of MSFS – Planning – forecasts of resource consumption in percentage terms are translated into initial resource reservation plans, specifying recommended vCPUs (virtual CPUs) and RAM over time for each active virtual machine included in MSFS. Recognizing the inherent uncertainty in forecasts and the potential impact of inaccurate predictions – leading to either SLA breaches or significant resource wastage – we introduce the concept of administrative rules. These rules form a set of policies guiding the refinement of initial reservation plans into adjusted ones, proactively minimizing the repercussions of incorrect data-driven recommendations. An example of administrative rules might entail adjusting initial vCPUs or RAM values if the recommendations surpass 80% of the machine’s resources, as this could result in performance degradation. Consequently, administrative rules also serve as a means to inject domain knowledge of cloud platform administrators, when time series data is limited. Since the resource usage characteristics of virtual machines can change dynamically, historical data from a single machine may not be sufficiently representative for effective forecasting. Accurate predictions require a robust historical database to leverage past information and detect similar future events with high accuracy. Moreover, LFM overlooks potential interactions between virtual machines that may belong to the same cluster or communicate with one another, resulting in an erroneous assumption that they are isolated and independent. Thus, LFM can be seen as a significant example of inefficient data utilization. On the other hand, GSFM improves upon the GFM approach by grouping similar or related time series instead of using all available data indiscriminately. By grouping time series with similar characteristics, GSFM avoids the oversimplified assumption underlying GFM that resource usage patterns across all machines have a uniform constructive influence. GSFM employs knowledge-based decisions when constructing datasets for training group- specific models, addressing data insufficiency issues that affect LFM. In contrast, GFM aims for broad applicability but may struggle with highly diverse and chaotic usage patterns that a single model may fail to capture effectively. Thus, GSFM’s approach of distributing knowledge across several representative models (serving as experts in specific resource usage characteristics) is more effective than relying on a single model that strives to be adequate for all possible scenarios and patterns.
Table 8. Back-to-back evaluation metrics for resource usage prediction at the virtual machine level in the test set.
Table 8. Back-to-back evaluation metrics for resource usage prediction at the virtual machine level in the test set.
Co ntext Virtual Machine
Evaluation Metric Resource Method VM01 VM02 VM03 VM04 VM05 VM06 VM07 VM08
MAE CPU GFM 0.123 0.209 3.282 1.653 3.863 2.656 1.510 2.787
GSFM 0.025 0.249 2.788 1.674 3.552 2.882 1.479 2.775
LFM 1.019 0.308 2.966 2.379 12.511 4.156 4.503 2.910
RAM GFM 1.057 0.288 0.850 1.173 1.152 1.012 0.841 0.633
GSFM 0.216 0.072 0.724 0.617 1.284 0.994 0.475 0.704
LFM 8.764 0.285 1.024 4.270 2.291 1.270 3.712 1.364
RMSE CPU GFM 0.123 0.370 3.700 1.993 4.335 3.066 1.758 3.405
GSFM 0.031 0.369 3.205 2.046 4.109 3.208 1.773 3.414
LFM 1.022 0.396 3.390 2.698 13.027 4.743 4.785 3.429
RAM GFM 1.061 0.296 0.938 1.213 1.368 1.160 0.884 0.789
GSFM 0.264 0.078 0.846 0.658 1.481 1.178 0.523 0.836
LFM 8.793 0.288 0.179 4.336 2.426 1.333 3.761 1.489
MdAE CPU GFM 0.128 0.085 3.155 1.460 3.819 2.438 1.453 2.290
GSFM 0.022 0.149 2.652 1.391 3.180 2.910 1.314 2.441
LFM 1.045 0.260 2.857 2.420 13.348 4.792 4.752 2.424
RAM GFM 1.102 0.298 0.823 1.217 1.052 0.911 0.869 0.529
GSFM 0.189 0.071 0.670 0.617 1.210 0.859 0.473 0.635
LFM 8.984 0.274 0.980 4.363 2.368 1.288 3.784 1.419
FR CPU GFM 1.000 0.383 0.269 0.363 0.352 0.441 0.351 0.436
GSFM 1.000 0.792 0.397 0.443 0.369 0.315 0.458 0.465
LFM 1.000 0.818 0.402 0.182 0.038 0.150 0.022 0.523
RAM GFM 1.000 0.000 0.327 0.000 0.405 0.351 0.000 0.349
GSFM 1.000 0.000 0.426 0.000 0.324 0.373 0.000 0.281
LFM 1.000 0.000 0.843 0.000 0.133 0.286 0.000 0.077
However, the noted superiority of GSFM does not change the fact that predictions are subject to high uncertainty, and in the face of dynamic changes – resource usage characteristics previously unobserved in historical data – they may become significantly inaccurate. Therefore, proactive and reactive mechanisms are critical to minimize the negative effects of forecast inaccuracies. In MSFS, these mechanisms encompass administrative rules, adjusted reservation plans and various anti-discrepancy strategies. Given that the Prediction stage is just one component of MSFS, all pre-prediction and post-prediction mechanisms play a crucial role. Therefore, optimizing multi-time series forecasting should extend beyond solely improving prediction evaluation metrics to mitigating the risks associated with GFM and LFM, ensuring cost-effective model maintenance and addressing the critical FinOps factor – the scalability of undertaken actions and strategies. These objectives remain consistent with the properties of the GSFM approach introduced in MSFS.

4.4. Dynamic Resource Reservation Planning

Dynamic resource reservation plans are derived from mapping resource usage predictions obtained from group- specific forecasting models – GSFM, providing recommendations for the specific configuration of virtual machines. Therefore, we introduce the concept of a reference virtual machine type – a machine of this type would be statically reserved throughout the entire evaluation period, regardless of resource utilization.
This approach enables the creation, and subsequent assessment, of diverse resource reservation plans. Consequently, the reference machine types selected for the evaluation of the Planning stage of MSFS are general-purpose machines available on the Google Cloud Platform within the e2standard flavor. Therefore, when the reference type is set as c2d-standard-16, it is assumed that the forecasted usage percentage corresponds to the number of vCPUs and gigabytes of RAM associated with that specific machine type – 16 and 64, respectively.
Therefore, Figure 15 illustrates the consolidated initial CPU reservation plans for VM07 (which experienced the most SLA violations), with c2d-standard-16 set as the reference for forecast translation. Without MSFS enabled, this type would be used throughout the entire test set period. However, with MSFS, both dynamic reservations and scaling based on predicted demand are possible, resulting in significant resource savings. Additionally, a critical factor influencing the quality of initial reservation plans is the appropriate choice of the execution schedule for the Prediction stage. In the case of weekly predictions (Figure 15a), forecasts are considerably less accurate and lead to more SLA violations. With dynamic time series, more frequent decisions such as selecting the daily execution schedule enable greater accuracy and an improved balance between resource utilization and service availability (Figure 15b). With multiple forecasts per timestamp, the newer one overrides the older, resulting in possible updates of reservation plans. The mapping methodology presented for translating predictions into resource recommendations remains the same for RAM usage, as determining the recommended reservation type requires knowledge of both CPU and RAM demand.
Figure 15. Initial CPU reservation plan for VM07 with e2-standard-16 as the reference virtual machine type.
Figure 15. Initial CPU reservation plan for VM07 with e2-standard-16 as the reference virtual machine type.
Preprints 117619 g003
As a result, the assessment of MSFS’s Planning stage involved preparing initial reservation plans for each of the eight virtual machines in the dataset, followed by evaluating their characteristics across various selected reference types. Among the metrics applied are both quantitative costeffectiveness and qualitative indicators. Table 9 provides a summary of initial reservation plans based on weekly CPU and RAM predictions made with a daily execution schedule throughout the test set. Qualitative and quantitative metrics are averaged across all virtual machines to draw general conclusions.
Table 9. Averaged qualitative and quantitative indicators for initial reservation plans based on various reference virtual machine types.
Table 9. Averaged qualitative and quantitative indicators for initial reservation plans based on various reference virtual machine types.
Reference type Percentage cost reduction(with MSFS) Daily USD cost (without MSFS) Daily USD cost (with MSFS) Percentage CPU usage(with MSFS) Percentage RAM usage(with MSFS) Scaling events Violation events Percentage availability
e2-standard-2 0.00 2.07 2.07 17.69 22.13 0.00 0.00 100.00
e2-standard-4 39.40 4.14 2.51 35.39 38.17 0.63 0.13 99.81
e2-standard-8 48.24 8.29 4.29 48.88 44.40 1.13 0.63 99.04
e2-standard-16 50.74 16.58 8.17 55.41 56.86 2.25 2.00 96.93
e2-standard-32 57.58 33.15 14.06 60.60 57.52 2.25 2.13 96.75
Enabling MSFS involves transitioning to a less computationally demanding virtual machine type whenever feasible and scaling up as required. As a result, the average daily operating cost of a virtual machine in USD is significantly reduced with MSFS – Daily USD cost (with MSFS). For instance, the daily cost of an e2-standard-16 machine with static reservation – Daily USD cost (without MSFS) – would be USD 16.58 (specific to the europe-central2 location). However, by adjusting such an initial reference type within the e2-standard flavor to forecasts, the average cost can be reduced to approximately USD 8.17.
Qualitative indicators such as average CPU and RAM usage enable the assessment of resources that remain unused despite the application of MSFS. While the evaluation in this study relies solely on predefined machine types, utilizing custom machine types with finer vCPU and RAM level granularity can further optimize resource utilization. This is particularly noticeable in the case of e2-standard-2 selected as the reference type. As it is the lowest-tier variant in the e2-standard flavor hierarchy, further down-scaling is not possible, resulting in Percentage cost reduction (with MSFS) equal to zero. However, excessive resource occupancy may impact service and application performance stability, so aiming for near-perfect waste reduction may not be a desirable direction. Additionally, crucial metrics encompass the frequency of scaling events – Scaling events, denoting the average number of transitions between machine types (upscaling and down-scaling) per single virtual machine during the test set period. Excessive scaling can induce instability and incur additional overhead. Moreover, Violation events denote situations where the actual usage surpassed the forecasted predictions, impacting service availability and SLAs.
Adjusted reservation plans result from incorporating domain knowledge through administrative rules, modifying initial plans to enhance the interpretability of MSFS. Table 10 presents a summary of adjusted reservation plans. The administrative rule used involved recommending a more robust machine type within the e2-standard flavor hierarchy if the forecasted demand exceeded 80% of the initially suggested machine’s computational capacity, in terms of either CPU or RAM.
Such an administrative rule prioritizes enhancing overall cloud environment performance over achieving consistent cost reduction. Consequently, it leads to a significant decrease in the number of violation events and an increase in service availability, ensuring compliance with SLAs despite achieving a less significant increase in cost-effectiveness compared to the initial reservation plans. Thus, administrative rules not only enhance the interpretability of MSFS decisions but also enable the following choice: maximizing cost-effectiveness, maintaining high QoS or a tradeoff between these two. Analyzing both Table 9 and Table 10 reveals that the greater the computational capacity of the virtual machine, the higher the probability of resource wastage, where versatile MSFS can yield particularly favorable results. As shown in Table 9, the highest average percentage cost reduction was achieved for the e2-standard- 32 reference type compared to initial reservation plans – 57.57%. However, we use the results obtained from the adjusted reservation plans as a reference point, since these embody the desired balance between resource utilization and performance. Consequently, with respect to adjusted reservation plans and Table 10, the greatest average percentage cost reduction was achieved with experimental scenarios assuming e2-standard-32 as the reference type – 44.71%. It is worth noting that creating reservation plans enables committed use discounts to be leveraged, which can further aid in cost optimization.
Table 10. Averaged qualitative and quantitative indicators for adjusted reservation plans based on various reference virtual machine types.
Table 10. Averaged qualitative and quantitative indicators for adjusted reservation plans based on various reference virtual machine types.
Reference type Percentage cost reduction(with MSFS) Daily USD cost (without MSFS) Daily USD cost (with MSFS) Percentage CPU usage(with MSFS) Percentage RAM usage(with MSFS) Scaling events Violation events Percentage availability
e2-standard-2 0.00 2.07 2.07 17.69 22.13 0.00 0.00 100.00
e2-standard-4 20.87 4.14 3.28 29.52 36.42 0.75 0.00 100.00
e2-standard-8 34.02 8.29 5.47 35.33 40.23 1.50 0.00 100.00
e2-standard-16 36.98 16.58 10.45 35.06 41.12 4.75 0.25 99.62
e2-standard-32 44.71 33.15 18.33 36.07 39.75 4.75 0.25 99.62

4.5. Summary

The application of a multi-faceted GSFM concept in MSFS aims to construct group-specific forecasting models, thereby striking a compromise between the GFM and LFM concepts in a knowledge-based manner. STG allows for the optimization of multi-time series forecasting, making it scalable in the face of a growing number of virtual machines while enabling GSFM to achieve improved evaluation metrics compared to competing approaches.
In general, the concept-centric approach underscores the meticulous attention given to both pre-prediction and post- prediction phases. Therefore, within MSFS, alongside data- driven STG, various mechanisms are integrated, including ensemble-based anomaly detection using HEADA, dynamic resource reservation, and addressing prediction uncertainties. While administrative rules and adjusted plans serve to mitigate discrepancies, continuous monitoring remains paramount. Thus, MSFS components are FinOps-driven, playing a pivotal role in the domain of applied machine learning within the context of cloud environments.

5. Conclusions

In this research, we introduced STG as an effective method for optimizing multi-time series forecasting, addressing the need for scalable and reliable solutions to enhance resource utilization in rapidly evolving cloud environments. STG further served as the basis for the GSFM concept, representing a tradeoff between the two contrasting approaches – GFM and LFM. The multivariate time series data originating from these are characterized by high complexity, dynamism, and variations in both dimensionality and lengths across individual signals. Nevertheless, MSFS efficiently accommodates these properties. During evaluation, our approach demonstrated both qualitative and quantitative benefits, resulting not only in outperforming the GFM and LFM concepts but also in significant cost savings based on resource reservation plan estimations, with average operational expense reductions of up to 44.71%.
Throughout the study, we consistently advocate incorporating domain-level metrics, as they enhance interpretability in applied machine learning and facilitate multi-criteria decision-making. Additionally, we consider leveraging the FinOps context to be crucial, enabling the challenges associated with cloud computing to be overcome. While cloud computing environments are highly appealing, achieving high stability and performance while optimizing costs and reducing resource wastage inevitably entails a tradeoff. The imperative for such a balance is apparent in both the design of MSFS and the outcomes of its evaluation. Additionally, the need for stability in decisions and verdicts when relying on machine learning has also been demonstrated with respect to HEADA, where the concepts of hierarchy, ensemble approach, voting mechanisms and weighting have been integrated to achieve greater outlier detection certainty. While various cloud environments may have diverse operational purposes, the MSFS concept emerges as cloud-agnostic due to its scalability, cost-awareness, and a range of options enabling it to be tuned to specific use cases. This solution can be utilized in both small and large cloud environments, accommodating transitions between these states, allowing for the injection of domain knowledge, and thus justifying actions undertaken in terms of cloud resource management. Resource usage prediction greatly improves allocation efficiency over traditional methods. Conventional cloud resource management models, often labeled as pessimistic, reserve resources with a substantial buffer above the expected demand to accommodate potential spikes and ensure that SLAs are met. This approach frequently leads to over-provisioning, wasted resources and higher operational costs, and it also lacks a data-driven basis. In contrast, our method leverages machine learning for dynamic predictions, aligning resource reservations more closely with anticipated demand. This approach minimizes waste and enhances cost effectiveness by adjusting allocations based on forecasted needs, providing a more efficient resource management solution. Nevertheless, dynamic resource allocation based on
forecasts introduces risks associated with prediction inaccuracies, which are less pronounced in traditional models. To mitigate these risks, implementing robust monitoring and anti-discrepancy mechanisms is essential. These safeguards help balance resource wastage with effective utilization.
Despite the fact that MSFS addresses many of the key challenges related to cloud resource usage optimization, the introduction of both the Embedding and Grouping stages may introduce additional overhead. Consequently, MSFS requires the embedding model to be up-to-date, necessitating periodic retraining to ensure that time series similarity detection remains reliable. Careful retraining, monitoring, and concept drift detection strategies may be necessary to maintain the high quality and effectiveness of Embedding. Additionally, the Grouping stage requires that forecasting models for groups remain current, ensuring that reservation plans are based on predictions that accurately reflect the dynamic characteristics of resource usage. Therefore, timely re-evaluation of virtual machine assignments to groups is critical. In a production environment, the frequency of these updates would depend on the dynamics of the environment and would need to be empirically determined. On the other hand, overly frequent re-evaluation or determining group membership based on an excessively short period of resource usage could lead to MSFS instability. Furthermore, our solution proves effective even in small and medium- sized cloud environments, particularly when only limited data is available. However, it is important to highlight that where more data is available, forecasts generally become more accurate. With smaller datasets, there is a higher risk of inaccurate predictions, a factor that needs to be clearly acknowledged. Additionally, in our future research, we aim to focus on optimizing resource usage in data-scarce cloud environments. We plan to explore the use of continuous learning, generative-based data augmentation and genetic algorithms while addressing challenges like concept drift and data drift within a FinOps context.

CRediT Authorship Contribution Statement

Mateusz Smendowski: Methodology, Software, Investigation, Writing – original draft, Piotr Nawrocki: Supervision, Conceptualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

The data that support the findings of this study are available from Polcom, but restrictions apply to the availability of these data, which were used under license for the current study, and thus are not publicly available. The data are, however, available from the authors upon reasonable request and with permission of Polcom.

Acknowledgments

The research presented in this paper was supported by funds from the Polish Ministry of Science and Higher Education allocated to the AGH University of Krakow. The authors would like to thank Polcom for providing the data used in the tests.

References

  1. Alqahtani, D., 2023. Leveraging sparse auto-encoding and dynamic learning rate for efficient cloud workloads prediction. IEEE Access.
  2. Audibert, J., Michiardi, P., Guyard, F., Marti, S., Zuluaga, M.A., 2020. Usad: Unsupervised anomaly detection on multivariate time series, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, New York, NY, USA. p. 3395–3404.
  3. Bader, J., Lehmann, F., Thamsen, L., Leser, U., Kao, O. Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures. Future Generation Computer Systems 2024, 150, 171–185. [CrossRef]
  4. Bandara, K. , Bergmeir, C., Hewamalage, H., 2021a. LSTM-MSNet: Leveraging forecasts on sets of related time series with multiple seasonal patterns. IEEE Transactions on Neural Networks and Learning Systems 32, 1586–1599.
  5. Bandara, K., Bergmeir, C., Smyl, S. Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach. Expert systems with applications 2020, 140, 112896. [CrossRef]
  6. Bandara, K., Hewamalage, H., Liu, Y.H., Kang, Y., Bergmeir, C., 2021b. Improving the accuracy of global forecasting models using time series data augmentation. Pattern Recognition 120, 108148.
  7. Barbado, A., Óscar Corcho, Benjamins, R. Rule extraction in unsupervised anomaly detection for model explainability: Application to oneclass svm. Expert Systems with Applications 2022, 189, 116100. [CrossRef]
  8. Barut, C., Yildirim, G., Tatar, Y., 2024a. An intelligent and interpretable rule-based metaheuristic approach to task scheduling in cloud systems. Knowledge-Based Systems 284, 111241.
  9. Barut, C., Yildirim, G., Tatar, Y., 2024b. An intelligent and interpretable rule-based metaheuristic approach to task scheduling in cloud systems. Knowledge-Based Systems 284, 111241.
  10. Behera, I., Sobhanayak, S. Task scheduling optimization in heterogeneous cloud computing environments: A hybrid ga-gwo approach. Journal of Parallel and Distributed Computing 2024, 183, 104766. [CrossRef]
  11. Cai, B., Yang, S., Gao, L., Xiang, Y., 2023. Hybrid variational autoencoder for time series forecasting. arXiv preprint arXiv:2303.07048.
  12. Cai, T.T., Ma, R. Theoretical foundations of t-sne for visualizing high-dimensional clustered data. Journal of Machine Learning Research 2022, 23, 1–54.
  13. Chen, W.J., Yao, J.J., Shao, Y.H. Volatility forecasting using deep neural network with time-series feature embedding. Economic research-Ekonomska istraživanja 2023, 36, 1377–1401. [CrossRef]
  14. Daraghmeh, M., Agarwal, A., Manzano, R., Zaman, M., 2021. Time series forecasting using facebook prophet for cloud resource management, in: 2021 IEEE International Conference on Communications Workshops (ICC Workshops), pp. 1–6.
  15. Dogani, J., Khunjush, F., Mahmoudi, M.R., Seydali, M. Multivariate workload and resource prediction in cloud computing using cnn and gru by attention mechanism. The Journal of Supercomputing 2023, 79, 3437–3470. [CrossRef]
  16. Froese, V., Jain, B., Rymar, M., Weller, M. Fast exact dynamic time warping on run-length encoded time series. Algorithmica 2023, 85, 492–508. [CrossRef]
  17. Gabhane, J.P., Pathak, S., Thakare, N.M. A novel hybrid multi-resource load balancing approach using ant colony optimization with tabu search for cloud computing. Innovations in Systems and Software Engineering 2023, 19, 81–90. [CrossRef]
  18. Optimizing multi-time series forecasting for enhanced cloud resource utilization.
  19. Gagolewski, M., Bartoszuk, M., Cena, A. Are cluster validity measures (in) valid? Information Sciences 2021, 581, 620–636. [CrossRef]
  20. Geetha, P., Vivekanandan, S., Yogitha, R., Jeyalakshmi, M. Optimal load balancing in cloud: Introduction to hybrid optimization algorithm. Expert Systems with Applications 2024, 237, 121450. [CrossRef]
  21. Gupta, R., Saxena, D., Gupta, I., Makkar, A., Singh, A.K., 2022a. Quantum machine learning driven malicious user prediction for cloud network communications. IEEE Networking Letters 4, 174–178.
  22. Gupta, R., Saxena, D., Gupta, I., Singh, A.K., 2022b. Differential and triphase adaptive learning-based privacy-preserving model for medical data in cloud environment. IEEE Networking Letters 4, 217– 221.
  23. Hao, Y., Zhao, C., Li, Z., Si, B., Unger, H. A learning and evolution-based intelligence algorithm for multi-objective heterogeneous cloud scheduling optimization. Knowledge-Based Systems 2024, 286, 111366. [CrossRef]
  24. Hewamalage, H., Bergmeir, C., Bandara, K. Global models for time series forecasting: A simulation study. Pattern Recognition 2022, 124, 108441. [CrossRef]
  25. Jastrzebska, A., Nápoles, G., Salgueiro, Y., Vanhoof, K. Evaluating time series similarity using concept-based models. KnowledgeBased Systems 2022, 238, 107811.
  26. Jeong, B., Baek, S., Park, S., Jeon, J., Jeong, Y.S. Stable and efficient resource management using deep neural network on cloud computing. Neurocomputing 2023, 521, 99–112. [CrossRef]
  27. Jiang, J., Liu, F., Ng, W.W., Tang, Q., Zhong, G., Tang, X., Wang, B. Aerf: Adaptive ensemble random fuzzy algorithm for anomaly detection in cloud computing. Computer Communications 2023, 200, 86–94. [CrossRef]
  28. Kim, S., Chung, E., Kang, P. Feat: A general framework for feature-aware multivariate time-series representation learning. Knowledge-Based Systems 2023, 277, 110790. [CrossRef]
  29. Li, H., Zheng, W., Tang, F., Zhu, Y., Huang, J., 2023a. Few-shot time-series anomaly detection with unsupervised domain adaptation. Information Sciences 649, 119610.
  30. Parekh, Ruchit, and Charles Smith. “Innovative AI-driven software for fire safety design: Implementation in vast open structure.” World Journal of Advanced Engineering Technology and Sciences 12.2 (2024): 741-750.
  31. fficient time series augmentation methods, in: 2020 13th International.
  32. Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1004–1009.
  33. Liu, P., Liu, J., Wu, K., 2020b. Cnn-fcm: System modeling promotes stability of deep learning in time series prediction. Knowledge-Based Systems 203, 106081.
  34. Liu, Z., Zhang, J., Li, Y. Towards better time series prediction with model-independent, low-dispersion clusters of contextual subsequence embeddings. Knowledge-Based Systems 2022, 235, 107641. [CrossRef]
  35. Mahadevan, A. Mahadevan, A., Mathioudakis, M. Cost-aware retraining for machine learning. Knowledge-Based Systems 2024, 293, 111610. [Google Scholar] [CrossRef]
  36. Nalmpantis, C., Vrakas, D., 2019. Signal2vec: Time series embedding representation, in: International conference on engineering applications of neural networks, Springer. pp. 80–90.
  37. Nawrocki, P., Smendowski, M., 2023. Long-term prediction of cloud resource usage in high-performance computing, in: International Conference on Computational Science, Springer. pp. 532–546.
  38. Nawrocki, P., Smendowski, M., 2024a. Finops-driven optimization of cloud resource usage for high-performance computing using machine learning. Journal of Computational Science , 102292.
  39. Nawrocki, P., Smendowski, M., 2024b. Optimization of the use of cloud computing resources using exploratory data analysis and machine learning. Journal of Artificial Intelligence and Soft Computing Research 14, 287–308.
  40. Nawrocki, P., Sus, W. Anomaly detection in the context of longterm cloud resource usage planning. Knowledge and Information Systems 2022, 64, 2689–2711. [CrossRef]
  41. Osypanka, P., Nawrocki, P. Qos-aware cloud resource prediction for computing services. IEEE Transactions on Services Computing 2023, 16, 1346–1357. [CrossRef]
  42. Parekh, Ruchit, and Charles Smith. “Innovative AI-driven software for fire safety design: Implementation in vast open structure.” World Journal of Advanced Engineering Technology and Sciences 12.2 (2024): 741-750.
  43. D., Gupta, I., Gupta, R., Singh, A.K., Wen, X., 2023a. An ai-driven vm threat prediction model for multi-risks analysis-based cloud cybersecurity. IEEE Transactions on Systems, Man, and Cybernetics: Systems.
  44. Saxena, D., Gupta, R., Singh, A.K., Vasilakos, A.V., 2023b. Emerging vm threat prediction and dynamic workload estimation for secure resource management in industrial clouds. IEEE Transactions on Automation Science and Engineering.
  45. Shu, W., Cai, K., Xiong, N.N. Research on strong agile response task scheduling optimization enhancement with optimal resource usage in green cloud computing. Future Generation Computer Systems 2021, 124, 12–20. [CrossRef]
  46. Si, W., Pan, L., Liu, S. A cost-driven online auto-scaling algorithm for web applications in cloud environments. KnowledgeBased Systems 2022, 244, 108523.
  47. Singh, A.K., Gupta, R. A privacy-preserving model based on differential approach for sensitive data in cloud environment. Multimedia Tools and Applications 2022, 81, 33127–33150. [CrossRef]
  48. Singh, A.K., Swain, S.R., Saxena, D., Lee, C.N., 2023. A bio- inspired virtual machine placement toward sustainable cloud resource management. IEEE Systems Journal.
  49. Song, K., Yu, Y., Zhang, T., Li, X., Lei, Z., He, H., Wang, Y., Gao, S. Short-term load forecasting based on ceemdan and dendritic deep learning. Knowledge-Based Systems 2024, 294, 111729. [CrossRef]
  50. Storment, J., Fuller, M., 2023. Cloud FinOps. O’Reilly Media, Inc.
  51. Tran, M.N., Kim, Y. Optimized resource usage with hybrid auto-scaling system for knative serverless edge computing. Future Generation Computer Systems 2024, 152, 304–316. [CrossRef]
  52. Uribarri, G., Mindlin, G.B. Dynamical time series embeddings in recurrent neural networks. Chaos, Solitons & Fractals 2022, 154, 111612.
  53. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30.
  54. Wang, Y., Long, H., Zheng, L., Shang, J., 2023. Graphformer: Adaptive graph correlation transformer for multivariate long sequence time series forecasting. Knowledge-Based Systems , 111321.
  55. Wang, Z., Pei, C., Ma, M., Wang, X., Li, Z., Pei, D., Rajmohan, S., Zhang, D., Lin, Q., Zhang, H., et al., 2024. Revisiting vae for unsupervised time series anomaly detection: A frequency perspective, in: Proceedings of the ACM on Web Conference, pp. 3096–3105.
  56. Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., Sun, L., 2022. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125.
  57. Yadav, H., Thakkar, A. Noa-lstm: An efficient lstm cell archi- tecture for time series forecasting. Expert Systems with Applications 2024, 238, 122333. [CrossRef]
  58. Yi, K., Zhang, Q., Fan, W., Wang, S., Wang, P., He, H., An, N., Lian, D., Cao, L., Niu, Z., 2024. Frequency-domain mlps are more effective learners in time series forecasting. Advances in Neural Information Processing Systems 36.
  59. Yuan, H., Liao, S., 2024. A time series-based approach to elastic kubernetes scaling. Electronics 13.
  60. Zhang, F., Guo, T., Wang, H. Dfnet: Decomposition fusion model for long sequence time-series forecasting. Knowledge-Based Systems 2023, 277, 110794. [CrossRef]
  61. Zhu, J., Bai, W., Zhao, J., Zuo, L., Zhou, T., Li, K. Variational mode decomposition and sample entropy optimization based trans- former framework for cloud resource load prediction. Knowledge- Based Systems 2023, 280, 111042. [CrossRef]
Figure 1. The concept of integrating the MSFS within a cloud environment.
Figure 1. The concept of integrating the MSFS within a cloud environment.
Preprints 117619 g001
Figure 2. Diagram of the MSFS operation.
Figure 2. Diagram of the MSFS operation.
Preprints 117619 g002
Table 1. Feature analysis.
Table 1. Feature analysis.
Feature Articles
[4,5,6,11,13,23,27,31,32,51,56,59],
Machine learning-based [1,2,3,10,17,24,25,28,32,34,36,37,38,39,40,42,47,48,50,52,53,54,57,58,60],
[20,21,43,46]
Statistical learning-based [29,36,37,38]
Clustering enabled [5,23,29,32,33]
Anomaly detection aware [2,26,28,36,37,38,39,40,54]
Resource reservation [37,39,40]
FinOps aware [37,40]
Forecasting optimization [5,24,32,33,34,53,57]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated