A. Dataset
This study uses the Alibaba Cluster Trace 2018 dataset as the primary data source for model validation and methodological analysis. The dataset consists of real backend system monitoring information from an actual cloud platform, covering a wide range of computing, storage, and scheduling activities of online services. It includes multiple performance indicators such as CPU utilization, memory usage, network latency, task scheduling, and request throughput of microservice instances. These metrics comprehensively reflect the dynamic characteristics of the system under different loads and operational states. Compared with traditional single-node monitoring data, this dataset more closely resembles a real distributed environment. It features multi-tenant, multi-task, and strongly coupled dynamics, making it highly suitable for research on performance anomaly prediction and time-series modeling in backend microservices.
The dataset records continuous system operation cycles over multiple days, with a temporal resolution down to the second. It contains the complete lifecycle of tasks, including submission, start, execution, and completion. Each record is associated with a specific machine identifier, service instance number, and timestamp, which allows the construction of service dependency graphs that evolve. This high spatiotemporal resolution enables a detailed exploration of system performance variations from both structural and temporal perspectives, facilitating the identification of potential anomaly patterns. In addition, the dataset provides comparisons between resource requests and actual usage, which can be used to describe resource scheduling efficiency and performance degradation trends, offering a realistic basis for dynamic load modeling.
During data preprocessing, sampling errors, missing values, and extreme outliers are first cleaned and imputed to ensure the continuity and stability of the input sequences. Then, all monitoring indicators are standardized and segmented into time windows, transforming the data into a multi-dimensional time-series format. Based on service dependency logs, dynamic adjacency matrices are constructed to generate input samples suitable for contrastive time-series representation learning. The dataset's large scale, high dimensionality, and noisy characteristics provide a solid experimental foundation for validating the robustness and generalization of the proposed model in dynamic environments. It also serves as an important public data resource supporting research in cloud computing and microservice anomaly analysis.
B. Experimental Results
This paper first conducts a comparative experiment, and the experimental results are shown in
Table 1.
From the overall results,
Table 1 presents the comparative performance of different methods on the backend microservice performance anomaly prediction task. It can be observed that traditional machine learning models, such as Decision Tree and XGBoost, show relatively low accuracy and recall, indicating their limited expressive ability when handling multi-dimensional and complex time-series data. These models rely mainly on static features and shallow architectures, which makes them insufficient in capturing dynamic dependencies and temporal evolution patterns. As a result, they fail to effectively identify potential performance degradation trends in highly dynamic microservice systems. In contrast, deep neural networks such as MLP, Transformer, and BERT demonstrate stronger capabilities in nonlinear mapping and high-dimensional feature extraction, leading to a significant improvement in overall performance metrics.
A further comparison shows that the Transformer and BERT models perform well in time-series feature modeling, achieving AUC values of 0.934 and 0.946, respectively. This demonstrates the advantage of the self-attention mechanism in capturing global dependencies. However, these models mainly focus on sequence modeling at the feature level and lack an integrated consideration of structural dependencies and topological evolution among services. Therefore, in complex distributed environments, they still struggle to capture the complete propagation paths of anomalies. The LSTM-Transformer model, by combining recurrent memory units with attention mechanisms, partially alleviates this issue and improves both recall and precision. Nevertheless, it still shows limitations in learning structural consistency across heterogeneous data.
The proposed model in this study outperforms all other methods across four evaluation metrics, with particularly strong performance in AUC and recall. This indicates that the model not only maintains high detection accuracy but also achieves a significant improvement in anomaly coverage. Such superiority is mainly attributed to the contrastive time-series representation learning framework, which enables adaptive modeling of multi-scale spatiotemporal features under unsupervised conditions. By integrating dynamic graph construction with temporal encoding, the model effectively captures dynamic dependencies among microservices and forms discriminative representations in the latent space that distinguish between normal and abnormal states. This results in more robust performance anomaly prediction and verifies the effectiveness of the proposed approach for dynamic dependency modeling and early risk identification in complex cloud environments. Furthermore, this paper presents the impact of the learning rate on the experimental results, as shown in
Table 2.
As shown in
Table 2, different learning rate settings have a clear impact on model performance, and the overall trend follows a pattern of "rising first and then stabilizing." When the learning rate is high (for example, 0.0004), the model converges faster in the early training stage. However, the large step size can cause instability in parameter updates, leading to the loss of fine-grained features in modeling complex spatiotemporal dependencies. As a result, both accuracy and recall remain relatively low. As the learning rate decreases, the model approaches the optimal solution in a smoother manner, allowing it to better learn the underlying patterns of backend microservice performance variations.
When the learning rate decreases to 0.0003 and 0.0002, all evaluation metrics improve significantly. This indicates that the parameter update speed and the loss reduction process become more stable, enabling the model to learn both local details and global representations effectively. The improvement at this stage mainly benefits from the enhanced discriminative capability of the contrastive time-series representation learning framework under small-step optimization. This allows the model to more accurately separate the temporal characteristics of normal and abnormal states. The increase in the recall metric, in particular, reflects that the model becomes more sensitive to anomaly detection and can identify potential performance degradation trends earlier.
When the learning rate is further reduced to 0.0001, the model achieves its best performance across all metrics, with an accuracy of 0.947, precision of 0.941, recall of 0.936, and AUC of 0.974. This shows that the optimization process is most stable at this setting. A smaller learning rate enables the model to adjust parameters more delicately during gradient updates, leading to more precise feature alignment and semantic matching in complex dynamic service topologies and multi-source time-series inputs. This result demonstrates that a moderately small learning rate enhances the representational power of the contrastive learning framework, allowing spatiotemporal features to form clearer aggregation and separation structures in the latent space.
Overall, changes in learning rate affect not only the convergence speed of the model but also directly determine the stability and generalization of the latent representations. In backend microservice performance anomaly prediction tasks, an excessively large learning rate can cause the model to overfit local fluctuations, while an overly small one may lead to local minima. The learning rate of 0.0001 used in this study achieves the best balance between optimization efficiency and model stability. This fully demonstrates that an appropriate learning rate plays a crucial role in enabling contrastive spatiotemporal representation learning models to achieve high-precision anomaly prediction in dynamic and complex systems.
In addition, we evaluate the effect of different optimizers on model performance, and the corresponding experimental results are summarized in
Table 3.
The choice of optimizer has a significant impact on both model convergence and prediction performance. The overall trend shows a gradual improvement across all metrics, including accuracy, precision, recall, and AUC, from AdaGrad to AdamW. AdaGrad adjusts the learning rate adaptively using the accumulated squared gradients, which allows for rapid convergence in the early stages of training. However, as the training progresses, the learning rate decays to very small values, limiting the model's ability to learn complex patterns in later stages. As a result, its overall performance remains low. SGD alleviates this issue to some extent, but its fixed learning rate mechanism causes slow convergence in high-dimensional non-convex loss spaces and insufficient modeling of dynamic dependencies.
Adam combines the advantages of momentum and adaptive learning rate adjustment, allowing it to dynamically adapt the update magnitude based on gradient changes during training. This enhances the model's ability to learn complex temporal dependencies and service node variations. Such capability is particularly important for multi-metric joint modeling tasks, where large-scale differences exist among feature dimensions. Traditional optimizers often struggle to balance these differences. The results obtained with Adam demonstrate that adaptive optimization strategies can effectively improve model stability and generalization, enabling the contrastive time-series representation learning framework to form clearer feature distribution structures in the latent space.
Further analysis shows that AdamW performs better than all other optimizers, achieving the highest values across all metrics. This indicates that the weight decay mechanism plays an effective regularization role during training. Compared with Adam, AdamW separates the weight decay term from gradient updates, allowing the model to control parameter adjustments more precisely and reduce overfitting. This optimization method is especially critical in contrastive learning frameworks, where excessive fitting of the latent space can weaken semantic separability among features. The superior performance of AdamW in this task demonstrates its ability to balance model complexity and feature consistency, enhancing the model's sensitivity to backend performance degradation.
In summary, the choice of optimizer directly affects the representational power and generalization ability of contrastive time-series representation learning models under complex spatiotemporal structures. Traditional optimizers provide stability but fail to meet the learning requirements of multi-dimensional heterogeneous inputs and dynamic graph features. In contrast, optimizers with adaptive adjustment and weight regularization mechanisms can better capture dynamic dependencies among services. The final results confirm that adopting the AdamW optimization strategy significantly improves model convergence efficiency and anomaly prediction accuracy, providing a stronger foundation for robustness and reliability in backend microservice performance anomaly detection tasks.
Figure 2 illustrates the experimental results obtained under different temperature coefficient settings.
The temperature coefficient has a significant impact on model performance, but different metrics do not respond consistently to this parameter, reflecting a certain "trade-off." When the temperature coefficient is 0.05, both Precision and Recall are at low levels, indicating that the similarity distribution of contrastive learning is too sharp, and sample representations are prone to over-clustering, resulting in insufficient separability between normal and abnormal states in the latent space, thus weakening the anomaly identification ability. As the temperature coefficient increases to 0.1, Precision and Accuracy reach better levels, indicating that moderate temperature can improve the clarity of the discrimination boundary, making the model more robust in reducing false positives.
Further increasing the temperature coefficient to 0.2 and 0.3, Recall shows a continuous upward trend, and AUC also rises accordingly, reaching its highest value at 0.3. This indicates that higher temperatures help to widen the overall ranking difference between abnormal and normal samples, improving the coverage of potential anomalies and the overall discriminative power. However, at the same time, accuracy and precision decrease at higher temperatures, indicating that excessively high temperatures may make the contrast constraints too smooth, reducing the model's sensitivity to fine-grained features and thus increasing false alarms. Overall, the temperature coefficient plays a key moderating role between false alarm control and "abnormal coverage," and a more appropriate value range should be selected based on the operational preferences for missed detections/false alarms.
This paper also presents a sensitivity experiment on the effect of missing data rate on accuracy, and the experimental results are shown in
Figure 3.
Figure 3 illustrates that the changes in each indicator are not a simple overall monotonic decrease, but rather exhibit differentiated sensitivity and inflection point characteristics. Accuracy shows a slight increase between 0.25 and 0.5, reaching a peak, and then significantly decreases between 0.75 and 1.0; AUC shows a more stable monotonic decline, indicating that the higher the mixing ratio, the more continuously the model's overall ranking and discriminative ability is weakened. In summary, a lower to moderate mixing ratio is more conducive to maintaining the stability of the representation space, while an excessively high mixing ratio will disrupt the constraint of contrastive learning on structural consistency, making it more difficult to effectively align temporal dependencies and cross-service association information, thus leading to overall performance degradation.
Further analysis of Precision and Recall reveals a stronger nonlinear effect: Precision drops rapidly between 0.25 and 0.75, reaching its lowest point at 0.75, before slightly recovering at 1.0; Recall peaks at 0.25, but experiences a sharp drop at 0.5, then recovers only slightly between 0.75 and 1.0 and stabilizes. This indicates that a higher mixing ratio significantly alters the model's discriminative bias: on the one hand, it reduces sensitivity to capturing anomalous patterns, and on the other hand, it makes predictions more conservative or the boundaries more ambiguous, ultimately reflected in a continuous decline in AUC and a significant drop in Accuracy at high mixing ratios. Overall, this experiment demonstrates that the Mixing Data Ratio is a key factor affecting the robustness of contrastive temporal representation learning, requiring a balance between covering more mixed information and maintaining the separability of representations.