A. Dataset
The dataset used in this study is the Alibaba Cloud AIOps Dataset, a real-world industrial dataset designed for research on microservice performance monitoring and anomaly analysis. It covers dozens of core microservice components from a large-scale online e-commerce platform and includes multidimensional performance metrics, trace logs, and system event records. The dataset spans a period of more than three months. Each microservice node provides high-frequency monitoring samples, including key operational indicators such as CPU usage, memory consumption, network latency, thread pool utilization, and request throughput. These detailed and multi-level features offer a rich foundation for modeling system performance degradation. The openness and high-dimensional structure of this dataset make it an essential experimental basis for self-supervised modeling and spatiotemporal feature extraction in microservice systems.
The structured design of the dataset captures the complexity of real-world microservice environments. Each sample record contains a timestamp, service instance identifier, metric vector, and associated topological dependency information. By constructing service dependency graphs, dynamic time-varying directed graphs can be formed to describe the evolution of invocation chains and resource contention among services. The dataset also provides multi-source feature dimensions, including application-level request logs, container-level performance metrics, and system-level resource states. This enables models to learn causal patterns of performance degradation across different granularities. In addition, the dataset defines performance states for different time windows, which facilitates the construction of sliding window sequences for self-supervised temporal modeling.
Based on this dataset, it is possible to evaluate the adaptive feature learning capability of models in complex topological environments. The performance degradation processes reflected in the dataset are often gradual rather than sudden, caused by combined factors such as network congestion, service dependency latency, and resource bottlenecks. Modeling and validation on this dataset can simulate how degradation propagates in real cloud-native microservice systems, providing a stable foundation for training self-supervised recognition models. The multidimensional, time-varying, and highly correlated characteristics of this dataset make it an important benchmark resource for research on performance degradation identification and spatiotemporal feature learning in large-scale microservice architectures.
B. Experimental Results
This paper first conducts a comparative experiment, and the experimental results are shown in
Table 1.
As shown in
Table 1, the proposed self-supervised performance degradation identification model achieves the best results across all evaluation metrics in the comparative experiments. Traditional machine learning methods, such as XGBoost and MLP, show significantly lower accuracy, precision, and recall compared with deep learning-based models. This is because these traditional methods cannot effectively capture the dynamic dependencies in high-dimensional time series data of microservice systems. They usually rely on static feature inputs for decision-making and cannot adaptively learn the inter-service relationships and temporal evolution patterns. Therefore, their performance is limited when dealing with complex topological structures and non-stationary performance variations.
In contrast, deep neural network models such as 1DCNN, Transformer, and GNN improve the ability to capture degradation patterns by incorporating convolutional and graph-based modeling mechanisms. 1DCNN can recognize local feature fluctuations within short time windows, while Transformer and GNN further combine global attention with structural dependencies, demonstrating stronger spatiotemporal modeling capabilities. However, since these models still rely on supervised learning paradigms and lack semantic consistency constraints across time and services, their generalization and stability remain limited in dynamic microservice environments.
The performance improvement of self-supervised contrastive learning methods such as SimCLR indicates that unsupervised representation learning can effectively alleviate the limitations caused by label scarcity. By pulling similar samples closer and separating dissimilar ones in the feature space, the model can learn the underlying distribution of degradation patterns without explicit labels. This enhances the discriminability and robustness of learned features. These results confirm the applicability of self-supervised mechanisms in complex operations and maintenance scenarios, especially under high-dimensional and heterogeneous data conditions, where they outperform traditional supervised methods in feature learning.
Overall, the proposed model significantly outperforms all comparison methods in terms of accuracy, precision, recall, and F1-score. This demonstrates the effectiveness of the proposed spatiotemporal self-supervised framework for performance degradation identification. By integrating multi-source feature fusion, dynamic graph structure modeling, and self-supervised consistency constraints, the model captures the complex dependencies and temporal evolution patterns in microservice systems. The results show that the model not only achieves superior recognition accuracy but also exhibits strong generalization and stability, providing a reliable performance perception and early warning mechanism for intelligent operations in large-scale microservice environments.
The experimental results for different optimizers are also presented in
Table 2.
As shown in
Table 2, different optimizers exhibit distinct differences in training stability and final model performance. The overall trend indicates that traditional gradient-based methods, such as AdaGrad and SGD, converge relatively quickly but are prone to getting stuck in local optima when dealing with complex spatiotemporal features and non-stationary distributions. This leads to noticeable fluctuations in precision and recall. AdaGrad suffers from rapid learning rate decay, which limits its ability to optimize fine-grained features in later stages. SGD shows more stable global convergence but lacks sensitivity in updating dynamic feature weights.
In contrast, Adam provides higher adaptability in parameter optimization through first and second moment correction mechanisms. It balances the gradient update speed across different feature dimensions, maintaining a stable convergence process in complex microservice datasets. In this study’s self-supervised task, Adam performs better than traditional optimizers, especially in the F1-score, showing improved balance and indicating its superiority in handling data heterogeneity and temporal variation across multiple monitoring sources.
AdamW further introduces a weight decay mechanism, which helps prevent overfitting while maintaining convergence speed. This optimizer achieves the highest overall performance in accuracy, precision, and recall, demonstrating its ability to preserve feature distribution consistency and stability within the self-supervised learning framework. Its suppression of noisy gradients in high-dimensional feature spaces allows the model to capture latent temporal patterns of performance degradation more effectively, resulting in finer dynamic representations.
In summary, AdamW shows the best training stability and generalization capability in this task. This finding suggests that choosing optimizers with proper regularization can significantly enhance model robustness and spatiotemporal modeling effectiveness in large-scale microservice performance degradation identification. Its superior performance also provides a reliable optimization reference for deploying self-supervised identification models in complex industrial environments.
This paper also presents an experiment on the sensitivity of the learning rate to the F1-Score, and the experimental results are shown in
Figure 2.
As shown in
Figure 2, different learning rates have a significant impact on the model’s F1-score. This indicates that optimizer parameters play a crucial role in self-supervised performance degradation identification tasks. When the learning rate is too low (such as 1e-5), the model updates slowly and tends to fall into local optima. As a result, spatiotemporal features are not fully fitted, and the recognition performance is limited. When the learning rate increases moderately (up to 1e-4), the model can better capture the dynamic patterns of microservice performance degradation. It forms clearer decision boundaries in the feature space, and the F1-score reaches its highest value, suggesting that the spatiotemporal modeling and contrastive optimization processes are most stable.
When the learning rate continues to increase (such as 5e-4 or 1e-3), the model performance slightly declines. This is mainly due to unstable feature distributions caused by overly rapid gradient updates, which weaken the aggregation of degradation-related features. A higher learning rate causes oscillations during optimization, making it difficult for the model to maintain consistent representations in the self-supervised contrastive space. This reduces global feature consistency and local feature sensitivity. These results indicate that learning rate control is essential to ensure stable convergence when dealing with complex, multi-source monitoring data.
Overall, when the learning rate is around 1e-4, the model achieves the best balance between spatiotemporal representation and self-supervised optimization objectives. This reflects the adaptive advantage of self-supervised mechanisms when applied to multidimensional performance data. The results also verify that the model shows low sensitivity to hyperparameters in dynamic microservice scenarios. It can maintain stable performance degradation identification within an appropriate learning rate range, providing valuable guidance for parameter selection and automatic tuning in large-scale systems.
This paper also presents an experiment on the sensitivity of the residual coefficients to the F1-Score, and the experimental results are shown in
Figure 3.
As shown in
Figure 3, the residual coefficient has a significant impact on the model’s F1-score, indicating that residual fusion plays a key regulatory role in self-supervised performance degradation identification tasks. When the residual coefficient is small (such as 0.1), the model mainly relies on representations from the current time step. This leads to insufficient information transfer and weak memory of historical dependencies, which limits the global aggregation of degradation features. As the residual coefficient increases, the model gradually enhances cross-time-step information fusion during feature updates. This improves the spatiotemporal consistency of the latent space, and the F1-score rises accordingly.
When the residual coefficient reaches a moderate level (such as 0.5), the model achieves peak performance. At this point, the residual branch and the main branch reach an optimal balance in information fusion. A moderate residual signal can preserve historical features while suppressing noise accumulation, improving the robustness and generalization of the model under complex dependency topologies. The model can more accurately capture latent performance relationships among microservices, making spatiotemporal feature representations more stable and supporting effective self-supervised aggregation and discrimination of degradation patterns.
When the residual coefficient continues to increase (such as 0.7 or higher), model performance begins to decline. This occurs because an overly strong residual signal weakens the dominance of the main feature update process, causing the model to rely too heavily on historical representations. As a result, representation redundancy and gradient imbalance appear under dynamic performance fluctuations. These findings suggest that in large-scale microservice environments, the design of residual mechanisms must maintain a dynamic balance between information transmission and feature innovation. Only by doing so can the self-supervised framework fully leverage its advantages in spatiotemporal feature modeling and achieve efficient and stable identification of performance degradation processes.
This paper also presents the effect of the temperature coefficient on the experimental results, and the experimental results are shown in
Figure 4.
As shown in
Figure 4, the temperature coefficient in the contrastive learning framework has a significant impact on model performance. The temperature coefficient controls the separation scale between positive and negative samples, influencing how features are clustered and separated in the latent space. When the temperature coefficient is small (such as 0.05), the model pulls similar samples too close together, causing the feature distribution to become overly concentrated. This leads to unstable gradient updates and lower F1-score and other metrics. Although the model aggregates features quickly in this case, it struggles to distinguish between different degradation patterns, resulting in weak and less robust spatiotemporal feature boundaries.
When the temperature coefficient increases to a moderate range (such as 0.2), the model achieves peak performance across all metrics. At this level, the temperature balances the contrast strength between positive and negative samples, enabling the latent feature space to maintain intra-class consistency while preserving sufficient inter-class separability. This result shows that a moderate temperature setting can effectively enhance the discriminative capability of features in self-supervised contrastive learning, making the spatiotemporal representations clearer and more stable. This property is crucial for capturing fine-grained variations in microservice performance degradation and improves the model’s sensitivity to degradation trends under complex dependency topologies.
When the temperature coefficient continues to increase (such as 0.5 or higher), model performance begins to decline slightly. A higher temperature weakens the distance constraints between positive and negative samples, causing features to become too dispersed in the latent space. This reduces the aggregation of degradation-related features. Under high temperatures, the model struggles to maintain stable alignment in feature contrast, leading to an imbalance between global feature consistency and local feature resolution, which results in decreased precision and recall.
Overall, the model performs best when the temperature coefficient is around 0.2. This indicates that in large-scale microservice performance degradation identification, the temperature setting in contrastive learning is a key factor affecting feature distribution stability and semantic aggregation capability. Proper temperature adjustment helps the model maintain information balance in high-dimensional feature spaces, allowing the self-supervised learning process to strengthen discriminative representations without excessive separation. This improves both the accuracy and generalization ability of performance degradation pattern recognition.
This paper also presents the impact of the monitoring sampling interval on the experimental results, and the experimental results are shown in
Figure 5.
As shown in
Figure 5, the monitoring sampling interval has a clear impact on model performance. The four main metrics (Accuracy, Precision, Recall, and F1-score) all show a gradual decline as the sampling interval increases. When the sampling interval is short (such as 1s), the model can capture fine-grained system fluctuations and effectively learn short-term dependencies and sudden degradation patterns among microservices. At this stage, the model achieves the best overall accuracy and stability. The high temporal resolution of the data allows the model to better reconstruct the dynamic evolution of performance degradation, resulting in stronger spatiotemporal representation capability.
As the sampling interval increases (such as 3s or 5s), model performance gradually decreases. This suggests that longer sampling periods reduce the model’s sensitivity to short-term anomalies and subtle fluctuations. Some transient degradation behaviors are smoothed out by the longer intervals, making it difficult for the model to capture fine-grained performance variations during training. As a result, feature representations in the latent space become more blurred, which lowers overall detection accuracy and recall. Although the model’s training efficiency improves slightly in this setting, its ability to describe temporal degradation trends becomes clearly limited.
When the sampling interval is further extended to 10s or 20s, model performance declines significantly. Longer sampling intervals disrupt the temporal continuity of monitoring data, making it harder for the model to establish causal and sequential dependencies among services. Consequently, degradation patterns become diluted or obscured. In dynamic topologies and heterogeneous monitoring environments, long sampling intervals also amplify system noise and delay effects, reducing the model’s responsiveness in highly dynamic workload scenarios.
Overall, the experimental results show that the choice of sampling interval is crucial for performance degradation identification. Very short intervals provide richer information but increase monitoring costs and computational overhead, while overly long intervals significantly weaken spatiotemporal perception. The proposed model achieves its best performance at a 1s sampling interval, demonstrating its ability to fully capture the spatiotemporal dependencies of microservice performance degradation in high-frequency data environments. This provides valuable guidance for real-time performance analysis in intelligent operations and maintenance systems.