Submitted:
23 June 2025
Posted:
25 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Understanding Anomalies in Financial Markets
- Breaks typical statistical expectations (e.g., returns several standard deviations from the mean),
- Occurs under abnormal market conditions (e.g., flash crashes or illiquidity events),
- Violates known inter-variable relationships (e.g., price-volume decoupling),
- Indicates potential concerns such as fraud, algorithmic failure, or market manipulation [3].
- Point Anomalies: Individual observations that sharply diverge from historical norms (e.g., an isolated price spike). These are commonly detected using statistical models such as Z-score thresholds.
- Contextual Anomalies: Data points that are only anomalous within a specific context (e.g., high trading volume during a normally quiet period). Detection of these requires auxiliary variables or temporal context.
1.2. Why Anomalies Matter
- Proactive risk mitigation during crises or regime shifts,
- Enhanced performance of algorithmic trading strategies,
- Smarter regulatory surveillance and compliance systems,
- Increased model robustness under high-stress or non-stationary market conditions.
2. Challenges in Financial Anomaly Detection
2.1. Lack of Labeled Data
2.2. Data Accessibility Constraints
2.3. Evaluation Under Uncertainty
2.4. How can we uncover abnormal trading patterns before they lead to systemic disruptions? Can we design intelligent systems to identify early warning signs in dynamic and volatile markets? What mechanisms allow us to discern structural anomalies amidst noisy financial data?
- A Generative Adversarial Network (GAN) [12] that learns the underlying distribution of return-volume dynamics and detects outliers via reconstruction errors.
2.5. Why LSTM Autoencoders and GANs?
- We develop and evaluate a hybrid detection framework that captures both point-based and collective anomalies.
- We design an artificial anomaly injection procedure for robust evaluation, enabling quantitative benchmarking despite the absence of ground truth labels.
- We conduct comprehensive sensitivity testing across economic regimes, asset classes, and model parameters to ensure reliability and interpretability.
- We demonstrate the practical utility of our models through consistent performance across volatile periods, such as the Global Financial Crisis and the COVID-19 pandemic.
3. Related Work
4. Methodology
4.1. Baseline Methods
- Z-Score: This simple statistical method flags extreme observations by identifying values that deviate significantly from a rolling mean, based on standard deviation thresholds. Though computationally efficient, it is sensitive to window size and assumes normality and stationarity, limiting its utility in volatile markets.
- GARCH(1,1): The Generalized Autoregressive Conditional Heteroskedasticity model [5] captures time-varying volatility, a key feature in financial returns. Anomalies are inferred from sudden spikes in conditional variance. While GARCH is widely used in econometrics, it may overlook non-volatility-driven anomalies.
- One-Class SVM on Raw Features: This model estimates the support of the input distribution [6] using only raw return and volume data, without accounting for temporal dependencies. It constructs a hyperplane to separate “normal” from “abnormal” points but lacks sequential context.
4.2. Hybrid LSTM Autoencoder + One-Class SVM
| Algorithm 1 Hybrid LSTM-SVM Detection |
|
4.3. GAN-Based Anomaly Detection
4.4. Robustness and Sensitivity Analysis
- Sequence Window Length (W): We explore the influence of different temporal contexts by setting the LSTM window size to . Shorter windows emphasize local fluctuations, while longer sequences provide a broader view of market behavior. The model performs consistently across settings, highlighting its adaptability to varying temporal scales.
- SVM Regularization Parameter (): We vary to examine how the One-Class SVM balances model sensitivity and tolerance to noise. As expected, tighter margins (lower ) reduce false positives but may under-detect novel anomalies. Moderate values yield the best trade-off between precision and recall.
- GAN Training Stability: To verify convergence and generalization, we train the GAN over a range of epochs (50 to 200). The model stabilizes reliably by 100 epochs, with negligible variation beyond. This confirms that the GAN learns a robust representation without overfitting.
- Anomaly Injection Thresholds: We inject synthetic anomalies by perturbing return-volume sequences at controlled magnitudes (90th to 99th percentile). This benchmark allows us to evaluate the model’s recall and F1 score under a spectrum of anomaly severities. Our method maintains strong detection performance, even under extreme perturbations.
| Algorithm 2 GAN-Based Anomaly Detection |
|
5. Experimental Setup
6. Experimental Setup
6.1. Market Coverage and Dataset Composition
- Equity Indices: Widely tracked benchmarks such as SPY and QQQ.
- Mega Cap Stocks: Large, well-established firms (e.g., AAPL, MSFT, AMZN).
- Small Cap Equities: Higher volatility firms with less liquidity (e.g., CHGG, PLUG).
- Penny Stocks: Low-priced, highly volatile stocks (e.g., SNDL, COSM).
- High/Low Volatility Stocks: To explore model behavior under different risk regimes.
6.2. Historical Coverage and Relevance
6.3. Preprocessing
6.4. Artificial Anomaly Injection
- If the return is positive, it is increased by the 95th percentile of the absolute return distribution.
- If the return is negative, it is decreased by the 95th percentile of the absolute return distribution.
- Trading volume is scaled by a factor up to the defined anomaly size.
- The direction of adjustment (increase or decrease) is randomly assigned.
6.5. Training Protocol
- LSTM Autoencoder: Trained using mean squared error (MSE) loss over sliding sequences of return-volume data.
- One-Class SVM: Trained exclusively on latent vectors from presumed normal data.
- GAN: Trained in an unsupervised manner with alternating generator and discriminator updates, using return-volume vectors.
6.6. Evaluation Metrics
- Precision: The proportion of predicted anomalies that are correct.
- Recall: The proportion of actual anomalies that are successfully detected.
- F1 Score: The harmonic mean of precision and recall.
- F4 Score: Weighted variant emphasizing recall, suitable for early warning systems.
6.7. Implementation Details
7. Results and Discussion
7.1. Model Performance Across Stock Categories and Market Regimes
7.2. Overall Detection Performance
7.3. Behavior During Crisis Periods
7.4. Sensitivity to Hyperparameters
7.5. Visualization of Detection Mechanics
7.6. Discussion and Practical Implications
- Unsupervised Learning: Requires no labeled data, making it broadly applicable in real-world settings.
- Modular Design: LSTM handles temporal structure, while GAN captures statistical shifts—making the system adaptive and interpretable.
- Resilience Across Market Regimes: Robust performance across bullish, bearish, and volatile periods highlights the model’s generalization ability.
8. Conclusions
8.1. Reflections and Future Directions
8.2. Toward Practical Impact
References
- V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58, 2009. [CrossRef]
- M. P. Ahmed, A. Mahmood, and J. Hu, “A survey of anomaly detection techniques in financial domain,” Future Generation Computer Systems, vol. 55, pp. 278–288, 2016. [CrossRef]
- M. Goldstein and S. Uchida, “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data,” PLOS ONE, vol. 11, no. 4, p. e0152173, 2016. [CrossRef]
- L. Liu, C. Y. Chan, and K. K. Ang, “Transfer learning on convolutional activation feature as applied to a building quality assessment robot,” Advances in Science, Technology and Engineering Systems Journal, vol. 4, no. 2, pp. 115–123, 2019. [CrossRef]
- T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986. [CrossRef]
- B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001. [CrossRef]
- F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” in Proc. 2008 IEEE Int. Conf. on Data Mining (ICDM), pp. 413–422, 2008. [CrossRef]
- P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long short term memory networks for anomaly detection in time series,” in Proc. Eur. Symp. on Artificial Neural Networks (ESANN), 2016.
- D. Li, D. Chen, J. Jin, L. Shi, J. Goh, and S. Ng, “MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks,” in Proc. Int. Conf. on Artificial Neural Networks (ICANN), pp. 703–716, 2019. [CrossRef]
- R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” ACM Computing Surveys, vol. 51, no. 3, pp. 1–36, 2019.
- J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, no. 1, pp. 1–18, 2015.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
- L. Liu, Y. Chen, W. Wang, and Y. Sun, “CNN-based automatic coating inspection system,” Advances in Science, Technology and Engineering Systems, vol. 3, no. 6, pp. 469–478, 2018. [CrossRef]
- L. Liu, Y. Chen, and W. Wang, “AI-facilitated coating corrosion assessment system for productivity enhancement,” Engineering: Open Access, vol. 3, no. 2, pp. XX–XX, 2023. [CrossRef]
- H. Huang and Y. Wu, “Deep learning-based high-frequency jump test for detecting stock market manipulation: evidence from China’s securities market,” Kybernetes, 2024. [CrossRef]
- P. Malhotra, L. P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Time-series anomaly detection with stacked LSTM and multivariate Gaussian distribution,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2016.
- D. Blázquez-García, A. Conde, U. Mori, and J. A. Lozano, “A review on outlier/anomaly detection in time series data,” ACM Computing Surveys, vol. 54, no. 3, pp. 1–33, 2021. [CrossRef]
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020. [CrossRef]







| Method | Sequential | Unsupervised | Used in Finance | Captures Volatility | Handles Dist. Shift | Complexity |
|---|---|---|---|---|---|---|
| Z-Score | No | Yes | Yes | No | No | Low |
| GARCH | Yes (volatility) | Yes | Yes | Yes | No | Medium |
| One-Class SVM | No | Yes | Yes | No | No | Medium |
| LSTM Autoencoder | Yes | Yes | Partial | No | Yes (latent space) | High |
| GAN | No | Yes | Partial | No | Yes | High |
| Ours (LSTM + GAN) | Yes | Yes | Yes | Yes (via hybrid) | Yes | High |
| Category | Example Tickers |
|---|---|
| Indices | SPY, QQQ, DIA, IWM, VTI |
| Mega Cap | AAPL, MSFT, GOOG, AMZN, BRK-B, TSLA, UBER, SNAP, PTON, LYFT |
| Small Cap | ETSY, CHGG, PLNT, SFIX, RVLV, PLUG, FCEL, SPCE, BYND, HCMC |
| High Volatility | AMD, NVDA, MRNA, ZM |
| Low Volatility | KO, JNJ, PG, PEP, MCD |
| Penny Stocks | SNDL, ZOMDF, CTRM, COSM |
| Model | Strength | Weakness | Best Use Case |
|---|---|---|---|
| LSTM Autoencoder | High recall in structured regimes | Struggles with sharp distributional shocks | Sequential anomalies in mega/small caps |
| GAN (unsupervised) | Sensitive to complex shifts | May overfit small data | Regime detection under volatility (COVID) |
| One-Class SVM | Simple, interpretable | Poor performance in latent space | Baseline detection on raw features |
| GARCH(1,1) | Captures volatility clustering | Misses collective anomalies | Low-volatility indices with smooth trends |
| Z-Score | Fast, robust to noise | Low recall, overly simplistic | Only effective for extreme spikes |
| Model | Precision | Recall | F1 Score | F4 Score |
|---|---|---|---|---|
| Z-Score | 0.51 | 0.38 | 0.44 | 0.40 |
| GARCH(1,1) | 0.58 | 0.43 | 0.49 | 0.46 |
| One-Class SVM (raw) | 0.61 | 0.50 | 0.55 | 0.53 |
| LSTM Autoencoder | 0.64 | 0.71 | 0.67 | 0.69 |
| GAN (unsupervised) | 0.66 | 0.68 | 0.67 | 0.68 |
| Ours (LSTM + GAN) | 0.70 | 0.80 | 0.74 | 0.78 |
| Scenario | Rationale | ||
|---|---|---|---|
| Balanced Importance | 0.5 | 0.5 | Equal weight to LSTM and GAN. A good default when no prior preference exists. |
| Temporal Emphasis | 0.7 | 0.3 | Highlights sequential anomalies. Ideal when time-based disruptions are more critical. |
| Distributional Emphasis | 0.3 | 0.7 | Focuses on statistical irregularities. Useful for detecting regime shifts. |
| Adaptive Tuning | CV tuned | CV tuned | Automatically selected based on validation metrics (e.g., ROC-AUC or F1). |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).