Preprint
Data Descriptor

This version is not peer-reviewed.

A Data-Centric Network Traffic Dataset for Anomaly Detection: Construction, Reproducible Pipeline, and Technical Validation

Submitted:

08 June 2026

Posted:

10 June 2026

You are already at the latest version

Abstract
The effectiveness of machine learning and deep learning methods for network anomaly detection depends strongly on the quality and representativeness of the datasets used for training and evaluation. However, many publicly available benchmarks rely on synthetic traffic, outdated attack scenarios, or limited representation of encrypted communications. This work presents a network traffic dataset derived from operational firewall logs collected in a heterogeneous institutional environment dominated by HTTPS/TLS traffic. A structured data-centric pipeline was implemented, including preprocessing, behavioral feature engineering, unsupervised pseudo-labeling through the EFMS-KMeans algorithm, class balancing using SMOTE, and temporal sequence generation for sequential analysis. The resulting dataset contains large-scale flow-level records describing volumetric, behavioral, and temporal traffic characteristics while preserving privacy through anonymization procedures. Technical validation was conducted using statistical analysis, entropy-based measurements, clustering quality metrics, and dimensionality reduction techniques, confirming data consistency, diversity, and class separability. The dataset is publicly available through the Mendeley Data repository together with metadata and documentation supporting anomaly detection research, encrypted traffic analysis, and the evaluation of machine learning and deep learning approaches in realistic cybersecurity environments.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated