A Data-Centric Network Traffic Dataset for Anomaly Detection: Construction, Reproducible Pipeline, and Technical Validation

Daniel Quirumbay Yagual; Diego Fernández Iglesias; Francisco J. Nóvoa; Daniel Garabato

doi:10.20944/preprints202606.0740.v1

Submitted:

08 June 2026

Posted:

10 June 2026

You are already at the latest version

Abstract

The effectiveness of machine learning and deep learning methods for network anomaly detection depends strongly on the quality and representativeness of the datasets used for training and evaluation. However, many publicly available benchmarks rely on synthetic traffic, outdated attack scenarios, or limited representation of encrypted communications. This work presents a network traffic dataset derived from operational firewall logs collected in a heterogeneous institutional environment dominated by HTTPS/TLS traffic. A structured data-centric pipeline was implemented, including preprocessing, behavioral feature engineering, unsupervised pseudo-labeling through the EFMS-KMeans algorithm, class balancing using SMOTE, and temporal sequence generation for sequential analysis. The resulting dataset contains large-scale flow-level records describing volumetric, behavioral, and temporal traffic characteristics while preserving privacy through anonymization procedures. Technical validation was conducted using statistical analysis, entropy-based measurements, clustering quality metrics, and dimensionality reduction techniques, confirming data consistency, diversity, and class separability. The dataset is publicly available through the Mendeley Data repository together with metadata and documentation supporting anomaly detection research, encrypted traffic analysis, and the evaluation of machine learning and deep learning approaches in realistic cybersecurity environments.

Keywords:

anomaly detection

;

data-centric cybersecurity

;

network traffic dataset

;

pseudo-labeling

;

reproducible pipeline

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Data-Centric Network Traffic Dataset for Anomaly Detection: Construction, Reproducible Pipeline, and Technical Validation

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe