Phishing detection models often report strong benchmark performance, yet their reliability under realistic deployment conditions remains uncertain. This study examines this problem by investigating three failure modes of cross-dataset phishing email detection: corpus generalisation failure, asymmetric prevalence-shift failure, and artifact-driven spurious learning. Using six public email corpora, CEAS_08, Enron, Ling, Nazario, Nigerian Fraud, and SpamAssassin, the study evaluates Term Frequency (TF) and Inverse Document Frequency (IDF)-based Logistic Regression and Linear Support Vector Classifier (SVC) models across pooled baseline testing, single-corpus cross-dataset transfer, leave-one-corpus-out pooled training, prevalence-shift simulation, training-prevalence manipulation, dataset-identification analysis, top-feature inspection, artifact-removal ablation, and targeted artifact masking.
The findings show that single-corpus models are unstable under cross-dataset transfer, with F1-scores varying substantially across source–target combinations. In contrast, leave-one-corpus-out pooled training improves robustness, with Logistic Regression achieving sustained F1-scores between 0.8201 and 0.8994, and Linear SVC achieving F1-scores between 0.7607 and 0.8910 across unseen corpora. Prevalence-shift experiments reveal that failure is asymmetric and threshold-dependent. High-prevalence-trained models maintain high recall under fixed thresholds but suffer sharp recall degradation when operational alert-budget constraints are imposed. Conversely, low-prevalence-trained models become overly conservative in high-threat environments, producing high precision but substantially lower recall and poorer calibration. Artifact analyses further show that source corpus identity is highly learnable, with dataset-identification accuracy reaching 0.9722 for Logistic Regression and 0.9806 for Linear SVC. Top-feature and masking analyses indicate that models rely partly on corpus markers, date tokens, URL/domain terms, headers, and other artifact-like features rather than only general phishing indicators.
The study contributes a deployment-aware and adversary-aware evaluation framework for phishing detection. It shows that benchmark accuracy alone is insufficient for assessing real-world robustness and that reliable phishing detection requires cross-corpus validation, prevalence-aware thresholding, and systematic testing for artifact-driven spurious learning.