Contrastive learning–based models such as DrugCLIP have recently emerged as scalable tools for structure-based virtual screening by embedding protein structures and small molecules into a shared representation space. While these approaches demonstrate high throughput and competitive screening performance in ligand retrieval tasks, their ability to correctly identify biologically relevant ligand-binding pockets has not been systematically evaluated. Here, we construct a benchmarking dataset comprising 42 pharmacologically diverse human protein targets with experimentally validated drug-bound structures spanning multiple target families. Using this dataset, we evaluate the pocket recognition capability of DrugCLIP and compare its performance with a traditional structure-based workflow (Fpocket combined with ESSA) and a machine learning-based method (P2Rank). DrugCLIP shows robust performance for well-characterized target classes, including kinases (10/10) and nuclear receptors (5/5), but exhibits markedly reduced accuracy for ion channels (1/4), GPCRs (3/5) and transporters (3/5). Notably, pocket prediction accuracy does not strongly correlate with structural data availability, suggesting that intrinsic pocket characteristics rather than training data abundance primarily affect model performance. Across the benchmark, DrugCLIP achieves an overall success rate of 71% (95% CI: 56-83%), compared with 79% (95% CI: 64-88%) for Fpocket+ESSA and 93% (95% CI: 81-98%) for P2Rank. McNemar’s test showed no significant difference between DrugCLIP and Fpocket+ESSA (p=0.508), whereas P2Rank significantly outperformed DrugCLIP (p=0.012). Together, these results provide a quantitative evaluation of pocket recognition by contrastive learning–based models and highlight key limitations of embedding-based approaches for pocket localization.