Contrastive learning–based models such as DrugCLIP have recently emerged as scalable tools for structure-based virtual screening by embedding protein structures and small molecules into a shared representation space. While these approaches demonstrate high throughput and competitive screening performance in ligand retrieval tasks, their ability to correctly identify biologically relevant ligand-binding pockets has not been systematically evaluated. Here, we construct a benchmarking dataset comprising 42 pharmacologically diverse human protein targets with experimentally validated drug-bound structures spanning multiple target families. Using this dataset, we evaluate the pocket recognition capability of DrugCLIP and compare its performance with a traditional structure-based workflow that integrates geometric pocket detection (Fpocket) with dynamics-informed pocket ranking (ESSA). DrugCLIP achieves perfect success rates for several well-studied target classes, including kinases (10/10), GPCRs (5/5), and nuclear receptors (5/5), but shows markedly reduced performance for ion channels (1/4) and transporters (2/5). Notably, pocket prediction accuracy does not strongly correlate with structural data availability, suggesting that intrinsic pocket characteristics rather than training data abundance primarily affect model performance. Across the benchmark, DrugCLIP does not outperform traditional pocket identification strategies (DrugCLIP vs. Fpocket+ESSA: 74% vs. 79%). Together, these results provide a quantitative evaluation of pocket recognition by contrastive learning–based models and highlight key limitations that should be considered when applying embedding-based approaches in prospective structure-based drug discovery.