Gait is a repetitive whole-body movement that encodes inter-segmental coordination and spatiotemporal patterns and has been used not only for identity recognition but also for inferring attributes such as sex. Many vision-based approaches, however, rely on appearance cues, which are sensitive to occlusion and clothing variation and may raise privacy concerns; robustness under everyday perturbations remains insufficiently quantified. Here, we investigate skeleton-based gait sex classification using 2D pose sequences from the PsyMo dataset. We rendered 17 COCO keypoints into 50×50 grayscale skeleton images and trained a 3D residual CNN on non-overlapping 15-frame clips. Evaluation used a subject-wise, stratified split with balanced sexes, and the same test-subject set was shared across four aggregated conditions (A: overall; B: partial occlusion/carrying; C: speed changes; D: smartphone use). Accuracy ranged from 0.658 to 0.749, with the lowest performance in B. Confusion-matrix–based error decomposition with subject-level bootstrap confidence intervals revealed pronounced sex-wise error asymmetry in B and C, driven by reduced male recall and increased male-to-female misclassification. In D, a simple arm-swing amplitude index was not significantly associated with prediction confidence or misclassification. Grad-CAM quantification further suggested that joint-group importance shifts across conditions, indicating condition-dependent reliance on motion cues.