Background: Psychological safety—the belief that one can speak up without fear of negative consequences—is fundamental to team learning and performance, yet controlled experimental research is constrained by practical and ethical limitations. This study validates large language model (LLM) agents as a methodological tool for simulating team psychological safety dynamics by comparing AI-simulated teams against human teams across identical experimental scenarios. We conducted parallel experiments with 5,280 AI teams (26,400 agent interactions across 5 LLM architectures) and 249 human teams (1,245 participants; final analytic sample: 247 teams, 1,235 participants after quality screening) using a 2×2 factorial design manipulating leader inclusiveness (High/Low) and error management culture (Learning/Blaming). Teams completed realistic work scenarios while we measured psychological safety perceptions, learning behaviors, team performance, and moderating effects of demographic diversity. A comprehensive validation framework assessed convergent validity (main effects, moderation patterns, mediation pathways), discriminant validity (falsification tests), and measurement properties. AI simulations demonstrated strong convergent validity for main effects: leader inclusiveness effect size (AI: d = 2.21, 95% CI [2.13, 2.29]; Human: d = 1.58, 95% CI [1.42, 1.74]), error culture effect (AI: d = 1.39, 95% CI [1.32, 1.46]; Human: d = 0.97, 95% CI [0.82, 1.12]). AI effects were consistently larger than human effects across all relationship types. Main effects showed calibration ratio = 1.42× (95% CI [1.37×, 1.49×]), with precision-weighted calibration across all 14 effect comparisons = 1.38× (95% CI [1.32×, 1.44×]). This systematic inflation requires effect size adjustment when extrapolating to human teams: multiply main effects by ≈0.70, correlations by ≈0.88, with type-specific calibration detailed for different relationship types.AI effects were consistently larger (mean ratio = 1.40×), suggesting a systematic calibration factor. Mediation pathways showed parallel structure (AI: 77.7% mediated, 95% CI [73.2%, 82.2%]; Human: 90.7%, 95% CI [83.8%, 97.6%]), with bootstrap difference test indicating proportions do not differ significantly (p = .182) despite narrowly non-overlapping individual confidence intervals. Moderator convergence varied: demographic composition effects showed lower pattern correlations (r = .43, 95% CI [.09, .68]) compared to main effects (r = .97, 95% CI [.89, .99]). Eight falsification tests confirmed discriminant validity: AI teams showed theoretically appropriate null effects in control scenarios (8/8 tests supported predictions after theoretical refinement). Cross-model consistency was high (ICC = .79, 95% CI [.73, .84]), with calibration factors stable across architectures (SD = 0.04), indicating systematic rather than model-specific inflation. GPT-4 and Claude-3.5 showed closest absolute alignment to human effect magnitudes. LLM-based simulations offer valid approximations of psychological safety dynamics for theory testing, with predictable calibration requirements (effect size multiplier ≈ 0.70). These tools enable hypothesis testing at scales and experimental control infeasible with human participants, though current limitations in capturing complex moderator interactions and precise effect magnitude warrant continued validation. This methodology significantly expands the experimental toolkit for team science research.