Accurate prediction of pedestrian intention and future paths is essential for traffic safety, urban planning, and autonomous navigation. This study develops a multimodal prediction model that combines meaning-based image-text features, motion trajectories, and social interactions. We extract visual-language information from RGB sequences using a CLIP-based encoder and represent group behavior using a Social-GRU network. To improve the reliability of predictions, we apply Bayesian modeling to manage uncertainty. We tested the method on the Waymo and ETH/UCY datasets. On the ETH dataset, the model achieved a 14.2% reduction in average displacement error and a 17.6% reduction in final displacement error, compared with leading baseline methods. The model remained effective in crowded spaces, unclear visual conditions, and sudden motion changes. The results confirm that combining visual-language and motion data improves prediction accuracy. This method offers a practical solution for real-world pedestrian analysis in intelligent transport systems.