SignFuse: A Proposed Dual-Stream Cross-Modal Framework for Gloss-Free Sign Language Translation with Large Language Models

Gurpreet Singh; Purva Mundada

doi:10.20944/preprints202604.1065.v1

Submitted:

14 April 2026

Posted:

15 April 2026

You are already at the latest version

Abstract

Sign language translation (SLT) aims to convert sign language videos into spoken language text, serving as a critical bridge for communication between the Deaf and hearing communities. While recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in gloss-free SLT, existing methods typically rely on single-modality visual features, failing to fully exploit the complementary nature of appearance and structural cues inherent in sign language. In this architectural proposition paper, we introduce SignFuse, a novel dual-stream cross-modal fusion framework that synergistically combines CNN-based visual features with Graph Convolutional Network (GCN)-based skeletal features for gloss-free sign language translation. Our framework introduces three key innovations: (1) a Cross-Modal Fusion Attention (CMFA) module that performs bidirectional cross-attention between visual and skeletal modalities to produce enriched multimodal representations; (2) a Hierarchical Temporal Aggregation (HTA) mechanism that captures sign language dynamics at multiple temporal scales—frame-level, segment-level, and sequence-level; and (3) a Progressive Multi-Stage Training blueprint that systematically aligns visual-skeletal features with the LLM’s linguistic space through contrastive pre-training, feature alignment, and LoRA-based fine-tuning. We provide the complete mathematical formulation, detailed architectural specifications, and a fully implemented PyTorch codebase. As the computational barriers to training MLLMs remain high, we formalize the experimental methodology required to validate this framework on standard benchmarks (PHOENIX-14T, CSL-Daily, How2Sign) and extend an open invitation to the broader research community to conduct empirical validation and advance this architectural paradigm through collaboration. This work is presented as a concept and architectural framework paper, aiming to establish a theoretical foundation and encourage future empirical validation by the research community.

Keywords:

sign language translation

;

Multimodal Large Language Models

;

cross-modal fusion

;

Graph Convolutional Networks

;

hierarchical temporal modeling

;

position paper

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

SignFuse: A Proposed Dual-Stream Cross-Modal Framework for Gloss-Free Sign Language Translation with Large Language Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe