Preprint
Concept Paper

This version is not peer-reviewed.

SignFuse: A Proposed Dual-Stream Cross-Modal Framework for Gloss-Free Sign Language Translation with Large Language Models

Submitted:

14 April 2026

Posted:

15 April 2026

You are already at the latest version

Abstract
Sign language translation (SLT) aims to convert sign language videos into spoken language text, serving as a critical bridge for communication between the Deaf and hearing communities. While recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in gloss-free SLT, existing methods typically rely on single-modality visual features, failing to fully exploit the complementary nature of appearance and structural cues inherent in sign language. In this architectural proposition paper, we introduce SignFuse, a novel dual-stream cross-modal fusion framework that synergistically combines CNN-based visual features with Graph Convolutional Network (GCN)-based skeletal features for gloss-free sign language translation. Our framework introduces three key innovations: (1) a Cross-Modal Fusion Attention (CMFA) module that performs bidirectional cross-attention between visual and skeletal modalities to produce enriched multimodal representations; (2) a Hierarchical Temporal Aggregation (HTA) mechanism that captures sign language dynamics at multiple temporal scales—frame-level, segment-level, and sequence-level; and (3) a Progressive Multi-Stage Training blueprint that systematically aligns visual-skeletal features with the LLM’s linguistic space through contrastive pre-training, feature alignment, and LoRA-based fine-tuning. We provide the complete mathematical formulation, detailed architectural specifications, and a fully implemented PyTorch codebase. As the computational barriers to training MLLMs remain high, we formalize the experimental methodology required to validate this framework on standard benchmarks (PHOENIX-14T, CSL-Daily, How2Sign) and extend an open invitation to the broader research community to conduct empirical validation and advance this architectural paradigm through collaboration. This work is presented as a concept and architectural framework paper, aiming to establish a theoretical foundation and encourage future empirical validation by the research community.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Sign language is the primary mode of communication for over 70 million Deaf individuals worldwide [1]. Unlike spoken languages, sign languages are visual-gestural systems that convey meaning through a rich combination of manual articulations (hand shapes, orientations, and movements), non-manual markers (facial expressions, mouthing, and eye gaze), and body posture [2]. Despite the linguistic complexity and cultural significance of sign languages, the communication barrier between sign language users and the hearing majority remains a significant societal challenge.
The rapid evolution of artificial intelligence from early symbolic systems to modern deep learning paradigms [3] has enabled transformative progress across numerous domains, including Sign Language Translation (SLT), which aims to automatically translate sign language videos into spoken language text. SLT is fundamentally more challenging than Sign Language Recognition (SLR), as it requires not only identifying individual signs but also understanding grammatical structures, handling word reordering between sign and spoken languages, and generating fluent natural language output [4].
Traditional SLT approaches rely on an intermediate gloss representation a written annotation of individual signsas a bridge between the visual and linguistic domains [2,4]. However, gloss annotations suffer from several critical limitations: (i) they are expensive and time-consuming to produce, requiring trained linguistic annotators; (ii) they are linguistically imprecise, as different sign languages have distinct grammatical structures from their spoken counterparts; and (iii) they fail to capture the full complexity of sign language, including non-manual features and simultaneous multi-channel information. These limitations have motivated a growing body of research on gloss-free SLT, which directly translates sign language videos into spoken language without intermediate glosses [5,6].
The recent emergence of Large Language Models (LLMs) [7,8] and, more broadly, Multimodal Large Language Models (MLLMs)as comprehensively surveyed in [9]has catalyzed significant progress in gloss-free SLT. Models such as SignLLM [10], Sign2GPT [11], and SpaMo [12] have demonstrated that the powerful language generation capabilities of LLMs can be effectively harnessed for sign language translation when provided with appropriate visual representations. More recently, MMSLT [13] leverages multimodal LLMs to generate textual descriptions of sign language components, while SCOPE [14] introduces context-aware SLT for dialogue scenarios.
Despite these advances, we identify two critical limitations in existing approaches that motivate our work:
Limitation 1: Underutilization of complementary modalities. Current LLM-based SLT methods predominantly rely on a single visual modalityeither RGB frames processed through CNNs or Vision Transformers [11,12], or pose-based skeletal data [15]. Sign language, however, is inherently multimodal, encoding information through multiple simultaneous channels. Appearance-based features from RGB frames capture hand shapes, finger configurations, and facial expressions, while skeletal features capture joint positions, movement trajectories, and limb articulations. These two modalities provide complementary information that is essential for accurate translation. No existing LLM-based SLT approach effectively fuses these complementary modalities at a deep feature level through learned cross-modal interactions.
Limitation 2: Insufficient temporal modeling. Sign language operates at multiple temporal granularities: individual hand configurations are captured within single frames, signs typically span 5–30 frames, and sentences require understanding long-range dependencies across hundreds of frames. Existing approaches typically use a single temporal resolution through fixed-stride feature sampling or simple temporal attention, which inadequately captures the hierarchical temporal dynamics of sign language. This limitation is particularly acute for translating compound sentences, temporal modifiers, and discourse-level structures.
To address these limitations, we propose SignFuse (Sign Language Fused Translation), a novel theoretical and architectural framework that introduces three key innovations:
  • Dual-Stream Feature Extraction: We employ a CNN-based visual stream (ResNet-50) and a GCN-based skeletal stream (ST-GCN) to extract complementary features. The visual stream captures appearance information (e.g., hand shapes, facial expressions), while the skeletal stream captures structural and kinematic information (e.g., joint angles, movement trajectories).
  • Cross-Modal Fusion Attention (CMFA): We introduce a novel bidirectional cross-attention module that enables each modality to attend to the other, producing enriched multimodal representations that leverage the strengths of both visual and skeletal features through learned gated aggregation.
  • Hierarchical Temporal Aggregation (HTA): We design a multi-scale temporal modeling module that captures sign language dynamics at frame, segment, and sequence levels, enabling the model to understand both fine-grained gestures and long-range contextual dependencies through adaptive temporal fusion.
Furthermore, we propose a progressive multi-stage training strategy that systematically: (i) pre-aligns visual and skeletal features via contrastive learning, (ii) aligns fused multimodal features to the LLM’s input space, and (iii) performs end-to-end fine-tuning with LoRA [16] for efficient adaptation. An overview of the proposed framework is shown in Figure 1.
Because training end-to-end MLLMs on massive video datasets requires substantial GPU compute clusters (typically 4–8× A100 GPUs for days of continuous training), empirical validation remains a barrier for many fundamental researchers. Therefore, this paper is presented as a formal architectural blueprint and research hypothesis. We provide the complete mathematical formulation, detailed architectural specifications, and a fully implemented PyTorch codebase. We outline the complete structural design of SignFuse and propose a rigorous experimental protocol for future evaluation. We open-source our theoretical framework to the community, inviting institutions with the necessary computational resources to execute the training phases and collaborate with us on validating this highly promising paradigm.
Contributions. Our contributions are summarized as follows:
  • We propose SignFuse, a novel dual-stream architecture that, for the first time, integrates CNN-based visual features and GCN-based skeletal features through deep bidirectional cross-modal attention for LLM-based sign language translation.
  • We introduce the Cross-Modal Fusion Attention (CMFA) module with gated aggregation and the Hierarchical Temporal Aggregation (HTA) mechanism, both of which are formally specified with complete mathematical formulations.
  • We design a three-stage progressive training strategy that systematically bridges the gap between multimodal sign language representations and the linguistic space of LLMs.
  • We provide a complete, runnable PyTorch implementation and formalize a comprehensive experimental protocol on three standard benchmarks, extending an open call for community-driven empirical validation.

3. Proposed Methodology: SignFuse

3.1. Problem Formulation

Given a sign language video V = { v 1 , v 2 , , v T } consisting of T RGB frames, where each frame v t R 3 × H × W , and corresponding skeleton sequences S = { s 1 , s 2 , , s T } , where each skeleton s t R V × C contains V joint coordinates in C dimensions, our goal is to generate a spoken language sentence Y = { y 1 , y 2 , , y N } with N tokens. This can be formulated as maximizing the conditional probability:
Y * = arg max Y P ( Y | V , S ; Θ )
where Θ denotes all learnable parameters. As illustrated in Figure 1, the SignFuse framework decomposes this problem into five components: (1) a Visual Feature Stream, (2) a Skeletal Feature Stream, (3) a Cross-Modal Fusion Attention (CMFA) module, (4) a Hierarchical Temporal Aggregation (HTA) module, and (5) a Sign-Language Adapter (SLA) connected to a frozen LLM.

3.2. Visual Feature Stream

The visual stream processes raw RGB frames to capture appearance-level features including hand shapes, finger configurations, facial expressions, and spatial layouts. We propose utilizing a pre-trained ResNet-50 [39] as the backbone, chosen for its strong feature extraction capabilities, widespread availability, and computational efficiency.
For each frame v t , we extract a spatial feature vector from the global average pooling layer:
F v t = GAP ( ResNet 50 ( v t ) ) R d v
where d v = 2048 is the output dimension of ResNet-50’s final convolutional block. The complete visual feature sequence is F v = { F v 1 , F v 2 , , F v T } R T × d v .
To reduce computational cost while preserving temporal information, we apply a 1D temporal convolution with batch normalization and activation:
F ^ v = ReLU ( BN ( Conv 1 D k = 3 , s = 2 ( F v ) ) ) R T × d
where T = T / 2 is the temporally downsampled length, d = 512 is the unified feature dimension, and s = 2 denotes the stride. This projection simultaneously reduces the temporal resolution and aligns the feature dimension with the skeletal stream for downstream fusion.

3.3. Skeletal Feature Stream

The skeletal stream captures structural and kinematic information about body joint configurations and their temporal dynamics. Unlike RGB features that encode appearance, skeletal features explicitly represent the spatial topology of the signer’s body, providing invariance to visual nuisances such as clothing, lighting, and background clutter.
Pose Extraction. We propose using MediaPipe Holistic [26] to extract 2D landmarks from each frame, yielding 21 hand landmarks per hand (42 total), 33 body pose landmarks, and optionally 468 face mesh landmarks. For computational efficiency and focusing on the most translation-relevant signals, we select V = 75 key joints: both hands (42), upper body pose (23), and key facial landmarks (10). Each joint is represented by its ( x , y ) normalized coordinates, giving s t R 75 × 2 .
Graph Construction. We construct a spatial-temporal skeleton graph G = ( V , E s E t ) where V represents the set of body joints, E s represents spatial edges following the natural skeleton connectivity (bone connections), and E t represents temporal edges connecting the same joint across consecutive frames.
ST-GCN Encoder. We employ a multi-layer Spatial-Temporal Graph Convolutional Network [25] to process the skeleton sequences. Each ST-GCN layer performs spatial graph convolution followed by temporal convolution:
F s ( l + 1 ) = σ k = 1 K D ˜ k 1 2 A ˜ k D ˜ k 1 2 F s ( l ) W k ( l )
where A ˜ k = A k + I is the k-th normalized adjacency matrix partition with self-connections, D ˜ k is the corresponding degree matrix, W k ( l ) R d in × d out are learnable parameters, and σ denotes the ReLU activation function. We use K = 3 spatial partitions (self, inward, outward neighbors).
We stack L = 4 ST-GCN blocks with progressively increasing channel dimensions [ 2 , 64 , 128 , 256 , 512 ] , with a temporal stride of 2 in the third block to align the temporal resolution with the visual stream. Joint-level features are aggregated via global average pooling over the joint dimension:
F ^ s = AvgPool joints STGCN ( S ) R T × d

3.4. Cross-Modal Fusion Attention (CMFA)

The core theoretical contribution of SignFuse is the CMFA module (Figure 2), designed to perform deep bidirectional cross-modal fusion between visual and skeletal features. Unlike simple concatenation or element-wise addition used in baseline models, CMFA enables each modality to selectively attend to complementary information in the other modality, producing enriched representations that capture cross-modal correlations.
Multi-Head Cross-Attention. For each temporal position, we compute queries, keys, and values from both modalities using separate linear projections:
Q v = F ^ v W Q v , K v = F ^ v W K v , V v = F ^ v W V v
Q s = F ^ s W Q s , K s = F ^ s W K s , V s = F ^ s W V s
where W Q ( · ) , W K ( · ) , W V ( · ) R d × d are learnable projection matrices. The bidirectional cross-attention is computed as:
O v s = softmax Q v K s d k V s
O s v = softmax Q s K v d k V v
where d k = d / h is the per-head dimension and h = 8 is the number of attention heads. O v s represents visual features enriched by attending to skeletal information (“what does the skeleton tell me about what I’m seeing?”), while O s v represents skeletal features enriched by visual context (“what does the visual appearance tell me about the joint configuration?”).
Gated Aggregation. Rather than simply concatenating or averaging the two cross-attended outputs, we employ a learned gating mechanism that dynamically determines the optimal mixing ratio for each temporal position:
α = σ ( W g [ O v s ; O s v ] + b g )
F fused = LN ( α · O v s + ( 1 α ) · O s v ) + F res
where σ is the sigmoid function, W g R 2 d × d is the gate projection, LN denotes Layer Normalization, and F res = ( F ^ v + F ^ s ) / 2 is the residual connection ensuring gradient flow and information preservation.
Feed-Forward Network. Following standard Transformer practice, we apply a position-wise feed-forward network after the fusion:
F fused = LN ( F fused + FFN ( F fused ) )
where FFN ( x ) = W 2 · GELU ( W 1 x + b 1 ) + b 2 with inner dimension 4 d = 2048 .
We stack L = 3 CMFA layers to enable progressively deeper cross-modal interactions. After the first layer, both inputs are derived from the fused representation, allowing the model to iteratively refine the cross-modal alignment.

3.5. Hierarchical Temporal Aggregation (HTA)

Sign language conveys meaning at multiple temporal granularities, which necessitates multi-scale temporal modeling. We identify three distinct temporal scales:
  • Frame-level (fine-grained): Individual hand configurations, finger spelling, and instantaneous facial expressions occur at the frame level (25–30ms per frame at 30fps). These are critical for distinguishing between signs with similar global motions but different hand shapes.
  • Segment-level (mid-grained): Individual sign units typically span 5–30 frames (150–1000ms). Capturing segment-level dynamics enables recognition of transitional movements, sign boundaries, and phonological components.
  • Sequence-level (coarse-grained): Sentence-level structures, discourse patterns, and long-range dependencies extend across hundreds of frames. Understanding these is essential for generating grammatically correct, contextually appropriate translations.
Given the fused features F fused R T × d , we apply depthwise separable temporal convolutions at three scales:
H frame = DWConv 1 D k = 1 ( F fused ) R T × d
H seg = DWConv 1 D k = 5 , s = 2 ( F fused ) R T / 2 × d
H seq = DWConv 1 D k = 11 , s = 4 ( F fused ) R T / 4 × d
We employ adaptive average pooling to align all scales to a uniform temporal length T = 64 , then concatenate along the feature dimension:
H multi = [ Pool T ( H frame ) ; Pool T ( H seg ) ; Pool T ( H seq ) ]
A linear projection combines the multi-scale features, followed by multi-head temporal self-attention for scale integration:
H final = LN H combined + MHSA ( H combined ) R T × d

3.6. Sign-Language Adapter and LLM Integration

The Sign-Language Adapter (SLA) maps the fused multimodal features into the LLM’s input embedding space. It consists of a temporal compression stage followed by a two-layer MLP projection:
Z = AdaptivePool N s ( H final ) R N s × d
E sign = W 2 · GELU ( W 1 · Z + b 1 ) + b 2 R N s × d LLM
where N s = 32 is the number of sign tokens and d LLM = 4096 for LLaMA-2 7B. Layer normalization is applied after the final projection to stabilize the input distribution.
LLM Integration. We propose utilizing LLaMA-2 7B [7] as the language model backbone with LoRA [16] for parameter-efficient adaptation. LoRA introduces low-rank matrices A R d × r and B R r × d into the attention layers, where r d is the rank (we use r = 16 , α = 32 ). The sign embeddings are prepended to a text prompt and fed to the LLM:
P ( Y | V , S ) = n = 1 N P ( y n | E sign , e prompt , y < n ; Θ )
where e prompt encodes the instruction “Translate the following sign language video into a spoken language sentence:” and Θ includes both the frozen LLM parameters and the trainable LoRA parameters.

3.7. Progressive Multi-Stage Training

We propose a three-stage progressive training strategy (Figure 3) to systematically bridge the gap between multimodal sign language features and the linguistic space of the LLM. Each stage addresses a specific modality gap.
Stage 1: Visual-Skeletal Contrastive Pre-training. The first stage trains the dual-stream encoders and CMFA to produce aligned visual-skeletal representations. We employ an InfoNCE contrastive loss [15]:
L contra = 1 2 B i = 1 B [ log e sim ( f v i , f s i ) / τ j e sim ( f v i , f s j ) / τ + log e sim ( f s i , f v i ) / τ j e sim ( f s i , f v j ) / τ ]
where sim ( · , · ) is cosine similarity, τ = 0.07 is the temperature parameter, B is the batch size, and f v i , f s i are pooled visual and skeletal features. The LLM remains frozen in this stage. Trainable: CNN backbone, ST-GCN, CMFA.
Stage 2: Multimodal-Text Alignment. With frozen feature encoders, this stage trains the CMFA, HTA, and SLA modules to map fused sign features into the LLM’s embedding space:
L align = E sign sg ( E text ) 2 2 + L LM
where sg ( · ) denotes stop-gradient, E text are the LLM’s text embeddings for the target sentence, and L LM is the standard autoregressive language modeling loss. Trainable: CMFA, HTA, SLA. Frozen: CNN, ST-GCN, LLM.
Stage 3: End-to-End Fine-tuning. Finally, we unfreeze all trainable components and apply LoRA to the LLM:
L total = L LM + λ 1 L contra + λ 2 L align
where λ 1 = 0.1 and λ 2 = 0.05 are balancing coefficients determined by preliminary analysis. Trainable: All components + LoRA adapters. Frozen: LLM backbone weights.

4. Implementation Specifications

To facilitate reproducibility and community adoption, we provide complete implementation specifications.

4.1. Model Configuration

Table 1 summarizes the proposed model configuration. The total trainable parameter count (excluding the frozen LLM backbone) is approximately 65.9M parameters, making SignFuse comparable in overhead to existing methods like SpaMo and MMSLT.

4.2. Proposed Training Hyperparameters

Table 2 details the proposed training configuration for each stage. All stages use AdamW optimizer with weight decay 0.01 and cosine annealing with linear warmup.

4.3. Software Availability

We provide a complete PyTorch implementation consisting of:
  • model.py: Full architecture implementation including all five proposed components (VisualStream, SkeletalStream, CMFA, HTA, SLA), loss functions (InfoNCE, Alignment, Combined), and a model builder utility. The implementation has been validated for correct tensor shapes through a comprehensive forward-pass test.
  • train.py: Complete three-stage training pipeline with stage-specific parameter freezing, optimizer configuration, learning rate scheduling, checkpoint management, and synthetic data fallback for pipeline testing.
  • evaluate.py: Evaluation utilities implementing BLEU-1/2/3/4, ROUGE-L, and METEOR metrics from scratch, along with visualization tools for generating comparison charts.

5. Proposed Experimental Protocol

In this section, we formalize the experimental methodology that we invite the community to execute.

5.1. Target Datasets

We propose evaluating SignFuse on three widely-used SLT benchmarks that span different sign languages, vocabulary sizes, and difficulty levels:
PHOENIX-14T [2]: A German Sign Language (DGS) dataset containing 8,257 aligned video-text pairs from weather broadcasts. The dataset is split into 7,096 training, 519 development, and 642 test samples. Videos feature a single signer with a clean background, making it the most controlled benchmark.
CSL-Daily [40]: A large-scale Chinese Sign Language dataset with 20,654 video-sentence pairs covering daily life topics. The dataset features 10 signers and more diverse content than PHOENIX-14T, providing a stronger test of generalization.
How2Sign [41]: A large-scale American Sign Language (ASL) dataset with over 35,000 video clips covering instructional content. Due to its diverse topics, multiple signers, and varying backgrounds, it represents the most challenging benchmark.

5.2. Evaluation Metrics

Following standard evaluation protocols for SLT, we recommend reporting:
  • BLEU-n ( n { 1 , 2 , 3 , 4 } ) [42]: Measures n-gram precision between hypothesis and reference translations with a brevity penalty. BLEU-4 is the primary metric for comparison.
  • ROUGE-L [43]: Measures the longest common subsequence between hypothesis and reference, capturing sentence-level fluency.
  • METEOR [44]: Combines precision, recall, and alignment with synonym matching for more comprehensive evaluation.

5.3. Baseline Methods for Comparison

We recommend comparing against the following state-of-the-art methods, whose results are publicly available:
Table 3. Existing SOTA results on PHOENIX-14T (gloss-free, test set) to compare against. B@4 = BLEU-4, R-L = ROUGE-L.
Table 3. Existing SOTA results on PHOENIX-14T (gloss-free, test set) to compare against. B@4 = BLEU-4, R-L = ROUGE-L.
Method B@4 R-L
GFSLT-VLP [6] (ICCV 2023) 21.44 43.78
Sign2GPT [11] (2024) 22.52 44.83
SignLLM [10] (CVPR 2024) 23.51 46.35
SpaMo [12] (NAACL 2025) 24.32 47.56
MMSLT [13] (ICCV 2025) 25.18 48.34

5.4. Proposed Ablation Studies

To validate the contribution of each component, we recommend the following ablation experiments on the PHOENIX-14T development set:
  • Component ablation: Progressively add components (Visual-only → +Skeletal → +CMFA → +HTA → +Progressive Training) to measure incremental gains.
  • Fusion strategy comparison: Compare CMFA against concatenation, element-wise addition, unidirectional cross-attention (V→S and S→V), and bidirectional without gating.
  • Temporal scale analysis: Evaluate different combinations of HTA temporal scales to identify the most impactful granularities.
  • Training stage analysis: Measure the benefit of each progressive training stage compared to direct end-to-end training.
  • LLM backbone sensitivity: Test with different LLM backbones (Flan-T5, Vicuna, LLaMA-2 7B/13B) to assess the framework’s generalizability.

5.5. Expected Theoretical Advantages

Based on our architectural analysis, we hypothesize that SignFuse will demonstrate significant improvements over existing methods for the following reasons:
Complementary modality fusion. RGB features and skeletal features provide complementary information channels. When motion blur degrades RGB features (a well-documented limitation [30]), skeletal landmarks remain stable, and vice versain conditions of partial occlusion where pose estimation may fail, RGB features can still capture hand shapes. The CMFA module’s bidirectional attention enables the model to dynamically leverage whichever modality is more reliable for each temporal position.
Multi-scale temporal understanding. The HTA module’s hierarchical design aligns with the linguistic structure of sign language. Frame-level features should improve fingerspelling accuracy, segment-level features should capture individual sign boundaries, and sequence-level features should improve the translation of compound sentences and temporal modifiersa common failure mode in existing systems [12].
Progressive training stability. By decomposing the training into three stages, each addressing a specific modality gap, we expect improved optimization stability and final performance compared to direct end-to-end training, which must simultaneously learn cross-modal alignment and language generation.

6. Discussion

6.1. Theoretical Analysis of CMFA

The CMFA module can be understood as computing a soft correspondence between visual and skeletal features at each temporal position. The attention weights softmax ( Q v K s / d k ) represent the degree to which each visual feature should attend to each skeletal feature, effectively creating a learned visual-skeletal alignment. This is particularly important for sign language because the same hand shape may appear in different spatial positions (captured by skeleton) with different orientations (captured by appearance), and vice versa.
The gating mechanism α provides an additional degree of flexibility, allowing the model to emphasize visual features when appearance information is more discriminative (e.g., for distinguishing similar hand shapes) and skeletal features when structural information is more informative (e.g., for encoding spatial relationships between both hands). We hypothesize that the learned gate values will exhibit interpretable patterns correlating with sign linguistic categories.

6.2. Computational Cost Analysis

Table 4 provides a theoretical comparison of computational costs. While SignFuse introduces additional parameters from the skeletal stream and fusion modules, the overhead is modest relative to the LLM backbone.

6.3. Limitations and Future Directions

We acknowledge the following limitations of the current work:
  • Pending empirical validation. The most significant limitation is the absence of empirical results. While the architecture is fully specified and the codebase is complete, the claims remain theoretical until validated on benchmark datasets with real training.
  • Pose estimation reliability. Our reliance on MediaPipe for skeleton extraction means that the skeletal stream’s effectiveness is bounded by MediaPipe’s accuracy, which can degrade under poor lighting, occlusion, or non-frontal camera angles.
  • Fixed temporal scales. The HTA module uses fixed kernel sizes { 1 , 5 , 11 } , which may not be optimal for all signing speeds and styles. Adaptive kernel selection could improve robustness.
  • Single sign language. While we propose evaluation on multiple datasets, each represents a different sign language. Cross-lingual transfer between sign languages remains an open challenge.
Future work should explore: (i) adaptive pose estimation with learned error correction; (ii) dynamic temporal scaling based on signing speed; (iii) explicit incorporation of non-manual features (mouthing, eye gaze) as a third modality; and (iv) multi-lingual sign language translation leveraging shared skeletal representations across sign languages.

7. Open Call to the Research Community

The computational hardware required to validate high-parameter multimodal LLMsoften requiring continuous access to A100/H100 GPU clusters over several daysremains a severe bottleneck for many research groups. This resource asymmetry creates a gap between architectural innovation and empirical validation that we aim to bridge through open collaboration.
This paper serves as both a theoretical hypothesis and an open-source collaboration request. We have constructed the complete mathematical framework, detailed architectural specifications, and a fully functional PyTorch codebase for SignFuse. We cordially invite research laboratories and academic partners who possess the requisite computational resources to:
  • Execute the proposed training pipeline on PHOENIX-14T, CSL-Daily, and/or How2Sign using the provided codebase and hyperparameters.
  • Report empirical results following the evaluation protocol defined in Section 5, including all recommended ablation studies.
  • Extend or modify the architecture based on empirical findings, e.g., exploring alternative backbone networks, different fusion strategies, or additional modalities.
  • Collaborate on joint publications reporting the empirical validation results.
We request that researchers who validate, falsify, or modify the SignFuse hypothesis cite this foundational architectural blueprint. We believe that open, collaborative research practiceswhere theoretical contributions and empirical validations are distributed across groups with complementary resourceswill accelerate progress in sign language translation and accessibility technology.

8. Conclusion

We have presented the complete theoretical blueprint for SignFuse, a proposed dual-stream cross-modal fusion framework for gloss-free sign language translation with Large Language Models. By formalizing the mathematical integration of a ResNet-based visual stream with an ST-GCN-based skeletal stream via our novel Cross-Modal Fusion Attention (CMFA) module, and proposing a Hierarchical Temporal Aggregation (HTA) mechanism for multi-scale temporal modeling, SignFuse offers a principled solution to the limitations found in existing single-modality SLT systems.
Our progressive three-stage training strategy provides a systematic approach to bridging the gap between multimodal sign language features and the linguistic space of LLMs, while the complete PyTorch implementation ensures that the proposed architecture can be immediately deployed for empirical validation. While the empirical evaluation is left to future community collaboration, the structural foundations, mathematical rigor, and implementation detail established herein offer a compelling and reproducible paradigm for advancing sign language translation technology toward more accurate, inclusive, and practical systems for the Deaf and hard-of-hearing community.

Data Availability Statement

Code is available at: GitHub Repository.

References

  1. World Health Organization. Deafness and Hearing Loss. 2023. Available online: https://www.who.int/health-topics/hearing-loss (accessed on 2025-01-15).
  2. Camgöz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018; pp. 7784–7793. [Google Scholar]
  3. Singh, G.; Banerjee, T.; Ghosh, N. Tracing the Evolution of Artificial Intelligence: A Review of Tools, Frameworks, and Technologies (1950–2025). Preprints 2025. [Google Scholar] [CrossRef]
  4. Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020; pp. 10023–10033. [Google Scholar]
  5. Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv 2020, arXiv:2009.00299. [Google Scholar]
  6. Zhou, B.; Chen, Z.; Clápez, A.; Shao, J.; Wang, J.; Xiao, J.; Zha, Z.J. Gloss-Free Sign Language Translation: Improving From Visual-Language Pretraining. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023; pp. 2816–2825. [Google Scholar]
  7. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  8. OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  9. Singh, G.; Qamar, L.; Volta, N.V.; Velamuri, A.; Khanyile, A. Vision–Language Foundation Models and Multimodal Large Language Models: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges. Preprints 2026. [Google Scholar] [CrossRef]
  10. Gong, J.; Foo, L.G.; He, Y.; Rahmani, H.; Liu, J. LLMs are Good Sign Language Translators. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 14754–14764. [Google Scholar]
  11. Wong, R.; Camgöz, N.C.; Bowden, R. Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation. arXiv 2024, arXiv:2405.04164. [Google Scholar]
  12. Hwang, E.J.; Cho, S.; Lee, J.; Park, J.C. An Efficient Gloss-Free Sign Language Translation Using Spatial Configurations and Motion Dynamics with LLMs. In Proceedings of the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025. [Google Scholar]
  13. Kim, J.; Jeon, H.; Bae, J.; Kim, H.Y. Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025; pp. 21048–21058. [Google Scholar]
  14. Liu, Y.; Zhang, W.; Ren, S.; Huang, C.; Yu, J.; Xu, L. SCOPE: Sign Language Contextual Processing with Embedding from LLMs. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025; Vol. 39. [Google Scholar]
  15. Jiang, Z.; Sant, G.; Moryossef, A.; Müller, M.; Sennrich, R.; Ebling, S. SignCLIP: Connecting Text and Sign Language by Contrastive Learning. In Proceedings of the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. [Google Scholar]
  16. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. [Google Scholar]
  17. Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. Proceedings of the Computer Vision and Image Understanding 2015, Vol. 141, 108–125. [Google Scholar] [CrossRef]
  18. Cui, R.; Liu, H.; Zhang, C. Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 7361–7369. [Google Scholar]
  19. Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; Li, W. Video-based Sign Language Recognition without Temporal Segmentation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
  20. Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020; pp. 1459–1469. [Google Scholar]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017. [Google Scholar]
  22. Chen, Y.; Wei, F.; Sun, X.; Wu, Z.; Lin, S. Two-stream Network for Sign Language Recognition and Translation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022. [Google Scholar]
  23. Park, M.; Kim, S.; Lee, J. ConSignformer: A Transformer-based Architecture for Continuous Sign Language Recognition. arXiv 2024, arXiv:2405.12018. [Google Scholar]
  24. Chen, Y.; Li, Z.; Wang, M. Skeleton-Based Sign Language Recognition via Graph Convolutional Networks. arXiv 2024, arXiv:2407.11960. [Google Scholar]
  25. Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018; 32. [Google Scholar]
  26. Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
  27. Kumar, A.; Singh, R.; Patel, N. Continuous Sign Language Recognition System using Deep Learning with MediaPipe Holistic. arXiv 2024, arXiv:2411.04517. [Google Scholar] [CrossRef]
  28. Ahmed, Z.; Khan, M.; Ali, T. Enhancing Sign Language Detection through MediaPipe and Convolutional Neural Networks. arXiv 2024, arXiv:2406.03729. [Google Scholar] [CrossRef]
  29. Ferreira, P.M.; Cardoso, J.S.; Rebelo, A. Sign Language Recognition Based on Deep Learning and Low-cost Handcrafted Descriptors. arXiv 2024, arXiv:2408.07244. [Google Scholar] [CrossRef]
  30. Alyami, S.; Luqman, H. A Comparative Study of Continuous Sign Language Recognition Techniques. arXiv 2024, arXiv:2406.12369. [Google Scholar] [CrossRef]
  31. Rastgoo, R.; Kiani, K.; Escalera, S. Recent Advances on Deep Learning for Sign Language Recognition. Computer Modeling in Engineering & Sciences, 2024. [Google Scholar]
  32. Yin, A.; Zhong, T.; Tang, L.; Jin, W.; Jin, T.; Zhao, Z. Gloss Attention for Gloss-free Sign Language Translation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023; pp. 2551–2562. [Google Scholar]
  33. Inan, M.; Sicilia, A.; Alikhani, M. SignAlignLM: Integrating Multimodal Sign Language Processing into Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics (ACL), 2025. [Google Scholar]
  34. Zheng, W.; Li, H.; Chen, W.; Wang, Y. CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition. arXiv 2024, arXiv:2407.12174. [Google Scholar]
  35. Singh, G. A Review of Multimodal Vision–Language Models: Foundations, Applications, and Future Directions. Preprints 2025. [Google Scholar] [CrossRef]
  36. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019. [Google Scholar]
  37. Li, J.; Selvaraju, R.R.; Gotmare, A.D.; Joty, S.; Xiong, C.; Hoi, S. Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021. [Google Scholar]
  38. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: a Visual Language Model for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022. [Google Scholar]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 770–778. [Google Scholar]
  40. Zhou, H.; Zhou, W.; Qi, W.; Pu, J.; Li, H. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021; pp. 1316–1325. [Google Scholar]
  41. Duarte, A.; Palaskar, S.; Venberèg, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Jordi, T.; Giro-i Nieto, X. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021; pp. 2735–2744. [Google Scholar]
  42. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002; pp. 311–318. [Google Scholar]
  43. Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, 2004. [Google Scholar]
  44. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. [Google Scholar]
Figure 1. Overview of the proposed SignFuse Framework. Our dual-stream architecture extracts visual features via a CNN backbone and skeletal features via a pose estimator followed by a GCN encoder. The CMFA module performs bidirectional cross-attention between the two modalities. Fused features are processed by the HTA module before mapping to the LLM input space through the SLA.
Figure 1. Overview of the proposed SignFuse Framework. Our dual-stream architecture extracts visual features via a CNN backbone and skeletal features via a pose estimator followed by a GCN encoder. The CMFA module performs bidirectional cross-attention between the two modalities. Fused features are processed by the HTA module before mapping to the LLM input space through the SLA.
Preprints 208450 g001
Figure 2. Proposed Cross-Modal Fusion Attention (CMFA) Module. Bidirectional cross-attention enables each modality to attend to complementary features in the other, producing enriched multimodal representations via gated aggregation.
Figure 2. Proposed Cross-Modal Fusion Attention (CMFA) Module. Bidirectional cross-attention enables each modality to attend to complementary features in the other, producing enriched multimodal representations via gated aggregation.
Preprints 208450 g002
Figure 3. Proposed Multi-Stage Training Strategy. Stage 1 aligns visual-skeletal features via contrastive learning. Stage 2 maps fused features to the LLM embedding space. Stage 3 performs end-to-end LoRA fine-tuning. Snowflake/flame icons denote frozen/trainable components.
Figure 3. Proposed Multi-Stage Training Strategy. Stage 1 aligns visual-skeletal features via contrastive learning. Stage 2 maps fused features to the LLM embedding space. Stage 3 performs end-to-end LoRA fine-tuning. Snowflake/flame icons denote frozen/trainable components.
Preprints 208450 g003
Table 1. Proposed model configuration.
Table 1. Proposed model configuration.
Component Specification
Visual Stream
   Backbone ResNet-50 (ImageNet pre-trained)
   Output dimension d v = 2048
   Temporal conv k = 3 , stride s = 2 , output d = 512
Skeletal Stream
   Pose estimator MediaPipe Holistic
   Number of joints V = 75 (hands + body + face)
   ST-GCN layers 4 layers, channels: [ 2 , 64 , 128 , 256 , 512 ]
   Temporal kernel k = 9 , stride 2 at layer 3
CMFA Module
   Feature dimension d = 512
   Number of layers L = 3
   Attention heads h = 8
   FFN inner dim 4 d = 2048
HTA Module
   Temporal kernels { 1 , 5 , 11 }
   Target length T = 64
SLA + LLM
   Sign tokens N s = 32
   LLM backbone LLaMA-2 7B (frozen)
   LLM dimension d LLM = 4096
   LoRA rank / alpha r = 16 , α = 32
Table 2. Proposed training hyperparameters per stage.
Table 2. Proposed training hyperparameters per stage.
Parameter Stage 1 Stage 2 Stage 3
Epochs 30 20 15
Learning rate 1 × 10 4 5 × 10 5 2 × 10 5
Batch size 32 32 32
Warmup epochs 3 2 1
Gradient clipping 1.0 1.0 1.0
Hardware 4× NVIDIA A100 80GB
Table 4. Estimated computational cost comparison. Extra parameters are relative to the frozen LLM backbone.
Table 4. Estimated computational cost comparison. Extra parameters are relative to the frozen LLM backbone.
Method Extra Params (M) Modalities
Sign2GPT [11] ∼45 RGB
SpaMo [12] ∼52 RGB (spatial + motion)
MMSLT [13] ∼69 RGB + MLLM descriptions
SignFuse (Ours) ∼66 RGB + Skeleton (GCN)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated