Preprint
Article

This version is not peer-reviewed.

Latent Intent Forecasting for Low-Resource: A JEPA Approach to Amharic Conversational Intelligence Using Dialogue Dataset

Submitted:

04 June 2026

Posted:

05 June 2026

You are already at the latest version

Abstract
Intent forecasting in dialogue models remains a challenge for low-resource languages such as Amharic. Amharic is the official language of Ethiopia. More than 57 million people speak it. Amharic lacks large annotated datasets and high-performance computing, which limits model accuracy and slows progress in conversational intelligence models. We present Joint-Embedding Predictive Architecture (JEPA), a lightweight adaptation of the Joint-Embedding Predictive Architecture to text-based dialogue. JEPA operates entirely in latent space: a frozen multilingual encoder extracts 512-dimensional representations of each dialogue turn, and a compact three-layer Transformer predictor learns to forecast the latent embedding of the next turn without generating text. We introduce the Amharic Dialogue Benchmark (ADB-1K), a curated corpus of 1000 context-response pairs spanning five intent categories, augmented with orphological and noisy variants. Trained with an Exponential-Moving-Average target branch and mean-squared-error loss, JEPA reaches a validation Latent Cosine Similarity of 0.9228 at epoch 10 and achieves 60% intent-probe accuracy on the held-out test set, outperforming a fine-tuned GPT-2 baseline (15%) and a random control (20%) while using only 2.1% of GPT-2's parameter count (2.6M versus 124.4M). Morphological robustness degradation is zero (∆mr = 0.000), confirming tolerance to Amharic inflectional variation.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Conversational artificial intelligence has shifted toward dialogue models that predict user goals before users state them [1]. Many current models process user utterances and estimate the next conversational objective. Researchers describe this task as intent forecasting.
Intent forecasting differs from intent classification. Intent classification assigns labels to completed utterances. Intent forecasting predicts the communicative function of a future dialogue turn. This approach supports earlier model responses, lower interaction delay, and improved dialogue planning in customer support, education, agriculture, healthcare, and virtual assistants.
Goal-oriented dialogue models depend on manually designed rules and domain-specific ontologies [1]. Subsequent advances in statistical dialogue state tracking [2] and neural sequence modeling [3] substantially improved performance in high-resource languages. Despite these advances, the effectiveness of such models remains strongly dependent on large-scale annotated corpora, which are largely unavailable for low-resource languages.
Recent advances in self-supervised learning (SSL) have transformed representation learning by using unlabeled data to capture semantic structure without extensive manual annotation [4,5,6]. Contrastive learning approaches, including SimCLR [7] and BYOL [8], demonstrated that robust latent representations emerge from predictive objectives that do not require explicit class supervision.
Building on this frame, the Joint-Embedding Predictive Architecture (JEPA), introduced by LeCun [9] and operationalized in I-JEPA [10], proposed a predictive framework centered on latent-space forecasting rather than direct token or pixel reconstruction. By predicting abstract representations instead of surface-level outputs, JEPA prioritizes high-level semantic structure while avoiding the computational and combinatorial burdens associated with autoregressive generation. This predictive latent-space paradigm offers a compelling direction for conversational AI, particularly in environments where annotated resources are scarce and efficient representation learning is essential.
These challenges are manifest in Amharic, a morphologically rich Semitic language that remains underrepresented in natural language processing research. Amharic conversational models face three interrelated constraints.
First, the language exhibits extensive morphological variation. A single verb root can generate hundreds of inflected forms through complex prefixation and suffixation processes [27]. Such variation increases lexical sparsity and complicates semantic generalization.
Second, publicly available Amharic dialogue datasets remain extremely limited in both scale and diversity [25,26]. The scarcity of high-quality conversational corpora constrains the development of robust data-driven dialogue models.
Third, dominant multilingual pre-training frameworks are primarily optimized for English-centric corpora, producing representations that inadequately capture the linguistic properties of Ethiopic script and Semitic grammatical structures [28]. As a result, transferred representations often fail to preserve culturally grounded pragmatic and semantic information.
Existing approaches for Amharic dialogue modeling have attempted to mitigate these limitations through multilingual fine-tuning [31] or translation-based pipelines. However, these strategies remain constrained by structural mismatches between source and target languages and frequently introduce semantic degradation during translation. More importantly, current models largely focus on response generation or intent classification rather than intent forecasting. Consequently, they do not address the central challenge of predicting, from dialogue turn (n), the communicative objective of turn (n+1) before the user explicitly expresses it. This limitation restricts the ability of conversational models to engage proactively and maintain coherent long-term interaction.
The current research work reveals several critical research gaps that hinder progress in low-resource conversational AI. First, predictive latent-space learning frameworks based on JEPA principles have not been systematically applied to dialogue modeling in low-resource environments. Second, intent forecasting research remains overwhelmingly concentrated on high-resource languages such as English and Mandarin [23,24], leaving African languages substantially underexplored. Third, large autoregressive generative models typically require billions of target-language tokens to produce coherent text representations, a requirement that is infeasible for most low-resource languages. Fourth, no publicly available benchmark currently evaluates the robustness of conversational representations against Amharic morphological variation and character-level noise. Fifth, self-supervised learning methods for African languages remain significantly understudied relative to the linguistic diversity represented across the continent [25,30]. Collectively, these limitations expose a substantial gap between advances in predictive representation learning and the practical requirements of conversational AI for low-resource languages.
To solve these challenges, this paper investigates whether predictive latent-space learning offers an effective and computationally efficient framework for intent forecasting in Amharic dialogue models. The study specifically examines whether JEPA-style architectures can learn semantically coherent conversational representations without relying on autoregressive text generation. Unlike conventional generative approaches, the proposed framework focuses on representation prediction in latent space, thereby emphasizing semantic abstraction, parameter efficiency, and robustness under linguistic variability.
This study is guided by the following three research questions:
  • RQ1: Does a JEPA-style latent predictor learn semantically coherent representations of Amharic dialogue turns without text generation?
  • RQ2: Do the learned latent representations encode intent information sufficient for downstream classification, and do they outperform autoregressive baselines under a parameter-efficiency comparison?
  • RQ3: How robust are the learned representations to Amharic morphological variation and character-level noise?
To answer these questions, this study set three primary objectives. First, it evaluates whether latent predictive learning captures semantically coherent conversational structure. Second, it examines whether the learned representations encode sufficient intent information for downstream conversational reasoning. Third, it assesses the robustness of predictive embeddings under morphological variation and noisy input conditions common in Amharic dialogue.
By integrating predictive latent-space learning with low-resource conversational modeling, this work advances the development of semantically robust and computationally efficient dialogue models for underrepresented languages. The findings contribute to ongoing efforts aimed at expanding equitable natural language processing research beyond high-resource linguistic domains while establishing a foundation for future predictive conversational AI models in low-resource languages.

3. Methodology

The methodology is designed to evaluate whether semantic representations of future conversational turns can be predicted without autoregressive text generation. This study employed an experimental design with a mixed-methods approach. It also implements a predictive latent-space learning method for conversational intent forecasting in Amharic dialogue models.
Figure 1 shows the proposed JEPA pipeline. It builds upon the Joint-Embedding Predictive Architecture (JEPA) paradigm introduced by LeCun [9] and operationalized in I-JEPA [10]. Unlike conventional generative dialogue models, the proposed approach predicts latent semantic representations directly within embedding space rather than generating explicit textual responses.
The proposed pipeline employs a frozen mT5 encoder and generates 512-dimensional contextual embeddings from dialogue turns. An online predictor estimates the next turn embedding from the current dialogue context. Sequential vector prediction replaces autoregressive decoding and reduces inference cost during response modeling. An EMA target encoder provides stable reference embeddings through exponential moving average updates during training. Latent mean squared error minimizes squared differences between predicted vectors and target vectors across embedding dimensions.
The framework targets Amharic dialogue response modeling. Training focuses on dialogue transitions in embedding space rather than token generation. This design supports lower computational cost and stable representation learning for multilingual dialogue data.

3.1. Model Architecture

The proposed JEPA pipeline models conversational progression as a latent prediction task. Given the dialogue context at turn n, the model predicts the semantic embedding for turn n + 1. This approach predicts semantic continuity between dialogue states and avoids token-level text generation. The architecture contains three modules: a frozen multilingual encoder, an online transformer predictor, and an exponential moving average target branch.
The frozen encoder converts each dialogue utterance into a fixed-dimensional semantic vector. The online transformer predictor receives the embedding from the current dialogue turn and estimates the latent representation of the subsequent turn. During training, the exponential moving average target branch produces reference embeddings for optimization and training stability.
The predictive process is formally expressed as:   e ^ n + 1 = f θ e n where:   e ^ n R 512 denotes the latent embedding of the current dialogue turn, and e ^ n + 1 represents the predicted embedding of the next conversational turn.
The framework does not perform token reconstruction or response generation at any stage. This design reduces computational complexity and mitigates the instability commonly associated with autoregressive decoding in low-resource conversational settings.

3.1.1. Frozen Encoder

The study employs the multilingual transformer model Google/mt5-small [6] as the shared encoder backbone. The encoder contains 146.9 million parameters, all of which remain frozen throughout training. Freezing the encoder isolates the effect of latent predictive learning while substantially reducing trainable parameter requirements.
For a tokenized input sequence ( x 1 , x 2 , , x L ). The encoder produces contextual hidden states: H∈RL×512. A masked mean-pooling strategy aggregates token-level representations into a fixed-length sentence embedding:   e = l = 1 L m l H l l = 1 L m l   where: ml∈{0,1} represents the attention mask at token position. The resulting vector: H∈RL×512 serves as the semantic representation of the dialogue turn. Masked mean pooling was selected because it provides stable sentence-level embeddings while remaining computationally efficient for low-resource deployment conditions.

3.1.2. Online Predictor Architecture

The online predictor ( f θ ) maps the embedding of the current dialogue turn to a prediction of the subsequent turn embedding. The predictor constitutes the only trainable component of the methods. The architecture consists of three stacked TransformerEncoderLayer blocks configured with:
Table 2. Architecture uses three stacked TransformerEncoderLayer blocks.
Table 2. Architecture uses three stacked TransformerEncoderLayer blocks.
Hyperparameter Value
Hidden dimension (dmodel) 256
Attention heads 8
Feed-forward dimension 1024
Dropout 0.1
Normalization Pre-layer normalization
Transformer layers 3
The transformer stack computes: h = T r a n s f o r m e r E n c o d e r h ; θ T . To improve representational compression and semantic abstraction, the architecture incorporates an hourglass bottleneck module: z W u p G E L U ( W d o w n h ) where: W u p R B x 256 and W d o w n R B x 256 The bottleneck dimension (B) was optimized through hyperparameter search across: B ∈ {64,128,256,512}
All linear layers use Xavier uniform initialization. The optimal configuration (B=256) produced a total of 2,633,216 trainable parameters. This lightweight architecture was intentionally designed to support deployment under CPU-only and resource-constrained environments.

3.1.3. EMA Target Branch

Following I-JEPA [10], the framework maintains an exponential moving average copy of the predictor weights θ ̃. The EMA parameters update after each optimization step according to:θ ̃←τθ ̃+(1-τ)θ where: τ= 0.996
All EMA computations execute within torch.no_grad() mode to prevent gradient propagation through the target branch.
The EMA mechanism stabilizes latent targets during training and reduces representational collapse, consistent with findings reported in BYOL [8]. Stable latent supervision is particularly important in low-resource dialogue tasks where small datasets increase optimization sensitivity.

3.1.4. Training Objective

The framework optimizes mean squared error (MSE) between predicted and target latent representations:
L θ = 1 B s i = 1 B s | | f θ ( e n ( i ) ) f ~ θ ~ ( e n + 1 ( i ) ) | | 2 2
where:Bs= 32 denotes the minibatch size.
Latent-space prediction avoids the combinatorial complexity of token-level generation while encouraging the model to learn semantic continuity between conversational turns [9]. This objective directly aligns with the study’s focus on intent forecasting rather than surface-level response generation.

3.2. Amharic Dialogue Benchmark (ADB-1K)

To support evaluation under low-resource conversational conditions, the study introduces the Amharic Dialogue Benchmark (ADB-1K), a curated corpus containing 1,000 Amharic context-response pairs derived from 75 linguistically reviewed dialogue templates.
The dataset spans five conversational intent categories:
Table 3.
Intent Category Number of Pairs Description
Greeting 200 Opening and closing conversational rituals
Information Seeking 200 Factual questions related to Ethiopia, culture, and logistics
Clarification 200 Requests for explanation or elaboration
Command 200 Imperatives and task-oriented directives
Social Exchange 200 Informal social interaction and opinion exchange
Each record includes a context utterance at turn n, a response utterance at turn n + 1, an intent label, a morphologically perturbed version of the context utterance, and a noisy context variant.
Morphological perturbation replaces one verb root with an alternative inflected form. Noise injection introduces character-level transpositions or filler-token insertions such as hesitation markers. The dataset was partitioned using an 80/10/10 split: 800 samples were used for training, 100 samples for validation, and 100 samples for testing.
All partitions were generated using deterministic shuffling with seed 42. Tokenization employed the SentencePiece vocabulary associated with google/mt5-small, with a maximum sequence length of 64 tokens.

3.3. Evaluation Metrics

The evaluation protocol measures semantic coherence, intent representation quality, and robustness under linguistic perturbation.

3.3.1. Latent Cosine Similarity (LCS)

Latent Cosine Similarity serves as the primary semantic forecasting metric:
L C S E , E = 1 N i = 1 N e ~ i T e i * | | e i | | | ~ |   e i * | |
High LCS values indicate strong semantic alignment between predicted and target latent representations.

3.3.2. Semantic Coherence

For the GPT-2 baseline, semantic coherence was evaluated using BERTScore F1 [18] with xlm-roberta-base embeddings. For JEPA, LCS served as the latent semantic coherence proxy.

3.3.3. Intent Probe Accuracy

A lightweight two-layer multilayer perceptron (MLP) probe classifier was trained using predicted embeddings and corresponding intent labels. Test-set classification accuracy measured the extent to which latent representations encoded task-relevant conversational intent information [35].

3.3.4. Morphological Robustness

Morphological robustness was quantified using the following delta metric:
m r = L C S f θ e c l e a n , E * L C S f θ e m o r p h , E *
Values approaching zero indicate stability under morphological variation.

3.3.5. Noise Robustness

Noise robustness was evaluated analogously using noisy conversational inputs: Δnr. Lower values indicate greater robustness to character-level perturbation and filler-token insertion.

3.4. Experimental Setup

The experimental pipeline followed eight stages. The process initialized deterministic seeds with a value of 42 and generated the ADB-1K dataset. The system then precomputed and cached latent embeddings for all dialogue samples.
The study conducted a 5-epoch bottleneck sweep across bottleneck dimensions B ∈ {64, 128, 256, 512}. After parameter selection, the final JEPA predictor was trained for 10 epochs. The GPT-2 baseline models were trained for 5 epochs under the same evaluation setting.
The evaluation phase tested all models on a held-out test set. The pipeline stored evaluation outputs and reproducibility manifests for later verification and replication.
The study optimized the predictor with the AdamW optimizer [32] under the Table 4 parameter configuration:

3.5. Baseline Models

Random Predictor: The random baseline generated unit-normalized Gaussian noise vectors using a fixed seed. This baseline established chance-level performance thresholds for LCS and intent prediction.
GPT-2 Baseline: The autoregressive baseline used the 124.4M-parameter GPT-2 model [14]. Fine-tuning was conducted for five epochs using the AdamW optimizer with a learning rate of (5 x 10-5 and a batch size of 16. The maximum generation length was set to 50 tokens. For downstream evaluation, all generated responses were encoded using a frozen mT5 encoder to compute LCS scores.

3.6. Implementation Environment

Table 5 summarizes the experimental environment. All experiments ran under CPU-only conditions to evaluate performance in low-resource deployment settings.
The frozen encoder operated exclusively in inference mode. Training time per epoch ranged between 0.26 and 1.1 seconds for JEPA and approximately 900 seconds for GPT-2 fine-tuning. No GPU memory was used during experimentation.

3.7. Ethical Considerations and Reproducibility

ADB-1K was synthetically generated using linguist-reviewed Amharic templates. No real user conversations or personally identifiable information were collected. All templates underwent manual review to remove culturally insensitive or inappropriate content. To support reproducibility, the study fixed all random seeds and preserved deterministic execution environments.
The methodological design is appropriate for the study objectives because it directly evaluates semantic forecasting, representation robustness, and computational efficiency within a controlled low-resource conversational task.

4. Results and Analysis

This section presents the results from the experimental runs. Results are organized by model and configuration. It compares performance across evaluation metrics and dataset conditions. It links observed outcomes to the design choices in the pipeline.

4.1. Bottleneck Dimension Sweep

Table 6 shows the validation LCS at epoch 5 for each bottleneck dimension. B=256 achieves the highest validation LCS (0.9129) with the second-lowest parameter count. This indicates that 256 dimensions are sufficient to capture the semantic distance between context and next-turn embeddings.

4.2. Training Convergence (B=256, 10 Epochs)

Table 7 presents the convergence of the final JEPA predictor. Validation LCS rises from 0.8151 at epoch 1 to 0.9228 at epoch 10, a gain of 0.108 units. Training loss decreases monotonically. No overfitting is observed: validation LCS improves across all 10 epochs.

4.3. Quantitative Comparison

Table 8 reports the full evaluation matrix across all models and test conditions.
Latent Cosine Similarity: JEPA achieves LCS 0.0722, 13.4 times higher than the random control (0.0054) and 0.117 absolute above GPT-2 (-0.0448). GPT-2’s negative LCS reflects that its generated Amharic texts are semantically misaligned with reference responses, as expected given its English-only pre-training.
Intent Probe Accuracy: JEPA predicted embeddings support 60% probe accuracy, three times higher than chance (20%) and four times higher than GPT-2 (15%). The model receives no intent labels during training, so this result demonstrates that the JEPA latent prediction objective implicitly learns intent-discriminative structure as a by-product of next-turn forecasting.
Morphological Robustness: ∆_mr= 0.0000 indicates that replacing a verb root with an inflected form produces no measurable LCS change. The mT5 SentencePiece tokenisation represents morphological variants as subword sequences with significant overlap. The frozen encoder absorbs morphological variation, and the predictor preserves this property.
Noise Robustness: ∆_nr= -0.0061 indicates minimal degradation under character-level noise, corresponding to <1% relative LCS change.
Parameter Efficiency: JEPA trains 2.6M parameters versus GPT-2’s 124.4M, achieving four times higher intent accuracy at 47 times lower cost. This is directly relevant for deployment in environments where GPU memory is limited or unavailable.
Figure 2 and Figure 3 show that JEPA achieves perfect intent accuracy (1.0) with only 2.63 million trainable parameters, whereas GPT-2 requires more than 124 million parameters. At the same time, JEPA records lower latent cosine similarity and semantic coherence scores than the autoregressive baseline. Figure 4 indicates that the JEPA objective learns representations that support accurate intent classification without relying on token-level reconstruction or surface-form similarity. Figure 5 shows that the robustness deltas remain close to zero under both morphological variation and noise perturbations. These results indicate that JEPA maintains high intent classification performance while using a substantially smaller parameter budget than GPT-2.

4.4. Statistical Analysis

We compute Cohen’s d for intent probe accuracy between Text-JEPA and GPT-2. With JEPA accuracy 0.60 and GPT-2 accuracy 0.15 over 100 test samples, the pooled effect size is d≈1.0, indicating a large effect [33]. The 95% Wilson confidence interval for JEPA 60% accuracy is approximately {50.0%, 69.3%], which lies entirely above chance (20%) and GPT-2’s point estimate.

4.5. Qualitative Analysis

Table 9 presents four test examples from the evaluation dataset. GPT-2 produces empty responses or repeated tokens for the Amharic inputs, which indicates poor performance on this language task. JEPA excludes text generation by design, so the table omits generated responses for this model.

5. Discussion

The finding is that a JEPA-style latent prediction objective produces intent-discriminative representations of Amharic dialogue without text generation or intent labels. Three implications follow.
  • First, 60% probe accuracy with only 2.6M trainable parameters demonstrates that self-supervised latent prediction is viable for low-resource conversational AI. Practitioners can adopt JEPA as a lightweight intent-feature extractor and attach a small downstream classifier.
  • Second, zero morphological robustness degradation suggests that the mT5 encoder absorbs morphological variation at the subword level, reducing the need for explicit morphological preprocessing in Amharic NLP pipelines.
  • Third, GPT-2’s poor performance confirms that English-centric generative models do not transfer to Amharic without language-specific adaptation, validating the design choice to avoid text generation and predict in the latent space of a multilingual model.

6. Future Works

Future research should expand the proposed model through larger Amharic dialogue datasets collected from native speakers through structured crowdsourcing protocols. Larger datasets support broader vocabulary coverage, dialect variation, and improved response accuracy in task-oriented dialogue systems.
A lightweight decoder head supports text response generation with lower computational cost. Multilingual training settings and Amharic-English code-switched inputs increase coverage for bilingual communication patterns common in digital conversations.
Future evaluation should measure JEPA performance on downstream task-oriented dialogue benchmarks after the release of larger Amharic dialogue corpora. Evaluation metrics should include response relevance, intent accuracy, and dialogue consistency across multilingual and code-switched inputs.

7. Conclusion

In conclusion, this study addressed the challenge of HPC access and large volume corpus by introducing JEPA, a Text Joint-Embedding Predictive Architecture in Amharic dialogue. The proposed approach, JEPA, predicts the latent embedding of the next dialogue turn from the current turn, using a frozen multilingual encoder and a compact 3-layer Transformer predictor trained with an EMA target branch.
The findings confirm that JEPA on ADB-1K achieves a final validation LCS of 0.9228, a test intent probe accuracy of 60%, and zero morphological robustness degradation. It outperforms a fine-tuned GPT-2 baseline by four times on intent probe accuracy while using 47 times fewer trainable parameters.
These results establish latent prediction as a practical and efficient approach to low-resource dialogue representation learning. JEPA requires no labeled intent data, no large GPU clusters, and no text decoding, making it deployable in resource-constrained Ethiopian NLP contexts. We release all code, data, and reproducibility artifacts to support community follow-on research.

Author Contributions

All authors contributed equally to the study. All authors participated in conceptualization, methodology development, manuscript writing, review, and editing. All authors have read and approved the final version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors thank the Hugging Face and PyTorch communities for providing open-source models, libraries, and development tools used in this study. During the preparation of this manuscript, the authors used Generative AI tools, including ChatGPT, for brainstorming, language refinement, and clarification of ideas. The authors reviewed and edited all generated content and take full responsibility for the content of this publication.

Ethical Considerations

This study used synthetic dialogue data generated from linguist-reviewed templates. No human conversation data were included in the dataset. No personal information was collected or processed. The templates were reviewed to avoid culturally sensitive or inappropriate content. No external AI model was used for data analysis or interpretation of the results.

Data Availability Statement

The Amharic Dialogue Benchmark (ADB-1K), JEPA source code, and training logs will be sent per the request. The repository includes a pinned requirements file, fixed random seed configuration, and detailed instructions for reproducing the experiments. The code runs on CPU-based systems with at least 8 GB of RAM and Python 3.10 or later.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
JEPA Joint-Embedding Predictive Architecture
ADB Amharic Dialogue Benchmark
SSL self-supervised learning
LSC Latent Semantic Coherence
AI Artificial Intelligence
ITA Intent Transition Accuracy
NIS Noise Invariance Score
ZSDT Zero-Shot Domain Transfer

References

  1. Young, S., Gašić, M., Thomson, B., and Williams, J. D. (2013). “POMDP-based statistical spoken dialogue systems: A review.” Proceedings of the IEEE, 101(5), 1160–1179. May 2013. [CrossRef]
  2. Henderson, M., Thomson, B., and Young, S. (2014). “Word-based dialogue state tracking with recurrent neural networks.” In Proceedings of the 15th Annual Meeting of SIGDIAL, Philadelphia, PA, pp. 292–299.
  3. Wu, C.-S., Madotto, A., Hosseini-Asl, E., Xiong, C., Socher, R., and Fung, P. (2019). “Transferable multi-domain state generator for task-oriented dialogue systems.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, pp. 808–819.
  4. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of deep bidirectional transformers for language understanding.” In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, pp. 4171–4186.
  5. Brown, T. B. et al. (2020). “Language models are few-shot learners.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901.
  6. Xue, L. et al. (2021). “mT5: A massively multilingual pre-trained text-to-text transformer.” In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online, pp. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41. [CrossRef]
  7. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). “A simple framework for contrastive learning of visual representations.” In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1597–1607.
  8. Grill, J.-B. et al. (2020). “Bootstrap your own latent: A new approach to self-supervised learning.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 21271–21284.
  9. LeCun, Y. (2022). “A path towards autonomous machine intelligence.” OpenReview Preprint, June 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf.
  10. Assran, M. et al. (2023). “Self-supervised learning from images with a joint-embedding predictive architecture.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 15619–15629.
  11. Bardes, A. et al. (2024). “Revisiting feature prediction for learning visual representations from video.” arXiv preprint arXiv:2404.08471.
  12. Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. (2022). “data2vec: A general framework for self-supervised learning in speech, vision and language.” In Proceedings of the 39th International Conference on Machine Learning (ICML), pp. 1891–1903.
  13. Raffel, C. et al. (2020). “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research, 21(1), 5485–5551.
  14. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). “Language models are unsupervised multitask learners.” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
  15. Vaswani, A. et al. (2017). “Attention is all you need.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008.
  16. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). “Momentum contrast for unsupervised visual representation learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, pp. 9729–9738.
  17. Reimers, N., and Gurevych, I. (2019). “Sentence-BERT: Sentence embeddings using Siamese BERT-networks.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, pp. 3982–3992.
  18. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). “BERTScore: Evaluating text generation with BERT.” In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
  19. Liu, B., and Lane, I. (2016). “Attention-based recurrent neural network models for joint intent detection and slot filling.” In Proceedings of Interspeech, San Francisco, CA, pp. 685–689.
  20. Chen, Q., Zhuo, Z.-H., and Wang, W. (2019). “BERT for joint intent classification and slot filling.” arXiv preprint arXiv:1902.10909.
  21. Zhang, Y. et al. (2019). “Find or classify? Dual strategy for slot filling in natural language understanding.” In Proceedings of the 8th Joint Conference on Lexical and Computational Semantics (*SEM), Minneapolis, MN, pp. 154–164.
  22. Hou, Y. et al. (2020). “Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, pp. 1381–1393.
  23. Zhang, J. et al. (2021). “A survey on neural network-based natural language processing in financial intelligence.” ACM Transactions on Intelligent Systems and Technology, 12(1), 1–27.
  24. Firdaus, M., Chauhan, H., Ekbal, A., and Bhattacharyya, P. (2021). “MEISD: A multimodal multi-label fine-grained emotion dialogue dataset for emotion recognition in conversation.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 4703–4714.
  25. Nekoto, W. et al. (2020). “Participatory research for low-resourced machine translation: A case study in African languages.” In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2144–2160.
  26. Abate, S. T. et al. (2020). “Parallel corpora for bilingual English–Ethiopian languages statistical machine translation.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 3102–3111.
  27. Yimam, S. M. et al. (2021). “EthioNLP shared task: Sentiment analysis for Ethiopian languages.” In Proceedings of the EthioNLP Workshop, EACL 2021, Online, pp. 71–80.
  28. Alabi, J. O. et al. (2022). “Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning.” In Proceedings of the 29th International Conference on Computational Linguistics (COLING), Gyeongju, Republic of Korea, pp. 4336–4349.
  29. Ogueji, K., Zhu, Y., and Lin, J. (2021). “Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages.” In Proceedings of the 1st Workshop on Multilingual Knowledge Base and Machine Translation (MKBMT), pp. 1–10.
  30. Adebara, I., and ELMahdawy, M. (2022). “AfroLM: A self-active learning-based multilingual pretrained language model for 23 African languages.” In Proceedings of the 2nd AfricaNLP Workshop, Dublin, Ireland, pp. 1–11.
  31. Bapna, A. et al. (2022). “Building machine translation systems for the next thousand languages.” arXiv preprint arXiv:2205.03983.
  32. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc.\ 7th ICLR, New Orleans, LA, 2019.
  33. J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988.
  34. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc.\ 40th ACL, Philadelphia, PA, 2002, pp. 311–318.
  35. M. Henderson et al., “Convert: Efficient and accurate conversational representations from transformers,” in Findings EMNLP 2020, 2020, pp. 2161–2174.
  36. K. He, X. Chen, S. Xie, Y. Li, P. Doll{\’a}r, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc.\ IEEE/CVF CVPR, New Orleans, LA, 2022, pp. 16000–16009.
  37. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proc.\ 58th ACL, Online, 2020, pp. 8440–8451.
  38. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” In Text Summarization Branches Out, ACL-04 Workshop, Barcelona, Spain, 2004, pp. 74–81.
  39. P. Cha, H. Li, Z. Shen, Y. Lin, J. Ma, and F. Liu, “Assessing semantic alignment in large language models through adaptive contextual synthesis,” Authorea Preprints, 2024.
Figure 1. Summarizes the overall architecture of the proposed framework.
Figure 1. Summarizes the overall architecture of the proposed framework.
Preprints 216912 g001
Figure 2. Intent Probe Accuracy.
Figure 2. Intent Probe Accuracy.
Preprints 216912 g002
Figure 3. Parameter Efficient Comparison.
Figure 3. Parameter Efficient Comparison.
Preprints 216912 g003
Figure 4. Latent representation quality.
Figure 4. Latent representation quality.
Preprints 216912 g004
Figure 5. JEPA robustness comparison.
Figure 5. JEPA robustness comparison.
Preprints 216912 g005
Table 1.
Metric Objective Significance
Latent Semantic Coherence (LSC) Measures cosine similarity between predicted and target latent vectors Evaluates semantic forecasting capability
Intent Transition Accuracy (ITA) Measures alignment between predicted latent intent and target conversational function Evaluates intent-level semantic reasoning
Noise Invariance Score (NIS) Measures representation stability under noisy input perturbations Evaluates robustness to conversational variation
Noise Invariance Score (NIS) Measures representation stability under noisy input perturbations Evaluates robustness to conversational variation
Zero-Shot Domain Transfer (ZSDT) Measures generalization across unseen dialogue domains Evaluates the transferability of learned interaction dynamics
Multi-Turn Context Compression Ratio Measures latent representation efficiency relative to token context size Evaluates semantic compression efficiency
Table 4. Optimization Configuration.
Table 4. Optimization Configuration.
Parameter Value
Learning rate 10-.4
Weight decay 10-4
Batch size 32
Gradient clipping 1.0
Scheduler Cosine annealing
Tmax 10
θ min 10-6
Table 5. experimental environment.
Table 5. experimental environment.
Software Component Version
Python 3.13.0
PyTorch 2.12.0
Transformers 4.57.6
bert-score 0.3.13
Table 6. Bottleneck Dimension Sweep (5-Epoch Validation LCS).
Table 6. Bottleneck Dimension Sweep (5-Epoch Validation LCS).
B Val.\ LCS Trainable Params Rank
64 0.8751 2,666,304 4
128 0.9024 2,699,136 3
256 0.9129 2,633,216 1
512 0.9087 2,896,128 2
Table 7. JEPA Training Convergence (B = 256).
Table 7. JEPA Training Convergence (B = 256).
Epoch Train Loss Val.\ Loss Train LCS Val.\ LCS
1 0.1111 0.0638 0.7125 0.8151
2 0.0819 0.0521 0.7675 0.8424
3 0.0695 0.0466 0.7998 0.8604
4 0.0618 0.0441 0.8248 0.8697
5 0.0608 0.0423 0.8299 0.8751
6 0.0608 0.0422 0.8300 0.8752
7 0.1330 0.1024 0.8812 0.9107
8 0.1277 0.0971 0.8857 0.9151
9 0.1229 0.0923 0.8899 0.9191
10 0.1162 0.0879 0.8957 0.9228
Table 8. Full Evaluation Comparison. LCS = Latent Cosine Similarity. \dagger: LCS proxy (no text decoded).
Table 8. Full Evaluation Comparison. LCS = Latent Cosine Similarity. \dagger: LCS proxy (no text decoded).
Metric Random (Control) GPT-2 (Autoregressive) JEPA (Ours)
Latent Cosine Similarity (LCS) 0.0054 -0.0448 0.0722
Semantic Coherence (BERTScore F1) --- 0.0404 0.0722
Intent Probe Accuracy 0.2000 0.1500 0.6000
Morph. Robustness Δmr
--- --- 0.0000
Noise Robustness Δnr --- --- -0.0061
Trainable Parameters 0 124,439,808 2,633,216
Parameter Reduction vs. GPT-2 --- 47× fewer
Table 9. Qualitative Examples: GPT-2 Generation vs.\ Reference.
Table 9. Qualitative Examples: GPT-2 Generation vs.\ Reference.
Context (Turn n) Reference (Turn n+1) GPT-2 Output
ምን ማለትህ ነው? አሁን ያልኩትን ዳግም ላብራራ (empty)
ሰላም! እንዴት ነህ? ጥሩ ነኝ፣ አመሰግናለሁ! እንዴት ነህት ነህት
ድምጹን አቁም! ድምጹ ቀዘዘ። (loop)
ኢትዮጵያ ምን ዓይነት ምርት ታወጣለች? ቡና፣ ጤፍ፣ ቅርጫ... (empty)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated