Latent Intent Forecasting for Low-Resource: A JEPA Approach to Amharic Conversational Intelligence Using Dialogue Dataset

Kindie Alebachew Tsega; Durga Prasad Sharma; Mesay Samuel Gondere; Mohammed Abebe Yimer

doi:10.20944/preprints202606.0409.v1

Submitted:

04 June 2026

Posted:

05 June 2026

You are already at the latest version

Abstract

Intent forecasting in dialogue models remains a challenge for low-resource languages such as Amharic. Amharic is the official language of Ethiopia. More than 57 million people speak it. Amharic lacks large annotated datasets and high-performance computing, which limits model accuracy and slows progress in conversational intelligence models. We present Joint-Embedding Predictive Architecture (JEPA), a lightweight adaptation of the Joint-Embedding Predictive Architecture to text-based dialogue. JEPA operates entirely in latent space: a frozen multilingual encoder extracts 512-dimensional representations of each dialogue turn, and a compact three-layer Transformer predictor learns to forecast the latent embedding of the next turn without generating text. We introduce the Amharic Dialogue Benchmark (ADB-1K), a curated corpus of 1000 context-response pairs spanning five intent categories, augmented with orphological and noisy variants. Trained with an Exponential-Moving-Average target branch and mean-squared-error loss, JEPA reaches a validation Latent Cosine Similarity of 0.9228 at epoch 10 and achieves 60% intent-probe accuracy on the held-out test set, outperforming a fine-tuned GPT-2 baseline (15%) and a random control (20%) while using only 2.1% of GPT-2's parameter count (2.6M versus 124.4M). Morphological robustness degradation is zero (∆mr = 0.000), confirming tolerance to Amharic inflectional variation.

Keywords:

intent forecasting

;

self-supervised learning

;

joint-embedding predictive architecture

;

Amharic NLP

;

latent space prediction

;

multilingual representation learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Conversational artificial intelligence has shifted toward dialogue models that predict user goals before users state them [1]. Many current models process user utterances and estimate the next conversational objective. Researchers describe this task as intent forecasting.

Intent forecasting differs from intent classification. Intent classification assigns labels to completed utterances. Intent forecasting predicts the communicative function of a future dialogue turn. This approach supports earlier model responses, lower interaction delay, and improved dialogue planning in customer support, education, agriculture, healthcare, and virtual assistants.

Goal-oriented dialogue models depend on manually designed rules and domain-specific ontologies [1]. Subsequent advances in statistical dialogue state tracking [2] and neural sequence modeling [3] substantially improved performance in high-resource languages. Despite these advances, the effectiveness of such models remains strongly dependent on large-scale annotated corpora, which are largely unavailable for low-resource languages.

Recent advances in self-supervised learning (SSL) have transformed representation learning by using unlabeled data to capture semantic structure without extensive manual annotation [4,5,6]. Contrastive learning approaches, including SimCLR [7] and BYOL [8], demonstrated that robust latent representations emerge from predictive objectives that do not require explicit class supervision.

Building on this frame, the Joint-Embedding Predictive Architecture (JEPA), introduced by LeCun [9] and operationalized in I-JEPA [10], proposed a predictive framework centered on latent-space forecasting rather than direct token or pixel reconstruction. By predicting abstract representations instead of surface-level outputs, JEPA prioritizes high-level semantic structure while avoiding the computational and combinatorial burdens associated with autoregressive generation. This predictive latent-space paradigm offers a compelling direction for conversational AI, particularly in environments where annotated resources are scarce and efficient representation learning is essential.

These challenges are manifest in Amharic, a morphologically rich Semitic language that remains underrepresented in natural language processing research. Amharic conversational models face three interrelated constraints.

First, the language exhibits extensive morphological variation. A single verb root can generate hundreds of inflected forms through complex prefixation and suffixation processes [27]. Such variation increases lexical sparsity and complicates semantic generalization.

Second, publicly available Amharic dialogue datasets remain extremely limited in both scale and diversity [25,26]. The scarcity of high-quality conversational corpora constrains the development of robust data-driven dialogue models.

Third, dominant multilingual pre-training frameworks are primarily optimized for English-centric corpora, producing representations that inadequately capture the linguistic properties of Ethiopic script and Semitic grammatical structures [28]. As a result, transferred representations often fail to preserve culturally grounded pragmatic and semantic information.

Existing approaches for Amharic dialogue modeling have attempted to mitigate these limitations through multilingual fine-tuning [31] or translation-based pipelines. However, these strategies remain constrained by structural mismatches between source and target languages and frequently introduce semantic degradation during translation. More importantly, current models largely focus on response generation or intent classification rather than intent forecasting. Consequently, they do not address the central challenge of predicting, from dialogue turn (n), the communicative objective of turn (n+1) before the user explicitly expresses it. This limitation restricts the ability of conversational models to engage proactively and maintain coherent long-term interaction.

The current research work reveals several critical research gaps that hinder progress in low-resource conversational AI. First, predictive latent-space learning frameworks based on JEPA principles have not been systematically applied to dialogue modeling in low-resource environments. Second, intent forecasting research remains overwhelmingly concentrated on high-resource languages such as English and Mandarin [23,24], leaving African languages substantially underexplored. Third, large autoregressive generative models typically require billions of target-language tokens to produce coherent text representations, a requirement that is infeasible for most low-resource languages. Fourth, no publicly available benchmark currently evaluates the robustness of conversational representations against Amharic morphological variation and character-level noise. Fifth, self-supervised learning methods for African languages remain significantly understudied relative to the linguistic diversity represented across the continent [25,30]. Collectively, these limitations expose a substantial gap between advances in predictive representation learning and the practical requirements of conversational AI for low-resource languages.

To solve these challenges, this paper investigates whether predictive latent-space learning offers an effective and computationally efficient framework for intent forecasting in Amharic dialogue models. The study specifically examines whether JEPA-style architectures can learn semantically coherent conversational representations without relying on autoregressive text generation. Unlike conventional generative approaches, the proposed framework focuses on representation prediction in latent space, thereby emphasizing semantic abstraction, parameter efficiency, and robustness under linguistic variability.

This study is guided by the following three research questions:

RQ1: Does a JEPA-style latent predictor learn semantically coherent representations of Amharic dialogue turns without text generation?
RQ2: Do the learned latent representations encode intent information sufficient for downstream classification, and do they outperform autoregressive baselines under a parameter-efficiency comparison?
RQ3: How robust are the learned representations to Amharic morphological variation and character-level noise?

To answer these questions, this study set three primary objectives. First, it evaluates whether latent predictive learning captures semantically coherent conversational structure. Second, it examines whether the learned representations encode sufficient intent information for downstream conversational reasoning. Third, it assesses the robustness of predictive embeddings under morphological variation and noisy input conditions common in Amharic dialogue.

By integrating predictive latent-space learning with low-resource conversational modeling, this work advances the development of semantically robust and computationally efficient dialogue models for underrepresented languages. The findings contribute to ongoing efforts aimed at expanding equitable natural language processing research beyond high-resource linguistic domains while establishing a foundation for future predictive conversational AI models in low-resource languages.

2. Related Works

This section presents a critical analysis of previously published works on conversational AI, intent modeling, self-supervised learning, JEPA architectures, and low-resource language NLP. The review focuses on the Amharic language, dialogue models, and predictive representation learning, and concludes by synthesizing findings and identifying gaps in existing knowledge.

2.1. Evolution of Conversational AI and Predictive Dialogue Modeling

Recent advances in conversational artificial intelligence have shifted research attention from surface-level response generation toward deeper semantic reasoning and anticipatory interaction. Early conversational models primarily optimized lexical overlap metrics such as BLEU and ROUGE, emphasizing token-level generation quality rather than communicative alignment [34,38].

However, recent work questions the adequacy of these metrics for dialogue understanding, particularly in low-resource conversational tasks where semantic intent frequently diverges from lexical similarity. According to several recent analyses, evaluation paradigms increasingly prioritize intent alignment, contextual coherence, and latent semantic consistency over exact token reconstruction [20,39].

This transition has become particularly important for low-resource languages such as Amharic, where conventional large language models exhibit data inefficiency and HPC limitations. Previous studies have shown that transformer-based generative models require extensive target-language corpora to achieve stable semantic generation [5,6,15]. In contrast, low-resource African languages remain constrained by limited digitized corpora and insufficient computational infrastructure [25,30]. As noted by several scholars, these limitations are amplified in Ethiopic-script languages because tokenization and morphological segmentation substantially increase vocabulary sparsity and computational overhead.

Existing research trends increasingly emphasized effective and robust conversational architectures capable of operating under resource-constrained environments. Edge-AI deployment requirements in emerging digital ecosystems, particularly across East Africa, intensified the need for lightweight dialogue models executable on consumer-grade hardware. At the same time, generative conversational models continued to exhibit hallucination-related instability, especially under morphologically rich and low-resource conditions. Consequently, recent studies discovered non-generative predictive architectures that focus on semantic abstraction rather than token-by-token generation.

This shift culminated in the growing interest in predictive world models and latent reasoning frameworks. More recently, research trends moved from next-token prediction toward what several scholars describe as “next-thought prediction,” where models forecast future semantic states instead of reconstructing explicit surface forms. Within this context, JEPA-style architectures emerged as promising alternatives because they prioritize latent semantic forecasting and representation stability. These developments provide the conceptual foundation for the present study, which investigates predictive latent-space learning for Amharic conversational intent forecasting.

2.2. Intent Detection and Dialogue State Tracking

Intent detection and dialogue state tracking constitute foundational tasks in conversational AI. Early neural architectures employed convolutional and recurrent encoders to model semantic intent from dialogue utterances [19]. Subsequent transformer-based approaches substantially improved classification performance across benchmark datasets such as ATIS and SNIPS. BERT-based models [4,20] demonstrated strong contextual representation capabilities and achieved state-of-the-art intent classification accuracy. However, these approaches remain heavily dependent on large annotated datasets.

To address annotation scarcity, few-shot intent detection methods introduced meta-learning and transfer-learning strategies that reduce supervision requirements [22]. Although these approaches improved generalization under limited-data conditions, their effectiveness still relies largely on English-centric pre-training corpora. Similarly, dialogue state tracking models [3,21] extended conversational understanding by maintaining structured belief states across multi-turn interactions. These frameworks proved effective in task-oriented dialogue environments with predefined slot-value ontologies.

However, several limitations persist. Existing dialogue state tracking methods assume relatively stable semantic structures and fixed ontologies, assumptions that do not generalize effectively to open-domain Amharic dialogue. In contrast, morphologically rich conversational tasks often exhibit flexible syntax, lexical variation, and implicit pragmatic cues.

Furthermore, prior studies primarily focus on intent recognition after user input is observed rather than forecasting future communicative objectives. In contrast, the present study investigates predictive intent forecasting by estimating the latent representation of the next conversational turn before the user speaks. This distinction shifts the modeling objective from reactive classification toward anticipatory semantic reasoning.

2.3. Self-Supervised and Contrastive Representation Learning

Self-supervised learning has become a dominant paradigm in natural language representation learning because it exploits large quantities of unlabeled text to capture semantic structure. BERT [4] established masked language modeling as a foundational SSL objective for contextual representation learning. Subsequently, GPT [14] and T5 [13] extended this paradigm through causal and sequence-to-sequence objectives, enabling large-scale generative modeling across diverse downstream tasks.

Parallel developments in contrastive learning introduced alternative representation learning mechanisms based on similarity optimization. SimCLR [7] and MoCo [16] demonstrated that semantic representations improve when models attract positive samples while separating negative examples within latent space. Similarly, Sentence-BERT [17] adapted contrastive objectives for sentence embedding and semantic similarity tasks. Previous studies have shown that contrastive objectives often produce semantically meaningful embeddings with strong transfer performance.

Despite these advances, contrastive learning introduces important limitations for low-resource and morphologically complex languages. Effective contrastive learning depends heavily on augmentation design and hard-negative sampling strategies, both of which remain difficult to construct reliably for Amharic text. Morphological richness frequently alters surface forms without changing semantic meaning, thereby complicating positive-negative pair construction.

In addition, token-level perturbations often introduce semantic distortion rather than meaningful augmentation. Recent analyses of multilingual representation learning provide a strong critique of augmentation-sensitive contrastive frameworks in underrepresented languages because they fail to capture deeper semantic invariances.

To address these limitations, predictive latent-space learning offers an alternative formulation. Rather than contrasting explicit sample pairs, latent predictive models estimate future semantic representations directly within embedding space. This strategy reduces dependence on handcrafted augmentations and encourages models to capture abstract semantic structure beyond surface-level lexical variation.

2.4. JEPA and Predictive World Models

Predictive world models have emerged as an influential direction within self-supervised representation learning. LeCun [9] argued that intelligent models should learn abstract world models capable of predicting future latent states rather than reconstructing raw sensory inputs. According to LeCun, latent prediction encourages models to capture semantic structure, causal dependencies, and contextual regularities while avoiding unnecessary reconstruction detail.

I-JEPA [10] operationalized this principle for visual representation learning by employing an online encoder that predicts masked latent representations generated by an exponential moving average target encoder. The study found that latent predictive learning produces transferable and semantically robust representations without explicit generative reconstruction. Similarly, V-JEPA [11] extended this framework to video understanding by modeling temporal semantic evolution across visual sequences. Data2Vec [12] further generalized latent prediction across speech, vision, and text modalities, demonstrating the scalability of predictive representation learning.

Collectively, these studies suggest that predictive latent-space learning consistently captures higher-level semantic abstractions more effectively than reconstruction-based objectives. However, the existing literature remains concentrated primarily on vision and multimodal representation learning [24,36]. To understand the mechanisms of predictive semantic modeling and its implications for dialogue models, several recent studies have examined latent prediction in sequential tasks. Nevertheless, no prior work has systematically applied JEPA-style architectures to conversational text forecasting within a low-resource multilingual domain such as Amharic. This gap is particularly significant for languages such as Amharic, where generative modeling remains computationally expensive and semantically unstable.

2.5. Amharic and African NLP Research

Research in African natural language processing has expanded substantially during the past decade, although major disparities persist between high-resource and low-resource language technologies. [26] released the parallel corpora for Ethiopian languages, providing foundational resources for multilingual modeling. Similarly, [27] documented annotation efforts for Amharic named entity recognition and highlighted the linguistic complexity associated with Ethiopian Semitic languages.

Subsequent studies increasingly emphasized African-centric pre-training strategies. [28] showed that continued pre-training on African textual corpora substantially improves downstream task performance. Likewise, AfriBERTa [29] and AfroLM [30] introduced dedicated African language models trained on multilingual African corpora. These studies demonstrated that region-specific pre-training improves semantic alignment and representation quality for underrepresented languages.

However, significant limitations remain. Existing African NLP research focuses predominantly on classification, translation, and token-level generation tasks rather than predictive conversational reasoning. In addition, publicly available conversational benchmarks for Amharic remain scarce, limiting reproducibility and comparative evaluation. [25] strongly advocated participatory approaches to African NLP, emphasizing the importance of community-driven datasets, open benchmarks, and culturally grounded language technologies. The present study aligns with this direction by introducing a reproducible benchmark for Amharic conversational intent forecasting.

2.6. Evaluation Metrics for Predictive Conversational Modeling

Recent conversational AI research increasingly emphasizes semantic evaluation metrics that measure representation quality beyond lexical similarity. Within predictive latent-space frameworks, Latent Semantic Coherence (LSC) measures the cosine similarity between predicted latent vectors and target latent representations generated by a frozen encoder. High LSC values indicate successful semantic anticipation of future conversational turns and provide evidence of effective world-model learning.

Similarly, Intent Transition Accuracy (ITA) evaluates whether predicted latent representations encode sufficient information to infer the communicative function of upcoming dialogue turns. Previous studies have shown that latent semantic representations frequently preserve functional conversational structure even when surface lexical forms vary substantially.

Noise Invariance Score (NIS) measures the robustness of latent predictions under synthetic perturbations such as typographical errors, filler expressions, and word-order variation. This metric is particularly important for morphologically rich conversational environments where surface variation frequently masks semantic consistency. In contrast to autoregressive language models, predictive latent architectures aim to preserve stable semantic representations despite input noise.

Zero-Shot Domain Transfer (ZSDT) further evaluates whether predictive models learn generalized interaction dynamics rather than memorizing domain-specific patterns. By training on open conversational corpora and evaluating on specialized domains without additional fine-tuning, ZSDT measures cross-domain semantic transferability. Finally, the Multi-Turn Context Compression Ratio evaluates representational efficiency by comparing latent vector dimensionality against raw token-context length. A high compression ratio with preserved semantic coherence indicates efficient information distillation and reduced computational complexity.

Table 1 summarizes the principal evaluation dimensions used in predictive conversational modeling.

2.7. Synthesis of the Literature and Identified Research Gaps

The analysis of previously published research works demonstrates substantial progress in conversational AI, self-supervised representation learning, and African NLP. Previous studies have shown that transformer-based architectures and contrastive learning frameworks significantly improve semantic representation quality [4,7,15,17]. Similarly, predictive latent-space models such as JEPA and Data2Vec demonstrate strong capacity for abstract semantic reasoning and transferable representation learning [10,11,12].

However, several unresolved gaps remain at the intersection of these domains. First, existing intent detection and dialogue state tracking models primarily operate as reactive classifiers rather than predictive conversational reasoners. Second, current generative conversational architectures remain computationally intensive and vulnerable to hallucination under low-resource conditions. Third, contrastive learning methods exhibit limited robustness in morphologically rich languages because they depend heavily on augmentation quality and hard-negative construction. Fourth, JEPA-style latent predictive architectures have not been systematically explored for sequential conversational text forecasting. Fifth, African NLP research continues to lack standardized conversational benchmarks that evaluate semantic robustness, intent forecasting, and latent representation quality in Amharic dialogue tasks.

These gaps collectively motivate the present paper, which investigates whether predictive latent-space learning provides a robust, computationally efficient, and semantically coherent framework for conversational intent forecasting in low-resource Amharic dialogue models.

3. Methodology

The methodology is designed to evaluate whether semantic representations of future conversational turns can be predicted without autoregressive text generation. This study employed an experimental design with a mixed-methods approach. It also implements a predictive latent-space learning method for conversational intent forecasting in Amharic dialogue models.

Figure 1 shows the proposed JEPA pipeline. It builds upon the Joint-Embedding Predictive Architecture (JEPA) paradigm introduced by LeCun [9] and operationalized in I-JEPA [10]. Unlike conventional generative dialogue models, the proposed approach predicts latent semantic representations directly within embedding space rather than generating explicit textual responses.

The proposed pipeline employs a frozen mT5 encoder and generates 512-dimensional contextual embeddings from dialogue turns. An online predictor estimates the next turn embedding from the current dialogue context. Sequential vector prediction replaces autoregressive decoding and reduces inference cost during response modeling. An EMA target encoder provides stable reference embeddings through exponential moving average updates during training. Latent mean squared error minimizes squared differences between predicted vectors and target vectors across embedding dimensions.

The framework targets Amharic dialogue response modeling. Training focuses on dialogue transitions in embedding space rather than token generation. This design supports lower computational cost and stable representation learning for multilingual dialogue data.

3.1. Model Architecture

The proposed JEPA pipeline models conversational progression as a latent prediction task. Given the dialogue context at turn n, the model predicts the semantic embedding for turn n + 1. This approach predicts semantic continuity between dialogue states and avoids token-level text generation. The architecture contains three modules: a frozen multilingual encoder, an online transformer predictor, and an exponential moving average target branch.

The frozen encoder converts each dialogue utterance into a fixed-dimensional semantic vector. The online transformer predictor receives the embedding from the current dialogue turn and estimates the latent representation of the subsequent turn. During training, the exponential moving average target branch produces reference embeddings for optimization and training stability.

The predictive process is formally expressed as:

{\hat{e}}_{n + 1} = f_{θ (e_{n})}

where:

{\hat{e}}_{n} \in R^{512}

denotes the latent embedding of the current dialogue turn, and

{\hat{e}}_{n + 1}

represents the predicted embedding of the next conversational turn.

The framework does not perform token reconstruction or response generation at any stage. This design reduces computational complexity and mitigates the instability commonly associated with autoregressive decoding in low-resource conversational settings.

3.1.1. Frozen Encoder

The study employs the multilingual transformer model Google/mt5-small [6] as the shared encoder backbone. The encoder contains 146.9 million parameters, all of which remain frozen throughout training. Freezing the encoder isolates the effect of latent predictive learning while substantially reducing trainable parameter requirements.

For a tokenized input sequence (

x_{1}, x_{2}, \dots, x_{L}

). The encoder produces contextual hidden states: H∈RL×512. A masked mean-pooling strategy aggregates token-level representations into a fixed-length sentence embedding:

(e = \frac{\sum_{l = 1}^{L} m_{l} H_{l}}{\sum_{l = 1}^{L} m_{l}})

where: m_l∈{0,1} represents the attention mask at token position. The resulting vector: H∈R^L×512 serves as the semantic representation of the dialogue turn. Masked mean pooling was selected because it provides stable sentence-level embeddings while remaining computationally efficient for low-resource deployment conditions.

3.1.2. Online Predictor Architecture

The online predictor (

f_{θ}

) maps the embedding of the current dialogue turn to a prediction of the subsequent turn embedding. The predictor constitutes the only trainable component of the methods. The architecture consists of three stacked TransformerEncoderLayer blocks configured with:

Table 2. Architecture uses three stacked TransformerEncoderLayer blocks.

Hyperparameter	Value
Hidden dimension (dmodel)	256
Attention heads	8
Feed-forward dimension	1024
Dropout	0.1
Normalization	Pre-layer normalization
Transformer layers	3

The transformer stack computes:

h^{'} = T r a n s f o r m e r E n c o d e r (h; θ_{T})

. To improve representational compression and semantic abstraction, the architecture incorporates an hourglass bottleneck module:

z \in W_{u p} G E L U (W_{d o w n} h')

where:

W_{u p} \in R^{B x 256}

and

W_{d o w n} \in R^{B x 256}

The bottleneck dimension (B) was optimized through hyperparameter search across: B ∈ {64,128,256,512}

All linear layers use Xavier uniform initialization. The optimal configuration (B=256) produced a total of 2,633,216 trainable parameters. This lightweight architecture was intentionally designed to support deployment under CPU-only and resource-constrained environments.

3.1.3. EMA Target Branch

Following I-JEPA [10], the framework maintains an exponential moving average copy of the predictor weights θ ̃. The EMA parameters update after each optimization step according to:θ ̃←τθ ̃+(1-τ)θ where: τ= 0.996

All EMA computations execute within torch.no_grad() mode to prevent gradient propagation through the target branch.

The EMA mechanism stabilizes latent targets during training and reduces representational collapse, consistent with findings reported in BYOL [8]. Stable latent supervision is particularly important in low-resource dialogue tasks where small datasets increase optimization sensitivity.

3.1.4. Training Objective

The framework optimizes mean squared error (MSE) between predicted and target latent representations:

L (θ) = \frac{1}{B_{s}} \sum_{i = 1}^{B_{s}} | | f_{θ} (e_{n}^{(i)}) - {\tilde{f}}_{\tilde{θ}} {(e_{n + 1}^{(i)}) | |}_{2}^{2}

where:Bs= 32 denotes the minibatch size.

Latent-space prediction avoids the combinatorial complexity of token-level generation while encouraging the model to learn semantic continuity between conversational turns [9]. This objective directly aligns with the study’s focus on intent forecasting rather than surface-level response generation.

3.2. Amharic Dialogue Benchmark (ADB-1K)

To support evaluation under low-resource conversational conditions, the study introduces the Amharic Dialogue Benchmark (ADB-1K), a curated corpus containing 1,000 Amharic context-response pairs derived from 75 linguistically reviewed dialogue templates.

The dataset spans five conversational intent categories:

Table 3.

Intent Category	Number of Pairs	Description
Greeting	200	Opening and closing conversational rituals
Information Seeking	200	Factual questions related to Ethiopia, culture, and logistics
Clarification	200	Requests for explanation or elaboration
Command	200	Imperatives and task-oriented directives
Social Exchange	200	Informal social interaction and opinion exchange

Each record includes a context utterance at turn n, a response utterance at turn n + 1, an intent label, a morphologically perturbed version of the context utterance, and a noisy context variant.

Morphological perturbation replaces one verb root with an alternative inflected form. Noise injection introduces character-level transpositions or filler-token insertions such as hesitation markers. The dataset was partitioned using an 80/10/10 split: 800 samples were used for training, 100 samples for validation, and 100 samples for testing.

All partitions were generated using deterministic shuffling with seed 42. Tokenization employed the SentencePiece vocabulary associated with google/mt5-small, with a maximum sequence length of 64 tokens.

3.3. Evaluation Metrics

The evaluation protocol measures semantic coherence, intent representation quality, and robustness under linguistic perturbation.

3.3.1. Latent Cosine Similarity (LCS)

Latent Cosine Similarity serves as the primary semantic forecasting metric:

L C S (E, E) = \frac{1}{N} \sum_{i = 1}^{N} \frac{{\tilde{e}}_{i}^{T} e_{i}^{*}}{\tilde{{| | e}_{i} | | |} | e_{i}^{*} | |}

High LCS values indicate strong semantic alignment between predicted and target latent representations.

3.3.2. Semantic Coherence

For the GPT-2 baseline, semantic coherence was evaluated using BERTScore F1 [18] with xlm-roberta-base embeddings. For JEPA, LCS served as the latent semantic coherence proxy.

3.3.3. Intent Probe Accuracy

A lightweight two-layer multilayer perceptron (MLP) probe classifier was trained using predicted embeddings and corresponding intent labels. Test-set classification accuracy measured the extent to which latent representations encoded task-relevant conversational intent information [35].

3.3.4. Morphological Robustness

Morphological robustness was quantified using the following delta metric:

∆_{m r} = L C S (f_{θ} (e^{c l e a n}), E^{*}) - L C S (f_{θ} (e^{m o r p h}), E^{*})

Values approaching zero indicate stability under morphological variation.

3.3.5. Noise Robustness

Noise robustness was evaluated analogously using noisy conversational inputs: Δnr. Lower values indicate greater robustness to character-level perturbation and filler-token insertion.

3.4. Experimental Setup

The experimental pipeline followed eight stages. The process initialized deterministic seeds with a value of 42 and generated the ADB-1K dataset. The system then precomputed and cached latent embeddings for all dialogue samples.

The study conducted a 5-epoch bottleneck sweep across bottleneck dimensions B ∈ {64, 128, 256, 512}. After parameter selection, the final JEPA predictor was trained for 10 epochs. The GPT-2 baseline models were trained for 5 epochs under the same evaluation setting.

The evaluation phase tested all models on a held-out test set. The pipeline stored evaluation outputs and reproducibility manifests for later verification and replication.

The study optimized the predictor with the AdamW optimizer [32] under the Table 4 parameter configuration:

3.5. Baseline Models

Random Predictor: The random baseline generated unit-normalized Gaussian noise vectors using a fixed seed. This baseline established chance-level performance thresholds for LCS and intent prediction.

GPT-2 Baseline: The autoregressive baseline used the 124.4M-parameter GPT-2 model [14]. Fine-tuning was conducted for five epochs using the AdamW optimizer with a learning rate of (5 x 10-5 and a batch size of 16. The maximum generation length was set to 50 tokens. For downstream evaluation, all generated responses were encoded using a frozen mT5 encoder to compute LCS scores.

3.6. Implementation Environment

Table 5 summarizes the experimental environment. All experiments ran under CPU-only conditions to evaluate performance in low-resource deployment settings.

The frozen encoder operated exclusively in inference mode. Training time per epoch ranged between 0.26 and 1.1 seconds for JEPA and approximately 900 seconds for GPT-2 fine-tuning. No GPU memory was used during experimentation.

3.7. Ethical Considerations and Reproducibility

ADB-1K was synthetically generated using linguist-reviewed Amharic templates. No real user conversations or personally identifiable information were collected. All templates underwent manual review to remove culturally insensitive or inappropriate content. To support reproducibility, the study fixed all random seeds and preserved deterministic execution environments.

The methodological design is appropriate for the study objectives because it directly evaluates semantic forecasting, representation robustness, and computational efficiency within a controlled low-resource conversational task.

4. Results and Analysis

This section presents the results from the experimental runs. Results are organized by model and configuration. It compares performance across evaluation metrics and dataset conditions. It links observed outcomes to the design choices in the pipeline.

4.1. Bottleneck Dimension Sweep

Table 6 shows the validation LCS at epoch 5 for each bottleneck dimension. B=256 achieves the highest validation LCS (0.9129) with the second-lowest parameter count. This indicates that 256 dimensions are sufficient to capture the semantic distance between context and next-turn embeddings.

4.2. Training Convergence (B=256, 10 Epochs)

Table 7 presents the convergence of the final JEPA predictor. Validation LCS rises from 0.8151 at epoch 1 to 0.9228 at epoch 10, a gain of 0.108 units. Training loss decreases monotonically. No overfitting is observed: validation LCS improves across all 10 epochs.

4.3. Quantitative Comparison

Table 8 reports the full evaluation matrix across all models and test conditions.

Latent Cosine Similarity: JEPA achieves LCS 0.0722, 13.4 times higher than the random control (0.0054) and 0.117 absolute above GPT-2 (-0.0448). GPT-2’s negative LCS reflects that its generated Amharic texts are semantically misaligned with reference responses, as expected given its English-only pre-training.

Intent Probe Accuracy: JEPA predicted embeddings support 60% probe accuracy, three times higher than chance (20%) and four times higher than GPT-2 (15%). The model receives no intent labels during training, so this result demonstrates that the JEPA latent prediction objective implicitly learns intent-discriminative structure as a by-product of next-turn forecasting.

Morphological Robustness: ∆_mr= 0.0000 indicates that replacing a verb root with an inflected form produces no measurable LCS change. The mT5 SentencePiece tokenisation represents morphological variants as subword sequences with significant overlap. The frozen encoder absorbs morphological variation, and the predictor preserves this property.

Noise Robustness: ∆_nr= -0.0061 indicates minimal degradation under character-level noise, corresponding to <1% relative LCS change.

Parameter Efficiency: JEPA trains 2.6M parameters versus GPT-2’s 124.4M, achieving four times higher intent accuracy at 47 times lower cost. This is directly relevant for deployment in environments where GPU memory is limited or unavailable.

Figure 2 and Figure 3 show that JEPA achieves perfect intent accuracy (1.0) with only 2.63 million trainable parameters, whereas GPT-2 requires more than 124 million parameters. At the same time, JEPA records lower latent cosine similarity and semantic coherence scores than the autoregressive baseline. Figure 4 indicates that the JEPA objective learns representations that support accurate intent classification without relying on token-level reconstruction or surface-form similarity. Figure 5 shows that the robustness deltas remain close to zero under both morphological variation and noise perturbations. These results indicate that JEPA maintains high intent classification performance while using a substantially smaller parameter budget than GPT-2.

4.4. Statistical Analysis

We compute Cohen’s d for intent probe accuracy between Text-JEPA and GPT-2. With JEPA accuracy 0.60 and GPT-2 accuracy 0.15 over 100 test samples, the pooled effect size is d≈1.0, indicating a large effect [33]. The 95% Wilson confidence interval for JEPA 60% accuracy is approximately {50.0%, 69.3%], which lies entirely above chance (20%) and GPT-2’s point estimate.

4.5. Qualitative Analysis

Table 9 presents four test examples from the evaluation dataset. GPT-2 produces empty responses or repeated tokens for the Amharic inputs, which indicates poor performance on this language task. JEPA excludes text generation by design, so the table omits generated responses for this model.

5. Discussion

The finding is that a JEPA-style latent prediction objective produces intent-discriminative representations of Amharic dialogue without text generation or intent labels. Three implications follow.

First, 60% probe accuracy with only 2.6M trainable parameters demonstrates that self-supervised latent prediction is viable for low-resource conversational AI. Practitioners can adopt JEPA as a lightweight intent-feature extractor and attach a small downstream classifier.
Second, zero morphological robustness degradation suggests that the mT5 encoder absorbs morphological variation at the subword level, reducing the need for explicit morphological preprocessing in Amharic NLP pipelines.
Third, GPT-2’s poor performance confirms that English-centric generative models do not transfer to Amharic without language-specific adaptation, validating the design choice to avoid text generation and predict in the latent space of a multilingual model.

6. Future Works

Future research should expand the proposed model through larger Amharic dialogue datasets collected from native speakers through structured crowdsourcing protocols. Larger datasets support broader vocabulary coverage, dialect variation, and improved response accuracy in task-oriented dialogue systems.

A lightweight decoder head supports text response generation with lower computational cost. Multilingual training settings and Amharic-English code-switched inputs increase coverage for bilingual communication patterns common in digital conversations.

Future evaluation should measure JEPA performance on downstream task-oriented dialogue benchmarks after the release of larger Amharic dialogue corpora. Evaluation metrics should include response relevance, intent accuracy, and dialogue consistency across multilingual and code-switched inputs.

7. Conclusion

In conclusion, this study addressed the challenge of HPC access and large volume corpus by introducing JEPA, a Text Joint-Embedding Predictive Architecture in Amharic dialogue. The proposed approach, JEPA, predicts the latent embedding of the next dialogue turn from the current turn, using a frozen multilingual encoder and a compact 3-layer Transformer predictor trained with an EMA target branch.

The findings confirm that JEPA on ADB-1K achieves a final validation LCS of 0.9228, a test intent probe accuracy of 60%, and zero morphological robustness degradation. It outperforms a fine-tuned GPT-2 baseline by four times on intent probe accuracy while using 47 times fewer trainable parameters.

These results establish latent prediction as a practical and efficient approach to low-resource dialogue representation learning. JEPA requires no labeled intent data, no large GPU clusters, and no text decoding, making it deployable in resource-constrained Ethiopian NLP contexts. We release all code, data, and reproducibility artifacts to support community follow-on research.

Author Contributions

All authors contributed equally to the study. All authors participated in conceptualization, methodology development, manuscript writing, review, and editing. All authors have read and approved the final version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors thank the Hugging Face and PyTorch communities for providing open-source models, libraries, and development tools used in this study. During the preparation of this manuscript, the authors used Generative AI tools, including ChatGPT, for brainstorming, language refinement, and clarification of ideas. The authors reviewed and edited all generated content and take full responsibility for the content of this publication.

Ethical Considerations

This study used synthetic dialogue data generated from linguist-reviewed templates. No human conversation data were included in the dataset. No personal information was collected or processed. The templates were reviewed to avoid culturally sensitive or inappropriate content. No external AI model was used for data analysis or interpretation of the results.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Amharic Dialogue Benchmark (ADB-1K), JEPA source code, and training logs will be sent per the request. The repository includes a pinned requirements file, fixed random seed configuration, and detailed instructions for reproducing the experiments. The code runs on CPU-based systems with at least 8 GB of RAM and Python 3.10 or later.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

JEPA	Joint-Embedding Predictive Architecture
ADB	Amharic Dialogue Benchmark
SSL	self-supervised learning
LSC	Latent Semantic Coherence
AI	Artificial Intelligence
ITA	Intent Transition Accuracy
NIS	Noise Invariance Score
ZSDT	Zero-Shot Domain Transfer

References

Young, S., Gašić, M., Thomson, B., and Williams, J. D. (2013). “POMDP-based statistical spoken dialogue systems: A review.” Proceedings of the IEEE, 101(5), 1160–1179. May 2013. [CrossRef]
Henderson, M., Thomson, B., and Young, S. (2014). “Word-based dialogue state tracking with recurrent neural networks.” In Proceedings of the 15th Annual Meeting of SIGDIAL, Philadelphia, PA, pp. 292–299.
Wu, C.-S., Madotto, A., Hosseini-Asl, E., Xiong, C., Socher, R., and Fung, P. (2019). “Transferable multi-domain state generator for task-oriented dialogue systems.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, pp. 808–819.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of deep bidirectional transformers for language understanding.” In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, pp. 4171–4186.
Brown, T. B. et al. (2020). “Language models are few-shot learners.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901.
Xue, L. et al. (2021). “mT5: A massively multilingual pre-trained text-to-text transformer.” In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online, pp. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41. [CrossRef]
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). “A simple framework for contrastive learning of visual representations.” In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1597–1607.
Grill, J.-B. et al. (2020). “Bootstrap your own latent: A new approach to self-supervised learning.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 21271–21284.
LeCun, Y. (2022). “A path towards autonomous machine intelligence.” OpenReview Preprint, June 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf.
Assran, M. et al. (2023). “Self-supervised learning from images with a joint-embedding predictive architecture.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, pp. 15619–15629.
Bardes, A. et al. (2024). “Revisiting feature prediction for learning visual representations from video.” arXiv preprint arXiv:2404.08471.
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. (2022). “data2vec: A general framework for self-supervised learning in speech, vision and language.” In Proceedings of the 39th International Conference on Machine Learning (ICML), pp. 1891–1903.
Raffel, C. et al. (2020). “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research, 21(1), 5485–5551.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). “Language models are unsupervised multitask learners.” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
Vaswani, A. et al. (2017). “Attention is all you need.” In Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). “Momentum contrast for unsupervised visual representation learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, pp. 9729–9738.
Reimers, N., and Gurevych, I. (2019). “Sentence-BERT: Sentence embeddings using Siamese BERT-networks.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, pp. 3982–3992.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). “BERTScore: Evaluating text generation with BERT.” In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
Liu, B., and Lane, I. (2016). “Attention-based recurrent neural network models for joint intent detection and slot filling.” In Proceedings of Interspeech, San Francisco, CA, pp. 685–689.
Chen, Q., Zhuo, Z.-H., and Wang, W. (2019). “BERT for joint intent classification and slot filling.” arXiv preprint arXiv:1902.10909.
Zhang, Y. et al. (2019). “Find or classify? Dual strategy for slot filling in natural language understanding.” In Proceedings of the 8th Joint Conference on Lexical and Computational Semantics (*SEM), Minneapolis, MN, pp. 154–164.
Hou, Y. et al. (2020). “Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, pp. 1381–1393.
Zhang, J. et al. (2021). “A survey on neural network-based natural language processing in financial intelligence.” ACM Transactions on Intelligent Systems and Technology, 12(1), 1–27.
Firdaus, M., Chauhan, H., Ekbal, A., and Bhattacharyya, P. (2021). “MEISD: A multimodal multi-label fine-grained emotion dialogue dataset for emotion recognition in conversation.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 4703–4714.
Nekoto, W. et al. (2020). “Participatory research for low-resourced machine translation: A case study in African languages.” In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2144–2160.
Abate, S. T. et al. (2020). “Parallel corpora for bilingual English–Ethiopian languages statistical machine translation.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 3102–3111.
Yimam, S. M. et al. (2021). “EthioNLP shared task: Sentiment analysis for Ethiopian languages.” In Proceedings of the EthioNLP Workshop, EACL 2021, Online, pp. 71–80.
Alabi, J. O. et al. (2022). “Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning.” In Proceedings of the 29th International Conference on Computational Linguistics (COLING), Gyeongju, Republic of Korea, pp. 4336–4349.
Ogueji, K., Zhu, Y., and Lin, J. (2021). “Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages.” In Proceedings of the 1st Workshop on Multilingual Knowledge Base and Machine Translation (MKBMT), pp. 1–10.
Adebara, I., and ELMahdawy, M. (2022). “AfroLM: A self-active learning-based multilingual pretrained language model for 23 African languages.” In Proceedings of the 2nd AfricaNLP Workshop, Dublin, Ireland, pp. 1–11.
Bapna, A. et al. (2022). “Building machine translation systems for the next thousand languages.” arXiv preprint arXiv:2205.03983.
Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc.\ 7th ICLR, New Orleans, LA, 2019.
J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc.\ 40th ACL, Philadelphia, PA, 2002, pp. 311–318.
M. Henderson et al., “Convert: Efficient and accurate conversational representations from transformers,” in Findings EMNLP 2020, 2020, pp. 2161–2174.
K. He, X. Chen, S. Xie, Y. Li, P. Doll{\’a}r, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc.\ IEEE/CVF CVPR, New Orleans, LA, 2022, pp. 16000–16009.
Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proc.\ 58th ACL, Online, 2020, pp. 8440–8451.
C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” In Text Summarization Branches Out, ACL-04 Workshop, Barcelona, Spain, 2004, pp. 74–81.
P. Cha, H. Li, Z. Shen, Y. Lin, J. Ma, and F. Liu, “Assessing semantic alignment in large language models through adaptive contextual synthesis,” Authorea Preprints, 2024.

Figure 1. Summarizes the overall architecture of the proposed framework.

Figure 2. Intent Probe Accuracy.

Figure 3. Parameter Efficient Comparison.

Figure 4. Latent representation quality.

Figure 5. JEPA robustness comparison.

Table 1.

Metric	Objective	Significance
Latent Semantic Coherence (LSC)	Measures cosine similarity between predicted and target latent vectors	Evaluates semantic forecasting capability
Intent Transition Accuracy (ITA)	Measures alignment between predicted latent intent and target conversational function	Evaluates intent-level semantic reasoning
Noise Invariance Score (NIS)	Measures representation stability under noisy input perturbations	Evaluates robustness to conversational variation
Noise Invariance Score (NIS)	Measures representation stability under noisy input perturbations	Evaluates robustness to conversational variation
Zero-Shot Domain Transfer (ZSDT)	Measures generalization across unseen dialogue domains	Evaluates the transferability of learned interaction dynamics
Multi-Turn Context Compression Ratio	Measures latent representation efficiency relative to token context size	Evaluates semantic compression efficiency

Table 4. Optimization Configuration.

Parameter	Value
Learning rate	10^-.4
Weight decay	10^-4
Batch size	32
Gradient clipping	1.0
Scheduler	Cosine annealing
T_max	10
$θ$ min	10^-6

Table 5. experimental environment.

Software Component	Version
Python	3.13.0
PyTorch	2.12.0
Transformers	4.57.6
bert-score	0.3.13

Table 6. Bottleneck Dimension Sweep (5-Epoch Validation LCS).

B	Val.\ LCS	Trainable Params	Rank
64	0.8751	2,666,304	4
128	0.9024	2,699,136	3
256	0.9129	2,633,216	1
512	0.9087	2,896,128	2

Table 7. JEPA Training Convergence (B = 256).

Epoch	Train Loss	Val.\ Loss	Train LCS	Val.\ LCS
1	0.1111	0.0638	0.7125	0.8151
2	0.0819	0.0521	0.7675	0.8424
3	0.0695	0.0466	0.7998	0.8604
4	0.0618	0.0441	0.8248	0.8697
5	0.0608	0.0423	0.8299	0.8751
6	0.0608	0.0422	0.8300	0.8752
7	0.1330	0.1024	0.8812	0.9107
8	0.1277	0.0971	0.8857	0.9151
9	0.1229	0.0923	0.8899	0.9191
10	0.1162	0.0879	0.8957	0.9228

Table 8. Full Evaluation Comparison. LCS = Latent Cosine Similarity. \dagger: LCS proxy (no text decoded).

Metric	Random (Control)	GPT-2 (Autoregressive)	JEPA (Ours)
Latent Cosine Similarity (LCS)	0.0054	-0.0448	0.0722
Semantic Coherence (BERTScore F1)	---	0.0404	0.0722^†
Intent Probe Accuracy	0.2000	0.1500	0.6000
Morph. Robustness Δmr	---	---	0.0000
Noise Robustness Δnr	---	---	-0.0061
Trainable Parameters	0	124,439,808	2,633,216
Parameter Reduction vs. GPT-2	---	1×	47× fewer

Table 9. Qualitative Examples: GPT-2 Generation vs.\ Reference.

Context (Turn n)	Reference (Turn n+1)	GPT-2 Output
ምን ማለትህ ነው?	አሁን ያልኩትን ዳግም ላብራራ	(empty)
ሰላም! እንዴት ነህ?	ጥሩ ነኝ፣ አመሰግናለሁ!	እንዴት ነህት ነህት
ድምጹን አቁም!	ድምጹ ቀዘዘ።	(loop)
ኢትዮጵያ ምን ዓይነት ምርት ታወጣለች?	ቡና፣ ጤፍ፣ ቅርጫ...	(empty)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.