1. Introduction
Conversational artificial intelligence has shifted toward dialogue models that predict user goals before users state them [
1]. Many current models process user utterances and estimate the next conversational objective. Researchers describe this task as intent forecasting.
Intent forecasting differs from intent classification. Intent classification assigns labels to completed utterances. Intent forecasting predicts the communicative function of a future dialogue turn. This approach supports earlier model responses, lower interaction delay, and improved dialogue planning in customer support, education, agriculture, healthcare, and virtual assistants.
Goal-oriented dialogue models depend on manually designed rules and domain-specific ontologies [
1]. Subsequent advances in statistical dialogue state tracking [
2] and neural sequence modeling [
3] substantially improved performance in high-resource languages. Despite these advances, the effectiveness of such models remains strongly dependent on large-scale annotated corpora, which are largely unavailable for low-resource languages.
Recent advances in self-supervised learning (SSL) have transformed representation learning by using unlabeled data to capture semantic structure without extensive manual annotation [
4,
5,
6]. Contrastive learning approaches, including SimCLR [
7] and BYOL [
8], demonstrated that robust latent representations emerge from predictive objectives that do not require explicit class supervision.
Building on this frame, the Joint-Embedding Predictive Architecture (JEPA), introduced by LeCun [
9] and operationalized in I-JEPA [
10], proposed a predictive framework centered on latent-space forecasting rather than direct token or pixel reconstruction. By predicting abstract representations instead of surface-level outputs, JEPA prioritizes high-level semantic structure while avoiding the computational and combinatorial burdens associated with autoregressive generation. This predictive latent-space paradigm offers a compelling direction for conversational AI, particularly in environments where annotated resources are scarce and efficient representation learning is essential.
These challenges are manifest in Amharic, a morphologically rich Semitic language that remains underrepresented in natural language processing research. Amharic conversational models face three interrelated constraints.
First, the language exhibits extensive morphological variation. A single verb root can generate hundreds of inflected forms through complex prefixation and suffixation processes [
27]. Such variation increases lexical sparsity and complicates semantic generalization.
Second, publicly available Amharic dialogue datasets remain extremely limited in both scale and diversity [
25,
26]. The scarcity of high-quality conversational corpora constrains the development of robust data-driven dialogue models.
Third, dominant multilingual pre-training frameworks are primarily optimized for English-centric corpora, producing representations that inadequately capture the linguistic properties of Ethiopic script and Semitic grammatical structures [
28]. As a result, transferred representations often fail to preserve culturally grounded pragmatic and semantic information.
Existing approaches for Amharic dialogue modeling have attempted to mitigate these limitations through multilingual fine-tuning [
31] or translation-based pipelines. However, these strategies remain constrained by structural mismatches between source and target languages and frequently introduce semantic degradation during translation. More importantly, current models largely focus on response generation or intent classification rather than intent forecasting. Consequently, they do not address the central challenge of predicting, from dialogue turn (n), the communicative objective of turn (n+1) before the user explicitly expresses it. This limitation restricts the ability of conversational models to engage proactively and maintain coherent long-term interaction.
The current research work reveals several critical research gaps that hinder progress in low-resource conversational AI. First, predictive latent-space learning frameworks based on JEPA principles have not been systematically applied to dialogue modeling in low-resource environments. Second, intent forecasting research remains overwhelmingly concentrated on high-resource languages such as English and Mandarin [
23,
24], leaving African languages substantially underexplored. Third, large autoregressive generative models typically require billions of target-language tokens to produce coherent text representations, a requirement that is infeasible for most low-resource languages. Fourth, no publicly available benchmark currently evaluates the robustness of conversational representations against Amharic morphological variation and character-level noise. Fifth, self-supervised learning methods for African languages remain significantly understudied relative to the linguistic diversity represented across the continent [
25,
30]. Collectively, these limitations expose a substantial gap between advances in predictive representation learning and the practical requirements of conversational AI for low-resource languages.
To solve these challenges, this paper investigates whether predictive latent-space learning offers an effective and computationally efficient framework for intent forecasting in Amharic dialogue models. The study specifically examines whether JEPA-style architectures can learn semantically coherent conversational representations without relying on autoregressive text generation. Unlike conventional generative approaches, the proposed framework focuses on representation prediction in latent space, thereby emphasizing semantic abstraction, parameter efficiency, and robustness under linguistic variability.
This study is guided by the following three research questions:
RQ1: Does a JEPA-style latent predictor learn semantically coherent representations of Amharic dialogue turns without text generation?
RQ2: Do the learned latent representations encode intent information sufficient for downstream classification, and do they outperform autoregressive baselines under a parameter-efficiency comparison?
RQ3: How robust are the learned representations to Amharic morphological variation and character-level noise?
To answer these questions, this study set three primary objectives. First, it evaluates whether latent predictive learning captures semantically coherent conversational structure. Second, it examines whether the learned representations encode sufficient intent information for downstream conversational reasoning. Third, it assesses the robustness of predictive embeddings under morphological variation and noisy input conditions common in Amharic dialogue.
By integrating predictive latent-space learning with low-resource conversational modeling, this work advances the development of semantically robust and computationally efficient dialogue models for underrepresented languages. The findings contribute to ongoing efforts aimed at expanding equitable natural language processing research beyond high-resource linguistic domains while establishing a foundation for future predictive conversational AI models in low-resource languages.