From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI

Danielle S. McNamara; Linh Huynh

doi:10.20944/preprints202605.1608.v1

Submitted:

24 May 2026

Posted:

25 May 2026

You are already at the latest version

Abstract

As generative AI expands the technical frontiers of prediction, measurement, and design, a growing tension has emerged between algorithmic fluency and institutional trust. This paper proposes stewardship as a necessary fourth paradigm of educational data science. Stewardship represents the professional and epistemic work of governing judgment in an environment where analytic systems are increasingly generative and persuasive. Current research suggests that while AI excels at bounded analytic tasks, its capacity for systemic educational transformation remains unproven. Therefore, the field’s primary challenge is no longer technical performance, but the governance of interpretation, validation, and action. By centering on provenance, accountable oversight, and learner agency, stewardship provides the framework needed to anchor analytic innovation in responsible institutional improvement and human-centric purposes.

Keywords:

generative AI

;

learning analytics

;

educational data science

;

AI stewardship

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

For more than a century, educational technologies have been animated by the hope that machines might not only scale instruction but also improve judgment about learning itself. This lineage is well established: from Pressey’s early teaching machine to later forms of computer-assisted instruction, each wave of innovation promised more adaptive, responsive, and individualized education [1,2]. Learning analytics (LA) emerged within this broader history of artificial intelligence in education (AIED), educational data mining (EDM), and data-intensive educational research, with a distinctive ambition: not merely to automate educational tasks, but to collect, analyze, and act on learner data in ways that improve teaching and learning. As several recent reviews note, LA matured during a period when digitization, online platforms, and trace data made it possible to detect patterns in learner behavior at scale, while also sharpening questions about validity, actionability, and ethics. In that sense, the arrival of large language models (LLMs) and generative AI (GenAI) did not create the field’s central tensions. It amplified them [3,4].

Much of current discussion in the literature and media treats GenAI as a sudden rupture. Since the release of ChatGPT in late 2022, educational discourse has been saturated with claims about transformation: personalized tutoring, automated feedback, conversational analytics, synthetic learner data, multimodal dashboards, and AI-enabled intervention systems. Recent conceptual work in LA has argued that GenAI may affect every phase of the learning analytics cycle, from the identification of learners to the processing of unstructured data, to explanatory analytics, personalization, and adaptive intervention. Yan, Martinez-Maldonado, and Gašević [5], for example, position GenAI as a potential catalyst for analyzing discourse, generating synthetic data, enriching multimodal interaction data, and making analytics more interactive and interpretable. Misiejuk, López-Pernas, Kaliisa, and Saqr [4] similarly describe GenAI as opening new possibilities for the design of LA tools and for supporting teachers’ assessment and monitoring practices. Yet these same authors also caution that the evidence base remains uneven and that practical implications for real interventions are still underdeveloped.

This tension between expanding technical capability and constrained empirical validation reflects a deeper pattern in the development of educational data science. The field can be understood as evolving through three overlapping paradigms: prediction, measurement, and design. These paradigms do not represent discrete stages, but rather dominant orientations that continue to coexist and shape one another. Prediction focuses on identifying patterns in data to forecast outcomes such as student dropout, disengagement, or performance [6,7]. This work translates complex trace data into actionable signals, aligning with descriptive, diagnostic, predictive, and prescriptive forms of analytics [5]. While prediction enables large-scale insight into learning processes, it does not by itself determine how educational outcomes should be improved.

Measurement focuses on the validity of inferences drawn from learner data. It examines whether digital traces can serve as credible indicators of constructs such as reasoning, collaboration, self-regulation, and affect [8]. This perspective emphasizes that analytic outputs depend on theoretically grounded interpretations, and that translating learning into data necessarily involves assumptions that must be examined and validated. This concern remains central in current GenAI research, where much of the work focuses on coding, scoring, and classifying unstructured data, often with uneven validation practices [4].

Design focuses on how analytic insight is embedded within systems that shape teaching and learning. This includes dashboards, feedback systems, and intervention designs that translate data into action [9,10]. Within this paradigm, the field engages the challenge of “closing the loop”—ensuring that analytics inform practice in ways that lead to meaningful improvement.

Learning Engineering has emerged as a central development within this design paradigm. It provides an iterative, evidence-based approach that integrates learning science, human-centered design, and data-informed decision making to support continuous improvement [11,12,13].Learning Engineering is not synonymous with analytics; rather, it is the process through which analytic insight is translated into intervention, evaluated in context, and refined over time. In this sense, if prediction helps the field anticipate and measurement helps it interpret, Learning Engineering enables it to act.

These paradigms remain essential. However, GenAI introduces new pressures across all three. In prediction, LLMs produce outputs that are more fluent and contextually persuasive, even when their causal grounding is uncertain. In measurement, they enable large-scale interpretation of unstructured data while leaving underlying construct validity unresolved. In design, they make it possible to generate explanations, feedback, and adaptive responses in real time, even when institutions lack clear criteria for evaluating their pedagogical appropriateness. The resulting risk is not only error, but uncalibrated certainty: outputs that appear meaningful before their epistemic status has been established.

These paradigms remain essential. But in the age of LLMs and GenAI, they are no longer sufficient. GenAI introduces new pressures across all three. In prediction, LLMs produce outputs, forecasts, and recommendations that are fluent, authoritative and contextually persuasive, even when their causal grounding is uncertain. In measurement, they make it easier to label, summarize, and interpret unstructured data at scale, even when the underlying construct validity may be questionable. In design, and especially within Learning Engineering, generative systems make it possible to produce feedback, explanations, narratives, and adaptive supports in real time. Yet the ability to generate such responses often outpaces our theoretical and ethical frameworks for determining when those interventions are pedagogically appropriate, equitable, or safe.

This shift alters the architecture of judgment in educational data science. Historically, analytics involved a layered process in which data were transformed into indicators, interpreted through theoretical frameworks, and translated into action through human decision making [9,10]. Generative systems compress this chain by producing fluent interpretations directly from data, often without making intermediate reasoning steps visible. As a result, outputs increasingly function as judgments rather than analytic inputs.

This development raises a question that prediction, measurement, and design do not fully address: how should increasingly generative, fluent, and consequential systems be governed once they begin to shape educational interpretation and action?

This article argues that the answer lies in a fourth paradigm: stewardship.

Stewardship refers to the disciplined governance of judgment in educational data science. It encompasses the commitments, practices, and institutional arrangements through which analytic outputs become educationally legitimate, uncertainty is represented, decisions remain accountable, and systems are revised when their consequences diverge from their intentions. It is not an external ethical overlay, but an organizing principle for how prediction, measurement, design, and Learning Engineering operate under conditions of generative analytics.

Recent work already points toward this need. Khosravi et al. (2023) [3] call for “GenAI analytics” that capture prompts, interaction context, and model parameters. Yan et al. (2024) [5] highlight the need to reconsider the learner in contexts where human and AI contributions are intertwined. Misiejuk et al. (2025) [4] show that while GenAI is expanding across the learning analytics cycle, validated instructional uses and evaluation standards remain underdeveloped. Together, these studies suggest that the central scarcity in the field is no longer computational capability, but disciplined judgment.

The argument advanced here is therefore not that GenAI should be resisted, but that it requires educational data science to mature. As AI expands the range of analytic outputs, the field’s contribution can no longer be defined primarily by generating those outputs, but in governing their interpretation, validation, and use. Learning Engineering remains essential as the process through which analytic insight is translated into iterative educational improvement [12]. Stewardship becomes essential as the framework that ensures such improvement remains epistemically grounded, institutionally accountable, and aligned with the purposes of education.

Stewardship does not begin from a blank slate. It extends prior work in responsible LA, which established that analytics are not merely technical systems, but sociotechnical practices shaped by accountability, reasonable care, and the obligation to act [14,15]. Many of the concerns raised here are not new. Educational technology research has also long cautioned against overreliance on automated systems, the displacement of professional judgment, and the risks of treating model outputs as authoritative [16,17].

What is different in the context of GenAI is not the existence of these risks, but their amplification and transformation. LLMs produce outputs that are not only predictive or descriptive, but fluent, contextually responsive, and rhetorically persuasive. They collapse analytic pipelines into conversational interfaces and reduce the visibility of uncertainty and intermediate reasoning. As a result, overreliance becomes easier to trigger and harder to detect, which is why stewardship now becomes necessary.

2. Why LLMs Change the Problem: Fluency, Delegation, and the Governance of Judgment

Large language models do not simply improve existing learning analytics workflows. They alter the conditions under which educational judgments are produced, interpreted, and acted upon. They do so by extending generative capabilities across multiple phases of the learning analytics cycle [4,5]. Earlier analytics systems typically produced bounded outputs: risk scores, classifications, visualizations, alerts, or dashboard indicators. Those outputs could still be misleading, reductive, or harmful, but they were usually constrained by clearer interfaces and narrower forms of interpretation [9,18]. LLMs change this by producing language itself as the analytic medium. They do not only calculate but also explain, summarize, recommend, and generate justification [19,20]. In doing so, they make analytic outputs more accessible and more persuasive, even when explanation faithfulness and appropriate reliance remain unsettled [3,21,22].

This shift matters because educational judgment has always been a layered activity. Learner activity is transformed into indicators, interpreted through theories of learning, and then translated into action by teachers, designers, advisors, or institutions [9,10]. Learning analytics has never merely found insights in data; it has always constructed usable interpretations through a chain of decisions about what to capture, how to model it, what counts as meaningful, and when intervention is warranted [9,10]. Learning Engineering makes this layered structure visible by framing educational improvement as an iterative process in which theory, data, design, implementation, and revision are tightly linked rather than separated into isolated stages [12]. What LLMs do is compress and partially obscure that chain. They can move directly from traces or prompts to polished explanations and recommendations without exposing the intermediate reasoning steps needed for inspection [19,23,24]. The result is not just efficiency. It is a reconfiguration of where judgment appears to reside.

Yan, Martinez-Maldonado, and Gašević [5] provide perhaps the clearest conceptual basis for understanding this shift. They argue that GenAI may shape every phase of the learning analytics cycle, including analysis of unstructured data, synthetic data generation, multimodal enrichment, interactive and explanatory analytics, and personalization or adaptive intervention. Embedded in that argument is a profound change in what analytics can look like. Analytics are no longer limited to static metrics or visual representations; they can become conversational, responsive, context-sensitive, and seemingly interpretive. This shift helps explain why LLM-based analytics feel transformative in educational settings: they promise to close the distance between raw data and pedagogically usable recommendations [19,24,25]. From a Learning Engineering perspective, that promise matters because it suggests that analytics may enter design and improvement cycles in more immediate and generative ways, shaping not only what is known about learning but how interventions are proposed, adapted, and refined [12].

But this apparent closing of distance creates a new problem. When explanation is generated fluently, it becomes easier to mistake plausibility for validity [21,26,27,28]. In other words, the first major way LLMs change educational data science is by making analytics rhetorically stronger before they are epistemically stronger.

2.1. Fluency as Epistemic Risk

The appeal of LLMs lies partly in fluency: they can present analytic outputs as coherent and contextually responsive explanations and recommendations rather than as charts or metrics alone. In learning analytics, this capability makes them especially attractive for dashboard narration, feedback generation, descriptive analytics, and stakeholder-facing explanation. Ochoa, Huang, and Shao [25] explicitly frame this promise in terms of making learning analytics more accessible to non-experts performing LA tasks with GenAI support. Recent research also points toward the same direction: LLM-powered chatbot can augment LA dashboards with contextualized and conversational explanation, improving non-experts’ comprehension of outputs without relying on deep technical expertise [20,29].

However, that promise should not be romanticized because ease of interaction is not the same as calibrated reliance. Ochoa et al. [25] emphasize that these systems are not yet sufficiently reliable for independent real-world use and that domain knowledge remains essential for interpreting and checking outputs. This caveat is not incidental. It reveals a central paradox of LLM-enabled analytics: the very features that make them usable also make them easy to overtrust [17,30]. A system that explains clearly, answers instantly, and produces seemingly thoughtful interpretations can persuade users that the underlying inference is stronger than it actually is [31,32,33]. In this sense, fluency is not merely a user-experience benefit. It is an epistemic risk.

This risk becomes even clearer within qualitative coding and text analysis. Liu et al. [34] found that GPT-4 can code a broad range of educational constructs, but its performance varies by construct, prompt strategy, and context. No single prompting method consistently performs best, and the constructs that challenge human coders also tend to challenge the model. That finding undercuts any simplistic claim that LLMs solve interpretation at scale. They can accelerate coding, but they do not remove ambiguity from the phenomena being coded. They reproduce and sometimes disguise the same uncertainty that already resided inherently within the construct itself. When those codes are later summarized in fluent prose, the uncertainty may disappear even though it has not disappeared from the analysis itself. This concern is consistent with recent work showing that natural-language explanations can be plausible or self-consistent without being faithful to the processes that generated them [22].

Misiejuk et al. [4] reinforce this point at the field level. Their synthesis indicates that discourse coding, scoring, and classification dominate current empirical work, but it also notes that some studies feed GenAI outputs into LA pipelines without sufficient validation. Here again, the problem is not merely technical error. Generative systems can produce usable-looking analytic artifacts before the field has established whether those artifacts are sufficiently valid to support downstream decisions [21]. Within the field of Learning Engineering, this is especially consequential, because iterative improvement depends on the quality of the evidence entering the cycle. If fluent but weakly validated outputs are treated as sound evidence, the improvement process itself can become distorted [12,35].

2.2. Delegation Without Visibility

The second major change introduced by LLMs is delegation. Educational data science has always delegated certain analytic tasks to algorithms, but LLMs expand both the range and the subtlety of what can be delegated. Systems can now summarize forum activity, classify discourse, generate personalized narratives from dashboards, answer stakeholder questions in plain language, explain visualizations, draft intervention suggestions, and synthesize multimodal observations [5,19,37]. Some of these delegations are desirable because they reduce labor, broaden access, and allow researchers or practitioners to work with data that would otherwise be too unstructured or voluminous to analyze effectively.

Yet delegation becomes more problematic when the analytic work being delegated is not merely procedural but interpretive [36]. Once a model is asked to explain why a learner appears disengaged, summarize a group’s collaborative dynamic, or suggest what kind of support a teacher should provide, it is no longer just processing data. It is participating in pedagogical judgment. That participation may still be partial and constrained, but it is substantive. The governance question is therefore not whether delegation occurs but whether the field has adequate ways to decide which judgments may be delegated, how uncertainty should be represented, and when human oversight must remain primary [37,38].

Yan et al. [5] argue that as the lines blur between learners and GenAI tools, the LA community must better understand human-AI collaboration and trace both human and AI contributions. That claim is often read as a matter of data capture or methodological innovation. But it is also a governance claim. If AI contributes to learning, sensemaking, or interaction, then educational data science must distinguish between human performance, AI mediation, and co-produced activity. Otherwise, institutions risk building analytics on increasingly unstable assumptions about authorship, effort, and learning itself.

The issue becomes even sharper in multimodal analytics. Whitehead, Nguyen, and Järvelä [39] demonstrate how Multimodal Large Language Models (MLLMs) can make complex non-verbal data more tractable through video analysis of posture in collaborative learning. But this is precisely the kind of domain where delegation can outrun interpretability. The model may annotate multimodal signals efficiently, but deciding what those annotations mean educationally still requires human and theoretical judgment [40,41]. As feature extraction becomes more powerful, the need for stewardship at the interpretation layer increases because these interpretations feed directly into downstream decisions.

In an iterative design and improvement framework, delegation is never solely about analytic efficiency; it also determines what counts as evidence for intervention. If models are delegated interpretive authority too early, then learning environments may be redesigned on the basis of outputs whose educational meaning has not been adequately established [12]. Delegation without visibility therefore threatens not only interpretation, but the integrity of improvement itself.

2.3. From Outputs to Consequences

A third way LLMs change the problem is by shifting attention from outputs to consequences. Traditional LA debates often focused on the quality of models or the interpretability of dashboards. With LLMs, the more pressing issue is increasingly what happens when model outputs circulate in educational settings as advice, explanation, or action. A generative summary is not merely information. It can shape how a teacher interprets a student, how a student understands their own progress, how an advisor prioritizes outreach, or how an institution allocates attention. LLM outputs are therefore consequential not only because they may be correct or incorrect, but because they can reorganize human judgment around them [28,42,43].

This shift toward consequence is also why Khosravi, Viberg, Kovanović, and Ferguson [3] call for robust GenAI analytics. Their argument is not only to analyze learners using GenAI, but also to analyze interactions with GenAI systems themselves: prompts, responses, model parameters, and the emerging forms of human-AI collaboration that these systems create. The field, then, is not simply incorporating a new tool, but encountering a new mediating layer in educational action. This change requires much richer attention to provenance, context, traceability, and outcome monitoring than conventional AI-in-education approaches typically assume.

2.4. Why Stewardship Becomes Unavoidable

Fluency, delegation, and consequence collectively render stewardship unavoidable. Prediction, measurement, and design remain indispensable. But none of these, on their own, is sufficient for governing LLM-based educational systems. Prediction can estimate likelihoods, but it cannot decide which forms of uncertainty must remain visible to users. Measurement can refine constructs, but it cannot determine when a generative interpretation is too weak to enter a feedback or intervention pipeline. Design can create usable interfaces, but it cannot establish what institutional safeguards are needed when those interfaces begin generating recommendations in real time.

Stewardship becomes necessary because LLMs increase not only what analytics can do, but also how quickly weakly-supported inference can become institutionalized and embedded in practice [44,45]. A model-generated code can become a dashboard category. A dashboard narrative can become a teacher’s impression. A recommendation can become an intervention norm. A conversational explanation can become a student’s understanding of their own ability. At each step, the issue is not simply whether the model worked, but whether sufficient epistemic and institutional discipline governs how those outputs are used [18,43,46].

The future of educational data science cannot be defined merely by better model performance. Under conditions of generative fluency, the field must address how uncertainty is communicated, how delegation is bounded, how provenance is documented, how iterative improvement remains evidence-based, and how institutions recognize when apparently helpful outputs begin to distort decision-making. The technical question is capability. The disciplinary question is stewardship.

3. Generative AI and Learning Analytics

Large language models have shifted the central empirical task for learning analytics from demonstrating technical possibility to examining how the field is evolving in practice. The current literature points to three concurrent patterns: areas of robust technical performance, limited evidence of broader pedagogical impact, and a recurrent tendency toward inflated interpretive claims. This framing is more analytically useful than a general discussion of opportunities and challenges because it distinguishes established methodological advances from educational conclusions that remain provisional, weakly supported, or overstated [4].

The risk introduced by generative systems is not simply additive but multiplicative. Earlier systems required interpretation: dashboards had to be read, models had to be understood, and outputs were often partial or fragmented. LLMs reduce this friction. They generate coherent explanations, recommendations, and narratives that appear complete and authoritative. This fluency increases the likelihood that users will accept outputs without interrogation, particularly in contexts where time, expertise, or institutional support for evaluation are limited [17,28]. In this sense, generative fluency does not merely support decision-making; it may reshape the threshold at which decisions are made.

The first pattern concerns domains in which empirical work now demonstrates consistent technical value. The second concerns the persistent gap between analytic or interface improvements and meaningful educational outcomes. The third concerns the tendency within the literature to translate promising technical results into stronger claims about pedagogical validity, objectivity, or transformation than the evidence can presently sustain. Taken together, these patterns reinforce the central argument of this article: the key challenge is no longer whether LLMs can produce useful outputs for learning analytics, but whether the field can govern how those outputs are interpreted, validated, and institutionalized [4].

3.1. Areas of Robust Technical Performance

The strongest evidence in the current literature concerns the use of GenAI to process unstructured educational data. This is not a trivial development. Much of the most educationally meaningful information in digital learning environments appears in text-rich forms that have historically been difficult to analyze at scale: discussion posts, peer feedback, reflective writing, collaborative discourse, tutoring dialogue, and other open-ended language. Misiejuk et al. [4] indicate that the dominant empirical uses of GenAI in LA are in discourse coding, scoring, and classification. In other words, the strongest current contribution of GenAI is not that it has already transformed intervention or redesign, but that it is making the measurement layer of analytics more tractable by converting text-rich data into usable analytic representations.

As discussed earlier, the study by Liu et al. [34] of GPT-4 and qualitative coding illustrates both the promise and the limitations of using GenAI to render text-rich educational data analytically tractable. Across three educational datasets, they found that GPT-4 could code a broad range of constructs with meaningful agreement to human coders, and that embeddings or carefully designed examples could improve performance for more difficult constructs. Importantly, however, their findings resist broad generalization: no single prompting or modeling strategy consistently performed best across tasks, and the constructs that proved most difficult for human coders were also the ones that most challenged the model.

Similarly, Long, Luo, and Zhang [24] show that GPT-4 can assist in classroom dialogue analysis with substantial time savings and high consistency relative to expert coding. In game-based learning context, Acosta et al. [49] applies LLMs to analyze multi-party epistemic dialogue acts in collaborative game-based learning, providing teachers with actionable insights about group dynamics and student learning. The evidence here is strong, but its strength lies in bounded augmentation, not replacement of human methodological judgment. Related methodological work outside the LA-special-issue literature points in the same direction: generative models can assist qualitative analysis at scale, but their effectiveness remains highly sensitive to prompt structure, interpretive framing, and researcher oversight, reinforcing that their value lies in augmentation rather than autonomous judgment [34,47,48].

A second area of robust technical performance appears in the use of GenAI to support descriptive and explanatory analytics for non-experts. For example, Yan et al. [20] demonstrate how multi-model generative chatbot VizChat can provide contextualized and personalized explanations for LA dashboards, offering comprehensive insights from multiple sources. While this suggests that GenAI can broaden access to aspects of analytic practice, successful use still depends on disciplined interpretive habits such as checking outputs, evaluation, and domain knowledge [25].

A third area concerns multimodal learning analytics, particularly feature extraction from non-verbal data. The case study by Whitehead et al. [39] suggests that MLLMs may be leveraged to extract postural behavior from video of collaborative learning. Zhou, Suraworachet, and Cukurova [41] also demonstrate how gaze and other non-verbal behaviors in group interaction can be automatically detected and linked to differences in collaborative learning outcomes. Together, these works show meaningful development because multimodal learning analytics has long required specialized pipelines, substantial technical expertise, and considerable manual effort to process non-verbal data streams. However, system capability does not resolve the interpretive challenge. Non-verbal signals become educationally meaningful only when they are mapped to constructs through theory-informed interpretation, rather than treated as self-explanatory features [50,51]. Reliability, data quality, prompt construction, and contextual sensitivity therefore remain central concerns. Here too, the evidence points to a meaningful methodological advance, not a solved interpretive problem [39,41].

3.2. Limited Pedagogical and Institutional Effects

GenAI is rapidly expanding the field’s analytic reach, especially where researchers and practitioners need to work with language, interaction, and multimodal data that are otherwise costly or difficult to process [4]. The evidence becomes substantially weaker, however, when it moves from analytic capability to pedagogical consequence. Technical success in coding, summarization, or conversational data analysis does not automatically translate into meaningful improvements in learning, teaching, or institutional decision making.

Learning analytics has encountered this problem before. For more than a decade, dashboards have served as a dominant interface to close the loop between data and action. Yet dashboards have yielded only limited gains: they increased awareness and access to information more reliably than they improve academic achievement, motivation, or deep learning behaviors. For example, a review of 38 empirical studies concluded that there is no evidence that dashboards have lived up to the promise of improving academic achievement, and that most reported effects were negligible or small, with limited evidence from well-powered controlled experiments [18].

GenAI may make such interfaces more conversational, personalized, eloquent and explanatory, as recent work on GenAI-augmented dashboards demonstrates [20]. However, this advancement does not remove the underlying problem. Unless these systems are grounded in stronger theory and evaluated for actual learning impact, the field risks repeating the same pattern with more sophisticated tools [4,18,52]. More broadly, evidence from both experimental studies and systematic reviews of GenAI in education suggest that apparent utility does not consistently translate into improved learning outcomes. The impact of generative AI on learning outcomes is highly variable, with outcomes depending heavily on instructional and task design, scaffolding, and how AI support is structured and integrated into the task [53,54].

Likewise, Misiejuk et al. [4] conclude that while students’ perceptions of GenAI are often positive and some studies report improvements in participation or task performance, evidence for actual learning outcomes remains limited. In short, the field has documented a meaningful expansion of analytic and interface capability, but not yet strong evidence of widespread pedagogical transformation. This distinction between analytic value and educational value is foundational for the stewardship argument. A model may classify discourse more efficiently, help an instructor inspect patterns more quickly, or give a user a more natural way to query data. Those are real advances. But they are not the same as demonstrating improved learning, effective self-regulation, or more defensible institutional action.

3.3. Inflation of Interpretive and Pedagogical Claims

A recurring pattern in the literature is the inflation of claims based on technically promising results. Overstatement in the field is not merely occasional; it follows a recognizable structure in which technical outputs are translated into stronger claims than the evidence supports. Coding performance is taken as evidence of educational understanding, conversational explanation as trustworthy pedagogy, automation as objectivity, and positive user response as proof of learning improvement.

One dimension of this pattern is the veneer of objectivity. LLMs generate fluent, confident, and apparently neutral language that can make weak inferences appear settled. This risk is evident in adjacent domains such as AI-supported language assessment, where automated systems can reproduce narrow and biased assumptions while appearing authoritative [55]. More broadly, research on human–AI interaction shows that users often rely on AI outputs without sufficient interrogation, particularly when those outputs are presented clearly and confidently [31,33].

A second element is the production of artificial authority. Research on LLM-as-a-judge shows that reliability, consistency, and bias remain unresolved, indicating that fluent evaluative output should not be treated as equivalent to robust judgment [56,57,58,59]. In practice, GenAI can support analysis effectively only when human oversight, checking, and interpretive discipline remain central [25,34]. The risks heighten when these human responsibilities are minimized and system outputs acquire unwarranted authority.

A third element is the inflation of pedagogical consequence. The field often moves too quickly from a methodological result—such as improved coding, easier querying, or a more accessible interface—to claims about personalization, adaptive learning, or transformation of practice. Misiejuk et al. [4] explicitly caution that validated classroom integrations and impacts on learning outcomes remain limited while the dashboard literature shows that a previous wave of analytics research frequently celebrated increased awareness or access without corresponding evidence of deep educational improvement [18]. Seen in this light, inflation of claims is not only a GenAI problem, but a persistent tendency within the field that GenAI risks intensifying.

A fourth element concerns dependency and cognitive outsourcing. This area should still be handled cautiously, but it has more direct empirical grounding than a purely speculative concern. Recent higher-education evidence indicates that stronger GenAI dependency may be associated with lower academic achievement through mechanisms involving false self-efficacy, while perceived teacher’s caring moderates part of that relationship [60]. This finding does not resolve the question of long-term cognitive outsourcing, but it does provide preliminary evidence that systems that appear highly supportive may also shift learners toward forms of dependence that weaken metacognitive monitoring or distort self-assessment. At minimum, this is an area where the field should proceed more cautiously than many current claims suggest. Most importantly, it points to the need to design systems that leverage GenAI within educational designs that strengthen learner agency and active processing rather than displace it.

3.4. Implications for the Present Argument

GenAI in learning analytics supports bounded augmentation more strongly than autonomous educational judgment. It has demonstrated clear value, particularly in processing unstructured text, extending descriptive analytics to non-experts, and enabling multimodal analysis. These are meaningful advances. However, pedagogical gains remain narrower than technical progress and claims often exceed the available evidence. The work is most convincing when GenAI is positioned as augmenting analytic workflows under validated conditions, and least convincing when it is treated as a reliable, autonomous source of educational judgment.

These conditions necessitate a stewardship framework. The available evidence does not yet justify delegating educational decision making to generative systems. Rather, it reflects a stage in which models are increasingly capable, useful, and rhetorically persuasive, while norms for validation, interpretability, provenance, and accountable use remain underdeveloped [30,31,33]. The central challenge is therefore not only to advance technical capability, but to establish the evaluative, professional, and institutional disciplines required to govern how such systems are trusted and used.

3.5. Generative AI and the Transformation of Data Science Work

The implications of GenAI extend beyond educational judgment to the work of educational data science itself. Many of the tasks that have historically defined the field—coding qualitative data, constructing features, generating summaries, interpreting patterns, and communicating results—are increasingly automated or augmented by LLMs [19,24,25]. The compression of analytic workflows by generative systems does not eliminate the need for data science, but it changes where and how its expertise is exercised.

Earlier forms of learning analytics required visible stages of analytic work. Data had to be processed, models specified, outputs interpreted, and findings translated into actionable insight. These stages made the epistemic labor of the field legible: assumptions could be examined, uncertainty could be debated, and interpretations could be contested. Generative systems compress this pipeline. They can move from raw or semi-structured data to fluent explanations, recommendations, or narratives with minimal visibility into intermediate reasoning. As a result, parts of the analytic process that were previously sites of professional judgment risk becoming opaque or implicitly delegated.

This shift creates a tension for the field. On one hand, generative systems expand access to analytic capabilities, enabling non-experts to engage in forms of data interpretation that were previously restricted by technical expertise [25]. On the other hand, this same accessibility can obscure the distinction between generating an output and justifying an inference. When explanation becomes automated, the role of the data scientist may shift from analyst and interpreter to validator of system outputs.

This transformation has been noted more broadly in discussions of AI and data science. As generative models increasingly handle coding, feature extraction, and even aspects of analysis, the core contribution of data science shifts from producing outputs to governing their interpretation, validation, and use. In this sense, GenAI does not eliminate the need for data science, but it relocates it. The field’s value becomes less about performing analytic tasks and more about ensuring that those tasks remain epistemically sound.

Within educational data science, this shift is particularly consequential. If analytic outputs increasingly take the form of fluent explanations, recommendations, or feedback, then the risk is not only that systems may be wrong, but that their outputs may be accepted without sufficient scrutiny. The problem is therefore not only technical automation, but the displacement of interpretive responsibility.

Stewardship emerges in response to this shift. It defines the work of the field not in terms of generating analytic outputs, but in governing the conditions under which those outputs are treated as knowledge and used in practice. In the generative era, the central question for educational data science is no longer only how to produce insight, but how to maintain the integrity of insight when its production is increasingly automated.

4. Stewardship as a Paradigm for Educational Data Science

GenAI alters the conditions under which educational judgment is produced and acted upon, raising a problem that prediction, measurement, and design do not fully resolve. While these paradigms remain essential—prediction for estimating likelihoods, measurement for strengthening construct validity, and design—they do not address how increasingly generative, fluent, and consequential analytic systems should be governed in practice.

Stewardship is proposed as the paradigm required to address this gap. It refers to the disciplined governance of judgment in educational data science: the commitments, practices, and institutional arrangements through which analytic outputs become educationally legitimate, uncertainty is made visible, decisions remain accountable, and systems are revised when their consequences diverge from their intentions. In this sense, stewardship governs the movement from analytic possibility to educational consequence. Although related to broader work on AI governance in education, the argument here is disciplinary: stewardship is not an external checklist, but an organizing paradigm for educational data science itself [3,23,61].

This need is underscored by the current empirical pattern. Technical advances in coding, classification, descriptive analytics, and multimodal feature extraction are clear, while evidence of sustained pedagogical transformation or learning outcomes remains limited [4,25,34,39]. At the same time, claims about learning and pedagogical impact often exceed what the evidence supports. Stewardship addresses this imbalance by focusing on how analytic systems are interpreted, validated, and governed as their outputs become more fluent and persuasive.

4.1. Stewardship as the Governance of Judgment

The need to govern which outputs may legitimately guide educational action becomes especially urgent in the context of LLMs. At its core, stewardship begins from a simple premise: educational data science is not valuable because it produces outputs, but because it helps determine which outputs may legitimately guide educational action. Ochoa et al. [25] show that LLMs may lower the expertise barrier for users to engage in learning analytics, but successful use still depends on accountable human oversight: checking, evaluation, and domain knowledge. This finding highlights a central tension in the field: broadening access to producing and using analytic outputs also broadens the need for norms that distinguish usable assistance from unwarranted authority.

GenAI can operate across the learning analytics cycle, from unstructured data analysis and synthetic data generation to explanatory analytics and personalized intervention [5]. As these systems become more capable of explaining, summarizing, and recommending, their outputs may be more readily accepted as authoritative rather than critically evaluated. Research on human-AI interaction supports this concern. Buçinca, Malaya, and Gajos [31] show that users frequently over-rely on AI suggestions, even when they are wrong, and that explanations do not reliably reduce that overreliance. Similarly, Salvi, Ribeiro, Gallotti, and West [33] demonstrate that GPT-4 can be more persuasive than human opponents in controlled settings, particularly when responses are personalized.

The field must therefore govern not only model performance, but also the inferential pathways through which model outputs are translated into feedback, institutional decisions, student self-understanding, and teacher judgment. Stewardship therefore reframes the value proposition of educational data science. In a pre-generative environment, the field could often define its contribution in terms of better prediction, better measurement, or better interfaces. In a generative environment, those are no longer sufficient markers of maturity. The central contribution must also include the capacity to calibrate uncertainty, document provenance, preserve human accountability, and ensure that educational action is not driven by outputs whose validity is weaker than their fluency suggests.

4.2. Core Commitments of a Stewardship Paradigm

A stewardship paradigm requires more than a general appeal to caution. It requires substantive commitments that orient research, design, Learning Engineering (LE) practice, and institutional governance. These commitments define not only how systems should be built, but how their outputs should be interpreted, evaluated, and used in educational contexts.

Epistemic discipline. The first commitment is epistemic discipline: the insistence that fluent or useful output must not be confused with warranted inference. This commitment is foundational because much of the risk introduced by LLMs lies in their ability to make uncertain interpretations appear settled. For example, Liu et al. [34] show that GPT-4 can assist with coding a range of educational constructs, but performance depends on the construct, prompt strategy, and context; the hardest constructs for human coders also remain difficult for the model. This limitation is a reminder that models do not resolve ambiguity in the underlying phenomenon. Stewardship requires that such ambiguity remain visible rather than being rhetorically smoothed away in dashboards, summaries, or interventions.

Epistemic discipline also implies a shift in research standards. Studies should no longer move directly from technical feasibility to claims about learning, pedagogy, or educational improvement. A model that classifies discourse more efficiently or produces a more natural-language explanation has achieved something meaningful, but that achievement does not automatically justify claims about deeper understanding or improved learning. Research must therefore distinguish more carefully between technical performance, interpretive validity, and pedagogical consequence. In Learning Engineering contexts, this distinction is especially critical, because iterative improvement cycles depend on the quality of the evidence they incorporate. If weak inference is treated as established fact, those cycles risk scaling error rather than reducing it [12].

Provenance and traceability. The second commitment is provenance and traceability. As GenAI becomes embedded in analytics workflows, stakeholders must be able to understand not only what a system produces, but how that output was generated and how it can be audited—what data it draws on, what transformations were applied, what prompts or contextual inputs shaped the response, and where uncertainty enters the process. Khosravi et al. [3] emphasize the importance of capturing prompts, interaction context, and model parameters in “GenAI analytics”. The need to document such provenance information is both a methodological and governance concern. Without provenance, outputs risk functioning as persuasive but opaque artifacts that cannot be meaningfully audited, reconstructed, or contested.

This commitment has direct implications for infrastructure and research practice. Educational data science must move beyond reporting performance metrics toward documenting analytic pipelines, decision pathways, and model conditions. As systems become more composite—combining prompts, prior interactions, interface logic, and model updates—trust can no longer rest on output quality alone. It must also depend on the ability to trace how an output came to matter. This aligns with broader work on AI governance in education, which increasingly foregrounds transparency, explainability, and auditability as conditions for responsible deployment [3,23,61].

Accountable human oversight. The third commitment is accountable human oversight. The literature consistently supports human-AI collaboration more strongly than autonomous AI judgment. Misiejuk et al. [4] establish that current classroom implementations emphasize human-AI collaboration rather than fully automated systems, and Ochoa et al. [25] clarify why: even when non-experts perform well with GenAI, successful use still requires checking outputs, evaluating results, and applying domain knowledge. These findings suggest that stewardship should not be framed as a temporary precaution before full automation becomes possible. Rather, human oversight should be treated as a constitutive feature of educational judgment [31,36].

This has implications for both design and institutional practice. Oversight must be meaningful, not symbolic. It requires clarity about what humans are responsible for interpreting, what they are expected to question, and what decisions must remain reviewable or contestable. Systems should be designed to support this interpretive role by making assumptions, uncertainty, and evidence visible. In Learning Engineering contexts, this means that iterative design cycles must preserve points at which human judgment remains nondelegable, particularly when outputs influence assessment, feedback, or high-stakes decisions.

Institutional learning. The fourth commitment is institutional learning. Stewardship cannot end at deployment. Educational systems must be designed so that institutions can continuously monitor how systems function in practice and refine them over time: identifying where outputs are effective, where performance begins to drift, where users misunderstand or misuse responses, where inequities emerge, and where unintended consequences develop. The closed-loop ambition of learning analytics has traditionally focused on feeding data back into teaching and learning [9]. A stewardship paradigm extends this loop to the institution itself. Institutions must become capable of learning from the consequences of the systems they adopt.

This shifts the focus of research and evaluation. The field must study not only model performance, but also how outputs are interpreted, how they shape practice, and how they evolve over time. Many risks associated with generative systems—such as overreliance, narrowing of attention, or normalization of weak evidence—are systemic rather than technical [30,31,45]. Addressing them requires monitoring interpretive use, organizational incentives, and downstream effects on educational practice. Learning Engineering plays a critical role here by structuring iterative cycles of design and evaluation, but stewardship determines what must be monitored, when revision is required, and how institutional learning is achieved.

Protection of learner agency. The fifth commitment is the protection of learner agency. As generative systems become more adaptive, personalized, and conversational, there is a risk that learners are positioned less as active participants in knowledge construction and more as recipients of optimized support. Yan et al. [5] argue that the learning analytics community must rethink the learner in contexts where human and AI contributions increasingly blur, which is not only a methodological issue but also a normative one.

Stewardship requires that systems be evaluated not only in terms of efficiency or task completion, but in terms of their effects on self-regulation, critical reflection, and durable understanding. Emerging evidence suggests that stronger reliance on GenAI may be associated with lower academic achievement through mechanisms such as false self-efficacy [60], highlighting the need to design systems that support rather than displace learner cognition.

4.3. Operationalizing Stewardship

A central challenge for stewardship is that its core principles—uncertainty, oversight, and accountability—are already widely acknowledged, yet inconsistently enacted. Stewardship therefore requires not only normative commitments, but conditions that make those commitments operational in practice. Simply restating these commitments is unlikely to change practice. Thus, a shift from principles to operational conditions is required.

At minimum, stewardship implies three forms of implementation:

First, design constraints on system outputs. Generative systems should not present outputs as singular or authoritative by default. Interfaces should make uncertainty visible, provide access to underlying data or alternative interpretations, and require users to engage with multiple perspectives before action is taken.

Second, structured human-in-the-loop processes. Oversight must be defined as a specific role with clear responsibilities, rather than a general expectation. This includes specifying when outputs must be reviewed, what constitutes adequate validation, and how disagreements between human and system judgment are resolved.

Third, institutional accountability mechanisms. Organizations must monitor how systems are used in practice, not only whether they perform accurately. This includes tracking overreliance, identifying contexts in which outputs are accepted without scrutiny, and establishing processes for revising or withdrawing systems when unintended consequences emerge.

These are not exhaustive solutions. Rather, they illustrate that stewardship is not achieved through awareness alone, but through the design of systems, roles, and institutions that make disciplined judgment more likely.

4.4. Implications for Design, Practice and the Field

Taken together, these commitments redefine what counts as rigor and contribution in educational data science by shifting the field’s focus from producing analytic outputs to governing their use. Research must extend beyond demonstrating model performance to examine how outputs function within educational systems—how they are interpreted, where they are overtrusted, and whether they support meaningful learning outcomes. Design must prioritize interpretability, traceability, and contestability, ensuring that systems support human judgment rather than replace it. Institutions must develop governance structures that specify where generative systems are appropriate, how outputs are reviewed, and how consequences are monitored over time. Educational data science must engage not only with models and methods, but with the epistemic, professional, and institutional conditions under which those models become educationally consequential.

At a broader level, this shift is not a resistance to innovation, but a marker of what maturity now requires. In a field where generative models can already classify discourse, support descriptive analysis, extract multimodal features, and generate persuasive explanations, the scarcity is no longer computational capability but disciplined judgment. Educational data science must now define itself not only by what it can build, but by what it can justify, govern, revise, and, when necessary, refuse.

Stewardship therefore becomes the paradigm through which prediction, measurement, and design remain educationally credible in the age of generative analytics. The field’s next advance will not come from treating AI outputs as self-authenticating evidence of progress. It will come from building the epistemic, professional, and institutional conditions under which those outputs can serve learning without displacing the human purposes that make education worth improving in the first place [3,12].

5. Conclusions

Generative AI marks an important development in learning analytics, but its significance lies not only in the introduction of new technical capabilities. It also reconfigures the conditions under which educational inferences are generated, communicated, and acted upon. LLMs and related systems expand the range of data that can be processed, classified, narrated, and adapted within educational settings. They increase the tractability of unstructured language data, broaden access to selected analytic practices, and create new possibilities for multimodal analysis, conversational interfaces, and generative feedback [3,4,5]. At the same time, or thus far, technical progress has outpaced evidentiary, institutional, and professional adaptation. Areas of methodological promise coexist with limited evidence of broader pedagogical impact and with a recurrent tendency to extend interpretive and pedagogical claims beyond what current findings can support [4,18].

The challenge is not that the field lacks awareness of these risks, but that existing structures make them easy to ignore. This pattern has important implications for how the future of educational data science is conceptualized. The central issue is no longer whether generative systems can contribute to learning analytics; the available evidence indicates that they already do so in multiple areas of research and practice. Nor is the main question reducible to whether such systems are accurate enough to be useful in bounded tasks. The more consequential issue concerns how increasingly fluent, adaptive, and persuasive outputs should be governed once they begin to shape educational interpretation and action. Earlier phases of the field could often define progress in terms of improved prediction, more refined measurement, or more effective design. Those aims remain necessary. However, in the generative era they are no longer sufficient on their own. Additional attention must be directed toward the representation of uncertainty, the documentation of provenance, the preservation of meaningful human oversight, and the capacity of institutions to learn from the consequences of the systems they deploy [3,9,31].

The overarching objective of this article is to advance both a diagnostic account of the current literature and a constructive framework for the field. The diagnostic account has emphasized that the strongest evidence currently concerns bounded forms of analytic augmentation, particularly in the processing of unstructured data, support for descriptive analysis, and multimodal feature extraction [25,34,39]. The literature is less developed with respect to durable pedagogical effects, validated autonomous intervention, and institutionally robust deployment [4,18]. The constructive framework proposed here is stewardship. Stewardship names the governance of judgment in educational data science: the set of epistemic, professional, and institutional commitments through which analytic outputs become educationally legitimate, interpretive confidence is calibrated to evidence, and systems remain accountable to human purposes rather than merely technical possibility.

At the same time, an argument here has also underscored the continuing importance of Learning Engineering. If stewardship governs judgment, Learning Engineering provides the iterative infrastructure through which governed analytic insight is translated into educational action. It is the process that connects theory, data, design, implementation, and revision in order to improve learning environments over time. This distinction matters because stewardship and Learning Engineering are complementary rather than interchangeable. Stewardship asks whether an inference is strong enough, transparent enough, and accountable enough to guide action. Learning Engineering asks how that action should be designed, tested, enacted, and improved in context. Without stewardship, Learning Engineering risks accelerating poorly warranted interventions. Without Learning Engineering, stewardship risks remaining a normative stance without a pathway to educational improvement. Taken together, they provide a more complete response to the generative moment: stewardship governs the legitimacy of analytic judgment, while Learning Engineering catalyzes its responsible translation into practice.

The concept of stewardship is important because it clarifies that the central challenge introduced by GenAI is not simply one of performance, efficiency, or interface quality. It is a challenge of governance. As generative systems become more capable of explanation, recommendation, summarization, and adaptation, the field must determine what kinds of judgment may be delegated, what forms of evidence are required before outputs inform consequential decisions, and what forms of oversight and revision are necessary once these systems enter practice [5]. In this sense, stewardship is neither an ethical afterthought nor a general appeal to responsibility. It is the condition under which generative analytics can become educationally credible.

The broader implication is that the maturity of educational data science should no longer be assessed only by the extent to which intelligence can be automated. A more appropriate criterion is whether the field can develop the standards of evidence, epistemic infrastructure, design commitments, professional preparation, Learning Engineering capacity, and institutional governance needed to regulate the use of that intelligence in educational settings. Generative AI has expanded the horizon of what learning analytics may be able to do. Whether that expansion yields stronger educational understanding or more persuasive forms of analytic overreach will depend on the field’s capacity to steward judgment with greater rigor, transparency, and accountability, and to embed that judgment within iterative processes of educational design and improvement. The future of educational data science will depend not only on what generative systems can produce, but on whether the field can decide, justify, and refine what those systems should be allowed to mean and do.

Author Contributions

Conceptualization, D.S.M; investigation, D.S.M; resources, D.S.M; writing—original draft preparation, D.S.M; writing—review and editing, D.S.M. and L.H.; supervision, D.S.M; project administration, D.S.M.; funding acquisition, D.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305N210041 and R305T240035 to Arizona State University and Grant NSF IIS 2153481 to Rice University and Arizona State University. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences, the U.S. Department of Education, or the National Science Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EDM	Educational Data Mining
AIED	Artificial Intelligence in Education
GenAI	Generative AI
LLM	Large Language Model
LA	Learning Analytics

References

Design Recommendations for Intelligent Tutoring Systems. In - Intelligent Tutoring Systems with Generative AI; Sinatra, A.M., Rus, V., Lawton, P., Graesser, A.C., Eds.; US Army Combat Capabilities Development Command - Soldier Center: Orlando, FL, USA, 2025; Volume 12. [Google Scholar]
Skinner, B.F. Teaching machines. Science 1958, 128, 969–977. [Google Scholar] [CrossRef]
Khosravi, H.; Viberg, O.; Kovanović, V.; Ferguson, R. Generative AI and learning analytics. J. Learn. Anal. 2023, 10, 1–6. [Google Scholar] [CrossRef]
Misiejuk, K.; López-Pernas, S.; Kaliisa, R.; Saqr, M. Mapping the landscape of generative artificial intelligence in learning analytics: A systematic literature review. J. Learn. Anal. 2025, 12, 12–31. [Google Scholar] [CrossRef]
Yan, L.; Martinez-Maldonado, R.; Gašević, D. Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle. In Proceedings of the 14th Learning Analytics and Knowledge Conference, 2024; pp. 101–111. [Google Scholar] [CrossRef]
Baker, R.S.; Inventado, P.S. Educational data mining and learning analytics. In Learning Analytics: From Research to Practice; Larusson, J.A., White, B., Eds.; Springer: New York, NY, USA, 2014; pp. 61–75. [Google Scholar] [CrossRef]
Baker, R.S.; Siemens, G. Learning analytics and educational data mining. In The Cambridge Handbook of the Learning Sciences, 3rd ed.; Sawyer, R.K., Ed.; Cambridge University Press: Cambridge, UK, 2022; pp. 259–278. [Google Scholar] [CrossRef]
Greller, W.; Drachsler, H. Translating learning into numbers: A generic framework for learning analytics. Educ. Technol. Soc. 2012, 15, 42–57. [Google Scholar]
Clow, D. The learning analytics cycle: Closing the loop effectively. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge; ACM: New York, NY, USA, 2012; pp. 134–138. [Google Scholar] [CrossRef]
Wise, A.F. Designing pedagogical interventions to support student use of learning analytics. In Proceedings of the 4th International Conference on Learning Analytics and Knowledge; ACM: New York, NY, USA, 2014; pp. 203–211. [Google Scholar]
Azad, A.K.M.; Goodell, J.; Kessler, A.; Craig, S.D.; Saliah-Hassane, H. Learning Engineering - A System Design Approach for Engineering Education. In In Proceedings of the 2025 ASEE Annual Conference & Exposition, Montreal, QC, Canada, June 2025. [Google Scholar] [CrossRef]
Baker, R.S.; Boser, U.; Snow, E. Learning engineering: A view on where the field is at, where it is going, and the research needed. Technol. Mind Behav. 2022, 3. [Google Scholar] [CrossRef]
Learning Engineering Toolkit: Evidence-Based Practices from the Learning Sciences, Instructional Design, and Beyond; Goodell, J., Kolodner, J.L., Eds.; Routledge: London, UK, 2023. [Google Scholar] [CrossRef]
Pargman, T.C.; McGrath, C.; Viberg, O.; Knight, S. New vistas on responsible learning analytics: A data feminist perspective. J. Learn. Anal. 2023, 10, 133–148. [Google Scholar] [CrossRef]
Prinsloo, P.; Slade, S. An elephant in the learning analytics room: The obligation to act. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, 2017; pp. 46–55. [Google Scholar] [CrossRef]
Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019. [Google Scholar] [CrossRef]
Vasconcelos, H.; Jörke, M.; Grunde-McLaughlin, M.; Gerstenberg, T.; Bernstein, M.S.; Krishna, R. Explanations can reduce overreliance on AI systems during decision-making. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–38. [Google Scholar] [CrossRef]
Kaliisa, R.; Misiejuk, K.; López-Pernas, S.; Khalil, M.; Saqr, M. Have learning analytics dashboards lived up to the hype? A systematic review of 38 empirical studies. In Proceedings of the 14th Learning Analytics and Knowledge Conference, 2024; pp. 716–726. [Google Scholar] [CrossRef]
Lekan, K.; Pardos, Z.A. AI-augmented advising: A comparative study of GPT-4 and advisor-based major recommendations. J. Learn. Anal. 2025, 12, 110–128. [Google Scholar] [CrossRef]
Yan, L.; Zhao, L.; Echeverria, V.; Jin, Y.; Alfredo, R.; Li, X.; et al. VizChat: Enhancing learning analytics dashboards with contextualised explanations using multimodal generative AI chatbots. In Artificial Intelligence in Education; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 180–193. [Google Scholar] [CrossRef]
Madsen, A.; Chandar, S.; Reddy, S. Are self-explanations from large language models faithful? Findings of the Association for Computational Linguistics: ACL 2024. 2024, pp. 295–337. Available online: https://aclanthology.org/2024.findings-acl.19/.
Parcalabescu, L.; Frank, A. On measuring faithfulness or self-consistency of natural language explanations. Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. 2024, Volume 1, 6048–6089. [Google Scholar] [CrossRef]
Khosravi, H.; Shibani, A.; Jovanovic, J.; Pardos, Z.A.; Yan, L. Generative AI and learning analytics: Pushing boundaries, preserving principles. J. Learn. Anal. 2025, 12, 1–11. [Google Scholar] [CrossRef]
Long, Y.; Luo, H.; Zhang, Y. Evaluating large language models in analysing classroom dialogue. npj Sci. Learn. 2024, 9, 60. [Google Scholar] [CrossRef] [PubMed]
Ochoa, X.; Huang, X.; Shao, Y. Exploring the potential of generative AI to support non-experts in learning analytics practice. J. Learn. Anal. 2025, 12, 65–90. [Google Scholar] [CrossRef]
Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020; pp. 5185–5198. Available online: https://aclanthology.org/2020.acl-main.463/.
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021; pp. 610–623. [Google Scholar] [CrossRef]
Si, C.; Goyal, N.; Wu, T.; Zhao, C.; Feng, S.; Daumé, H., III; Boyd-Graber, J. Large Language Models Help Humans Verify Truthfulness - Except When They Are Convincingly Wrong. Proc. NAACL 2024, 2024, 1459–1474. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, M.; Low, W.Y.; Yang, X.J.; Li, B.A. Conversational explanations: Discussing explainable AI with non-AI experts. In Proceedings of the 30th International Conference on Intelligent User Interfaces, 2025; pp. 409–424. [Google Scholar] [CrossRef]
Schoeffer, J.; De-Arteaga, M.; Kuehl, N. Explanations, fairness, and appropriate reliance in human-AI decision-making. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024. [Google Scholar] [CrossRef]
Buçinca, Z.; Malaya, M.B.; Gajos, K.Z. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–21. [Google Scholar] [CrossRef]
Klingbeil, A.; Grützner, C.; Schreck, P. Trust and reliance on AI - An experimental study on the extent and costs of overreliance on AI. Comput. Hum. Behav. 2024, 160, 108352. [Google Scholar] [CrossRef]
Salvi, F.; Horta Ribeiro, M.; Gallotti, R.; West, R. On the conversational persuasiveness of GPT-4. Nat. Hum. Behav. 2025, 9, 1645–1653. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Zambrano, A.F.; Baker, R.S.; Barany, A.; Ocumpaugh, J.; Zhang, J.; Pankiewicz, M.; Nasiar, N.; Wei, Z. Qualitative coding with GPT-4: Where it works better. J. Learn. Anal. 2025, 12, 169–185. [Google Scholar] [CrossRef]
Chaleshtori, F.H.; Ghosal, A.; Gill, A.; Bambroo, P.; Marasović, A. On evaluating explanation utility for human-AI decision making in NLP. Find. Assoc. Comput. Linguist. EMNLP 2024, 7456–7504. [Google Scholar] [CrossRef]
Holstein, K.; McLaren, B.M.; Aleven, V. Co-designing a real-time classroom orchestration tool to support teacher-AI complementarity. J. Learn. Anal. 2019, 6, 27–52. [Google Scholar] [CrossRef]
Kasepalu, R.; Prieto, L.P.; Ley, T.; Chejara, P. Teacher artificial intelligence-supported pedagogical actions in collaborative learning coregulation: A wizard-of-oz study. Front. Educ. 2022, 7, 736194. [Google Scholar] [CrossRef]
Olsen, J.K.; Rummel, N.; Aleven, V. Designing for the co-orchestration of social transitions between individual, small-group and whole-class learning in the classroom. Int. J. Artif. Intell. Educ. 2021, 31, 24–56. [Google Scholar] [CrossRef]
Whitehead, R.; Nguyen, A.; Järvelä, S. Utilizing multimodal large language models for video analysis of posture in studying collaborative learning: A case study. J. Learn. Anal. 2025, 12, 186–200. [Google Scholar] [CrossRef]
Sellberg, C.; Sharma, A. Toward multimodal learning analytics in simulation-based collaborative learning: A design ethnography of maritime training. Int. J. Comput.-Support. Collab. Learn. 2025, 20, 201–221. [Google Scholar] [CrossRef]
Zhou, Q.; Suraworachet, W.; Cukurova, M. Detecting non-verbal speech and gaze behaviours with multimodal data and computer vision to interpret effective collaborative learning interactions. Educ. Inf. Technol. 2024, 29, 1071–1098. [Google Scholar] [CrossRef]
Kizilcec, R.F. To advance AI use in education, focus on understanding educators. Int. J. Artif. Intell. Educ. 2024, 34, 12–19. [Google Scholar] [CrossRef] [PubMed]
Lai, V.; Zhang, Y.; Chen, C.; Liao, Q.V.; Tan, C. Selective explanations: Leveraging human input to align explainable AI. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–35. [Google Scholar] [CrossRef]
Baker, R.S.; Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
Green, B.; Chen, Y. The principles and limits of algorithm-in-the-loop decision making. Proc. ACM Hum.-Comput. Interact. 2019, 3, 1–24. [Google Scholar] [CrossRef]
Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019. [Google Scholar] [CrossRef]
de la Iglesia, D.H.; Thomas, M.B.; Fuentes, C. AI-assisted qualitative analysis at scale: Opportunities and constraints for text-rich research. In Quality & Quantity; 2025. [Google Scholar] [CrossRef]
Morris, W.; Holmes, L.; Choi, J.S.; Crossley, S. Automated scoring of constructed response items in math assessment using large language models. Int. J. Artif. Intell. Educ. 2025, 35, 559–586. [Google Scholar] [CrossRef]
Acosta, H.; Lee, S.; Bae, H.; Feng, C.; Rowe, J.; Glazewski, K.; et al. Recognizing multi-party epistemic dialogue acts during collaborative game-based learning using large language models. Int. J. Artif. Intell. Educ. 2025, 35, 677–701. [Google Scholar] [CrossRef]
Guerrero-Sosa, J.D.; Romero, F.P.; Menéndez-Domínguez, V.H.; Serrano-Guerrero, J.; Montoro-Montarroso, A.; Olivas, J.A. A comprehensive review of multimodal analysis in education. Appl. Sci. 2025, 15, 5896. [Google Scholar] [CrossRef]
Schneider, B.; Worsley, M.; Martinez-Maldonado, R. Gesture and gaze: Multimodal data in dyadic interactions. In International Handbook of Computer-Supported Collaborative Learning; Springer International Publishing: Cham, Switzerland, 2021; pp. 625–641. [Google Scholar] [CrossRef]
Alfredo, R.; Echeverria, V.; Jin, Y.; Yan, L.; Swiecki, Z.; Gašević, D.; Martinez-Maldonado, R. Human-centred learning analytics and AI in education: A systematic literature review. Comput. Educ. Artif. Intell. 2024, 6, 100215. [Google Scholar] [CrossRef]
Lee, H.Y.; Chen, P.H.; Wang, W.S.; Huang, Y.M.; Wu, T.T. Empowering ChatGPT with guidance mechanism in blended learning: Effect of self-regulated learning, higher-order thinking skills, and knowledge construction. Int. J. Educ. Technol. High. Educ. 2024, 21, 16. [Google Scholar] [CrossRef]
Wang, J.; Fan, W. The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: Insights from a meta-analysis. Humanit. Soc. Sci. Commun. 2025, 12, 1–21. [Google Scholar] [CrossRef]
Curran, N.; de Leeuw, S.; Malyuga, E. AI and native speakerism: The intersections of technology, language assessment, and linguistic objectivity. Lang. Assess. Q. 2025. [Google Scholar] [CrossRef]
Chen, G.H.; Chen, S.; Liu, Z.; Jiang, F.; Wang, B. Humans or LLMs as the judge? A study on judgement bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; pp. 8301–8327. Available online: https://aclanthology.org/2024.emnlp-main.474/.
Gu, J.; Chen, H.; Feng, Y.; Chen, J.; Li, M.; Wu, Y. A survey on LLM-as-a-judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
Tan, S.; Zhuang, S.; Montgomery, K.; Tang, W.Y.; Cuadron, A.; Wang, C.; et al. Judgebench: A benchmark for evaluating LLM-based judges. arXiv 2024, arXiv:2410.12784. [Google Scholar] [CrossRef]
Wataoka, K.; Takahashi, T.; Ri, R. Self-preference bias in LLM-as-a-judge. arXiv 2024, arXiv:2410.21819. [Google Scholar] [CrossRef]
Sheng, Y.; Wang, C.; Chen, X. Effect of GenAI dependency on university students’ academic achievement: False self-efficacy and the moderating role of perceived teacher caring. Behav. Sci. 2025, 15, 1348. [Google Scholar] [CrossRef]
Fitsilis, P.; Damasiotis, V.; Dervenis, C.; Kyriatzis, V.; Tsoutsa, P. Effective data stewardship in higher education: Skills, competences, and the emerging role of open data stewards. arXiv 2024, arXiv:2410.20361. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI

Abstract

Keywords:

Subject:

1. Introduction

2. Why LLMs Change the Problem: Fluency, Delegation, and the Governance of Judgment

2.1. Fluency as Epistemic Risk

2.2. Delegation Without Visibility

2.3. From Outputs to Consequences

2.4. Why Stewardship Becomes Unavoidable

3. Generative AI and Learning Analytics

3.1. Areas of Robust Technical Performance

3.2. Limited Pedagogical and Institutional Effects

3.3. Inflation of Interpretive and Pedagogical Claims

3.4. Implications for the Present Argument

3.5. Generative AI and the Transformation of Data Science Work

4. Stewardship as a Paradigm for Educational Data Science

4.1. Stewardship as the Governance of Judgment

4.2. Core Commitments of a Stewardship Paradigm

4.3. Operationalizing Stewardship

4.4. Implications for Design, Practice and the Field

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

MDPI Initiatives

Important Links

Subscribe