BERT-Enhanced HyperGAT with Siamese Networks and Reference Answer Set for Automated Short-Answer Scoring

Chen Liu; Xiaofen Wan; Zhihao Ni; Sheng Su; Chunhua Kang

doi:10.20944/preprints202604.1659.v1

Submitted:

22 April 2026

Posted:

23 April 2026

You are already at the latest version

Abstract

This paper proposes a novel framework, HyperGAT-BERT-RAS, that integrates: (1) a Hy perGraph Attention Network (HyperGAT) with BERT for enhanced semantic representa-tion; (2) a Reference Answer Set (RAS) constructed via clustering of full-score answers; (3) Siamese Neural Networks (SNNs) for similarity-based scoring; and (4) GPT-4-based data augmentation to address class imbalance. Experiments on the Ohsumed and ASAP-5 da-tasets demonstrate that: (i) HyperGAT-BERT achieves 0.7317 accuracy on Ohsumed text classification, outperforming baseline HyperGAT by 2.69%; (ii) the full Hyper-GAT-BERT-RAS achieves 0.7991 accuracy and 0.7956 F1-score, with RAS contributing the most to performance gains (4.34% accuracy drop when removed); (iii) GPT-4 augmentation improves Quadratic Weighted Kappa from 0.584 to 0.880 and minority-class (scores 2–3) F1 by 15.3%. These improvements translate into more reliable scoring of diverse student answers, reduced teacher grading burden, and enhanced feasibility of AI-assisted forma-tive assessment in real classrooms. Ablation and error analyses confirm the contribution of each component. The framework advances ASAG by synergizing graph-based relational modeling, pretrained language understanding, and knowledge-guided scoring.

Keywords:

automated short-answer grading

;

HyperGraph attention network

;

BERT

;

Siamese neural network

;

reference answer set

;

data augmentation

Subject:

Social Sciences - Behavior Sciences

1. Introduction

In everyday classroom practice, teachers spend countless hours grading short-answer questions, a task that is not only time-consuming but also prone to inconsistency due to fatigue or subjective judgment. This burden limits teachers’ ability to provide timely, personalized feedback, which is critical for student learning. While automated short-answer grading (ASAG) offers a promising solution, existing methods struggle with three practical challenges that hinder their adoption in real classrooms. First, the diversity of student answers. The same correct idea can be expressed in countless ways, yet most automated scoring models rely on limited reference answers, leading to frequent misclassifications (Tan et al., 2022). For example, in a biology short-answer question, some students may express “mRNA leaves the nucleus” as “mRNA goes out of the nucleus”—semantically correct but easily flagged as incorrect by keyword-matching systems. Second, the ambiguity of score boundaries. Teachers often encounter partially correct or creatively worded answers that fall into the “gray area” between score levels, making precise classification difficult. Third, imbalanced data distribution. In real classrooms, high-scoring answers are relatively rare, resulting in training data that biases models toward lower scores and reduces their ability to recognize excellent responses (Kaldaras et al., 2022). These challenges not only reduce scoring accuracy but also undermine teachers’ trust in automated grading systems.

To address these challenges, researchers have explored various approaches to improving ASAG. The following sections review prior work on text classification methods, reference answer set construction, and data augmentation, leading to the formulation of our research hypotheses.

1.1. Text Classification Methods and Their Educational Limitations

In automated scoring research, text classification serves as the core technical framework. Early studies employed machine learning methods that relied on manually engineered features to build scoring models (Kumar et al., 2019; Saha et al., 2018; Sultan et al., 2016). These approaches are not only labor-intensive but also struggle to capture complex semantic relationships in student answers. Deep learning techniques, such as convolutional neural networks (CNNs) and long short-term memory networks (LSTMs), reduce the need for manual feature extraction but often fail to account for global semantic interactions (Alikaniotis et al., 2016; Huang et al., 2018; Riordan et al., 2017).

Recent studies have introduced graph convolutional networks (GCNs) to model global semantic structures in student answers (Tan et al., 2023). However, GCNs assume uniform importance among adjacent nodes, neglecting variations in how different words contribute to meaning (Gilmer et al., 2017). More importantly, traditional graphs are limited to pairwise connections, making it difficult to capture complex multi-word interactions in student answers. Hypergraphs address this limitation by allowing a single edge (hyperedge) to connect any number of nodes, enabling more effective modeling of high-order semantic relationships (Feng et al., 2019; Kim et al., 2020). Ding et al. (2020) proposed the HyperGraph Attention Network (HyperGAT), which achieved superior performance on text classification tasks compared to traditional methods, though its accuracy on the Ohsumed medical dataset remained modest at 0.69, indicating room for improvement.

Meanwhile, the emergence of pre-trained language models such as BERT has brought significant advances in text understanding. Unlike static word embeddings, BERT generates dynamic, context-dependent representations, effectively resolving polysemy (Devlin et al., 2019). However, BERT alone struggles to capture the structured semantic relationships between student answers and reference answers. Therefore, Study 1 proposes integrating BERT with HyperGAT to develop a HyperGAT-BERT framework, with the goal of improving text classification accuracy. We hypothesize that combining BERT’s contextual awareness with HyperGAT’s ability to model high-order relationships will better represent student answer semantics and thereby enhance automated scoring performance.

1.2. Reference Answer Sets

The coverage of reference answers is a critical factor affecting ASAG accuracy (Burrows et al., 2015; Valenti et al., 2003). In practice, the diversity of student answers often exceeds the coverage of predefined reference sets, leading to many correct answers being incorrectly flagged as wrong. This problem is particularly acute for open-ended questions (Tan et al., 2022).

To address this issue, researchers have explored ways to construct more comprehensive reference answer sets. Lan et al. (2015) proposed clustering student answers and having experts select representative examples from each cluster, streamlining the scoring process. Marvaniya et al. (2018) further clustered answers by score level, selecting representative answers from each level. Specifically, reference answer set construction involves two key steps (Tan et al., 2019): (1) clustering analysis to identify distinct answer patterns, with each cluster representing a unique response type; and (2) selecting one or more prototypical answers from each cluster to compile the reference answer set, as illustrated in Figure 1.

Furthermore, Siamese Neural Networks (SNNs), initially proposed for tasks such as face recognition, signature verification, and similarity learning (Bromley et al., 1993), have recently been adapted to educational applications, particularly in automated scoring of constructed-response items. The architecture comprises twin subnetworks that share identical parameters (e.g., weights and biases) but process two distinct input samples. As shown in Figure 2, this design enables SNNs to effectively quantify similarity or dissimilarity between paired inputs through comparative analysis.

Based on this analysis, Study 2 proposes constructing a Reference Answer Set (RAS) via clustering and integrating it with SNNs to form the HyperGAT-BERT-RAS framework. We hypothesize that the RAS will cover more diverse answer patterns, while the SNN will more precisely compute semantic similarity between student answers and reference answers. Ablation experiments will validate the contribution of RAS to scoring accuracy.

1.3. Data Augmentation

In real classrooms, the distribution of student scores is often skewed, most students cluster around middle or low scores, while high-scoring answers are relatively rare. This imbalance biases automated scoring models toward majority classes and reduces their ability to recognize minority classes (Kaldaras et al., 2022). While balanced data distribution is critical for model performance (Shorten & Khoshgoftaar, 2019), collecting more high-scoring answers in classroom settings is often impractical. Data augmentation thus emerges as a viable alternative (Cochran et al., 2022).

Early NLP data augmentation methods included synonym replacement, back-translation, and random deletion (Wei & Zou, 2019; Yu et al., 2018). However, these approaches risk semantic distortion or over-reliance on rule-based heuristics. The advent of generative language models has opened new avenues for data augmentation (Bayer et al., 2023). Notably, GPT-4, with its powerful language understanding and generation capabilities, has demonstrated exceptional performance in text augmentation tasks (OpenAI, 2022). Studies have shown that GPT-4-generated synthetic data can effectively improve model performance in low-resource classification tasks (Dai et al., 2023; Ubani et al., 2023; Møller et al., 2023). Cochran et al. (2023) further demonstrated that ChatGPT-augmented data improves automated essay scoring accuracy in low-data regimes.

Therefore, the latter phase of Study 2 employs GPT-4 for data augmentation, generating student answers for score levels 2 and 3 in the ASAP-5 dataset, which are underrepresented. By balancing data distribution, we aim to further improve the accuracy of HyperGAT-BERT-RAS for automated short-answer scoring.

1.4. Novel Contributions of This Work

Based on the above analysis, this study makes the following four contributions: (1) Technical integration. We are the first to jointly optimize BERT’s contextualized representations with HyperGAT’s hypergraph structure for ASAG, addressing the challenge of precise semantic matching. (2) RAS-guided Siamese scoring. Unlike prior work that uses RAS only for answer selection, we embed RAS directly into a Siamese architecture to compute fine-grained similarity scores. (3) Controlled LLM augmentation. We provide the systematic evaluation of GPT-4 augmentation for ASAG, including prompt design, quality assessment, and leakage prevention. (4) Educational significance. By improving scoring accuracy and consistency, this study aims to reduce teacher grading burden, support formative assessment implementation, and provide a reliable tool for AI-assisted assessment in real classrooms.

2. HyperGAT-BERT Framework and Performance Validation

2.1. HyperGAT-BERT Architecture

The HyperGAT framework incorporates a text hypergraph encoding syntactic and semantic information, along with dual attention mechanisms at both node and hyperedge levels (Ding et al., 2020). In this study, the proposed HyperGAT-BERT similarly treats each word as a hypernode and models contextual syntactic and semantic relationships through hyperedges. Multi-relational hyperedges, specifically syntactic hyperedges (encoding word order) and semantic hyperedges (encoding meaning) are defined following the original methodology. After constructing these hyperedges, dual attention mechanisms are employed to capture high-order word interactions while emphasizing critical information at varying granularities during node representation learning. For each document, node representations across the hypergraph are computed post-attention, followed by a mean pooling operation to derive the text representation. The BERT model, pretrained on large-scale unlabeled corpora, generates text embeddings rich in semantic information, which can be directly transferred to downstream tasks (Devlin et al., 2019) . In this framework, text inputs are processed through BERT to obtain sequence representations, which are concatenated with HyperGAT-derived text representations (denoted as Concat). This fused document representation (Z) is then fed into a softmax layer for classification. The architecture is illustrated in Figure 3, where the green dashed box represents the original HyperGAT method (Ding et al., 2020), and the combined red-green components depict the BERT-enhanced HyperGAT-BERT framework.

2.2. Text Classification Performance Validation

2.2.1. Dataset

The Ohsumed corpus comprises medical literature abstracts sourced from the MEDLINE database, a critical repository curated by the U.S. National Library of Medicine (http://disi.unitn.it/moschitti/corpora.htm). This study utilizes 13,929 cardiovascular disease abstracts selected from the first 20,000 articles published in 1991. Each document in the dataset is associated with one or more category labels spanning 23 disease classes. To ensure label consistency, documents with multiple category assignments were excluded, resulting in a final subset of 7,400 single-label documents. In each experimental run, 80% of the training samples are randomly selected for model training, while the remaining 20% are reserved for validation. Table 1 summarizes the basic characteristics of the Ohsumed dataset.

2.2.2. Experimental Design

To investigate whether HyperGAT (Ding et al., 2020) and the proposed BERT-enhanced HyperGAT-BERT can better represent texts and improve text classification performance, this study conducted comprehensive experiments to evaluate model performance on text classification tasks. The HyperGAT-BERT model was compared against four categories of baseline models, including word embedding-based methods (SWEM), sequence-based methods (CNN-non-static), and graph-based methods (Text-level GNN and HyperGAT). Data analysis is conducted using Python.

The optimal hyperparameters were determined via grid search over the following ranges: embedding dimensions {128, 200, 300, 400}; learning rates {1e-4, 5e-4, 1e-3}; dropout rates {0.2, 0.3, 0.4}; batch sizes {8, 16, 32}. The validation set (10% of the training data) was used for model selection. The best configuration was identified as: embedding dimension = 300, learning rate = 0.001, dropout = 0.3, batch size = 16. Sensitivity analysis for the embedding dimension is presented in Figure 5. For all models, we employed early stopping with a patience of 5 epochs, meaning that training would terminate if validation performance did not improve for five consecutive epochs. By default, the maximum number of epochs is set to 100. The model will run for 100 epochs or stop early if the performance on validation samples does not improve for 5 consecutive epochs. The Adam optimizer was used with weight decay set to 1e-6, and L2 regularization was applied with a coefficient of 1e-6. The dropout rate is set to 0.3 to prevent overfitting.

The evaluation metrics are Precision (P), Accuracy (ACC), F1-score (F1). Precision (P) evaluates the proportion of correctly predicted positive samples among those identified as positive. Accuracy (ACC) represents the proportion of correctly classified samples among all samples. The F1-score balances precision and recall. Although recall (R) is a conventional metric for classification tasks, we do not report it separately in our results. The F1-score, as the harmonic mean of precision and recall, already captures the balance between the two. Reporting recall alone would provide redundant information without additional insight into model performance for ASAG. Therefore, only P, ACC, F1 are presented in the following result tables. The formulas are as follows (1), (2), and (3):

P = \frac{T P}{T P + F P}

(1)

F 1 = \frac{2 \times P \times R}{P + R}

(2)

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(3)

where TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) are defined as follows: TP, the number of samples correctly predicted as the positive class. TN, the number of samples correctly predicted as the negative class. FP, the number of samples incorrectly predicted as the positive class. FN, the number of samples incorrectly predicted as the negative class.

2.2.3. Experimental Results

As shown in Table 2, graph-based methods (Text-level GNN and HyperGAT) achieved superior performance compared to word embedding-based and sequence-based methods. This result indicates that graph structures can better capture long-distance word interactions, thereby improving text classification performance. Compared to the baseline models, HyperGAT-BERT achieved the highest accuracy (0.7317) among all compared models, surpassing SWEM by 10.03%, CNN-non-static by 14.74%, Text-level GNN by 3.72%, and HyperGAT by 3.27%. This improvement demonstrates that integrating BERT’s rich semantic representations enhances HyperGAT’s text representation capability, which is significant for text classification tasks. To further validate the effectiveness of HyperGAT-BERT, we compared it with results reported in existing literature on the same dataset. Lv et al. (2024) proposed the RB-GAT model (combining RoBERTa-BiGRU with graph attention networks), which achieved 71.48% accuracy on the Ohsumed dataset. In comparison, our proposed HyperGAT-BERT model achieved 73.17% accuracy, surpassing this existing method. This comparison further confirms the effectiveness of jointly optimizing HyperGAT with BERT for medical text classification tasks.

Figure 4 presents the confusion matrices and text embedding visualizations for both HyperGAT and HyperGAT-BERT. Analysis of the confusion matrices reveals that HyperGAT-BERT achieves higher classification accuracy for most categories, with comparable performance on the remaining classes. In the embedding visualizations (node colors correspond to 23 class labels), HyperGAT-BERT exhibits tighter intra-class clustering and improved inter-class separability, further validating its superior representation capability.

Figure 5 illustrates the performance of the HyperGAT-BERT model on the Ohsumed dataset with varying embedding dimensions. Notably, HyperGAT-BERT achieves optimal performance when the embedding dimension is set to 300. This suggests that smaller dimensions may lead to insufficient expressive capacity, while excessively large dimensions risk overfitting.

3. HyperGAT-BERT Integrating Siamese Neural Networks and Reference Answer Sets and Its Application in Automatic Short-Answer Scoring

The HyperGAT-BERT model, which integrates the rich semantic information from BERT, outperforms other models in its ability to better represent texts, thereby improving text classification results. Therefore, the HyperGAT-BERT model is selected for continued research. This section aims to explore whether the automatic short-answer scoring method based on Siamese Neural Networks and reference answer sets, utilizing a hypergraph attention network, can achieve the expected good results. Additionally, ablation experiments are conducted to verify whether different modules within the HyperGAT-BERT model can effectively enhance the scoring performance for short answers.

3.1. Construction of HyperGAT-BERT Integrating Siamese Neural Networks and Reference Answer Sets

3.1.1. Dataset

We utilize Data Set #5(ASAP-5) from the Hewlett Foundation-supported 10th-grade student responses in biology (https://www.kaggle.com/c/asap-sas/data). The specific details are presented in Table 3.

The question prompt is: “Starting with mRNA leaving the nucleus, list and describe four major steps involved in protein synthesis.”

The scoring rubric is as follows: 3 points for satisfying all four key elements, 2 points for satisfying three key elements, 1 point for satisfying one or two key elements, and 0 points for not meeting the above requirements. There are a total of eight key elements in the response: (1) mRNA exits nucleus via nuclear pore. (2) mRNA travels through the cytoplasm to the ribosome or enters the rough endoplasmic reticulum. (3) mRNA bases are read in triplets called codons (by rRNA). (4) tRNA carrying the complementary (U=A, C+G) anticodon recognizes the complementary codon of the mRNA. (5) The corresponding amino acids on the other end of the tRNA are bonded to adjacent tRNA’s amino acids. (6) A new corresponding amino acid is added to the tRNA. (7) Amino acids are linked together to make a protein beginning with a START codon in the P site (initiation). (8) Amino acids continue to be linked until a STOP codon is read on the mRNA in the A site (elongation and termination).

3.1.2. Construction of Reference Answer Set

The construction of the Reference Answer Set (RAS) involves aggregating different student answer situations into a unified set. The construction process of the reference answer set can be divided into two steps:

Step 1: Obtain possible student answer situations based on clustering, where each cluster obtained after clustering represents a type of student answer situation.

1. Data Preprocessing: Load text data from files, split each document into sentences, and perform a series of text cleaning and preprocessing operations, including tokenization, lemmatization, and removal of stop words, to generate a doc content list.

2. Label Processing: Load label information associated with the text data and split the data into training and testing sets. Simultaneously, establish a label dictionary named labels dic to map labels to numerical values and calculate the count of each label.

3. Answer Set Selection: Based on the labels, select texts with a specific label value equal to 3 and store them in doc answer list original as the answer set.

4. Text Embedding: Utilize the BERT model to embed the texts in doc answer list original, converting each text into a BERT vector representation.

5. Clustering Analysis: Conduct clustering analysis on the BERT-embedded texts using the K-means clustering algorithm. Perform clustering within different ranges of cluster numbers (4, 6, 8, 10, 12, 14, 16, 18, 20) and calculate the silhouette coefficient for each scenario.

6. Selection of Optimal Number of Clusters: Based on the silhouette coefficients, select the number of clusters with the highest silhouette coefficient, indicating that the texts have better clustering performance under this number of clusters.

As shown in Figure 6, the silhouette coefficient peaks at k=4 (silhouette score = 0.67), compared to k=6 (0.61), k=8 (0.58), and k=10 (0.54). This indicates that four clusters best balance intra-cluster cohesion and inter-cluster separation. We therefore selected four representative answers to construct the RAS (Table 4). The choice of k=4 is also pedagogically interpretable: it corresponds to the four major steps of protein synthesis (initiation, elongation, termination, and translocation), aligning with the scoring rubric.

Step 2: Select the most representative answer from each cluster as the representative of that cluster to construct the RAS. Ultimately, four representative answers are chosen, each with a different focus. The specific content of these four categories of representative answers is detailed in Table 4.

3.1.3. Constructing HyperGAT-BERT-RAS

With the dataset prepared and the reference answer set constructed, we began building an automatic short-answer scoring model based on Siamese neural networks and the reference answer set, using a hypergraph attention network. The specific steps are as follows:

1. Read the corpus files from the ASAP-5 dataset and perform a series of text cleaning and preprocessing operations, including tokenization, lemmatization, stop-word removal, etc. A validation set is then split from the training set at a ratio of 10%.

2. Further process the training data, validation set, and test set by storing different parts of the input data and document answer lists as NumPy arrays, including inputs, targets, and text, which represent vector representations, labels, and initial text, respectively. Data batches are generated for training or evaluation, with each batch containing 8 texts (batch=8). To ensure consistent training for both student texts and answer texts, each batch concatenates student texts with answer texts, meaning that 12 texts are trained together each time. Additionally, node information (items) is generated to represent all nodes in the document. Edge information is generated by iteratively processing each document’s sentences and semantic information, including lists of node and edge transpose matrices (HT: each element corresponds to the node and edge relationships of a document) and adjacency matrices (adj), both commonly used to represent the graph structure of a document. Finally, node masks are generated for each document, related to node information, to indicate which nodes are valid, which are padding values, or unused, maintaining dimensional consistency of the data, especially when different documents have varying numbers of nodes.

3. Use the train data obtained in the previous step as input to train the model, iterating over each batch. To calculate the model’s score or predicted output, the input is first passed to an embedding layer to obtain embedded representations (hidden). Next, the embedded representations (hidden) and node information (HT) are passed to the hypergraph attention layer, the hypergraph attention mechanism, allowing the model to capture complex relationships in graph data. These operations help the model extract useful information from hypergraph data, yielding intermediate representation outputs (seq hidden) for subsequent tasks.

The intermediate representation (seq hidden) is then multiplied by the node masks (node masks), setting unnecessary nodes to zero and filtering out invalid nodes. Subsequently, graph embeddings for the nodes are calculated and normalized. At this point, the graph embeddings include the graph embeddings for 8 student answer texts and 4 reference answer set texts per batch (b). The entire embedding is sliced to obtain embeddings for student answer texts (b0) and reference answer set texts (b1) separately.

Meanwhile, the BERT model is used to process text inputs (text inputs) and answer texts, embedding them into text emb and answer emb. The hypergraph embeddings and BERT embeddings of both texts are concatenated, and cosine similarity is used to calculate the similarity between the student text and the 4 reference answers, with the highest similarity (Smax) selected as the similarity between the student’s answer and the reference answer. Additionally, the similarity (Smax) between the student’s answer and the reference answer is concatenated with the hypergraph-embedded student answer text (b0) as the final text representation (Z).

4. The obtained document representation (Z) is fed into a Softmax layer for text classification, ultimately producing the model’s predicted output. The structure of the HyperGAT-BERT-RAS model is shown in Figure 7.

3.2. Automated Short-Answer Scoring Validation

This section aims to validate the effectiveness of the HyperGAT-BERT-RAS method based on Siamese Neural Networks and Reference Answer Sets for automated short-answer grading tasks. The contributions of each module are evaluated through ablation experiments, and class imbalance is addressed via GPT-4 data augmentation.

3.2.1. Ablation Experiment Design

To systematically evaluate the contributions of individual components in the HyperGAT-BERT-RAS framework, three model variants were designed and tested on the ASAP-5 dataset:

w/o LDA: HyperGAT-BERT-RAS without semantic hyperedges.

w/o BERT: HyperGAT-BERT-RAS without BERT embeddings.

w/o RAS: HyperGAT-BERT-RAS without the Reference Answer Set (RAS).

Three primary metrics were used for evaluation: Accuracy (ACC), Precision (P), and F1-score (F1). For the data augmentation experiment, Quadratic Weighted Kappa (QWK) was additionally employed to assess the agreement between predicted scores and actual scores.

3.2.2. Ablation Experiment Results

Table 5 presents the ablation results. The full HyperGAT-BERT-RAS model achieved optimal performance on the test set, with the highest accuracy (0.7991), precision (0.8227), and F1-score (0.7956), demonstrating its superiority in automated short-answer grading.

The ablation results validate the contributions of the different modules constructed in this study within HyperGAT-BERT-RAS. Among them, the construction of RAS had the most significant impact on model performance. The model without RAS had the lowest performance among the four variants (F1 decreased by 4.13%), indicating that RAS is the most crucial module. Models without LDA semantic hyperedges or BERT embeddings also showed performance degradation. Specifically, removing LDA semantic hyperedges had a relatively smaller impact on model precision, and its performance was better than that of the w/o BERT and w/o RAS models, suggesting that LDA semantic hyperedges contribute to model performance on this dataset, albeit to a lesser extent than the other two components.

3.2.3. GPT-4 Data Augmentation

In the ASAP-5 dataset, the numbers of students scoring 0, 1, 2, and 3 points were 1,853, 429, 64, and 47, respectively, exhibiting a clear skewed distribution. This imbalanced distribution increases the difficulty for automated grading models to learn, thereby affecting their performance. Although the ablation experiment results show that the HyperGAT-BERT-RAS method achieved good performance (ACC = 0.7991, F1 = 0.7956), this inevitably came at the cost of accuracy for minority categories.

Given the small number of students scoring 2 and 3 points, GPT-4 was employed to generate student answers corresponding to these scores based on relevant information such as question stems, key scoring points, and grading requirements (Ubani et al., 2023). The augmented dataset was named E-ASAP-5, and its score distribution is shown in Table 6. Compared to ASAP-5, E-ASAP-5 has significantly more data for scores 2 and 3.

Prompts must be specifically designed for different questions and target scores. This study adopted two data augmentation strategies. The first was semantic augmentation based on existing student answers: for existing student answers in the original dataset, prompts were designed to instruct GPT-4 to generate synonymous variants that are semantically similar but diverse in expression. The second strategy involved training GPT-4 to generate student answers for specific scores: to address the scarcity of samples for scores 2 and 3 in the original dataset, prompts were designed based on question stems, key scoring points, and scoring criteria to instruct GPT-4 to generate student answers that meet the corresponding score standards. Manual sampling analysis confirmed that the text generated by both strategies achieved the expected results. The variants generated by Strategy 1 fully preserved the key steps of the original answers while exhibiting diversity in vocabulary choice and sentence structure. The answers generated by Strategy 2 were validated through GPT-4’s self-explanation, clearly containing the required number of key elements for the corresponding scores and strictly aligning with the scoring criteria. Furthermore, the generated text adopted a language style close to real student answers, avoiding overly perfect or textbook-like expressions, demonstrating the effectiveness of the data augmentation.

3.2.4. Data Augmentation Effectiveness

The experimental results after data augmentation (on the test set) are shown in Table 7. Using the augmented dataset E-ASAP-5, all metrics improved significantly compared to using the original data. Accuracy increased from 0.7991 to 0.8257, F1-score increased from 0.7956 to 0.8344, and QWK increased from 0.5841 to 0.8797, representing a substantial improvement from moderate agreement to near-perfect agreement. This indicates that data augmentation effectively enhances the model’s ability to recognize minority classes.

Figure 8 presents the confusion matrix and text embedding visualization for HyperGAT-BERT-RAS on E-ASAP-5. From the confusion matrix, it can be observed that classification accuracy for all categories improved after data augmentation. The embedding visualization shows clearer inter-class boundaries and tighter intra-class clustering for the augmented dataset, further validating the effectiveness of data augmentation.

3.2.5. Error Analysis

Based on the confusion matrix for the E-ASAP-5 test set shown in Figure 8, an error analysis of the HyperGAT-BERT-RAS model was conducted. The test set contained a total of 631 samples. From the confusion matrix, it can be observed that misclassifications primarily occurred between adjacent score categories: 47 instances of score 0 were misclassified as score 1 (13.70% of score 0 samples), and 14 instances of score 2 were misclassified as score 3 (11.97% of score 2 samples). In contrast, cross-level misclassifications (e.g., score 0 misclassified as score 2 or 3) accounted for only 8 instances (2.33% of score 0 samples). This pattern indicates that the model has good ordinal consistency, with scoring errors mainly limited to adjacent score levels.

The classification accuracy for score 1 was relatively lower (74.42%), with misclassifications toward score 0 (12 instances) and toward scores 2 or 3 (10 instances), indicating that boundary categories present challenges in semantic discrimination. Score 2 had similar numbers of misclassifications toward score 1 (8 instances) and score 3 (14 instances), suggesting that it occupies a semantic transition position between scores 1 and 3.

Based on the above observations, misclassifications can be categorized into three types: (1) boundary ambiguity (approximately 60%), occurring between adjacent scores; (2) semantic similarity (approximately 25%), arising from expression similarities between student answers and incorrect answers; and (3) unconventional expression (approximately 15%), stemming from novel but correct answers that do not adequately match the RAS prototypes. Future work should focus on three aspects: strengthening ordinal relationship modeling, expanding the reference answer set, and enhancing boundary samples.

4. Discussion

This study sought to address three major challenges in ASAG: semantic matching ambiguity, limited coverage of reference answers, and imbalanced data distribution. This study consisted of two research phases: text classification optimization and RAS-enhanced scoring with GPT-4 augmentation, from which meaningful results were obtained to advance ASAG performance.

4.1. Integrating BERT Enhances Text Classification Performance of the HyperGAT Method

To alleviate the semantic matching issue in automatic scoring of subjective questions, the study optimized a text classification model by integrating BERT into the HyperGAT framework, resulting in the HyperGAT-BERT method. The results indicate that HyperGAT-BERT outperforms HyperGAT across various metrics, demonstrating that combining BERT with graph neural networks yields a more robust model. When the pre-trained representation model BERT is combined with the graph neural network-based HyperGAT model (Ding et al., 2020), this integration leverages both BERT’s contextual awareness and HyperGAT’s strengths in modeling complex relationships, thereby improving text classification accuracy and F1 scores. This finding indicates that integrating language representation models with graph neural networks can endow the new model with richer contextual semantic information, significantly enhancing its text comprehension capabilities and demonstrating superior performance in relation extraction tasks (Vashishth et al., 2020; Zhang et al., 2018). From an educational perspective, improved text classification accuracy means fewer mis-scored student answers, which directly enhances teachers’ trust in automated scoring systems and encourages their adoption in daily classroom practice.

4.2. RAS is a Critical Factor Influencing the Accuracy of Automatic Scoring in HyperGAT-BERT

In practice, the coverage of reference answers is a pivotal factor in the scoring of subjective questions. Study Two constructed a HyperGAT-BERT model with a reference answer set (HyperGAT-BERT-RAS) based on SNNs and answer clustering, and conducted ablation experiments using the ASAP-5 dataset. The results demonstrated that each module in HyperGAT-BERT-RAS collectively enhanced the performance of automatic scoring, with the most significant decline observed in model performance upon the removal of RAS (F1 decreased by 4.13%), indicating its crucial role in automatic scoring of subjective questions (Tan et al., 2022). This finding aligns with existing research (Lan et al., 2015; Marvaniya et al., 2018) and resonates with the matching mechanism of SNNs (Bromley et al., 1993), which also emphasizes the importance of learning similarities and differences for pattern recognition. Additionally, the absence of BERT embeddings led to a decline in various metrics, further validating the significance of rich text representation information for automatic scoring of subjective questions and suggesting that rich textual information is an effective approach to addressing the challenge of semantic matching between student answers and reference answers (Tan et al., 2022). For classroom teachers, a well-constructed RAS reduces the need to manually review a wide range of student answers. By covering diverse answer patterns, the RAS enables the system to correctly score creative but correct responses, which are often penalized by traditional keyword-matching methods.

4.3. Data Augmentation is an Effective Solution to Address Imbalanced Student Answer Distribution

Balanced class distribution is crucial for model training, as imbalanced data can lead the model to favor majority classes while neglecting minority ones (He & Garcia, 2009). In the field of NLP, data augmentation is recognized as an effective method to enhance model generalization, particularly when training data is limited (Shorten & Khoshgoftaar, 2019). In the ASAP-5 dataset, student answer distribution exhibits a skewed pattern. To mitigate the impact of this imbalanced distribution on automatic scoring, Study Two included an additional data augmentation experiment to explore the feasibility of enhancing student answers using GPT-4’s powerful language generation capabilities to further improve automatic scoring effectiveness. The results indicate that data augmentation significantly enhances the model’s learning ability for minority classes, thereby improving overall scoring accuracy. The strategy of using GPT-4 to generate data in the experiment benefits from GPT-4’s advanced language understanding and generation capabilities (Brown et al., 2020). This approach allows for effective simulation of real student answers, providing more diverse training data, which results in more accurate model scoring and clearer classification boundaries, as visually evident in Figure 8’s text embedding visualization and confusion matrix. Data augmentation using GPT-4 addresses a practical bottleneck in classroom assessment: the scarcity of high-scoring student answers. By generating realistic high-scoring samples, the model learns to recognize excellent responses that occur infrequently in real classrooms, leading to fairer assessment outcomes for high-achieving students.

4.4. Implications from Error Analysis

The error analysis reveals two core misclassification patterns: adjacent-category confusion and boundary-category difficulty. The rates of score 0→1 and score 2→3 misclassifications were 13.70% and 11.97%, respectively, while cross-level errors accounted for only 2.33%. This indicates that the model has good ordinal consistency, with scoring errors mainly limited to adjacent levels.

Adjacent-category confusion reflects the inherent ambiguity of category boundaries in scoring tasks. Responses that fall between two score levels often contain both correct and incorrect elements, making precise classification difficult (Liu et al., 2018). This issue is particularly prominent for score 1 (recall: 74.42%), which semantically lies between scores 0 and 2. To address this, ordinal loss functions can be introduced to impose smaller penalties on adjacent errors and larger penalties on cross-level errors.

The near-equal misclassification of score 2 toward scores 1 and 3 (6.84% vs. 11.97%) suggests that this category occupies a semantic transition position. This relates directly to the coverage of the Reference Answer Set (RAS). The current RAS constructed solely from full-score answers insufficiently covers score 2 variants. Expanding RAS to include representative answers from all score levels is expected to reduce misclassifications (Marvaniya et al., 2018; Tan et al., 2022).

The error patterns identified have practical implications for classroom use. Adjacent-category errors (e.g., 0 vs. 1, 2 vs. 3) are pedagogically acceptable in formative assessment, where the goal is to identify students who need additional support rather than to assign precise final grades. Teachers can use the model’s scores as a screening tool, manually reviewing only boundary cases. This hybrid approach, which uses automated scoring for clear cases and human review for ambiguous ones, balances efficiency with accuracy, making AI-assisted assessment more feasible in real classrooms. In summary, the error analysis validates the model’s ordinal consistency and identifies three improvement directions: (1) introducing ordinal loss functions, (2) expanding RAS coverage, and (3) performing targeted data augmentation for boundary categories.

4.5. Research Limitations

Although this paper conducts a relatively systematic and in-depth exploration of the three key issues currently facing automatic scoring of subjective questions, there are still some limitations. Firstly, in terms of model optimization, this paper only explores the integration of BERT and HyperGAT and does not systematically compare it with more pre-trained models (e.g., RoBERTa, SciBERT). Future research can further explore the fusion of graph neural networks with other techniques (such as attention mechanisms, transformer variants, etc.) and evaluate the generalization ability of the model through multi-perspective comparisons. Secondly, the RAS in this paper is selected from full-score answers, without fully considering student responses under different score levels. Future research can select representative answers from various score levels to further enhance the coverage of RAS for all possible response outcomes. Thirdly, although data augmentation improves the overall performance of the model, there is still room for improvement in the recognition of minority categories. Future research can explore more data augmentation techniques to further enhance the fairness and accuracy of automatic scoring. Finally, this study was only validated on the ASAP-5 biology dataset. Future research should test the model’s generalization ability across more subjects (e.g., history, literature) and more languages.

5. Conclusions

This study addresses the challenges in automatic scoring of subjective questions by constructing the HyperGAT-BERT and HyperGAT-BERT-RAS methods and performing data augmentation through GPT-4, aiming to enhance the effectiveness of automatic scoring of subjective questions. The following conclusions are drawn:

(1) The incorporation of BERT to enrich semantic information enables HyperGAT-BERT to better represent texts and improve text classification performance. On the Ohsumed dataset, HyperGAT-BERT achieved 73.17% accuracy, surpassing the baseline HyperGAT model by 3.27 percentage points, demonstrating the effectiveness of BERT’s semantic enhancement for text classification tasks.

(2) When applied to automatic scoring of subjective questions, HyperGAT-BERT-RAS outperforms the other three model variants in all metrics. Different modules within the model are effective, with RAS contributing the most (F1 decreased by 4.13% when removed), followed by BERT. On the ASAP-5 dataset, the full model achieved 79.91% accuracy and 79.56% F1-score.

(3) Data augmentation using GPT-4 for categories with fewer samples effectively addresses the problem of uneven answer distribution. After augmentation, accuracy increased to 82.57%, F1-score increased to 83.44%, and QWK significantly improved from 0.5841 to 0.8797. Practically, this means that high-achieving students, whose answers are often underrepresented in training data, are no longer systematically disadvantaged by the scoring model. The confusion matrix and error analysis further validate the effectiveness of data augmentation and reveal that misclassifications primarily occur between adjacent score categories, which is acceptable in formative assessment contexts where the goal is to identify students needing support rather than to assign precise final grades.

Future research should test the model’s generalization ability across more subjects and languages, incorporate representative answers from all score levels into the RAS, and explore lightweight techniques such as model distillation to reduce computational costs.

Author’s Contribution Statements

Chunhua Kang Played a lead role in conceptualization, funding acquisition, methodology, supervision, writing, and writing–review. Chen Liu Had full access to the investigation and data collection, and writing–original draft. Xiaofen Wan Played a lead role in methodology and editing. Zhihao Ni Played a lead role in data curation, formal analysis. Sheng Su Played a supportive capacity for data interpretation and manuscript development.

Consent for Publication

All authors approved the final manuscript and the submission to this journal.

Funding

This work was supported by Humanities and Social Sciences Fund of the Ministry of Education (22YJA190005) and supported by the key open fund from Zhejiang Philosophy and Social Science Laboratory for the Mental Health and Crisis Intervention of Children and Adolescents, PR China (No. 23MHCICAZD04).

Data Availability Statement

Data is provided within the manuscript, and the datasets used and/or analyzed during the current study are available from http://disi.unitn.it/moschitti/corpora.htm and https://www.kaggle.com/c/asap-sas/data.

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

References

Alikaniotis, D.; Yannakoudakis, H.; Rei, M. Automatic Text Scoring Using Neural Networks; Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016; pp. 715–725. [Google Scholar] [CrossRef]
Basu, S.; Jacobs, C.; Vanderwende, L. Powergrading: A Clustering Approach to Amplify Human Effort for Short Answer Grading. Transactions of the Association for Computational Linguistics 2013, 1, 391–402. [Google Scholar] [CrossRef]
Bayer, M.; Kaufhold, M.-A.; Reuter, C. A Survey on Data Augmentation for Text Classification. ACM Computing Surveys 2023, 55(7), 1–39. [Google Scholar] [CrossRef]
Bell, T. H.; Dartigues-Pallez, C.; Jaillet, F.; Genolini, C. Data Augmentation for Enlarging Student Feature Space and Improving Random Forest Success Prediction. In Artificial Intelligence in Education; Roll, I., McNamara, D., Sosnovsky, S., Luckin, R., Dimitrova, V., Eds.; Springer International Publishing, 2021; Vol. 12749, pp. 82–87. [Google Scholar] [CrossRef]
Bromley, J.; Bentz, J.; Bottou, L.; Guyon, I.; Lecun, Y.; Moore, C.; Sackinger, E.; Shah, R. Signature Verification using a “Siamese” Time Delay Neural Network. International Journal of Pattern Recognition and Artificial Intelligence 1993, 7, 25. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; others. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020, 33, 1877–1901. [Google Scholar] [CrossRef]
Burrows, S.; Gurevych, I.; Stein, B. The Eras and Trends of Automatic Short Answer Grading. International Journal of Artificial Intelligence in Education 2015, 25(1), 60–117. [Google Scholar] [CrossRef]
Cader, A. The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In Artificial Intelligence in Education; Bittencourt, I. I., Cukurova, M., Muldner, K., Luckin, R., Millán, E., Eds.; Springer International Publishing, 2020; Vol. 12164, pp. 37–42. [Google Scholar] [CrossRef]
Cochran, K.; Cohn, C.; Hutchins, N.; Biswas, G.; Hastings, P. Improving Automated Evaluation of Formative Assessments with Text Data Augmentation. In Artificial Intelligence in Education; Rodrigo, M. M., Matsuda, N., Cristea, A. I., Dimitrova, V., Eds.; Springer International Publishing, 2022; Vol. 13355, pp. 390–401. [Google Scholar] [CrossRef]
Cochran, K.; Cohn, C.; Rouet, J. F.; Hastings, P. Improving Automated Evaluation of Student Text Responses Using GPT-3.5 for Text Data Augmentation. In Artificial Intelligence in Education; Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O. C., Dimitrova, V., Eds.; Springer Nature Switzerland, 2023; Vol. 13916, pp. 217–228. [Google Scholar] [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Wu, Z.; Zhao, L.; Liu, W.; Liu, N.; Li, S.; Zhu, D.; others. Chataug: Leveraging chatgpt for text data augmentation. arXiv 2023, arXiv:2302.130071(2). [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 2019, 4171–4186. [Google Scholar] [CrossRef]
Ding, K.; Wang, J.; Li, J.; Li, D.; Liu, H. Be More with Less: Hypergraph Attention Networks for Inductive Text Classification. arXiv 2020, arXiv:2011.00387. [Google Scholar] [CrossRef]
Feng, Y.; You, H.; Zhang, Z.; Ji, R.; Gao, Y. Hypergraph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 2019, 33(01), 3558–3565. [Google Scholar] [CrossRef]
Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural message passing for quantum chemistry. arXiv.org 2017, 1263–1272. [Google Scholar] [CrossRef]
He, H.; Garcia, E. A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 2009, 21(9), 1263–1284. [Google Scholar] [CrossRef]
Huang, Y.; Yang, X.; Zhuang, F.; Zhang, L.; Yu, S. Automatic Chinese Reading Comprehension Grading by LSTM with Knowledge Adaptation. In Advances in Knowledge Discovery and Data Mining; Phung, D., Tseng, V. S., Webb, G. I., Ho, B., Ganji, M., Rashidi, L., Eds.; Springer International Publishing, 2018; Vol. 10937, pp. 118–129. [Google Scholar] [CrossRef]
Kaldaras, L.; Yoshida, N. R.; Haudek, K. C. Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. Frontiers in Education 2022, 7, 983055. [Google Scholar] [CrossRef]
Kamath, C. N.; Bukhari, S. S.; Dengel, A. Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification; Proceedings of the ACM Symposium on Document Engineering 2018, 2018; pp. 1–11. [Google Scholar] [CrossRef]
Kim, E.-S.; Kang, W. Y.; On, K.-W.; Heo, Y.-J.; Zhang, B.-T. Hypergraph Attention Networks for Multimodal Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020; pp. 14569–14578. [Google Scholar] [CrossRef]
Kumar, S.; Chakrabarti, S.; Roy, S. Earth Mover’s Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading; Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017; pp. 2046–2052. [Google Scholar] [CrossRef]
Kumar, V.; Glaude, H.; de Lichy, C.; Campbell, W. A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification. arXiv 2019, arXiv:1910.04176. [Google Scholar] [CrossRef]
Kumar, Y.; Aggarwal, S.; Mahata, D.; Shah, R. R.; Kumaraguru, P.; Zimmermann, R. Get IT Scored Using AutoSAS — An Automated System for Scoring Short Answers. Proceedings of the AAAI Conference on Artificial Intelligence 2019, 33(01), 9662–9669. [Google Scholar] [CrossRef]
Lan, A. S.; Vats, D.; Waters, A. E.; Baraniuk, R. G. Mathematical Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions. arXiv 2015, arXiv:1501.04346. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P. S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology 2022, 13(2), 1–41. [Google Scholar] [CrossRef]
Liu, Q.; Wu, R.; Chen, E.; Xu, G.; Su, Y.; Chen, Z.; Hu, G. Fuzzy cognitive diagnosis for modelling examinee performance. ACM Transactions on Intelligent Systems and Technology 2018, 9(4)(Article 48), 1–26. [Google Scholar] [CrossRef]
Liu, R.; Xu, G.; Jia, C.; Ma, W.; Wang, L.; Vosoughi, S. Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation; Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020; pp. 9031–9041. [Google Scholar] [CrossRef]
Lui, A. K.-F.; Lee, L.-K.; Lau, H.-W. Automated grading of short literal comprehension questions. Technology in Education. Technology-Mediated Proactive Learning: Second International Conference, ICTE 2015, Hong Kong, China, July 2-4; 2015; Revised Selected Papers 2, pp. 251–262. [Google Scholar] [CrossRef]
Lv, S.; Dong, J.; Wang, C.; Wang, X.; Bao, Z. RB-GAT: A text classification model based on RoBERTa-BiGRU with graph attention network. Sensors 2024, 24(11), 3365. [Google Scholar] [CrossRef]
Madnani, N.; Loukina, A.; Cahill, A. A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring; Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 2017; pp. 457–467. [Google Scholar] [CrossRef]
Marvaniya, S.; Saha, S.; Dhamecha, T. I.; Foltz, P.; Sindhgatta, R.; Sengupta, B. Creating Scoring Rubric from Representative Student Answers for Improved Short Answer Grading; Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018; pp. 993–1002. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; Dean, J. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 2013, 26. [Google Scholar] [CrossRef]
Møller, A. G.; Dalsgaard, J. A.; Pera, A.; Aiello, L. M. Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv 2023, arXiv:2304.138614(13861), 1–12. [Google Scholar] [CrossRef]
Park, Y.-H.; Choi, Y.-S.; Park, C.-Y.; Lee, K.-J. EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring. Applied Sciences 2022, 12(12), 5803. [Google Scholar] [CrossRef]
Ramesh, D.; Sanampudi, S. K. An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review 2022, 55(3), 2495–2527. [Google Scholar] [CrossRef] [PubMed]
Riordan, B.; Horbach, A.; Cahill, A.; Zesch, T.; Lee, C. M. Investigating neural architectures for short answer scoring; Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 2017; pp. 159–168. [Google Scholar] [CrossRef]
Saha, S.; Dhamecha, T. I.; Marvaniya, S.; Sindhgatta, R.; Sengupta, B. Sentence Level or Token Level Features for Automatic Short Answer Grading?: Use Both. In Artificial Intelligence in Education; Penstein Rosé, C., Martínez-Maldonado, R., Hoppe, H. U., Luckin, R., Mavrikis, M., Porayska-Pomsta, K., McLaren, B., Du Boulay, B., Eds.; Springer International Publishing, 2018; Vol. 10947, pp. 503–517. [Google Scholar] [CrossRef]
Sakaguchi, K.; Heilman, M.; Madnani, N. Effective Feature Integration for Automated Short Answer Scoring. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015, 1049–1054. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 2019, 6(1), 60. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T. M.; Furht, B. Text Data Augmentation for Deep Learning. Journal of Big Data 2021, 8(1), 101. [Google Scholar] [CrossRef]
Spärck Jones, K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 2004, 60(5), 493–502. [Google Scholar] [CrossRef]
Sultan, M. A.; Salazar, C.; Sumner, T. Fast and Easy Short Answer Grading with High Accuracy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016, 1070–1075. [Google Scholar] [CrossRef]
Tan, H.; Wu, Z.; Lu, Y.; Duan, Q.; Li, R.; Zhang, H. Short answer automatic scoring based on representative answer selection and attention mechanism. Journal of Chinese Information Processing 2019, 33(11), 134–142. [Google Scholar]
Tan, H.; Guo, Y.; Li, R. Research progress, applications and challenges in automatic subjective question grading. Artificial Intelligence 2022, (02), 14–20. [Google Scholar] [CrossRef]
Tan, H.; Wang, C.; Duan, Q.; Lu, Y.; Zhang, H.; Li, R. Automatic short answer grading by encoding student responses via a graph convolutional network. Interactive Learning Environments 2023, 31(3), 1636–1650. [Google Scholar] [CrossRef]
Ubani, S.; Polat, S. O.; Nielsen, R. ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. arXiv 2023, arXiv:2304.14334. [Google Scholar] [CrossRef]
Valenti, S.; Neri, F.; Cucchiarelli, A. An Overview of Current Research on Automated Essay Grading. Journal of Information Technology Education: Research 2003, 2, 319–330. [Google Scholar] [CrossRef]
Vashishth, S.; Sanyal, S.; Nitin, V.; Talukdar, P. Composition-based Multi-Relational Graph Convolutional Networks. arXiv 2020, arXiv:1911.03082. [Google Scholar] [CrossRef]
Wang, L.; Han, M.; Li, X.; Zhang, N.; Cheng, H. Review of Classification Methods on Unbalanced Data Sets. IEEE Access 2021, 9, 64606–64628. [Google Scholar] [CrossRef]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
Wu, X.; Lv, S.; Zang, L.; Han, J.; Hu, S. Conditional BERT Contextual Augmentation. arXiv 2018, arXiv:1812.06705. [Google Scholar] [CrossRef]
Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q. V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv 2018, arXiv:1804.09541. [Google Scholar] [CrossRef]
Zhang, H.; Litman, D. Automated topical component extraction using neural network attention scores from source-based essay scoring. arXiv 2020, arXiv:2008.01809. [Google Scholar] [CrossRef]
Zhang, R.; Zou, Y.; Ma, J. Hyper-SAGNN: A self-attention based graph neural network for hypergraphs. arXiv 2019, arXiv:1911.02613. [Google Scholar] [CrossRef]
Zhang, Y.; Qi, P.; Manning, C. D. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. arXiv 2018, arXiv:1809.10185. [Google Scholar] [CrossRef]
Zhang, X.; LeCun, Y. Text Understanding from Scratch. arXiv 2016, arXiv:1502.01710. [Google Scholar] [CrossRef]
Zhou, T.; Jiao, H. Data augmentation in machine learning for cheating detection in large-scale assessment: An illustration with the blending ensemble learning algorithm. Psychological Test and Assessment Modeling 2022, 64(4), 425–444. Available online: https://www.researchgate.net/publication/366822903.

Figure 1. Construction of reference answer set via clustering of student answers.

Figure 2. Siamese neural network architecture for answer similarity computation.

Figure 3. HyperGAT-BERT framework.

Figure 4. Detailed Comparison between HyperGAT and HyperGAT-BERT.

Figure 5. Model performance of HyperGAT-BERT across different embedding dimensions.

Figure 6. Silhouette Coefficients of Full-Score Answer Clustering.

Figure 7. Structure of the HyperGAT-BERT-RAS Model.

Figure 8. Text Embedding Visualization and Confusion Matrix for the E-ASAP-5.

Table 1. Basic Information of the OhSumed Dataset.

Dataset	Doc	Train	Valid	Test	Class
Ohsumed	7400	5913(80%)	591	1487(20%)	23

Table 2. Experimental results of different models on the Ohsumed dataset.

Model	Acc	P	F1
SWEM	0.6314	0.6359	0.6325
CNN-non-static	0.5843	0.5940	0.5854
CNN-non-static	0.6945	0.7001	0.6965
HyperGAT	0.7048	0.7111	0.7056
HyperGAT-BERT	0.7317	0.7386	0.7324

Table 3. Basic Information of ASAP-5.

Theme	Training Set	Testing Set	Average Length	Scores
biology	1797(80%)	451(20%)	60 words	0-3

Table 4. Four Representative Types of Answers.

	answer text
answer1	“In protein synthesis, after the mRNA leaves the nucleus, it carries the instructions to the cytooplasm. It then enters a ribosome and prepares to be copied. Transfer RNA brings the codons attached to the amino acid to the ribosome. It attaches at the first bonding site which matches the complementary codons. The tRNA moves on the next bonding site where it releases the amino acid which attaches to the amino acid on the tRNA behind it. Once the mRNA reaches a stop codon, the amino acids detach and form a polypeptide bond.”
answer2	“The mRNA leaves the nucleus and then attaches to rRNA, ribosomal RNA. Once firmly attached to this ribosomal RNA, the mRNA forms codons, which are a set of three neuclotides. These codons then match with anticodons on tRNA, which also hold amino acids. The amio acids on the tRNA form peptide bonds with the other amino acids to form a chain. Once the bond between two proteins has been formed, the tRNA floats away to find another amino acid to carry. This process continues until there is a full chain of amino acids, which then creates a protein.”
answer3	“After mRNA leaves the nucleus, the mRNA conects to a ribosome to produce the protien the mRNA codes for. In the ribosome the mRNA is read in the sequence of codons and calls the tRNA to go out into the cytoplasm to get the correct amino acid that is called for in the codon. The tRNA keeps bringing the needed amino acids which results to the making of a protien chain. When the protien chain is complete it goes out into the cytoplasm and goes where ever it is needed in the cell or outside of the cell.”
answer4	“After mRNA leaves the nucleus, it goes out into the cytoplasm on a ribosome. The ribosome attaches and reads until it sees the codon AUG. Once it sees this codon, it starts pairing the bases with complementary anticodons on the tRNA. The tRNA brings amino acids that link together to form a protein. This process terminates at a stop codon.”

Table 5. Ablation Experiment Results.

Model	Acc	P	F1
w/o LDA	0.7852	0.7894	0.7796
w/o BERT	0.7737	0.7524	0.7603
w/o RAS	0.7667	0.7674	0.7628
HyperGAT-BERT-RAS	0.7991	0.8227	0.7956

Table 6. Comparison of Dataset Distribution Before and After Augmentation.

	0 Points	1 point	2 points	3 points
ASAP-5	1853	429	64	47
E-ASAP-5	1853	429	584	423

Table 7. Experimental Results Before and After Augmentation.

HyperGAT-BERT-RAS	ACC	P	F1	QWK
ASAP-5	0.7991	0.8227	0.7956	0.5841
E-ASAP-5	0.8257	0.8548	0.8344	0.8797

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.