WindPower-SAFusion: A Sparse-Attention and Multi-Scale Fusion Model for Wind Power Forecasting

Xuegong Zhang; Yarou Li; Zhuo Shao; Huzi Qiu; Jiatai Shi; Jing Wang; Dongdong Zhang; Xuejing Zhao

doi:10.20944/preprints202605.1395.v1

Submitted:

20 May 2026

Posted:

20 May 2026

You are already at the latest version

Abstract

With the increasing penetration of wind power, the uncertainty of wind power generation poses greater challenges to the secure operation of power grids. This paper proposes WindPower-SAFusion, an improved Informer-based model for wind power forecasting. The proposed model optimizes long-sequence modeling from three aspects. First, ProbSparse self-attention is adopted to reduce the computational complexity from O(L2) to O(LlogL). Second, a convolutional distillation encoder is introduced to compress the input sequence and highlight key temporal features. Third, a multivariate fusion and recursive multi-step forecasting framework is constructed. Using historical power and wind speed information, experiments are conducted on measured data from the Daliang Wind Farm in Guazhou, Gansu Province, China. The results show that the proposed model significantly outperforms several mainstream forecasting models in 1-day, 3-day, and 7-day forecasting tasks. Ablation experiments further demonstrate that each core module plays a critical role in improving forecasting accuracy and generalization performance. Therefore, the proposed method provides a technically feasible solution with promising engineering application potential for power grid dispatching and wind power management.

Keywords:

wind power forecasting

;

deep learning

;

attention mechanism

;

Informer model

;

time series analysis

Subject:

Computer Science and Mathematics - Other

1. Introduction

With the accelerated global transition toward a green and low-carbon energy structure, wind power, as an important component of clean and renewable energy, has experienced continuous growth in both installed capacity and grid-connected scale. However, the intermittent and fluctuating nature of wind energy leads to significant uncertainty in wind power generation, posing severe challenges to the secure and stable operation of power systems and real-time power balance [1]. In this context, improving the accuracy of wind power forecasting is not only a key step in providing forward-looking decision support for power grid dispatching, but also an important technical approach for enhancing wind power accommodation and reducing wind curtailment. From the perspective of power grid operation, wind power forecasting errors directly increase the demand for balancing reserves and raise imbalance assessment costs. Therefore, improving forecasting accuracy affects not only the accommodation level of wind power, but is also closely related to the allocation of frequency regulation and peak shaving resources, as well as the operational economy of the power system [2].

At present, wind power forecasting methods can be broadly classified into physical methods and statistical learning methods. Physical models, which are usually driven by numerical weather prediction, are based on fluid dynamics and atmospheric physical equations. These models use numerical weather prediction (NWP) data to simulate the meteorological conditions of wind farms and estimate the corresponding power generation [3]. Typical global forecasting models include the Global Forecast System (GFS) model developed by the National Weather Service of the United States [4] and the model of the European Centre for Medium-Range Weather Forecasts (ECMWF) [5]. Such methods rely heavily on the quality of meteorological data and are mainly suitable for medium- and long-term forecasting ranging from 6 h to several days [6]. Nevertheless, they often suffer from high computational complexity and have difficulty characterizing local microtopography and small-scale turbulence [7]. To address local wind field differences caused by complex terrain, previous studies have proposed finer-grained spatiotemporal modeling strategies, such as micro-spatiotemporal grids, to enhance the representation of terrain effects and improve short-term power forecasting accuracy in complex terrain scenarios [8].

Statistical models establish time-series relationships between inputs, such as wind speed and wind direction, and outputs, namely wind power, using historical data [9]. Common statistical approaches include autoregressive models (AR), autoregressive integrated moving average models (ARIMA), and exponential smoothing (ES), which are classical time-series forecasting methods [10,11]. However, these methods generally have limited capability in handling nonlinear and non-stationary sequences and often depend on a large number of high-quality samples. In recent years, machine learning methods such as support vector machines (SVMs) [12,13,14] and random forests [15] have improved fitting capability to some extent. However, they still have limitations in modeling long-term dependencies, and the problem that forecasting accuracy decreases significantly as the prediction horizon increases has not been fundamentally solved [16]. In particular, when single-step forecasting is extended to multi-step forecasting, recursive strategies are prone to error accumulation, whereas direct strategies often ignore the temporal correlations among different forecasting horizons, resulting in degraded prediction performance at multiple future time points.

Deep learning has promoted wind speed and wind power forecasting toward a data-driven paradigm. Nevertheless, it still needs to address several challenges, including noise, uncertainty, and multi-scale coupling [17]. Meanwhile, for dispatching and risk-constrained applications, single-valued point forecasts are often insufficient. Probabilistic forecasting can more directly characterize forecasting uncertainty and provide more useful inputs for reserve allocation, risk assessment, and other decision-making tasks [18]. In addition, physics-informed and hybrid neural forecasting models have recently been introduced to improve the physical consistency and generalization capability of renewable energy forecasting systems [19,20].

The remainder of this paper is organized as follows. Section 2 introduces the proposed WindPower-SAFusion framework, including sparse attention, the distillation encoder, multivariate fusion, and recursive multi-step forecasting. Section 3 presents the experimental study, where comparative experiments and ablation studies are conducted based on measured data to evaluate the engineering applicability of the proposed method. Section 4 concludes the paper and discusses future research directions.

2. Materials and Methods

WindPower-SAFusion is developed based on an improved Informer architecture and adopts an encoder–decoder structure for long-sequence wind power forecasting. The proposed model mainly consists of four components. First, ProbSparse attention is used to focus on critical time steps while reducing computational complexity. Second, a convolutional distillation encoder progressively compresses the sequence and extracts multi-scale temporal features. Third, multivariate fusion jointly exploits historical power and wind speed information. Finally, a recursive multi-step forecasting strategy is employed to generate future sequences step by step, thereby improving prediction stability. The overall workflow of the proposed model is illustrated in Figure 1.

2.1. ProbSparse Attention Mechanism

The full attention mechanism in the conventional Transformer [21] has a computational complexity of

O (L^{2})

when processing long sequences, which makes it difficult to meet the requirements of industrial long-sequence forecasting tasks such as wind power prediction. To address this issue, WindPower-SAFusion introduces the ProbSparse attention mechanism, which was originally proposed in Informer [22]. Its core idea is based on an important observation: only a small number of query vectors play a dominant role in the final attention distribution. Based on this observation, a query sparsity measurement function is adopted to identify important query positions:

M (q_{i}, K) = max_{j} \{\frac{q_{i} k_{j}^{T}}{\sqrt{d_{k}}}\} - \frac{1}{L_{k}} \sum_{j = 1}^{L_{k}} \frac{q_{i} k_{j}^{T}}{\sqrt{d_{k}}},

(1)

where

q_{i}

denotes the i-th query vector, and

L_{k}

is the length of the key sequence. This measurement function can effectively identify query vectors that differ significantly from most key vectors.

In the actual implementation, an efficient approximate Top-K selection strategy is adopted, in which only the most important attention connections for each query are retained. The sparse attention matrix is defined as

A_{sparse} = Mask (softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}), TopK) .

(2)

After sparsification, the computational complexity of self-attention is reduced from the original

O (L_{q} L_{k})

to

O (L_{q} log L_{k})

. This significantly improves computational efficiency in long-sequence forecasting tasks while reducing noise interference and maintaining the representation capability of the attention mechanism [22,23].

2.2. Encoder Module Design

The encoder consists of multi-layer ProbSparse attention and a convolutional feed-forward network (Conv-FFN), aiming to capture both long-range dependencies and local temporal patterns. The Conv-FFN replaces the conventional feed-forward network with convolutional operations to emphasize the correlations between adjacent time steps [24]. Its formulation is given by

ConvFFN (Z) = W_{2} * (W_{1} * Z + b_{1}) + b_{2},

(3)

where * denotes the convolution operation, Z is the input feature representation, and

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

are learnable parameters. This design enables the model to capture local temporal patterns more effectively and enhances its ability to identify short-term fluctuations [24].

Finally, a distillation compression mechanism is introduced at the end of each encoder layer. Specifically,

X_{distill}^{(j)} = ReLU (Conv 1 D (Z^{(j)})),

(4)

where the convolution kernel size is set to 3, the stride is set to 2, and the padding size is set to 1. This mechanism progressively halves the sequence length while retaining key information, allowing the encoder to construct multi-level temporal features with lower computational overhead.

2.3. Decoder Module Design

The decoder adopts a generative multi-step forecasting framework. First, the output at the end of the encoder is used as the initial condition:

T_{start} = Memory [:, - 1 :, :] .

(5)

Then, the initial decoder input is constructed by repeated operations:

T_{input} = Repeat (T_{start}, {output}_{w}) + P E,

(6)

where

P E

denotes the positional encoding information specially designed for the output window. This design ensures temporal continuity in the generated sequence.

The first layer of the decoder adopts a masked self-attention mechanism [21], ensuring that each position can only attend to previous positions. This effectively prevents future information leakage. In the encoder–decoder attention layer, the model performs deep fusion between the decoding process and the encoded features through

CrossAttn = SparseAttn (T_{dec}, Memory, Memory) .

(7)

This operation realizes the interaction between decoder states and encoder memory representations. When the length of the decoder output sequence does not match the length of the encoder input, linear interpolation is further applied for alignment, ensuring effective interaction across features of different dimensions. This structure supports recursive generation of future sequences and improves the stability of multi-step forecasting.

2.4. Positional Encoding and Input Representation

The model uses absolute positional encoding to describe the temporal structure of the sequence. The positional encoding is defined as

P E (p o s, 2 i) = sin (\frac{p o s}{10000^{2 i / d_{model}}}),

(8)

P E (p o s, 2 i + 1) = cos (\frac{p o s}{10000^{2 i / d_{model}}}) .

(9)

The multivariate input is mapped to the model dimension through a linear projection:

X_{embedded} = W_{e} X_{input} + b_{e},

(10)

and the final input representation is obtained as

X_{final} = X_{embedded} \times \sqrt{d_{model}} + P E .

(11)

This representation preserves the numerical structure of the input features while incorporating temporal positional information, thereby providing suitable input conditions for subsequent attention computation.

2.5. Training and Forecasting Strategy

During the training stage, the mean squared error is used as the loss function:

L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(12)

where

y_{i}

and

{\hat{y}}_{i}

denote the observed and predicted wind power values, respectively, and N is the number of samples.

The learning rate is dynamically adjusted using a cosine annealing strategy [25]:

η_{t} = η_{min} + \frac{1}{2} (η_{max} - η_{min}) (1 + cos (\frac{T_{cur}}{T_{max}} π)) .

(13)

This dynamic learning-rate adjustment enables rapid convergence in the early training stage and fine-grained optimization in the later stage, effectively avoiding local optima.

For multi-step forecasting, a recursive strategy is adopted to generate future power sequences:

{\hat{y}}_{t} = f (X_{t - τ : t - 1}),

(14)

X_{t} = [{\hat{y}}_{t}, wind_{speed}_{t}] .

(15)

By combining the predicted future power values with available wind speed features, this strategy effectively suppresses error accumulation and improves the stability of long-sequence forecasting.

3. Results and Discussion

3.1. Data Description

The experimental data were collected from the Daliang Wind Farm in Guazhou, Gansu Province, China. The temporal resolution of the data is 15 min. The dataset contains two features, namely wind speed and theoretical power, and covers the period from November 2022 to June 2025. The data were divided into the training set, validation set, and test set in chronological order. Missing values were handled using forward and backward filling, and standardization was performed as part of data preprocessing. Specifically, the wind speed and power variables were standardized using the following formula:

x_{std} = \frac{x - μ}{σ},

(16)

where x denotes the original value,

μ

and

σ

denote the mean and standard deviation of the corresponding variable in the training set, respectively, and

x_{std}

denotes the standardized value. The same scaling parameters estimated from the training set were applied to the validation and test sets to avoid information leakage.

3.2. Experimental Settings

In this study, the dataset was divided into the training set, validation set, and test set in chronological order. A four-fold rolling-window cross-validation strategy was adopted to evaluate the generalization ability of the model [26]. The input sequence length was set to 96, and the forecasting windows included 1 day, 3 days, and 7 days. During training, automatic mixed precision (AMP) and the float32 data type were used. Power and wind speed were standardized using StandardScaler. An early stopping strategy and recursive forecasting were adopted to achieve multi-step prediction. Finally, the model stability was evaluated over 10 training rounds using the mean absolute error (MAE) and the coefficient of determination (

R^{2}

).

3.3. Forecasting Result Analysis

To comprehensively evaluate the forecasting performance of WindPower-SAFusion under different time scales, three representative forecasting horizons, namely 1 day, 3 days, and 7 days, were selected for detailed analysis. The forecasting results are shown in Figure 2.

For short-term forecasting with a 1-day horizon, WindPower-SAFusion achieves an

R^{2}

value of 0.9077 and an MAE of 6.4629 MW. These results indicate that the model can accurately capture intraday power variations and maintain good tracking capability around sudden change points. This improvement is mainly attributed to the effective aggregation of key time steps by the sparse attention mechanism.

For medium-term forecasting with a 3-day horizon, the model still maintains high accuracy, with an

R^{2}

value of 0.8189 and an MAE of 3.9934 MW. As the forecasting horizon increases, the overall trend fitting remains stable, with only slight deviations at individual extreme points. This suggests that the multi-layer distillation encoder has advantages in balancing short-term fluctuations and medium-term trend modeling.

For medium- to long-term forecasting with a 7-day horizon, the model obtains an

R^{2}

value of 0.6880 and an MAE of 8.2465 MW. Although the forecasting accuracy decreases to some extent, the overall predicted trend remains consistent with the observed values. This indicates that the recursive multi-step forecasting strategy can maintain good temporal continuity in long-sequence forecasting.

Overall, the results at different time scales demonstrate that WindPower-SAFusion exhibits stable forecasting capability within the 1–7 day range. The sparse attention mechanism improves the extraction of key temporal features, the distillation mechanism enhances multi-scale feature fusion, and the multivariate input improves the adaptability and robustness of the model across different forecasting horizons. Through the collaborative effects of these modules, the proposed model achieves superior accuracy and stability in long-sequence wind power forecasting.

3.4. Experimental Design

3.4.1. Comparative Experimental Design

Several representative time-series and wind power forecasting models were selected as comparison models in this study. The selection principles were as follows: (1) the selected models should be representative or advanced in related fields; and (2) the models should cover different modeling frameworks, including basic deep learning models, economically driven models, and physical-information-fusion models.

Based on these principles, N-BEATS, Temporal Fusion Transformer (TFT), TS-CNN-Attention, and ARIMA were selected as comparison models. First, N-BEATS is a purely deep learning-based time-series forecasting model. It relies on a multi-layer fully connected structure for feature decomposition and has good interpretability [27]. Second, TFT is an interpretable multi-horizon time-series forecasting model that combines recurrent structures, attention mechanisms, and variable selection components [28]. Third, TS-CNN-Attention (TCA) combines convolutional sequence modeling and attention mechanisms to capture local temporal features and key temporal dependencies [21,24]. Fourth, ARIMA provides a classical linear statistical modeling baseline and has strong reference value in stationary sequence forecasting [11].

All models were evaluated under the same data partitioning and training procedure. The input features were standardized in a unified manner. For the deep learning models, the Adam optimizer was adopted, and an early stopping strategy was used. The comprehensive performance of WindPower-SAFusion was evaluated by comparing the MAE and

R^{2}

metrics.

3.4.2. Comparative Experimental Result Analysis

To verify the performance of the proposed model, four comparison models were selected for evaluation. The corresponding results are presented in Table 1, where the metrics were calculated on the test set of the complete rolling window. In this study, significance levels are denoted by asterisks: * indicates

p < 0.05

,

* *

indicates

p < 0.01

, and

* * *

indicates

p < 0.001

. Significance tests were conducted using two-sided t-tests to ensure the rigor and reliability of the evaluation results.

According to the comparative results in Table 1, WindPower-SAFusion achieves the best performance across all evaluation metrics. Its MAE is 4.3478, which is significantly lower than that of ARIMA (10.6463), N-BEATS (7.6895), TFT (8.2432), and TS-CNN-Attention (6.9182). The corresponding error reductions are 59.2%, 43.4%, 47.3%, and 37.1%, respectively. Meanwhile, the

R^{2}

value of WindPower-SAFusion reaches 0.8977, showing a substantial improvement over all comparison models. This indicates that the proposed model has a stronger ability to fit wind power variations. The differences between the proposed model and the comparison models are statistically significant according to the significance tests (

p < 0.01

), further verifying the statistical reliability of the performance advantage of the proposed model.

From the perspective of model structure, the advantage of WindPower-SAFusion mainly comes from the precise aggregation of key time steps by the sparse attention mechanism and the multi-scale feature extraction capability of the multi-layer distillation encoder. In contrast, N-BEATS has limitations in modeling long-sequence dependencies. TFT has limited ability to represent feature interactions in complex scenarios, whereas TS-CNN-Attention focuses more on economic loss weighting, but its overall forecasting accuracy remains inferior to that of the proposed model. Overall, WindPower-SAFusion demonstrates clear advantages in forecasting accuracy, stability, and structural adaptability.

3.5. Ablation Study

3.5.1. Ablation Study Design

To evaluate the effectiveness of each core module in WindPower-SAFusion, a controlled ablation study was designed in this research. The study focuses on the independent contributions of ProbSparse attention, the convolutional feed-forward network, and the distillation mechanism. Three model variants were constructed. Variant A removes the convolutional feed-forward network to examine the local feature extraction capability. Variant B removes the ProbSparse attention mechanism to evaluate its role in selecting key time steps in long sequences. Variant C removes the distillation mechanism to analyze the influence of sequence compression and multi-scale feature extraction. The evaluation metrics are consistent with those used in the main experiment, namely MAE and

R^{2}

, and significance tests were conducted to verify the reliability of the differences.

3.5.2. Ablation Study Result Analysis

Table 2 presents the performance changes after removing different modules. The results show that all three core components play important roles in improving the overall forecasting accuracy.

Specifically, removing the convolutional feed-forward network, namely Variant A, leads to a clear increase in MAE and a decrease in

R^{2}

. This indicates that the convolutional structure is crucial for extracting local temporal dependencies and characterizing rapid fluctuations. Removing the ProbSparse attention mechanism, namely Variant B, results in the most significant performance degradation, with the largest increase in MAE. This demonstrates that the mechanism can effectively focus on key time steps in long sequences and reduce noise interference, making it a core factor for achieving high forecasting accuracy. Although removing the distillation mechanism, namely Variant C, does not lead to an extreme decline in accuracy, the MAE still worsens and the

R^{2}

value fluctuates slightly. This suggests that the distillation mechanism plays a positive role in reducing redundant information, strengthening trend features, and preventing overfitting, while also improving computational efficiency.

In summary, the optimal performance of WindPower-SAFusion is achieved through the collaborative effects of the three modules. The sparse attention mechanism is responsible for selecting key time steps, the convolutional feed-forward network extracts local temporal patterns, and the distillation mechanism provides multi-scale feature compression and regularization. The absence of any one module affects the stability and accuracy of the model, which verifies the rationality and necessity of the proposed design.

Taking Figure 3 as an example, different model variants are used to forecast the same 3-day time period. From top to bottom, the figure presents the forecasting results of the original model, the model without the convolutional feed-forward network, the model without the ProbSparse attention mechanism, and the model without the distillation mechanism, respectively. It can be observed that the original model achieves the best fitting performance for the continuous 3-day data, with the predicted values almost overlapping the true values. After removing the convolutional feed-forward network, the overall predicted values become lower, and the fitting performance decreases significantly. When the ProbSparse attention mechanism is removed, the predicted values show an overall upward bias, indicating that the forecasting performance is also affected. After removing the distillation mechanism, the forecasting performance decreases slightly but not significantly. However, the model without the distillation mechanism exhibits a slower running speed and a significantly longer running time.

3.6. Evaluation of Practical Application Value

From the perspective of practical application, WindPower-SAFusion can meet the operational requirements of wind farms at different time scales. Short-term forecasting can support power control and electricity market trading. Medium-term forecasting is stable and reliable, which is helpful for generation planning and maintenance scheduling. Long-term forecasting can provide trend information and support risk assessment. In wind power forecasting, every 1 MW reduction in MAE has practical economic value. Compared with the best baseline model, WindPower-SAFusion reduces MAE by approximately 2.5704 MW, indicating its potential to reduce dispatching costs and improve wind power accommodation capacity.

Meanwhile, while maintaining high forecasting accuracy, the proposed model also achieves good computational efficiency and satisfies engineering deployment requirements. Overall, the design of sparse attention, multi-scale distillation encoding, and multivariate fusion jointly improves forecasting performance and efficiency, making WindPower-SAFusion a technically feasible solution for wind power forecasting in engineering applications.

4. Conclusions

This paper proposes WindPower-SAFusion, a long-sequence wind power forecasting model. By introducing ProbSparse attention, the proposed model effectively reduces the computational complexity of long-sequence modeling. A convolutional distillation encoder is used to extract multi-scale temporal features, and a multivariate fusion mechanism is further incorporated to fully integrate key meteorological factors such as historical power and wind speed. As a result, efficient and high-accuracy wind power forecasting is achieved.

Extensive experimental results show that the proposed model significantly outperforms existing mainstream methods in short-term, medium-term, and longer-term forecasting tasks. It demonstrates stronger stability and generalization capability, especially under complex wind conditions and rapid fluctuation scenarios. These advantages indicate that WindPower-SAFusion can better capture key temporal dependencies and multi-scale variation patterns in wind power sequences.

Future work will further investigate model lightweighting, edge deployment under resource-constrained conditions, and cross-scenario transferability across different meteorological conditions and wind farm regions. These studies are expected to further improve the usability and application value of the proposed model in practical wind power dispatching and operation management.

Data Availability Statement

The data used in this study are not publicly available due to confidentiality restrictions of the wind farm operator. The data may be available from the corresponding author upon reasonable request and with permission from the data provider.

References

Zhang, B.; Chen, J.; Wu, W. Review of key technologies for large-scale wind power integration into power systems. Autom. Electr. Power Syst. 2019, 43, 2–15. [Google Scholar]
Lee, K.; Park, B.; Kim, J.; et al. Day-ahead wind power forecasting based on feature extraction integrating vertical layer wind characteristics in complex terrain. Energy 2024, 288, 129713. [Google Scholar] [CrossRef]
Zhao, P.; Wang, J.; Xia, J.; et al. Performance evaluation and accuracy enhancement of a day-ahead wind power forecasting system in China. Renew. Energy 2012, 43, 234–241. [Google Scholar] [CrossRef]
Alessandrini, S.; Sperati, S.; Pinson, P. A comparison between the ECMWF and COSMO Ensemble Prediction Systems applied to short-term wind power forecasting on real data. Appl. Energy 2013, 107, 271–280. [Google Scholar] [CrossRef]
Ignatev, E.; Deriugina, G.; Suslov, K.; et al. Development of a hybrid model for medium-term wind farm power output forecasting. Renew. Energy 2025, 249, 123200. [Google Scholar] [CrossRef]
Peña, A.; Mirocha, J.D. One-year-long turbulence measurements and modeling using large-eddy simulation domains in the Weather Research and Forecasting model. Appl. Energy 2024, 363, 123069. [Google Scholar] [CrossRef]
Cheng, R.; Yang, D.; Liu, D.; et al. A reconstruction-based secondary decomposition-ensemble framework for wind power forecasting. Energy 2024, 308, 132895. [Google Scholar] [CrossRef]
Chen, Y.; Lin, C.; Zhang, Y.; et al. Proactive failure warning for wind power forecast models based on volatility indicators analysis. Energy 2024, 305, 132310. [Google Scholar] [CrossRef]
Malhan, P.; Mittal, M. A novel ensemble model for long-term forecasting of wind and hydro power generation. Energy Convers. Manag. 2022, 251, 114983. [Google Scholar] [CrossRef]
Cadenas, E.; Jaramillo, O.A.; Rivera, W. Analysis and forecasting of wind velocity in Chetumal, Quintana Roo, using the single exponential smoothing method. Renew. Energy 2010, 35, 925–930. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons: New York, NY, USA, 2015; pp. 12–45. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Wu, Q.; Peng, C. A least squares support vector machine optimized by cloud-based evolutionary algorithm for wind power generation prediction. Energies 2016, 9, 585. [Google Scholar] [CrossRef]
Tu, C.S.; Hong, C.M.; Huang, H.S.; Chen, C.H. Short term wind power prediction based on data regression and enhanced support vector machine. Energies 2020, 13, 6319. [Google Scholar] [CrossRef]
Jamii, J.; Trabelsi, M.; Mansouri, M.; Kouadri, A.; Mimouni, M.F.; Nounou, M. Medium-term wind power forecasting using reduced principal component analysis based random forest model. Wind Eng. 2024, 48, 597–616. [Google Scholar] [CrossRef]
Jursa, R.; Rohrig, K. Short-term wind power forecasting using evolutionary algorithms for the automated specification of artificial intelligence models. Int. J. Forecast. 2008, 24, 694–709. [Google Scholar] [CrossRef]
Liu, H.; Zhang, Z. Development and trending of deep learning methods for wind power predictions. Artif. Intell. Rev. 2024, 57, 112. [Google Scholar] [CrossRef]
Pinson, P. Wind energy: Forecasting challenges for its operational management. Stat. Sci. 2013, 28, 564–585. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Long, H.; Li, K.X.; Wu, Y.H.; et al. A hybrid model for ultra-short-term wind power forecasting based on TCN and physics-informed network. Energy 2025, 318, 138678. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; AAAI Press: Menlo Park, CA, USA, 2021; Volume 35, pp. 11106–11115. [Google Scholar] [CrossRef]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Bergmeir, C.; Hyndman, R.J.; Koo, B. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Comput. Stat. Data Anal. 2018, 120, 70–83. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Lim, B.; Arik, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed WindPower-SAFusion model. The model integrates multivariate input embedding, ProbSparse attention, convolutional feed-forward feature extraction, distillation compression, and a decoder-based recursive multi-step forecasting strategy to generate wind power predictions. The final results are evaluated using MAE and

R^{2}

and visualized through multi-horizon time-series plots.

Figure 1. Framework of the proposed WindPower-SAFusion model. The model integrates multivariate input embedding, ProbSparse attention, convolutional feed-forward feature extraction, distillation compression, and a decoder-based recursive multi-step forecasting strategy to generate wind power predictions. The final results are evaluated using MAE and

R^{2}

and visualized through multi-horizon time-series plots.

Figure 2. Multi-horizon wind power forecasting results of WindPower-SAFusion. The blue solid lines denote the observed power values, and the red dashed lines denote the predicted values. From top to bottom, the subplots correspond to the 1-day, 3-day, and 7-day forecasting horizons, respectively.

Figure 3. Ablation study comparison of WindPower-SAFusion. From top to bottom, the subplots show the forecasting results of the complete model, Variant A without the convolutional feed-forward network, Variant B without ProbSparse attention, and Variant C without distillation compression. The blue solid lines denote observed wind power, and the red dashed lines denote predicted wind power.

Table 1. Comparative experimental results.

Model	MAE	$R^{2}$	p-value	Sig.
ARIMA	10.6463	0.4544	$p < 0.001$	$* * *$
N-BEATS	7.6895	0.6464	0.0008	$* * *$
TFT	8.2432	0.6882	0.0011	$* *$
TCA	6.9182	0.6325	0.0006	$* * *$
WindPower-SAFusion	4.3478	0.8977	–	–

Table 2. Ablation study results.

Model	MAE	$R^{2}$
WindPower-SAFusion	4.5610	0.8923
Variant A	6.8128	0.7187
Variant B	7.2439	0.7989
Variant C	4.9815	0.8832

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

WindPower-SAFusion: A Sparse-Attention and Multi-Scale Fusion Model for Wind Power Forecasting

Abstract

Keywords:

Subject:

1. Introduction

2. Materials and Methods

2.1. ProbSparse Attention Mechanism

2.2. Encoder Module Design

2.3. Decoder Module Design

2.4. Positional Encoding and Input Representation

2.5. Training and Forecasting Strategy

3. Results and Discussion

3.1. Data Description

3.2. Experimental Settings

3.3. Forecasting Result Analysis

3.4. Experimental Design

3.4.1. Comparative Experimental Design

3.4.2. Comparative Experimental Result Analysis

3.5. Ablation Study

3.5.1. Ablation Study Design

3.5.2. Ablation Study Result Analysis

3.6. Evaluation of Practical Application Value

4. Conclusions

Data Availability Statement

References

MDPI Initiatives

Important Links

Subscribe