2.2. Machine Learning
Machine Learning (ML)[
25,
26] is a branch of artificial intelligence (AI). It originated in the 1950s in Hanover [
27] and focuses on systems that perform tasks that, to an external observer, would appear to be exclusively within the domain of human intelligence. ML emerged as a subset of AI following various schools of thought that defined an "intelligent" system as one capable of learning from experience and improving its performance [
28]. This distinctive trait sets ML algorithms apart from conventional computer programming, as they can operate even under conditions for which they haven't been explicitly programmed.
The main objective of testing the physical model with different configurations (frequency, load, number of power augmenters, and power augmenter distance) is to maximize the power output of the turbine. With the same input power (fan), the aim is to maximize the output power (generated by the turbine). As mentioned earlier, one of the fluid parameters measured by dedicated sensors during the various tests is the pressure difference (ΔP) created between the upstream and downstream stages of the turbine. The investigation aimed to identify which parameters, when varied within the different system configurations, had the greatest impact on the generation of ΔP. Understanding which parameters most significantly influence the value of ΔP would allow for coherent modification of the setup to maximize the turbine power output. To recap, the parameters varied during the different tests are applied load, presence/absence of power augmenters, number of power augmenters, flow feed frequency, and distance between the power augmenters. Among these parameters, the presence/absence of power augmenters and the number of power augmenters are categorical variables, unlike all the others which are continuous variables.
Table 3. summarizes the setup variables along with a description and range of variation.
The ΔP is evidently a continuous quantity with a variability range of [-151.9606, 85.0614] Pa. This range of ΔP is determined by computing the pressure difference between measurements taken downstream and upstream of the turbine. Thus, the negative sign denotes instances where the downstream pressure surpasses that measured upstream of the turbine (with consideration for potential variations in fluid flow direction). However, for power generation considerations, the absolute value of ΔP is paramount, given the self-aligning property of the Savonius turbine, which rotates unidirectionally regardless of fluid flow direction. Therefore, ΔP is evaluated in absolute terms, as power production remains unaffected by fluid flow direction.
With the available dataset, derived from the measurements taken, it was decided to train a Machine Learning (ML) algorithm for two main reasons:
As is well known, an ML algorithm is capable of learning a relationship between the features and the output, even under conditions it has never encountered before, unlike in traditional programming.
The dataset consists of 5 input variables (referred to as "features") and one output variable (ΔP), comprising 1044 instances, or observations. This dataset was then split to obtain three distinct datasets, each for one of the three phases required to develop a robust ML model: training, validation, and testing. Approximately 10% of the dataset (100 instances selected randomly) was reserved for the testing phase, while the remaining data underwent a 10-fold cross-validation.
Table 4. provides an overview of the datasets and the number of instances they contain.
This practice is employed to enhance the model's performance, especially when the dataset is not extensive. Consequently, the remaining 90% of the dataset is divided into 9 sections. Ten training and validation sessions are conducted, using 9 sections for training and 1 section for validation during each session.
In each phase, a different section will be used for validation. During validation, the model's performance is evaluated and improved. The unique aspect of using cross-validation is that it not only estimates the performance of the trained model but also provides a measure of how accurate its predictions are (through evaluation of the standard deviation) and how reproducible they are. Finally, the test dataset simulates a real application of the model, aiming to observe its behavior with data it has never seen during training. The test results provide a measure of the model's goodness.
The choice of the ML model to implement fell on decision trees, which are a predictive model based on a tree structure for decision-making. The advantages of this model for the problem addressed in this work will be discussed later.
A decision tree model is a structure composed of nodes and leaves. Each node represents a question or condition about a data attribute, and each leaf represents a class or output value, depending on whether it is a classification or regression problem.
At the beginning of the tree is the root node, which contains the entire training data set. From here, the tree branches into child nodes, each of which represents a possible response to the question or condition posed by the parent node. Each child node is connected to the parent node by a branch, indicating the hierarchical relationship between them. Descending along the tree, encounters with intermediate nodes occur, which continue to pose questions or conditions on the data based on previous answers. This process continues until reaching the tree's leaves, where a final decision is made, or a predictive output is provided.
Figure 8. show a schematization of a decision tree.
The algorithm underlying the operation of decision trees is called Classification and Regression Tree (CART). It is used for both classification and regression problems. The problem addressed in this work falls within the realm of regression. Indeed, the output variable is a continuous variable (unlike a classification problem where the output is a class). Therefore, in this case, the model will predict continuous values, and to assess the model's performance, the error in estimating these values is measured.
CART constructs a decision tree by splitting the dataset into two subsets of data according to a feature (
k) and a certain threshold (
t). The feature is chosen by carefully evaluating the k-t pairs that minimize a certain function. In the case of regression, the goal is not to determine a class but to obtain a value. The cost function to be minimized is based on the Mean Square Error (MSE) and is as follows [
29]:
Where:
MSE is calculated as the average of the squares of the differences between the values predicted by the tree () and the corresponding actual values in the training data (). Minimizing the MSE during tree construction helps find optimal splits that reduce the overall prediction error.
is the total number of instances in the training dataset.
represents the number of instances contained in the node of interest.
The number of instances to the right () and left () refers to the number of training samples ending up in the right and left subtree respectively during the tree splitting process.
This aspect is crucial because during the tree construction phase, the goal is to find splits that minimize predictive error while simultaneously avoiding overfitting. Therefore, the splits must be chosen to minimize the MSE and ensure that each subtree has a sufficient number of instances to make accurate predictions.
Overfitting occurs when a model excessively fits the training data, capturing the noise present in the data rather than just the relevant patterns. In this way, the decision tree runs the risk of becoming too complex, with many splits and nodes, to the point of effectively memorizing the training data instead of being able to generalize correctly to new, unseen data. As a result, the model may fail to generalize, meaning it cannot properly evaluate new data, as it has overfit to the training data.
The choice of a decision tree-based ML algorithm to address the problem proposed in this work is supported by a series of motivations outlined below.
This ML model is capable of working with mixed variables (continuous and categorical) without needing to convert categorical ones into binaries through a process called one-hot encoding. Moreover, they can achieve good performance even with relatively small datasets (on the order of thousands of data points as in the case of the dataset used in this work). The computational burden required for training a decision tree model is very low since they are quickly implementable and do not require data normalization. The most important characteristic of decision trees, for the purposes of this work, is that they behave like white boxes. Unlike many other ML methods such as neural networks, which behave like black boxes, a decision tree model is highly interpretable. Therefore, it is possible to understand how the model makes its prediction. This allows for the identification of the most influential input variables on the output.
To analyze the model's performance, that is, to obtain a quantitative value of the trained model's goodness, various metrics are calculated using the results of the validation and test. In particular:
MSE, as already defined, calculates the average of the squares of the differences between the model's predicted values and the actual values in the validation (or test) dataset. This metric is particularly sensitive to outliers.
Root Mean Square Error (RMSE) is the square root of the MSE and provides a measure of the average prediction error in units of the output variable. It corresponds to the Euclidean norm [
29].
Where is the predicted value.
The coefficient R2, also known as the coefficient of determination, provides a measure of how well the model fits the data. R2 varies between 0 and 1 and represents the percentage of variation in the output variable explained by the model. A value closer to 1 indicates a better model fit.
Mean Absolute Error (MAE) calculates the average of the absolute differences between the predicted values and the actual values and is less sensitive to outliers compared to MSE [
29].
The ML model was trained using the "Regression Learner" app of MATLAB (MathWorks, v. R2023b). The chosen algorithm type is "Optimizable Tree", which explores different combinations of hyperparameters to achieve the best performance of the model.
Hyperparameters are configuration settings external to the model and cannot be directly estimated from data. They are set before the training process and govern the behavior of the learning algorithm. Unlike parameters, which are learned during training, hyperparameters are typically chosen based on heuristics, prior knowledge, or through a process of trial and error [
30,
31]. In particular, the hyperparameter used are the minimum leaf size, the maximum number of splits, and the minimum parent size. The minimum leaf size is the minimum number of observations per tree leaf. Smaller values may lead to more complex trees, potentially prone to overfitting. The maximum number of splits allowed in each tree. This parameter can control the complexity of the tree. The minimum number of observations per tree parent. Smaller values may lead to more splits and potentially more complex trees.
As regard the interpretation of the results with the aim of understanding the weight of each feature (or predictor) on the output, a prediction importance was conducted. The prediction importance is evaluated by looking at how much the risk of a node changes when a split is made based on that predictor. This change in risk is measured as the difference between the risk at the parent node and the combined risk at the two child nodes created by the split. For example, if a tree divides a parent node (like node 1) into two child nodes (nodes 2 and 3), the importance of the predictor involved in this split is boosted by:
Where
represents the node risk of node
, and
is the total number of branch nodes. Node risk is determined by multiplying the node probability (
) by the related MSE:
In which the node probability () proportion of observations in the original dataset that meet the conditions for each node in the tree.