DEBoost: A Python Library for Weighted Distance Ensembling in Machine Learning

In this paper, we introduce deboost, a Python library devoted to weighted distance ensembling of predictions for regression and classification tasks. Its backbone resides on the scikit-learn library for default models and data preprocessing functions. It offers flexible choices of models for the ensemble as long as they contain the predict method, like the models available from scikit-learn. deboost is released under the MIT open-source license and can be downloaded from the Python Package Index (PyPI) at https://pypi.org/project/deboost. The source scripts are also available on a GitHub repository at https://github.com/weihao94/DEBoost.


Introduction
Ensemble learning usually refers to methods that involve the combination of several models to perform a prediction, either in classification or regression problems. In many cases, ensembles perform better than a single model. They also reduce the likelihood of the selection of a model with poor performance [ Dietterich (2000)]. In recent years, most of the research on ensemble learning was done on classification problems which unfortunately are not entirely applicable to regression problems [ Mendes-Moreira et al. (2012)].
Some commonly used ensemble algorithms are bagging (bootstrap aggregating), boosting (an ensemble of models by resampling the data, which are then combined by majority voting) and stacking (a combination of models via a meta-classifier or meta-regressor). There have been recent research done on ensemble of model predictions via spatial and statistical techniques such as Bayesian model averaging, geostatistical output perturbation and spatial Bayesian model averaging (a combination of the two) [ Berrocal et al. (2007)]. Distance weighting measures on predictions were also researched upon, for example the usage of inverse distance weighting to improve predictions in one-dimensional time series analysis with singular spectrum analysis [ Awichi and Müller (2013)]. In our deboost Python library, we utilize existing distance metrics to obtain weighted ensembles of model predictions, for the classification and regression tasks. The library also utilizes well-known regression and classification models as default models. Users are able to make their own configuration to the set of default models used in the ensemble.
In the subsequent sections, we first introduce the distance metrics available in the initial release of the library, describe the computations of the weighted ensemble, introduce the library and finally present experimental results on some publicly available datasets.

Distance Metrics
In the version of the library's initial release, the available distance metrics for computation of the ensembles of predictions in regression, and prediction class probabilities in classification are: Bray-Curtis, Canberra, Chebyshev, City Block (Manhattan), correlation, Cosine, Euclidean, Hamming and Jaccard-Needham dissimilarity. Other non-distance metrics made available are the mean, median and Bhattacharyya distance. We now formally define each of the available spatial and statistical distance metrics.
Supoose for a regression context, that we have m regression models m 1 , . . . , m m ∈ M , where the models' predictions are n×1 matrices Y 1 , . . . , Y m respectively. Here, the kth observation of Y i is denoted as Y ik , where k ∈ {1, . . . , n}. Define d(Y i , Y j ) as the number of elements in Y i and Y j that differ at the same index. Also define A 11 , A 01 , A 10 , A 00 respectively as the total number of attributes where Y i and Y j both contain the value 1, the attribute of Y i is 0 and the attribute of Y j is 1, the attribute of Y i is 1 and the attribute of Y j is 0, and where Y i and Y j both have a value of 0. Then between any Y i and Y j for i = j, the Bray-Curtis distance, Canberra distance, Chebyshev distance, City Block (Manhattan) distance, correlation distance, Cosine distance, Euclidean distance, Hamming distance and Jaccard-Needham dissimilarity are respectively: Next, we have the Bhattacharyya distance between Y i and Y j defined as: where 2n is the total number of observations in Y i and Y k combined, and p(·), q(·) are the histogram probabilities of the distribution of Y i and Y j respectively, and p(Y k ), q(Y k ) are the histogram probabilities of the kth observation in the sequence of values in ascending order formed by concatenating the prediction arrays of Y i and Y j , which we denote as Y.
Finally, we have the mean and median of the predictions to be defined respectively as: where m(·) is a function that finds the median of the values.
Note that for the task of classification, the outputs of the predictions are the class probabilities. The spatial and statistical distance metrics introduced above are applied in a similar fashion for the classification task, but to each class across the models.

Weighted Ensemble
In this section, we describe the process of obtaining weighted ensembles using the distances computed in the metrics introduced in the previous section. There are two types of weighted ensembles in the initial release of the library, namely an assignment of higher weights to model predictions with smaller sum of distances to other models' predictions, and conversely an assigment of smaller weights. The mean and median are excluded from weighted ensembling as they are not computed via distance similarity methods.
For each model i's prediction Y i , without loss of generality, suppose that a distance metric d(·) is used to compute the (spatial or statistical) distance between Y i and Y j (model j's prediction) for all j = 1, . . . , m. Denote the distance computed as d ij . For each model i, i = 1, . . . , m, obtain the sum of distances of its predictions D i to all other models j = 1, . . . , i − 1, i + 1, . . . , m. This sum can be computed by At this junction, there are two methods at which the weights can be assigned. The first method involves assigning a higher weight to model predictions with smaller D i . The weight of model i's prediction Y i is given by and the ensembled prediction (for the regression case) iŝ On the other hand, the second method invovles an assignment of lower weights to model predictions with smaller D i . The weight of model i's prediction Y i for this method is thus and the ensembled prediction (for the regression case) iŝ The way at which the weighted distance ensemble is being carried out for the classification task is similar, where instead of the distances D i we have D ic for each class 1, . . . , C, i.e. the distances of model i's class c prediction probabilities to all the other models' class c prediction probabilities.

The deboost Library
The library utilizes SciPy [ Virtanen et al. (2020)] in computing the spatial distances and Scikit-learn [ Pedregosa et al. (2011)] for its models and evluation metrics. The code for computing Bhattacharyya distance was taken from Eric P. Williamson's GitHub repository at https://github.com/EricPWilliamson/bhattacharyya-distance. In the initial release of the library, only the continuous distribution method for computing Bhattacharyya distance was made available.
The models available as defaults in the program for regression are Ridge, Lasso, Elastic net, AdaBoost Regressor, Gradient Boosting Regressor, Random Forest Regressor, Support Vector Machine Regressor, LightGBM Regressor and XGBoost Regressor. For the classification task, the models are AdaBoost Classifier, Gradient Boosting Classifier, Gaussian Naive Bayes, K-Nearest Neighbors Classifier, Logistic Regression, Random Forest Classifier, Support Vector Machine Classifier, Decision Tree Classifier, LightGBM Classifier and XGBoost Classifier. These are also the models that had their ensembles evaluated in our experiments with a select few datasets.

Experimental Results
In our experiments, the datasets used for regression are the Boston housing prices dataset from Scikit-learn and the red/white wine quality datasets from the University of California, Irvine (UCI) Machine Learning Repository [ Pedregosa et al. (2011),Cortez et al. (2009]. The datasets used for classification are the aggregated Titanic dataset from Kaggle 1 , breast cancer 2 and heart disease 3 datasets from UCI's Machine Learning repository [ Dua and Graff (2017)]. As the objective of the experiments were to illustrate the performance gains in using the distance metrics in the library for ensembling model predictions, the default hyperparameters of each model was used.
The results obtained in our experiments can be found in the Appendix, in Tables 1 & 2 for regression and classification respectively. The Mode column indicates which method of assigning weights to a model's prediction was used. If a value 'SDHW' is present, it means that a test involving an assignment of higher weights to predictions with smaller distances was carried out. The error metric used for regression is mean-squared error (MSE) and classification accuracy (accuracy score in Scikit-learn) for the classification task. For experiments excluding the mean and median as metrics, it can be observed that most test cases with 'SDHW' in the regression task have lower MSE than those without 'SDHW'. The results are similar as well for the classification task, though the difference seems much smaller or in many clases negligible.

Future Work
The available distance metrics in the library in its initial release are by no means the only ones that can be used in the weighted ensemble. Over time, we will continuously update the library to contain more distance metrics and possibly include additional features that will be beneficial to its users.
Appendix A.