SimulaD: A Novel Feature Selection Heuristics For Discrete Data

For discrete big data which have a limited range of values, Conventional machine learning methods cannot be applied because we see clutter and overlapping of classes in such data: many data points from different classes overlap. In this paper we introduce a solution for this problem through a novel heuristics method. By applying a running average (with a window-size= d) we could transform Discrete data to broad-range, Continuous values. When we have more than 2 columns and one of them is containing data about the tags of classi�cation (Class Column), we could compare and sort the features (Non-class Columns) based on the R2 coe�cient of the regression for running averages. The parameters tuning could help us to select the best features (the non-class columns which have the best correlation with the Class Column). “Window size” and “Ordering” could be tuned to achieve the goal. This optimization problem is hard and we need an Algorithm (or Heuristics) for simplifying this tuning. We demonstrate a novel heuristics, Called Simulated Distillation (SimulaD), which could help us to gain a somehow good results with this optimization problem.


Introduction
There are numerous previous heuristics for feature selection methods [1].Heuristics are based on human intuitions for solving technical problems [2].Here we provide a new and novel heuristic for Feature Selection Method.
For discrete big data which have a limited range of values, Conventional machine learning methods cannot be applied because we see clutter and overlapping of classes in such data: many data points from different classes overlap.In this paper, by applying a novel heuristics for feature selection method, we overcome this problem.We use the moving average lter [3] and linear regression [4] to achieve the goal.

The Problem Description
Suppose that we have 2 column of discrete data (Column A and Column B) about software quality from users' perspective with Likert-scale values [5] [6] [7].We wish to nd the probable correlation between these two columns.If the range of discrete values is limited, then we couldn't shape a su cient space of locus points to run regression algorithms (see gure-1).
We need a schema to map these limited-range, discrete values to broad-range, continuous values.This schema must conserve the necessary characteristics of initial values to show us any probable correlation between values.

The Proposed Method
By applying a running average [8] (with a window-size = d), we could transform the data to broad-range, Continuous values (see gure-2).It's could be considered as a type of continuous measuring of discrete data.Then we could apply regression algorithms to investigate the inherent correlation between these two sets of values (see gure-3).A real-world example is provided in gure-4.We could consider each point of the resulting continuous space locus, as a representation of a micro-community with d users population.
By varying the window size (d), the regression factor R 2 is varying.For different datasets, we could plot different d-R 2 diagrams.Extremum points of these plots are depicting an inherent characteristics feature of the dataset (see gure-5).

Comparing with Random base-line
We could examine the level of correlation by comparing the R2 coe cient for two different settings: 1) when columns are lled with the running averages of under study data, 2) the columns are lled with running averages of a randomly-generated base-line data.

Ordering of Data Effects the Correlation Results
Each ordering of the data yields a different R2 coe cient.So after the window size, the ordering is another parameter for tuning the results.An appropriate ordering could show us the inherent correlation between two columns (albeit after applying the regression).A set of N data items has N! different orderings and we couldn't check each one.So we need some algorithm (or Heuristics) to select appropriate ordering of data.As a simple one, we could average (or select the optimum from) the results of some random-selected samples of the "Ordering Space" of the data.

Feature Selection Method
When we have more than 2 columns and one of them is containing data about the tags of classi cation (Class Column), we could compare and sort the features (Non-class Columns) based on the R2 coe cient of the regression for running averages.
The parameters tuning could help us to select the best features (the non-class columns which have the best correlation with the Class Column)."Window size" and "Ordering" could be tuned to achieve the goal.
Again our optimization problem is hard and we need an Algorithm (or Heuristics) for simplifying this tuning.We demonstrate a novel heuristics, Called Simulated Distillation (SimulaD), which could help us to gain a somehow good results with this optimization problem.

De nition 2
We used Iranian National Computing Grid services [1] (80 computing cores) to do the computations.

Computational Complexity
For a dataset with N features and M records, the computational complexity of proposed heuristics algorithm is: O (k 1 k 2 k 3 NM).

Discussion On Application Domain
When we want to study a number of quality variables for a system, we can use different methods of quality quanti cation: Likert scale quality questionnaires [9], fuzzy logic [10] [11], totaling over system segments, statistical distribution approximation [12], Continuous signal approximation from discrete samples [13] and so on.
Our proposed method, which is suitable for many systems, especially complex socio-technical systems, is as follows: Using Likert-scale quality spectrum questionnaires to collect quantitatively discrete data about system quality variables, then convert this Discrete data to continuous data that are suitable for machine learning (by applying SimulaD algorithm on them).

The Advantage Over Other Methods
Our proposed algorithm uses the concept of micro-community [5].Each micro-community is like a eld of probability around each of the records.So in fact our approach uses the mathematical concept of "probability eld", albeit conceptually rather than technically, instead of using other mathematical objects (such as sets in fuzzy logic, signals in signal processing, totaling in statistics and so on).The stochastic nature of probability elds are very match with "quality uctuations" in the real world.So it could be a suitable quantitative model for describing quality variables of complex systems.
Focusing on micro-communities could unravel hidden topological structures in data (for an example, see gure -7).So it could be useful in the topological data analysis.

Conclusion
By applying a running average (with a window-size = d), we could transform the data to broad-range, Continuous values.It's could be considered as a type of continuous measuring of discrete data.We could compare and sort the features (Non-class Columns) based on the R2 coe cient of the regression for running averages.We have demonstrated a novel heuristics, Called Simulated Distillation (SimulaD), which could help us to gain a somehow good results with optimization problem of "Window Size" and "Ordering".

Declarations
• Ethics approval and consent to participate: Not Applicable or Yes.
• Consent for publication: Yes.
• Availability of data and materials: Yes, in Data Availability section.
• Funding: Yes, in Funding section.
• Authors' contributions: The rst author was involved in idea generation, text writing, data gathering and processing, chart preparation, and text editing.The second author has been involved in ideation, guidance and supervision, evaluation, text editing and research process management.
• Acknowledgements: Yes, in Acknowledgement section.Acknowledgement A preprint has previously been published [15].Thanks to the comments of Dr. Alireza Talebpour and his fellow researchers.

Figures
Figures

Figure 1 The
Figure 1

Figure 2 The
Figure 2

Figure 3 The
Figure 3

Figure 4 From
Figure 4