Incremental Learning for Large Scale Churn Prediction

: Modern companies accumulate a vast amount of customer data that can be used for creating a personalized experience. Analyzing this data is difﬁcult and most business intelligence tools cannot cope with the volume of the data. One example is churn prediction, where the cost of retaining existing customers is less than acquiring new ones. Several data mining and machine learning approaches can be used, but there is still little information about the different algorithm settings to be used when the dataset doesn’t ﬁt into a single computer memory. Because of the difﬁculties of applying feature selection techniques at a large scale, Incremental Probabilistic Component Analysis (IPCA) is proposed as a data preprocessing technique. Also, we present a new approach to large scale churn prediction problems based on the mini-batch Stochastic Gradient Decent (SGD) algorithm. Compared to other techniques, the new method facilitates training with large data volumes using a small memory footprint while achieving good prediction results.


Introduction
Data is one of the important assets for any type of company and thats why large volumes of data are becoming more common everyday.Therefore, large scale data analysis and management is also turning increasingly more complex.Churn prediction is one such exemplar, where well known data mining models can be used to predict whether a customer is likely to leave a company [1].
Whenever the signals that directly influence the churn phenomena are not known (i.e.customer satisfaction, competitive market strategies, etc.), it becomes difficult to achieve accurate prediction results, even if large volumes of transactional data are available.Most business operations are recorded and can therefore be used to create features that relate the customer behavior to the churn probability, however because of the size of the business data sets and the total number of different operations that can be performed, preprocessing techniques such as feature selection and dimensionality reduction must be performed.
Principal Component Analysis (PCA) is a well known technique for dimensionality reduction that projects the full dataset into a subset of the eigenvectors of the covariance matrix [2].The method is robust and can be used to reduce the dimensionality of high dimensional transactional data for the churn prediction problem, however the computational cost is expensive in the case of large datasets.An incremental version of PCA (IPCA) was proposed inn order to sequentially create the data projection, without an explicit pass over the whole data set each time a new data point arrives [3].Conversely, the IPCA algorithm is variation of the original PCA that allows to perform dimensionality reduction from partial subsets of the data so it can be used to extract low-dimensional features from monthly aggregated data [4].
Predictive analytics aims to identify an event before it takes place.In order to predict customer churn, data mining algorithms such as Support Vector Machines (SVM), Boosting and Decision Trees (DT) haven been traditionally preferred among other modeling techniques [1,5].The computational complexity of SVMs scale poorly with large datasets, but DTs combined with ensemble techniques such as boosting can deliver state of the art prediction results [6][7][8].Due to their weak structure, DTs are usually chosen over SVMs as base learners for boosting, however the final learner is a weighted average of the base learners and therefore the model does not make efficient use of the full training dataset.In the other hand, Stochastic Gradient Descent (SGD) is an on-line technique that operates on a single data point and thus it does not limit the volume of the training data.
In this paper we present a complete data pipeline for large scale datasets using an incremental learning approach.The paper is organized as follows.We first review the methodology and the proposed algorithm in Section 3 and then the experimental evaluation is shown in Section 4. Finally, we conclude the paper in Section 6.

Data Mining for Churn Prediction
Data mining is defined as the process of obtaining insight and identifying new knowledge from databases.From a methodological point of view, the Cross-Industry Process for Data Mining (CRISP-DM, see Figure 1) provides a common framework for delivering data mining projects.The CRISP-DM methodology includes five stages, which in the case of churn prediction can be summarized as: • Business Understanding : In this stage the goal is to gather an insight of the business goals.We are interested in predicting which customers are active and can potentially become inactive in the near future.In this case, the business goal is to use transactional features in any month and decrease the number of customers that might leave the company in a 3 month period.• Data Understanding : Getting insight into the available data and how this relates to the business goals is part of this stage.In our case, we have customer transactional data from 25 months.The total number of records is 62.740.535, and each record contains a total number of 162 features.Table 1 summarizes the type and number of features for each customer record.• Data Preparation : Data cleaning and preparation involves removing missing data and creating binary label that indicates whether a customer will be inactive (not performing any transaction) over the next 3 months.For each customer ID, transactional data for any given month is paired with the inactivity label after the next 3 months period.The resulting dataset consists of 25 sliding windows.Figure 2 shows the resulting number of data for each class, where class 1 represents customers that remain active and class 0 represents customers that become inactive.• Modeling : Data modeling is the process of encapsulating the business and data knowledge into a model that explains the training data.This model is also used to predict unseen labels from a test set.Given the size and complexity of the dataset, in this paper we propose an incremental learning pipeline.The whole pipeline and the different algorithm settings is explained in Section 3. • Evaluation : Performance evaluation is the last stage of the process and involves partitioning the dataset into one or more training and testing datasets.Also, a performance criteria must also be defined in order to assess the validity of the overall results.These metrics and the whole cross-validation results are shown in Section 5.

Incremental Learning for Predictive Analytics
Obtaining highly informative variables that are neither correlated or missing is difficult, especially in high dimensional settings.This is often the case in most machine learning problems, so dimensionality reduction techniques such as PCA are used to alleviate the construction of such models.However, the cost of projecting the original features into a lower dimensional manifold depends on the size of the data set and therefore cannot be directly used to large scale problems [9].Although some specific implementations of distributed machine learning techniques for big data analytics exists (e.g.Apache Mahout/Spark1 ), there is a communication overhead when using a computer cluster.Morover, significant energy savings could be achieved when using an incremental learning strategy for medium-sized datasets in a single computer node [10].

Incremental PCA
Input data vectors may be very high dimensional but typically lie close to a of very low manifold, meaning that the distribution of the data is heavily constrained.PCA is a linear dimensionality reduction technique that can be obtained through a low-rank factorization operation such as the Singular Value Decomposition (SVD).Given a zero-mean input matrix X ∈ R n×d , we can write the SVD decomposition as: where X is the optimal low-rank reconstruction of the original data matrix X, U is a n × r column matrix containing the r eigenvectors of XX T , V is another d × r column matrix with the r eigenvectors of X T X and D is a square diagonal r × r matrix.
When large datasets 0 << d << n are taken into consideration, there is a computational bottleneck for the covariance matrix XX T which might eventually not fit into memory.However, it is possible to divide the entire dataset into blocks and process sequentially or in parallel each one of the blocks [11].Furthermore, this approach is also appealing to the churn prediction problem where data is each block can represent monthly aggregated data and we can use several blocks for parameter estimation and cross-validating results.
AS the name suggests, Incremental Probabilistic Component Analysis (IPCA) incrementally learns a subspace representation of the data from a partial subset of the dataset [3].The method requires the specification of the number of eigenvalues to be used for the representation (PCA components) but the complete data does not necessarily must fit into memory.This is particularly important for big data analytics when the computation of a full covariance matrix cannot be performed efficiently.
If we now write the input data matrix as the concatenation of two matrices X = [AB] T , we can first compute the SVD of A to obtain a partial fit Â = UDV T and then use this result to compute [ Â B] T = U D V T .More technical details can be found in [3]

Stochastic Gradient Decent
Similar to the dimensionality reductions step, we now need to consider algorithms with online learning capabilities.SGD can be used to perform one pass over a data block (mini-batch) and update the model according to the direction of the gradient [2].The gradient of a function F(•) with parameter w ∈ R r+1 is calculated as the sum of a loss function J(w) = 1 b ∑ b j J(w; x j , y j ) and a regularization term.The parameter b accounts for the size of the data block and the loss function can be written as: SGD performs partial updates to the unknown parameter w using the following update rule: where η i is a (possibly non-stationary) learning rule and ∇F(w) is the gradient of the empirical loss function.In the case of logistic regression classifiers, J(w) becomes the negative log-likelihood of the binomial distribution, however when the hinge loss is used the model can also accommodate to a linear SVM with gradient: where w i must be scaled with a factor of min{1, 1 √ λ ||w i ||}.

Technical Equipment
The technical equipment used has an Intel Xeon processor E5-v2620 with 1Tb storage, two Intel Xeon Phi coprocessors 7120p and 5120p, 32 GB of RAM and the Centos 7 operating system.The programming language was Python (https://www.python.org),Pandas (http://pandas.pydata.org) was used for data pre-processing and all algorithms were implemented with the scikit-learn library [12].

Data
The dataset contains information of monthly customer transactions and the churn label was calculated as inactivity in any of the given 19 months.The full dataset contains more than 150 predictors corresponding to different types of transactions with a total size of 23 GB and a total number of n = data rows.Data pre-processing included : data cleaning (removing null, missing and constant values), transformation (normalization, representing categorical variables as dummy indicator variables, standardization) and label calculation (pairing transactional data with a 3 months ahead churn label).As a result of the data processing task, a total number of 19 labeled data files were produced.

Evaluation
To evaluate the performance of the proposed approach we must adjust several design parameters according to an error criteria.Therefore, in order to validate the efficiency of the model parameters, a k-fold cross-validation scheme over different configurations whose performance is measured according to the criteria of accuracy A and precision P for classification: where |TP| i corresponds to the number of true positives, |FP| i the number of false positives, |TN| i the number of true negatives and |FN| i the number of false negatives in the validation set. Figure 3 shows the k-fold cross-validation scheme.

Dimensionality Reduction
Dimensionality reduction is first applied through IPCA over the k − 1 data blocks.For each fold, IPCA produces two new datasets with reduced dimensionality.The validation block uses the projection matrix obtained from all other blocks in order to create the test dataset for the k − ith block.The final number of predictors depends on the PCA components, which in turn are the eigenvectors that captures most of the variation of the full dataset.

Classifier training
SGD uses a loss function J(w) to calculate the gradient of the error, so the classification results are dependent on this function.Typical functions are the Hinge loss, the logarithmic (Log) function and the modified Huber loss.Moreover, the regularization term can also use the L 1 or L 2 penalty with different values for the α parameter.

Grid search cross-validation
The best configuration was determined through a grid search procedure and the performance was measured with the k-fold cross validation procedure.The following configurations were tested:

Conclusions
An incremental learning approach for large scale churn prediction has been presented.The proposed approach uses an incremental version of PCA that can efficiently compute eigenvalues of big matrices, so it can be used when the size of the dataset does not fit into memory.Furthermore, a fully incremental learning model was also delivered using SGD, which performs on line training using data batches.The proposed approach was evaluated with several months of data and each month was used as a hold-out sample for validation, while the rest of the data is used to incrementally train the model.Although the model was built using a large sample, the overall results show an average of 64% of success for the churn class, which is still lower than previously reported results using other learning techniques (such as non-linear SVM).Because of the difficulty of training more complex learning models at a large scale, it is still difficult to compare these results, so accelerating training and data pre-processing should be done in order to perform feature engineering for improving the classification rates.

Figure 3 .
Figure 3. k-fold cross-Validation scheme.Each data block is randomly used as a held-out sample ad the full statistical model is trained and fitted using all other data blocks

Figure 4 .
Figure 4. Cross-validation results using different number of PCA components and model parameters.The Hinge loss function shows improved performance in terms of accuracy in training when compared to the Log and Modified Huber loss functions.Also, training the model with the L 2 penalty and the regularization term α = 0.01 delivers good performance in the validation set.

Table 1 .
Customer Features

Table 2 .
Parameter Setting