Customer Churn-Prevention Model

The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed. This thesis work challenges the way to face this problem, and proposes the development of a complete solution, which does not focus exclusively on the prediction of churn, as is usually done in the state of the art research, but to intervene in different interactions that can be carried out with customers. The above focused not only to prevent customer churn, but to generate an added value of continuous improvement in sales processes, increase customer penetration, leading to an improvement in customer experience and consequently, an increase in customer loyalty. Dataset: DOI number or link to the deposited dataset in cases where the dataset is published or set to be published separately. If the dataset is submitted and will be published as a supplement to this paper in the journal Data, this field will be filled by the editors of the journal. In this case, please make sure to submit the dataset as a supplement when entering your manuscript into our manuscript editorial system. Dataset License: CC BY-NC


Summary
The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer.It is not surprising that companies pay close attention to the analysis and impact of churn on their business strategies. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed. This paper work challenges the way to face this problem, and proposes the development of a complete solution, which does not focus exclusively on the prediction of churn, as is usually done in the state of the art research, but to intervene in different interactions that can be carried out with customers. This is focused not only on preventing customer churn, but also on generating an added value of continuous improvement in sales processes, increasing customer penetration, leading to an improvement in customer experience and consequently, an increase in customer loyalty.
The preventive solution to customer churn begins by proposing that the first step in preventing a customer from leaving an organization is to improve the customer experience and the customer relationship. To this end, we propose a series of new data tools for sales consultants, so that when they reach a totally new customer, they can present a value offer based on dynamic content of the products most purchased by current customers; and for the latter, who already have a business relationship, is to present them with an analysis of products that may be interesting and possibly unknown to them, which are generated based on the buying behavior of similar customers, in their buying behavior, which drives the penetration of articles and services to the end users of the business. With this step, the relationship with the customer will be strengthened, by the inclusion of new items, giving as a direct result a decrease in churn, by increasing the relationship with my end customer.
As a second step, it is necessary to deepen the understanding of the customer, how can they be analyzed from a new point of view that has not been taken into account, and this answer can be generated through an unsupervised approach. By implementing this type of algorithms, customer groupings are achieved, generating as a first benefit, an improvement in marketing strategies, since strategies can be made for each of these populations specifically, and as a second utility, allow business analysts to identify opportunities and risks in these groups of customers, based on this new feature of information.
Finally, artificial intelligence algorithms are used to predict customer leakage, using as a source of learning information, the key characteristics of the customer, among which is whether it is churn.
In the case that this type of marking does not exist, it is proposed to generate an information model that analyzes the customer's purchasing behavior and, with these results, allows to mark the churn.
For the above described, the current master thesis in artificial intelligence, aims to generate an integrated solution that positively impacts the finances of an organization, directly decreasing the loss of customers and increasing both their loyalty and the sales volume of the final customer.

Data Description (required)
For the evaluation of the preventive model, a series of premises have been selected to evaluate the consistency, results, and usefulness of the model with real data from an organization.
During this stage, data models were analyzed and evaluated, in order to comply with conditions of real customer transnationality , customer information without churn markers and that these data were in the public domain, in order to ensure that the model has not been adjusted for the current experiments. Once these data have been identified, the process of developing experiments and their results proceeds in the following order: During the research, about twenty public domain data sets focused on customer transactionality were evaluated.
 Data completeness: Data with an acceptable amount of null information for analysis, less than or equal to 5% of the total records.  Churn Flags: Should not have churn marking, due to the need to test the marking model.  Data volume: Information greater than 100,000 transactions.  Data source: Real information, not pre-created by software vendors or from courses generated by universities.
Upon completion of this review, the public domain data selected is Brazilian E-Commerce Public Dataset by Olist. Retrieved June 10, 19 from https://www.kaggle.com/olistbr/brazilian-ecommerce, and described as "This dataset was generously provided by Olist, the largest department store in the Brazilian markets. Olist connects small businesses throughout Brazil with seamless, single-contract channels. These merchants can sell their products through the Olist Store and ship directly to customers using Olist's logistics partners. See more on our website: www.olist.com."

Transactional data model
Data is divided into multiple data sets for better understanding and organization, data flow is visualized in transactional information:

Customer Data
This dataset contains information about the customer and their location. It allows to identify unique customers in the order dataset and to find the order delivery location.
In the system, each order is assigned to a unique customer_id. This means that the same customer will get different identifiers for different orders. The purpose of having a customer_unique_id in the data set is to allow you to identify customers who made repurchases in the store. Otherwise, you will find that each order has a different customer associated with it. The information it contains is as follows:  customer_id: Key of the order dataset. Each order has a unique customer_id.

Multidimensional data model
This model will comply with the characteristics of a star model, which will allow us to perform multi-dimensional analysis on the existing information, as well as to integrate, consolidate and distribute in metrics, all transactional data and thus generate a single model for the purpose of the current papers.

Customer Dimension
This dataset consolidates the single view of the customer, in order to have the essential information that allows to analyze it from the characteristics associated with it.
The information it contains is the following:  customer_id: Unique key for customers.  customer_zip_code_prefix: First five digits of the customer's zip code.  customer_city: Name of the customer's city.  customer_state: Customer state

Geolocation Dimension
This dataset consolidates the single view of geography, in order to be able to perform geopositioning analysis of customers, their transactions and the products being purchased.
The information it contains is as follows:  geolocation_zip_code_prefix: First 5 digits of the zip code.

Vendor Dimension
This dataset consolidates the single view of the vendor, in order to be able to make a specialized analysis of the vendor, its performance, location, among other possible analyses to be developed with this view. The information it contains is the following:  seller_id: Unique key for the sellers.  seller_zip_code_prefix: First five digits of the seller's zip code.  seller_city: Name of the seller's city.  seller_state: State of the seller more items in the order, this value is distributed among the items)

Time Dimension
This dataset specializes in the management and consolidation of information. This dimension is a key axis when analyzing data over time.
The information it contains is as follows:

. Transaction fact
This dataset is the core of customer transnationality. It allows a multi-dimensional analysis with product, customer, vendor, geolocation, and time. In this fact, the information of the entire history of customer transnationality is consolidated.
The information it contains is the following:  order_id: Unique key of the order.  order_item_id: Order -product relationship.  product_id: Unique key of the product  customer_id: Customer key  customer_zip_code_prefix: First 5 digits of the zip code.  seller_id: Unique seller key  shipping_limit_date: Shows the seller's deadline to deliver the product.

Recommendation engine
The experiments developed with the recommendation engine are generated based on the normalization of the customer-product data and a reference dummy dataset.
 Normalized Dataset: A Customer-Product pivot table is generated in order to identify which products have been consumed by the customers. Once this information is obtained, the data is normalized to make it comparable. This matrix contains customer, product, and purchase history normalization information.  Reference Dataset: A purchase definition is made for the products associated with the customer, with which a stable frequency can be obtained, since it is arbitrarily defined that everyone buys a product. This Dataset is taken as a contrast to the normalized one to evaluate the results of the algorithm. All the experiments have been defined with a subset of customers and a limit of five recommended items, so that the results can be compared and evaluated with the same criteria.
The test customers are:  871766c5855e863f6eccc05f988b23cb  eb28e67c4c0b83846050ddfb8a35d051  3818d81c6709e39d06b2738a8d3a2474 2.3.1.1. Content-based recommendation engine This data tool is not related to the user, but to the products that are most purchased by all users, so its result should be consistent with the items with the highest number of sales among customers. As can be seen, the result is effectively aligned with the initial definition, that all customers would have the same products, in this case, the five best sellers. In the case of the reference dataset, the same results are maintained, all customers have the same products, the five best-selling products

Collaborative recommendation engine
This data tool is related to the user and the products they have purchased in their history. The approach presented is to identify how similar customers are, based on the products they have already purchased.
This approach, unlike the previous one, presents two approaches to measure the similarity between users, one of these is the cosine distance and the other, the Pearson correlation. The value resulting from this measurement shows that the closer it is to one (1), the more similar the customers are, and the closer it is to zero (0), the less similar they are.    The previous results are generated by specialized model, it is necessary to include quality metrics in the results to define which is the best recommender. The metrics are precision, recall and rsme.
Before generating a analyze, it is necessary to contextualize the metrics used during this evaluation, such as:  Accuracy: Its function is to analyze whether the results obtained have really been used by users of information, an example of this is, if 10 items are recommended and of these, the customer only buys 3, we can say that we have an accuracy of 30%, which indicates that it is very good value and the model has great impact.


Recall: It is analyzed if the products purchased by the customer is related to the recommended ones, an example of this is, if a customer buys 10 items and among the recommended ones there were 2 of them, then the recall would be 20%.
 RSME: Measures the error of the recommended products, the lower the value of this indicator, the better the results.

Unsupervised Classification -Customer
The experiments has been designed with an unsupervised algorithm (machine learning) with the objective of generating customer segmentation and thus providing the business with new information tools to support the strategies and monitoring of results that may be proposed.

Elbow Method
The first step is to define the number of clusters, we start by performing the elbow method.

Silhouette Method
The This method is applied to ensure that the clustering defined as k=4 is the best for the K-means algorithm. With this information, it can be concluded that four (4) clusters are the most suitable for grouping customers.    consolidates the single view of the customer, in order to have the essential information that allows to analyze it from the characteristics associated with it.
The information it contains is the following:  customer_id: Unique key for customers.  customer_zip_code_prefix: First five digits of the customer's zip code.  customer_city: Name of the customer's city.  customer_state: Customer state

Churn Prediction
The Once all the customer retention strategies have been implemented, it is necessary to know which of these customers are potential casualties, so that the loyalty teams can execute their strategies and thus avoid this announced loss. This requires a two-pronged approach:  The first one is focused on identifying which of these customers are already definitively low in the organization, using the customer's purchase behavior for this marking, since currently there is no such classification.  As a next step, once we have identified which customers are marked as churn, we proceed to obtain this input, in order to train supervised algorithms that allow us to predict the leakage of these customers and select the one that gives the best result. Developed with the recommendation engine are generated based on the normalization of the customer-product data and a reference dummy dataset.

Customer Churn -Mark
Experiments will be carried out to mark customers as lost, using the information we have on purchases, frequency, seasonality and amounts of the transactions carried out. This is applied with the RFM model, of which three components will be used:  Recency: This is the time that has passed since your last purchase. This is equal to the duration between a customer's first purchase and their last purchase. (Thus, if they have only made 1 purchase, the recency is 0.)  Frequency: This represents the number of repeats purchases the customer has made. This means, it is the count of periods in days in which the customer made a purchase. Therefore, it is the count of days in which the customer made a purchase.  T: Represents the age of the customer in units of time in days This is equal to the duration between a customer's first purchase and the end of the period.
Based on the definition of RFM the following hypothesis for marking Churn is applied: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 June 2021 doi:10.20944/preprints202106.0063.v1  Recency: By its very approach to customer seniority analysis, anything less than 1, can be marked as churn.  Frequency: It will be considered that the customer has a purchase frequency of less than 1 in a month.  T: It is observed that at level 400 onwards the values are purple, this means that it has been practically still in the last year, since the measurement is in days. By applying this hypothesis, it was obtained that 18% of the historical customer database is churn. be seen, the result is effectively aligned with the initial definition, that all customers would have the same products, in this case, the five best sellers.

Variable selection -First Strategy
In these stages, with the customers marked as lost, we proceed to identify the relevant information variables, which allow us to perform a series of experiments and thus identify the best artificial intelligence algorithm for the prediction of customer loss.
Before starting to run the AI algorithms, it is necessary to find the relevant information characteristics of the entire dataset. For this purpose, different prioritization methods are applied, being a total of four (4) approaches that allow to obtain with greater assertiveness the most appropriate variables to be used in the supervised learning algorithms. The techniques used for this feature prioritization are:  Feature importance with Extra Trees  Univariate selection  Recursive feature elimination  PCA  The characteristics selected as most relevant in their order are: 1. seller_id: Appears in three results.
2. order_status: Appears in two results, in a ranking of 1st and 2nd.
3. DATE: Appears in two results, in a ranking of 1st and 3rd.
4. customer_zip_code_prefix: Appears in two results, in a ranking of 1st and 4th.
5. product_id: Appears in two results, in a ranking of 3rd and 4th.
6. price: Appears in two results, in a ranking of 3rd and 4th.
7. customer_city: Appears in two results, in a ranking of 3rd and 2nd.
8. customer_state: Appears in two results, in a ranking of 2nd and 4th. previous results are generated by specialized model, it is necessary to include quality metrics in the results to define which is the best recommender. The metrics are precision, recall and rsme.
As a result, a ranking is made with the criteria initially described, and it is concluded that the four key fields for this experiment are seller_id, order_status, DATE, cus-tomer_zip_code_prefix, customer_zip_code_prefix, and customer_zip_code_prefix.

Variable selection -Second strategy
In this experiment, we arbitrarily chose to eliminate the temporal information in order to have a reference result of the previous experiment, and thus identify the changes in the decision of the prioritized variables.
As a result, the following results are shown below: As a result of the values generated by the four methods, the four (4) most relevant variables are prioritized, according to the number of times they were selected and the hierarchical position in which they were proposed:  Feature Importance with Extra Trees: identified the following features as relevant: price, seller_id, product_id, customer_state, customer_zip_code_prefix. The last three variables had the same value that is why a result of 4 values is not given.  Univariate selection: Identified the following features as relevant: order_status, cus-tomer_state, customer_city, product_id.  Recursive Feature Elimination: Identified the following features as relevant: seller_id, order_status, customer_state, customer_city  PCA: Identified the following features as relevant: customer_zip_code_prefix, cus-tomer_city, Price, product_id As a result, the most relevant characteristics are, in their order:product_id: Appears in three results 1. customer_state: Appears in three results 2. seller_id: Appears in two results, in a ranking of 1st and 2nd. 3. order_status: Appears in two results, in a ranking of 1st and 2nd. 4. customer_zip_code_prefix: Appears in two results, in a ranking of 1st and 2nd. 5. price: Appears in two results, in a ranking of 1st and 3rd. 6. customer_city: Appears in two results, ranked 3rd and 2nd. As a result, a ranking is made with the criteria initially described, and it is concluded that the four key fields for this experiment are product_id, customer_state, seller_id, order_status.
At the end of these two experiments, two subsets of information to be used in the supervised learning algorithms are concluded:  Group 1: seller_id, order_status, DATE, customer_zip_code_prefix.  Group 2 (control 1): product_id, customer_state, seller_id, order_status  Group 3 (control 2): no prioritization 2.3.1.3. Algorithms application In this stage, experiments will be carried out with each of the algorithms described below, according to the data sets defined in the previous stage. As a scope of these algorithms, the following will be implemented:  Random Forest  Linear Regression  Gradient Boosting Classifier  Support Vector Machine  Neural Networks   The results in summary the algorithms with the best results are:  Table 1: Random Forest -Gradient Boosting Classifier  Table 2: Random Forest -Gradient Boosting Classifier Table 3: Random Forest -Gradient Boosting Classifier

Methods (required)
For research purposes, a public dataset has been used so that it can be improved by other researchers. Similarly, the following parameters were defined for the data set used for this model.  Data completeness: Data with an acceptable amount of null information for analysis, less than or equal to 5% of the total records.  Churn Flags: Should not have churn marking, due to the need to test the marking model.  Data volume: Information greater than 100,000 transactions. from https://www.kaggle.com/olistbr/brazilian-ecommerce, and described as "This dataset was generously provided by Olist, the largest department store in the Brazilian markets. Olist connects small businesses throughout Brazil with seamless, single-contract channels. These merchants can sell their products through the Olist Store and ship directly to customers using Olist's logistics partners. See more on our website: www.olist.com." Conclusions  One of the most common mistakes in the data management industry is to assume that applying artificial intelligence algorithms is enough to solve any problem we have, and this is a totally false hypothesis. As part of this idea it was essential to understand the problem of churn, not only from the result of the problem, which is the customer leakage, and how to avoid it through algorithms that allow to predict this situation.  A hybrid solution allows not only to generate a working tool for organizations, but to provide added value from the first moment, since it integrates technology, algorithmic, statistics, and makes it available to organizations, in order to be part of the business strategies based on data.  The ensemble methods generated the best churn prediction values, unlike others found in the literature such as SVM. This can be understood, due to the binaryrequired classification in this solution, number of features and the amount of data available for training and testing development.  For historical data models, which do not have customer churn marking, the RFM data model allows to generate the marking based on customer purchase behavior. Its scope could be extended, since it could not only be marking churn, but also proactively detecting whether customers are about to be churned or not.  As a premise for the evolution of this degree work approach, it is necessary to define a robust business process that ensures the chaining of the three stages of customer prevention, in such a way that it can be formalized and trained at different levels of the organization, so that the entire solution is used properly and has a continuous improvement.  From the Customer Experience point of view, it is essential that a pleasant and easy-to-use visual environment is developed in the solution. With existing technologies, this environment can be created through business intelligence platforms such as PowerBI or Tableau, the latter being the front end of the solution.
With this product, you can directly impact the sales force and business, becoming a key tool to support them in preventing customer churn and improving customer loyalty.  The solution must be designed for mobile devices, so that any salesperson can have all their knowledge tools prepared and ready on their cell phone.  The use of the cloud as a fundamental part of the evolution of this project, can be applied from different views, one of these is the deployment of a web service with the information of Machine learning trained, which allows through the characteristics of a customer, that a salesperson, financial, marketing professional or similar, can predict whether or not a customer is likely to be churn, and thus proactively initiate the intervention of this.  The areas of opportunity from the data point of view are practically infinite, what is necessary to advance is how this area can learn from the business and support it to generate new tools to improve the final customer experience and the internal processes of any organization.
 There are current technologies that allow to merge programming languages such as Python in databases, which would allow an improvement in processing times, and a native integration between the two main points of paper, data and the application of artificial intelligence algorithms.