Customer Churn - Prevention Model – Prediction Model

: The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed. This paper challenges the way to face this problem, and proposes the development of a complete solution, which does not focus exclusively on the prediction of churn, as is usually done in the state of the art research, but to intervene in different interactions that can be carried out with customers. The above focused not only to prevent customer churn, but to generate an added value of continuous improvement in sales processes, increase customer pene-tration, leading to an improvement in customer experience and consequently, an increase in customer loyalty.


Summary
The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer. It is not surprising that companies pay close attention to the analysis and impact of churn on their business strategies. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed.
The focus of this research is how different concepts and techniques related to the artificial intelligence sub-branch can possibly address a mixed solution, not only from the data perspective but also integrating it with the business approach..
This dataset contains information about the customer and their location. It allows to identify unique customers in the order dataset and to find the order delivery location.
In the system, each order is assigned to a unique customer_id. This means that the same customer will get different identifiers for different orders. The purpose of having a customer_unique_id in the data set is to allow you to identify customers who made repurchases in the store. Otherwise, you will find that each order has a different customer associated with it. The information it contains is as follows:  customer_id: Key of the order dataset. Each order has a unique customer_id. This dataset includes data about the products sold by Olist. The information it contains is as follows:  product_id: Unique identifier of the product  product_category_name: Category name in Portuguese.  product_name_lenght: Number of characters extracted from the product name.  product_description_lenght: Number of characters extracted from the product description.  product_photos_qty: Number of published photos of the product.  product_weight_g: Product weight measured in grams.  product_length_cm: Product length measured in centimeters.  product_height_cm: Product height measured in centimeters.  product_width_cm: Width of the product measured in centimeters.

Vendor data
This dataset translates the product_category_name into English. The information it contains is as follows :  product_category_name: Category name in Portuguese.  product_category_name_english: Name of the category in English.

Category name translation data
This dataset contains information on Brazilian postal codes and their lat/long coordinates. The information it contains is the following:  geolocation_zip_code_prefix: First 5 digits of the zip code.  geolocation_lat: Latitude  geolocation_lng: Longitude  geolocation_city: City name  geolocation_state: State where city is located  2.1. Transactional data model

Experiments
Once all the customer retention strategies have been implemented, it is necessary to know which of these customers are potential casualties, so that the loyalty teams can execute their strategies and thus avoid this announced loss. This requires a two-pronged approach:  The first one is focused on identifying which of these customers are already definitively low in the organization, using the customer's purchase behavior for this marking, since currently there is no such classification.  As a next step, once we have identified which customers are marked as churn, we proceed to obtain this input, in order to train supervised algorithms that allow us to predict the leakage of these customers and select the one that gives the best result. Developed with the recommendation engine are generated based on the normalization of the customer-product data and a reference dummy dataset.

Customer Churn -Mark
Experiments will be carried out to mark customers as lost, using the information we have on purchases, frequency, seasonality and amounts of the transactions carried out. This is applied with the RFM model, of which three components will be used:  Recency: This is the time that has passed since your last purchase. This is equal to the duration between a customer's first purchase and their last purchase.
(Thus, if they have only made 1 purchase, the recency is 0.)  Frequency: This represents the number of repeats purchases the customer has made. This means, it is the count of periods in days in which the customer made a purchase. Therefore, it is the count of days in which the customer made a purchase.  T: Represents the age of the customer in units of time in days This is equal to the duration between a customer's first purchase and the end of the period.
Based on the definition of RFM the following hypothesis for marking Churn is applied:  Recency: By its very approach to customer seniority analysis, anything less than 1, can be marked as churn.  Frequency: It will be considered that the customer has a purchase frequency of less than 1 in a month.  T: It is observed that at level 400 onwards the values are purple, this means that it has been practically still in the last year, since the measurement is in days. By applying this hypothesis, it was obtained that 18% of the historical customer database is churn. be seen, the result is effectively aligned with the initial definition, that all customers would have the same products, in this case, the five best sellers.

Variable selection -First Strategy
In these stages, with the customers marked as lost, we proceed to identify the relevant information variables, which allow us to perform a series of experiments and thus identify the best artificial intelligence algorithm for the prediction of customer loss.
Before starting to run the AI algorithms, it is necessary to find the relevant information characteristics of the entire dataset. For this purpose, different prioritization methods are applied, being a total of four (4) approaches that allow to obtain with greater assertiveness the most appropriate variables to be used in the supervised learning algorithms. The techniques used for this feature prioritization are:  The characteristics selected as most relevant in their order are: 1. seller_id: Appears in three results.
2. order_status: Appears in two results, in a ranking of 1st and 2nd.
3. DATE: Appears in two results, in a ranking of 1st and 3rd.
4. customer_zip_code_prefix: Appears in two results, in a ranking of 1st and 4th.
5. product_id: Appears in two results, in a ranking of 3rd and 4th.
6. price: Appears in two results, in a ranking of 3rd and 4th.
7. customer_city: Appears in two results, in a ranking of 3rd and 2nd.
8. customer_state: Appears in two results, in a ranking of 2nd and 4th. previous results are generated by specialized model, it is necessary to include quality metrics in the results to define which is the best recommender. The metrics are precision, recall and rsme.
As a result, a ranking is made with the criteria initially described, and it is concluded that the four key fields for this experiment are seller_id, order_status, DATE, cus-tomer_zip_code_prefix, customer_zip_code_prefix, and customer_zip_code_prefix.

Variable selection -Second strategy
In this experiment, we arbitrarily chose to eliminate the temporal information in order to have a reference result of the previous experiment, and thus identify the changes in the decision of the prioritized variables.
As a result, the following results are shown below:  Feature Importance with Extra Trees: identified the following features as relevant: price, seller_id, product_id, customer_state, customer_zip_code_prefix. The last three variables had the same value that is why a result of 4 values is not given.  Univariate selection: Identified the following features as relevant: order_status, cus-tomer_state, customer_city, product_id.  Recursive Feature Elimination: Identified the following features as relevant: seller_id, order_status, customer_state, customer_city  PCA: Identified the following features as relevant: customer_zip_code_prefix, cus-tomer_city, Price, product_id As a result, the most relevant characteristics are, in their order:product_id: Appears in three results 1. customer_state: Appears in three results 2. seller_id: Appears in two results, in a ranking of 1st and 2nd. 3. order_status: Appears in two results, in a ranking of 1st and 2nd. 4. customer_zip_code_prefix: Appears in two results, in a ranking of 1st and 2nd. 5. price: Appears in two results, in a ranking of 1st and 3rd. 6. customer_city: Appears in two results, ranked 3rd and 2nd. As a result, a ranking is made with the criteria initially described, and it is concluded that the four key fields for this experiment are product_id, customer_state, seller_id, order_status.
At the end of these two experiments, two subsets of information to be used in the supervised learning algorithms are concluded:  Group 1: seller_id, order_status, DATE, customer_zip_code_prefix.

Algorithms application
In this stage, experiments will be carried out with each of the algorithms described below, according to the data sets defined in the previous stage. As a scope of these algorithms, the following will be implemented:  Random Forest  Linear Regression  Gradient Boosting Classifier  Support Vector Machine  Neural Networks    The results in summary the algorithms with the best results are:  Table 1: Random Forest -Gradient Boosting Classifier  Table 2: Random Forest -Gradient Boosting Classifier  Table 3: Random Forest -Gradient Boosting Classifier

Methods (required)
For research purposes, a public dataset has been used so that it can be improved by other researchers. Similarly, the following parameters were defined for the data set used for this model.
 Data completeness: Data with an acceptable amount of null information for analysis, less than or equal to 5% of the total records.  Churn Flags: Should not have churn marking, due to the need to test the marking model.
 Data volume: Information greater than 100,000 transactions.  Data source: Real information, not pre-created by software vendors or from courses generated by universities.
Upon completion of this review, the public domain data selected is Brazilian E-Commerce Public Dataset by Olist. Retrieved June 10, 19 from https://www.kaggle.com/olistbr/brazilian-ecommerce, and described as "This dataset was generously provided by Olist, the largest department store in the Brazilian markets. Olist connects small businesses throughout Brazil with seamless, single-contract channels. These merchants can sell their products through the Olist Store and ship directly to customers using Olist's logistics partners. See more on our website: www.olist.com."

Conclusions
 One of the most common mistakes in the data management industry is to assume that applying artificial intelligence algorithms is enough to solve any problem we have, and this is a totally false hypothesis. As part of this idea it was essential to understand the problem of churn, not only from the result of the problem, which is the customer leakage, and how to avoid it through algorithms that allow to predict this situation.  A hybrid solution allows not only to generate a working tool for organizations, but to provide added value from the first moment, since it integrates technology, algorithmic, statistics, and makes it available to organizations, in order to be part of the business strategies based on data.  The ensemble methods generated the best churn prediction values, unlike others found in the literature such as SVM. This can be understood, due to the binaryrequired classification in this solution, number of features and the amount of data available for training and testing development.  For historical data models, which do not have customer churn marking, the RFM data model allows to generate the marking based on customer purchase behavior. Its scope could be extended, since it could not only be marking churn, but also proactively detecting whether customers are about to be churned or not.  As a premise for the evolution of this degree work approach, it is necessary to define a robust business process that ensures the chaining of the three stages of customer prevention, in such a way that it can be formalized and trained at different levels of the organization, so that the entire solution is used properly and has a continuous improvement.  From the Customer Experience point of view, it is essential that a pleasant and easy-to-use visual environment is developed in the solution. With existing technologies, this environment can be created through business intelligence platforms such as PowerBI or Tableau, the latter being the front end of the solution.
With this product, you can directly impact the sales force and business, becoming a key tool to support them in preventing customer churn and improving customer loyalty.  The solution must be designed for mobile devices, so that any salesperson can have all their knowledge tools prepared and ready on their cell phone.  The use of the cloud as a fundamental part of the evolution of this project, can be applied from different views, one of these is the deployment of a web service with the information of Machine learning trained, which allows through the characteristics of a customer, that a salesperson, financial, marketing professional or similar, can predict whether or not a customer is likely to be churn, and thus proactively initiate the intervention of this. it to generate new tools to improve the final customer experience and the internal processes of any organization.  There are current technologies that allow to merge programming languages such as Python in databases, which would allow an improvement in processing times, and a native integration between the two main points of paper, data and the application of artificial intelligence algorithms.