Customer Churn-Prevention Model – Recomendation Engine

The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed. This thesis work challenges the way to face this problem, and proposes the development of a complete solution, which does not focus exclusively on the prediction of churn, as is usually done in the state of the art research, but to intervene in different interactions that can be carried out with customers. The above focused not only to prevent customer churn, but to generate an added value of continuous improvement in sales processes, increase customer penetration, leading to an improvement in customer experience and consequently, an increase in customer loyalty. Dataset License: CC BY-NC


Summary
The strategy of any organization is based on the growth of its customer base, and one of its principles is that selling a product to an existing customer is much more profitable than acquiring a new customer. It is not surprising that companies pay close attention to the analysis and impact of churn on their business strategies. However, this approach has several opportunities for improvement, since it usually has a totally reactive approach, which does not give opportunity to the areas specialized in customer experience and recovery, to give an effective response for that moment, since the customer is gone at the time of the intervention. This happens because usually a diagnostic analysis of customers who have stopped buying products or services in a defined period, commonly three (3) periods or months, is performed.
The focus of this research is how different concepts and techniques related to the artificial intelligence sub-branch can possibly address a mixed solution, not only from the data perspective but also integrating it with the business approach..

Transactional data model
Data is divided into multiple data sets for better understanding and organization, data flow is visualized in transactional information: This dataset contains information about the customer and their location. It allows to identify unique customers in the order dataset and to find the order delivery location.
In the system, each order is assigned to a unique customer_id. This means that the same customer will get different identifiers for different orders. The purpose of having a customer_unique_id in the data set is to allow you to identify customers who made repurchases in the store. Otherwise, you will find that each order has a different customer associated with it. The information it contains is as follows: • customer_id: Key of the order dataset. Each order has a unique customer_id. This dataset includes data about the products sold by Olist. The information it contains is as follows: • product_id: Unique identifier of the product • product_category_name: Category name in Portuguese.

Vendor data
This dataset translates the product_category_name into English. The information it contains is as follows : • product_category_name: Category name in Portuguese.
• product_category_name_english: Name of the category in English.

Category name translation data
This dataset contains information on Brazilian postal codes and their lat/long coordinates. The information it contains is the following:

Experiments
The experiments developed with the recommendation engine are generated based on the normalization of the customer-product data and a reference dummy dataset.
• Normalized Dataset: A Customer-Product pivot table is generated in order to identify which products have been consumed by the customers. Once this information is obtained, the data is normalized to make it comparable. This matrix contains customer, product, and purchase history normalization information. • Reference Dataset: A purchase definition is made for the products associated with the customer, with which a stable frequency can be obtained, since it is arbitrarily defined that everyone buys a product. This Dataset is taken as a contrast to the normalized one to evaluate the results of the algorithm. All the experiments have been defined with a subset of customers and a limit of five recommended items, so that the results can be compared and evaluated with the same criteria.
The test customers are: This data tool is not related to the user, but to the products that are most purchased by all users, so its result should be consistent with the items with the highest number of sales among customers.

Figure 1. Normalized Dataset -Content-based
As can be seen, the result is effectively aligned with the initial definition, that all customers would have the same products, in this case, the five best sellers.

Figure 2. Reference Dataset -Content-based
In the case of the reference dataset, the same results are maintained, all customers have the same products, the five best-selling products

Collaborative recommendation engine
This data tool is related to the user and the products they have purchased in their history. The approach presented is to identify how similar customers are, based on the products they have already purchased.
This approach, unlike the previous one, presents two approaches to measure the similarity between users, one of these is the cosine distance and the other, the Pearson correlation. The value resulting from this measurement shows that the closer it is to one (1), the more similar the customers are, and the closer it is to zero (0), the less similar they are.   The previous results are generated by specialized model, it is necessary to include quality metrics in the results to define which is the best recommender. The metrics are precision, recall and rsme.
Before generating a analyze, it is necessary to contextualize the metrics used during this evaluation, such as: • Accuracy: Its function is to analyze whether the results obtained have really been used by users of information, an example of this is, if 10 items are recommended and of these, the customer only buys 3, we can say that we have an accuracy of 30%, which indicates that it is very good value and the model has great impact.
• Recall: It is analyzed if the products purchased by the customer is related to the recommended ones, an example of this is, if a customer buys 10 items and among the recommended ones there were 2 of them, then the recall would be 20%.
• RSME: Measures the error of the recommended products, the lower the value of this indicator, the better the results.

Methods (required)
For research purposes, a public dataset has been used so that it can be improved by other researchers. Similarly, the following parameters were defined for the data set used for this model.
• Data completeness: Data with an acceptable amount of null information for analysis, less than or equal to 5% of the total records. • Churn Flags: Should not have churn marking, due to the need to test the marking model. • Data volume: Information greater than 100,000 transactions.
• Data source: Real information, not pre-created by software vendors or from courses generated by universities.
Upon completion of this review, the public domain data selected is Brazilian E-Commerce Public Dataset by Olist. Retrieved June 10, 19 from https://www.kaggle.com/olistbr/brazilian-ecommerce, and described as "This dataset was generously provided by Olist, the largest department store in the Brazilian markets. Olist connects small businesses throughout Brazil with seamless, single-contract channels. These merchants can sell their products through the Olist Store and ship directly to customers using Olist's logistics partners. See more on our website: www.olist.com."

Conclusions
• The content-based recommendation engine fulfills its objective of recommending the best-selling products to customers. • When analyzing the results obtained, the recommendations are strong, the same recommendation is always given to all customers selected from the chosen Dataset. • It is necessary to clarify that, if we compare the recommended products between the normalized Dataset and tests, they are totally different, and this is correct, since their recommendation is different, due to the same nature of the data.
• It can be detailed that the best model to be used is the Cosine distance model, with the normalized Dataset, since it presents the lowest squared error and the best precision and recall values.
• It is also suggested that for future experiments, these tests could be extended, since the current ones have not been good.