Deep Learning for Churn Prediction

Mar 11

Deep Learning is not a so new research area, but if we think about its applications with the development of computing power and data availability, one could say that this practice is in a sustainable growth route, since 2012. The applications are endless and very diverse, which can include, for instance, Automating Driving, Health Checks, Home System Devices like Alexa, or even to study the universe, in which a deep learning algorithm was used to get the first image of a black hole. Nonetheless, there are a lot of expectations involved in Deep Learning and we could say that there are some myths also. This article aims to show some deep learning techniques, presenting some code of a practical example, in a not-so-large dataset.

This project was delivered as an evaluation for the Deep Learning class at Nova - IMS and it was developed by Carlos Reis (https://www.linkedin.com/in/carlota-reis/), Guilherme Miranda (https://www.linkedin.com/in/guilhermedmiranda/) , Mariana Garcia (https://www.linkedin.com/in/mariana-cunha-garcia/) and Carlos Cardoso (https://www.linkedin.com/in/carlos-cardoso-9188a4103/).

We applied some deep learning techniques in a sample of 2,711 clients of an investment advisory company. The goal of this project is to predict whether a client will deposit or withdraw funds from their portfolio – a binary classification problem.

By predicting these outcomes, it is possible, for the investment company to tailor their customer retention strategy, according to each client’s probability of churn and to possibly avoid the clients’ withdrawal movements from the company’s account. Also, even if the outcome of the model is positive – this is, the client is predicted not to withdraw money – the company can act in order to persuade the client to increase their wallet share.

Data Preparation

The first step in preparing the data was to treat the missing values present in the variable Account Status since the blank values had deposits present in their account, therefore, they were considered as active, hence these missing values were replaced with number 1.

For the 125 missing values present in the variables Age, Birth Date, Gender, Marital Status and Occupation, it was concluded that no personal information about these 125 clients was found in the Data Set, therefore should not be considered and the missing values should be removed.

In the Client Profile variable, the missing values represented an unknown client profile, so it was assumed the most conservative profile for these clients.

After treating the missing values, the variables Gender, Marital Status, Occupation, and City were changed into Categorical Variables. In order to explore these variables, some graphs were plotted to assess the relationship between the Net Amount with each categorical variable and then we began to treat these variables accordingly.

The Account Number was dropped from the Data Set since it was not considered relevant for the prediction of customers’ churn. The same logic was applied to BirthDate since we already had the Age variable to deliver the relevant information about the clients. The Occupation was dropped since it had over 100 different occupations to consider and other variables such as the Annual Income, delivered more relevant information to our project. The City was not considered a variable with added value to our dataset and to the development of the project.

To treat outliers in the Annual Income, a function was created to find the outliers based on percentile.

For the Marital Status variable, the categories were changed to numerical and the categories “legally separated” and “divorced” were joined together, since the meaning of these two are identical and according to the plot, they have similar values.

The last step of the Data Preparation was to transform the variable ‘’Net Amount’’ into binary since it will be selected as our Target Variable.

Modeling: Initial Model

For the model creation, we choose to add two intermediate layers with 16 hidden units each, both with the rectified linear activation function. The input dimension represents the number of columns we have in our dataset. A third layer was added to output the scalar prediction regarding the customers’ churn probability, with the application of the sigmoid activation function so the output is a probability, giving a score between 0 and 1.

When training the network, the most suitable loss function for our project is the binary_crossentropy since it is used to solve a binary classification problem and the output will represent a probability.

From the plotted graphs, we can observe that the accuracy of the training and validation set increases with every epoch, but the opposite happens when we observe the loss of the training and validation set. After testing different possibilities for the number of epochs, it was considered 200 as the most suitable for the model in order to prevent overfitting.

When looking at the overall evaluation parameters, this model achieves significant results, above 89%, both in the training and in the test set.

Regularization Techniques

In order to explore overfitting treatments, which is a the most central problem in machine learning, and hence to construct a better model (more generalized), it was implemented some regularization techniques.

Although the best scenario would be adding more data to the train set, given the current limitation concerning the number of observations, we thought that the following regularization techniques would be indicated and hence improve the model’s performance on unseen data.

Simple Hold-out Validation

This was the first regularization technique applied to our model which led to slightly better results in terms of precision, recall and f1-score than when compared to these ratios from the first model. As the test results are similar to the train ones, one can infer that overfitting is being reduced by using this technique.

Adding Weight Regularization

Regarding the weight regularization approach, it generically penalizes the model’s complexity by adding a cost to the loss function and it forces the weights of the network to be smaller and consequently more constant. Two of these regularizations’ approaches were applied, differing only in the selected weights. In Keras these penalties are applied on each layer and summed in the loss function.

In the first case, the L2 regularization method was considered, which makes the sum of the weights proportional to the square of the weights coefficients values. On each layer, it was added an L2 regularizer of 0.001, as we’ve seen in class, which basically means that 0.001 will be added to the loss function according to every coefficient in the weight matrix of the layer.

With this regularization, the accuracy obtained was 94% and precision was 97%. This means that in terms of accuracy the model correctly predicted 469 (185+284) withdraws and deposits out of the 507 observations that comprise the test set. The precision measures the actual positives out of all predicted positives. In this case, the model only misclassified 28 observations as withdraws when they were actually deposits.

It is possible to observe the loss on both training and validation sets. From the figure below, one can infer that the descendent peak lasts until between 50 and 100 epochs. As expected, the loss on the validation set was superior to the training set, being approximately twice as much, which is an indicator of overfitting.

The same methodology was applied to the simultaneous L1 and L2 weight regularizations. By combining both penalties, the model’s performance decreased substantially. The precision of the model decreased to 56% which is a significant variation when compared to the initial model and even to the model with only the L2 regularization.

Adding Dropout

Following this rationale, the adding dropout technique was applied. The obtained evaluation measures were better than the original models although still slightly lower than the Simple Hold-out Validation technique.

Cross Validation

As observed in classes, the Cross Validation procedure was also applied to the dataset, in order to avoid information leak between the different sets of data and to achieve a more generalized model. Here, there is no need to have a validation set, since we will portion the data into smaller sets, in this case 10, with both training and test sets.

By splitting the data and fitting the observations, it was achieved an average accuracy of 94% with a standard deviation of 2%.

K-fold CV

After the Cross Validation it was applied the K-Fold Cross Validation, which is also like the previous procedure. In this case, the data was split into 4 folds, meaning that is similar to having 4 “different” models to train.

By assessing the confusion matrix, it is possible to see a significant decrease in the performance measures, with the accuracy achieving a value of 53%. In this case, there is clear evidence of overfitting the data given the discrepancy between the train and test results. The model is memorizing unwanted patterns and hence is not generalizing as well as before.

Tuning of the Parameters

In order to possibly reduce the overfitting of the data, the parameters of the model were tuned. This was done by using the GridSearch Cross Validation technique. The results obtained are in fact better than the ones from the initial model and are also similar between the train and the test sets.

Keras Callback

Lastly, it was also considered the Keras Callback and TensorBoard tools. Since when modeling a machine learning problem there’s a lot of back-and-forth (for instance in rearranging parameters), the callback tool comes in handy in those situations. There are a lot of applications of this tool but here it was only applied the Keras callback Model Checkpoint and Early Stopping, which are usually used in combination. When applying these tools to our model, it will make the model stop training when the accuracy of the validation set stops improving in more than 1 epoch.

The model’s accuracy and loss are represented in the graphs below, with the training set in orange and the test set in blue. As observed in the previous graphs, the accuracy is superior in the training set (although some peaks of the test set surpass the training set).

Conclusion/Challenges

One of the main challenges when computing a deep learning model is the possible presence of overfitting. So, several regularization techniques were implemented in order to decrease this effect.

As explored above, almost all these techniques were able to produce similar results between the train and tests sets and, thus, reducing overfitting of the data.

The obtained evaluation measures of the model, using the different regularization techniques, were good (all above 92% on all evaluation ratios) except for the K-fold validation and weight regularization (L1 and L2).

Since the problem we are trying to solve is a binary classification one, accuracy is one of the most crucial metrics because it represents the percentage of correct predictions. Although this measure alone doesn’t indicate whether a model is good or not, we also obtained high values in others evaluation ratios such as precision, F1-score and AUC.

One further improvement that could be done would be to apply this model to a larger dataset and obtain the same high evaluation metric that were obtained with the one used in this project.

Louise Cardoso

Welcome to Blog Capital Flow, your essential portal for up-to-date financial insights and analysis. Our site is dedicated to providing valuable information on investments, the financial market, global economy, and capital management. With expert articles and practical tips, we help investors and market enthusiasts make informed decisions and achieve their financial goals.

At Blog Capital Flow, you’ll find in-depth analysis on market trends, investment tips, and guidance on building a diversified portfolio. We keep you updated on the latest economic news, financial innovations, and strategies to maximize your returns. If you’re looking for quality content on personal finance, financial planning, or investing in stocks, bonds, and cryptocurrencies, Blog Capital Flow is your trusted resource.

Our goal is to empower our readers with the knowledge necessary to navigate the complex world of finance. Whether you’re a beginner or an experienced investor, we offer relevant content that can help enhance your financial skills and grow your wealth. Explore our articles and discover how Blog Capital Flow can be a vital tool on your journey to financial success.

Keywords: investments, financial market, global economy, capital management, personal finance, financial planning, cryptocurrencies, market analysis, portfolio diversification, economic news, returns, wealth, financial success.

https://www.blogcapitalflow.org