Can Machine Learning fail?
Machine Learning is a big trend right now, it is a hype, and maybe over overrated? In fact, Machine learning is a very good technic that can be used in a very large set of problems. I believe that the success and great work that the social medias and e-commence does in “predict” patterns and give recommendations in a very accurate way, has constructed some high expectations on it.
In a more practical approach Machine learning is a great technic to do some tasks, it is not magical, it is only a very good technic if applied and used in a correct way, like everything else.
Machine Learning is a instrument to solve problems, so the most important thing is to understand exactly what is the problem. We can split in four big groups of tasks:
Classification, the market is going up or down, the clients will default or not.
Regression applied when there is a continuous output, like what will be the inflation next year, how much the salary will growth next year.
Clustering, it is applied to understanding the characteristics of a group, the investor is conservative or aggressive. What are the characteristics of a group that has a probability to save more than $ 10k yearly; and
Association, this is the job that the social media does, like YouTube, Instagram, Netflix etc. by doing suggestions about what you could like. So, machine learning it is a tool to solve some problems and to do so, we must understand, exactly the problem that we have in front of us.
I would like to share a project did in the Machine Learning class at NOVA-IMS where we applied some regression technics like Linear Regression and Decision Tree. The goal of the project consists in developing a predictive data analytics solution for a French insurance company, following a CRISP-DM methodology (Where we could say that the most important part is the Business understanding, or to understand exactly the problem). The developed model seeks to forecast the number of claims each policyholder will have in the following year. By having this information, the insurance company could adjust its pricing model for the next year’s premiums according to the predicted number of claims.
After applying all the techniques and process of a CRISP-DM methodology, coming through Data Understanding, Data Preparation, Modelling and Evaluation for Liner Regression and Decision Tree. We were able to fit the results:
Despite the fact that the results obtained by the model in the training set are similar to the ones obtained using the test set, all obtained measures show us that the model predictive power can be considered weak. One possible reason for such results is the fact that the initial dataset is extremely inbalanced (many more observations reporting zero claims than higher than zero)
Similarly to what was observed using the Linear Regression Model, for the Decision Tree model the results obtained in the training set are similar to the ones obtained using the test set. Evethough, onde again, all obtained measures show us that the model predictive power can be considered weak.
Initialy, one could infer that the Decision Tree Model should return more robust results. Although, this was not the case. One possible reason for such is the fact that the initial dataset is extremely inbalanced (many more observations reporting zero claims than higher than zero). As well as that, a higher number of observations, meaning, a larger dataset, could lead to better results.
In this exercise we are able to see some problems involved in applying Machine Learning. It does not work as magical and are able to fit and solve all problems. Some time a good data Understanding and Data preparation can say more than the model itself. The fact is that Machine Learning is a great technique, but it needs hard work and the knowledge to be used correctly, every task needs the specific tool. If we think in the real life, we must know that this tool it will be used for many people and generally it better to make it simple. Do it simple!