Text Mining for Investiments
I am a great enthusiastic about Natural Processing Language (NPL). With the huge amount of information produces daily nowadays, it´s impossible for humans to follow it. To read everything that we have interesting, we could take all day long and could not be enough. When we think about the stock market and investment decisions this goes to another scale. The GameStop case is an example of how the social media and words can move assets in the stock market. I Believe that a diligent investment management firm must have some algorithms monitoring and getting signals about their assets and scope of assets that they are monitoring.
Related this field I would like to share some technics and evaluations of a project presented for Text Mining Course in the Post Graduation at NOVA-IMS
The last step of the Data Preparation was to transform the variable ‘’Net Amount’’ into binary since it will be selected as our Target Variable.
Modeling: Initial Model
For the model creation, we choose to add two intermediate layers with 16 hidden units each, both with the rectified linear activation function. The input dimension represents the number of columns we have in our dataset. A third layer was added to output the scalar prediction regarding the customers’ churn probability, with the application of the sigmoid activation function so the output is a probability, giving a score between 0 and 1.
When training the network, the most suitable loss function for our project is the binary_crossentropy since it is used to solve a binary classification problem and the output will represent a probability.
From the plotted graphs, we can observe that the accuracy of the training and validation set increases with every epoch, but the opposite happens when we observe the loss of the training and validation set. After testing different possibilities for the number of epochs, it was considered 200 as the most suitable for the model in order to prevent overfitting.
The goal of this project was developed an NLP model capable to predict the daily closing values of a stock market index based on news text. I used Python 3 with NLTK and Scikit Learn libraries.
The Project´s structure is: 1-Data Exploratory Analysis, 2- Data Preprocessing, 4- Feature Engineering, 5 – Classification Model and 6- Evaluation.
Regarding the technics applied, I tested four different models Gaussian Naïve Bayes, Logistic Regression, Multi-layer Perceptron classifier (MLP) and LSTM Networks.
1. Data Exploration
Here you should analyze the corpora and provide some conclusions and visual information (bar charts, word clouds, etc.) that contextualize the data.
The data provide for the analysis were in two CSV files named “train.csv” and “test.cvs”. The train.csv has 27 columns as follow:
The test.csv file follow the same framework, but without the column Closing Status that configure our target to be found.
It was proceeded a merge of the text in the columns and add ate a new column Headlines, this column contained all text found the de previous headlines columns.
The word count of the new column and the descriptive analyses bellow:
The count of word in a Histogram follow a distribution near of a normal distribution. It was observed that the headlines individually presenting a positive skew, although, putting all together presenting the follow distribution:
The same methodology was applied to the simultaneous L1 and L2 weight regularizations. By combining both penalties, the model’s performance decreased substantially. The precision of the model decreased to 56% which is a significant variation when compared to the initial model and even to the model with only the L2 regularization.
Then proceed to find the word with the highest frequency, the 15 with more frenquency:
2. Data Preprocessing
On Data Preprocessing it was done three methods, stop words, Stemming and Lemmatization. Lemmatization tries to find the lemma of the words and reduces the form of the words to linguistically into valid lemmas. Stemming is a normalization for words, this method tries to find the root of the words and contextualize the context and variations of the words.
The stop words is a method to filtered the words that not add much meaning to sentence.
It was built a function named “clean” with those three methods and applied to the split data sets.
3. Classification Models
Gaussian Naïve Bayes is a method of probabilistic classifier algorithms based on assumption that exist a conditional independence between the pair of features. The Naïve Bayes predict tag of a text and find the probability of each tag for a text and then a output the tag with the highest one.
Logistic Regression is the baseline supervised machine learning algorithm for classification and has a very close relationship with neural networks. Logistic regression can be used to classify an observation into one of two classes (like ‘positive sentiment’ and ‘negative sentiment’), or into one of many classes
Multi-layer Perceptron classifier (MLP) is an artificial neural network. The method consists of at least three layers of nodes: an input layer, a hidden layer and a output layer. MLP utilizes supervised learning of binary classifiers.
LSTM Networks is a method of neural networks, capable of learning long term dependencies. LSTM as designed to avoid the long-term dependency problem. Remembering information for long periods of times is practically their default behavior.
4. Evaluation and Results
It was implemented a simple approach to evaluate by accuracy to measure the percentage of inputs in the test set that the classifier correctly labeled.
The results bellow:
LSTM: We could observe that the model gain more accuracy with more ephocs runs, the ideal is was in the fourth one.
Conclusion
Clearing those results could be improved. Can be applied better technics, use more processing power and largest datasets, or also apply those same techniques not for the overall index, but for specifics companies. Another use for those techniques is to apply, not in the news, but in the report results of the companies and see if the report is positive or negative, combining with the reports from sell side researches and buy side researches.
The purpose here is to bring light for the investment mangers to this technique and how, in my point of view, nowadays it is an essential, and it can bring more focus for the team, saving time, avoiding unseeing risks and complementing the investment strategies framework.