Text Mining for Investiments

I am a great enthusiastic about Natural Processing Language (NPL). With the huge amount of information produces daily nowadays, it´s impossible for humans to follow it. To read everything that we have interesting, we could take all day long and could not be enough. When we think about the stock market and investment decisions this goes to another scale. The GameStop case is an example of how the social media and words can move assets in the stock market. I Believe that a diligent investment management firm must have some algorithms monitoring and getting signals about their assets and scope of assets that they are monitoring.

Related this field I would like to share some technics and evaluations of a project presented for Text Mining Course in the Post Graduation at NOVA-IMS

The last step of the Data Preparation was to transform the variable ‘’Net Amount’’ into binary since it will be selected as our Target Variable.

Modeling: Initial Model

For the model creation, we choose to add two intermediate layers with 16 hidden units each, both with the rectified linear activation function. The input dimension represents the number of columns we have in our dataset. A third layer was added to output the scalar prediction regarding the customers’ churn probability, with the application of the sigmoid activation function so the output is a probability, giving a score between 0 and 1.

When training the network, the most suitable loss function for our project is the binary_crossentropy since it is used to solve a binary classification problem and the output will represent a probability.

From the plotted graphs, we can observe that the accuracy of the training and validation set increases with every epoch, but the opposite happens when we observe the loss of the training and validation set. After testing different possibilities for the number of epochs, it was considered 200 as the most suitable for the model in order to prevent overfitting.

The goal of this project was developed an NLP model capable to predict the daily closing values of a stock market index based on news text. I used Python 3 with NLTK and Scikit Learn libraries.

The Project´s structure is: 1-Data Exploratory Analysis, 2- Data Preprocessing, 4- Feature Engineering, 5 – Classification Model and 6- Evaluation.

Regarding the technics applied, I tested four different models Gaussian Naïve Bayes, Logistic Regression, Multi-layer Perceptron classifier (MLP) and LSTM Networks.

1.  Data Exploration

Here you should analyze the corpora and provide some conclusions and visual information (bar charts, word clouds, etc.) that contextualize the data.

The data provide for the analysis were in two CSV files named “train.csv” and “test.cvs”. The train.csv has 27 columns as follow:

The test.csv file follow the same framework, but without the column Closing Status that configure our target to be found.

It was proceeded a merge of the text in the columns and add ate a new column Headlines, this column contained all text found the de previous headlines columns.

The word count of the new column and the descriptive analyses bellow:

The count of word in a Histogram follow a distribution near of a normal distribution. It was observed that the headlines individually presenting a positive skew, although, putting all together presenting the follow distribution:


The same methodology was applied to the simultaneous L1 and L2 weight regularizations. By combining both penalties, the model’s performance decreased substantially. The precision of the model decreased to 56% which is a significant variation when compared to the initial model and even to the model with only the L2 regularization.

Then proceed to find the word with the highest frequency, the 15 with more frenquency:

2.  Data Preprocessing

On Data Preprocessing it was done three methods, stop words, Stemming and Lemmatization. Lemmatization tries to find the lemma of the words and reduces the form of the words to linguistically into valid lemmas. Stemming is a normalization for words, this method tries to find the root of the words and contextualize the context and variations of the words.

The stop words is a method to filtered the words that not add much meaning to sentence.

It was built a function named “clean” with those three methods and applied to the split data sets.

3.  Classification Models

Gaussian Naïve Bayes is a method of probabilistic classifier algorithms based on assumption that exist a conditional independence between the pair of features. The Naïve Bayes predict tag of a text and find the probability of each tag for a text and then a output the tag with the highest one.

Logistic Regression is the baseline supervised machine learning algorithm for classification and has a very close relationship with neural networks. Logistic regression can be used to classify an observation into one of two classes (like ‘positive sentiment’ and ‘negative sentiment’), or into one of many classes

Multi-layer Perceptron classifier (MLP) is an artificial neural network. The method consists of at least three layers of nodes: an input layer, a hidden layer and a output layer. MLP utilizes supervised learning of binary classifiers.

LSTM Networks is a method of neural networks, capable of learning long term dependencies. LSTM as designed to avoid the long-term dependency problem. Remembering information for long periods of times is practically their default behavior.

4.  Evaluation and Results

It was implemented a simple approach to evaluate by accuracy to measure the percentage of inputs in the test set that the classifier correctly labeled.

The results bellow:

Gaussian NB

Logistic Regression

MLP

LSTM: We could observe that the model gain more accuracy with more ephocs runs, the ideal is was in the fourth one.

Conclusion

Clearing those results could be improved. Can be applied better technics, use more processing power and largest datasets, or also apply those same techniques not for the overall index, but for specifics companies. Another use for those techniques is to apply, not in the news, but in the report results of the companies and see if the report is positive or negative, combining with the reports from sell side researches and buy side researches.

The purpose here is to bring light for the investment mangers to this technique and how, in my point of view, nowadays it is an essential, and it can bring more focus for the team, saving time, avoiding unseeing risks and complementing the investment strategies framework.

Louise Cardoso

Welcome to Blog Capital Flow, your essential portal for up-to-date financial insights and analysis. Our site is dedicated to providing valuable information on investments, the financial market, global economy, and capital management. With expert articles and practical tips, we help investors and market enthusiasts make informed decisions and achieve their financial goals.

At Blog Capital Flow, you’ll find in-depth analysis on market trends, investment tips, and guidance on building a diversified portfolio. We keep you updated on the latest economic news, financial innovations, and strategies to maximize your returns. If you’re looking for quality content on personal finance, financial planning, or investing in stocks, bonds, and cryptocurrencies, Blog Capital Flow is your trusted resource.

Our goal is to empower our readers with the knowledge necessary to navigate the complex world of finance. Whether you’re a beginner or an experienced investor, we offer relevant content that can help enhance your financial skills and grow your wealth. Explore our articles and discover how Blog Capital Flow can be a vital tool on your journey to financial success.

Keywords: investments, financial market, global economy, capital management, personal finance, financial planning, cryptocurrencies, market analysis, portfolio diversification, economic news, returns, wealth, financial success.

https://www.blogcapitalflow.org
Previous
Previous

A Pineapple for a King

Next
Next

Data Science in Bond Market