Sentiment Analysis
For Predicting Stock Market Movements
Nikolaj Vinzentz Gade, Sebastian Gregers Winkel Hansen, Francesco Cocozza, Patrick Staeckmann, Christina Stolz, Peter Thyssen
Introduction
Predicting stock market movements is a well-known and very complex problem of interest. News outlets and social media sites have a great impact on the movements. The flood of information available however is impossible to surmount for both the individual and even investors. Hence, we provide a potential aid by creating a model that analyses the sentiments of a range of articles. This provides traders with a very fast decision-making tool.
The goal of our project is to be able to develop an algorithm that can analyse financial news and suggest different actions based on social media data or news on an index or a specific stock. Thereby, we settled on the stocks of large tech companies, specifically the FAANG companies. This choice was on account of the fact that it allowed us to focus on a single industry, thereby eliminating other factors that affect stock prices. Additionally, there is a lot of news coverage and correspondingly data available on the tech industry.
Project Work
Our method of working was to divide the project into three working groups in correspondence to our digital shaper tracks: the deep learning model, the data extraction and cleaning and finally the web site.
Data Collection
What we assumed to be the simplest and swiftest part of this project took the longest and presented numerous problems, stalling our progress a lot.
We chose to use news articles over social media content and a hybrid of both as data into which to feed our model. A dire difficulty we faced was deriving the news articles from the web. We had heard APIs to be an efficient and fast means to pull data off the internet however after trying out various API keys we were faced with multiple issues. Whilst the APIs did provide us with the metadata of relevant articles, they did not provide us with the content text. In the rarer cases where a section of the article was printed, the length did not suffice for the sentiment analysis model to accurately proceed. Beyond this, the number of articles needed for the model could also not be covered. Accordingly, we turned to web scraping.
Web Scraping
Web scraping is an effective way of gathering data from websites. Thereby, anything visible on a website can be extracted and exported into a data file that can be inputted into a model. Whilst the scraping can be done manually, automated tools are preferred when scraping web data as they work at a faster rate and are less time consuming when dealing with multiple websites and pages. There are multiple Python packages for web scraping but we settled with Scrapy.
Methodology
Websites have different structures and consist of different HTML/XML structure, and because of that, it is not possible to make one code to fit all websites. Therefore, we chose to only scrape articles from Reuter’s but it would be possible to scrape multiple sources to get a more accurate sentiment of the stock. Scrapy is a web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on CSS or XPath.
The first step is to import the needed packages for this part of the project as listed below.
To define which exact data our web crawler should extract we had to define a Spyder with Scrapy. Spiders are classes which define how a certain site will be scraped, including how to perform the crawl. We start with defining the name of this spider as “Reuters_test”. The requests function is used to crawl the URLs (start_urls). The URLs are the links to the different companies’ news pages on Reuters. The first requests to perform are obtained by calling the start_requests() method which generates the request for the defined start_urls and then the parse method as a callback function for the requests.
In the callback function, we parse the response from the webpage and return it as a dictionary. The function goes through all the links and into the individual articles on each webpage. We use ‘selectors’ to generate the exact data which in our case is the article date and article text. The web pages come in a HTML format and can be viewed by right-clicking on the part of the website we want to extract and press ‘inspect’. Now, it is possible to see which CSS expressions we need to extract the articles date and body text.
We initiate our crawler process and Scrapy downloads the articles into a dictionary. We transform it into a data frame. This we cleaned before converting it into a CSV file which can be used as data for our model.
Machine Learning Modelling
Sentiment Model
The problem at hand is trying to predict stock market movement based on sentiment analysis of news articles, and therefore we look to the field of Natural Language Processing (NLP). The field of NLP explores how to process and analyze large amounts of natural language data using algorithms. In our case, we want to be able to follow the evolving sentiment towards the FAANG companies expressed in financial articles, and use the results as an helpful indicator of how the companies stocks will perform.
To build our initial model, we decided to use the text module of the fastai library. This module contains many helpful functions for Natural Language Processing, and would allow us to efficiently create a model to help us determine the sentiment expressed in the Reuters articles.
We used multiple sources of data in the creation of our sentiment model, specifically:
Wikipedia:
A dataset (WikiText) containing over 100 million tokens extracted from Wikipedia.com. This dataset was used for pre-training our model, and added a general understanding of the english language to the model.
AG News:
A dataset containing 496,835 categorized news articles from more than 2000 news sources. Ths was used for fine-tuning our model to better understand newspaper english. About 25% of this data was business articles.
IMDB:
A dataset containing 50,000 highly polarized movie reviews for training and testing. This data was used to provide our model with an understanding of sentiment in text.
During all stages of model training on the aforementioned datasets, we performed 3 epochs, meaning the data was fed to the model 3 times for 3 rounds of learning. This was done to improve the models performance.
After having completed training, the sentiment classifier model was exported and made ready for use on up-to-date news articles from Reuters.
When the model is fed a new dataset containing FAANG related articles from Reuters, it will assess and score each article in a boolean format of either 1 (positive) or 0 (negative).
It is important to note that since our model’s understanding of sentiment was achieved by being trained on IMDB movie-reviews, it would be optimal for that purpose. Therefore, it’s performance on financial articles would likely have been stronger, had it been trained on financial articles with predetermined sentiment scores/ labels.
Web Site
Introduction
The idea was that we wanted to deploy an econometric model to a website which would allow users to simply choose stocks and time period in which they want to know the sentiment of the picked stocks. This would amount to a simple 2-page setup where we have a landing page explaining the main parts of the project and how it works and a portfolio page which would display the sentiment of the picked stocks in a picked time period. This would require either a deployment of an econometric model using Flask or a simple database connection where model output would be displayed on the webpage. Ideally, we wanted to deploy the model to provide a wholesome solution which could fetch real-time articles from a database and then predict the sentiment with the web application.
Challenges
The development of the front-end was relatively simple as it only required a 2-page setup, which required only core front-end plugins and standard setup. The back-end development became more complicated, as efforts towards making the core product, the model, proved to be more difficult than initially anticipated. This meant that the finished model was not done, until late in the process, why the back-end development effort was not prioritized. Reaching the end of the project phase the integration of the model or the integration of a database with results seemed out of reach. Because of this, the focus was put on developing the front-end to showcase the actual user functionality instead of the completed product.
Methodology and Solutions
To produce the desired website, HTML, CSS and JavaScript was utilized, where the Bootstrap plugin was used as the primary front-end designing plugin. This made the efforts to produce the front-end development significantly more effective and provided better results. The landing page was produced in a simple “container” setup, which included 3 sections: “Landing head”, “How it works and “About us”. The landing head is a master head, which includes a “carousel slider” with two pages which initially is activated on the landing head, and which allows to switch to what is considered to be the second page, “Portfolio”.
The portfolio page allows users to insert a time period and which stocks they want to investigate the sentiment of. The selected variables are then displayed in a Bootstrap chart setup. The overall website has been built according to the responsive methodology, so it works with both computer, tablet and mobile devices.
Final Project Results
Our sentiment model classifier achieved an accuracy of ~ 80%.
However, As of writing this article, we were not able to provide an accuracy assessment between our sentiment models prediction and the actual stock movement of the FAANG companies.
Evaluation
There are still further considerations and weaknesses that our group may address independently in the future. These include the weighting of different news articles and the assumption of perfect information conditions in the market. A further improvement is also expanding the source of our data, so that we derive it from multiple news sites. This would in turn improve the accuracy of the model. Furthermore, we need to continue working on making the model real-time.
Conclusion & Learnings
In terms of getting the data one of the biggest challenges was figuring out what outlet was suitable and usable for the project. API’s are convenient and easy to use, however when using free APIs it was found that a range of limitations arise. For example a time lag and limited length of articles. The limitations were so great that the model would add no real value to investors. Web Scraping which was therefore chosen instead.
One personal highlight of one of the group members was when all the different parts were put together. Being able to see all the hard work come together after long days of fighting code brought a great feeling of joy.
We had many hurdles with the online format during the coronavirus pandemic and theses. Nevertheless we were able to reach our goal in the end. For that we would like to thank the Copenhagen TechLabs team for guiding us through our coding experience and the mentors who gave us much needed support on this project. We hope our article gave some insight into our programming experience and uncovered some problems the new Techie generations may face- we wish you all the best on your individual coding journeys! Should you have any further interest or queries about the project or would you like to get in touch, do not hesitate to reach out.
GitHub Repository and team members LinkedIn profiles
- Nikolaj Vinzentz Gade , Data Science Track
- Sebastian Gregers Winkel Hansen, Data Science Track
- Francesco Cocozza, Artificial Intelligence Track
- Patrick Staeckmann, Web Development Track
- Christina Stolz, Data Science Track
- Peter Thyssen, Data Science Track