Fake Check — Spot fake news!
This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).
Abstract
Nowadays, we are constantly bombarded with a large amount of news with more eye-catching headlines and contents on various channels. Therefore, it has never been more challenging to verify the news, especially the viral news on social media. The aim of our project “Fake Check” is to examine the news regarding different subjects ranging from politics to world news and detect fake news among that news. To achieve the goal, we conducted descriptive and sentiment analysis of around 45,000 news during 2015 and 2018 to gauge the difference of word usage between fake and true news. Then we built a deep learning CNN (convolutional neural network) model to spot fake news in new datasets and created a simple website for users to check the news of their own interest.
Introduction
As global news becomes increasingly present through social media and a growing connectivity in our everyday life, fake news becomes a more and more pressing problem. Although fake news has many attributes that reoccur in this type of news, it is often difficult to differentiate between true and fake news or we realize that it could be fake after having read half of the article.
We aimed to develop a tool that makes it easier for the user to examine news based on found characteristics of the historical fake and true news datasets. Concretely, the tool provides users a probability for fake news so that they can assure their own assessment or decide whether they want to read this news further. By tackling fake news with a “tech4good” project we want to facilitate the evaluation of news for everyone equally.
Methods
Our team structure consists of three people from three tracks, namely the artificial intelligence track, the data science track and the web development track. We worked on the project in parallel and met via Zoom several times.
a) Data Science
In the field of data science, we first had to prepare our data set which included one file containing fake news and one file containing real news — each record (about 40.000 for each type of news) here consists of the title of the article, the text and the date of the article. Since we wanted to do some textual analysis including word counts and sentiment analysis, we had to first clean up the data set. We removed numbers as well as characters contained in the texts to be left only with the words. Furthermore, we lowercased all words to make them better comparable. Having prepared the data, we did some statistical analysis by comparing the mean and the standard deviation of the number of words contained in fake news and real news. Furthermore, we analyzed which words were used most frequently in both types of news. To get more insights in the use of words in the different types of news, we performed a sentiment analysis. For that, we used the words lists of Loughran-Mc. Donald that included word lists with positive, negative, uncertain and litigious words. Also, for the different groups of words, we analyzed which ones were used most frequently in which type of news. Additionally, we looked at the use of exclamation marks as well as at the use of uppercase letters. For all the results, we created visualizations.
The below graph shows the distribution of our true and fake news dataset by subject.
b) Artificial Intelligence
With respect to the application of artificial intelligence, our main goal was to spot fake news based on a trained deep learning model on the existing dataset of news (retrieved from Kaggle). The process of building a deep learning model can be divided into three steps as follows:
At first, we conducted data preparation and data cleaning. Concretely, we labeled the two separate datasets of fake and true news by adding a new column category, where 1 stands for fake news and 0 true news. Thereafter, we merged the two datasets and combined the columns — title and text. Subsequently, a textual cleaning function was defined to lowercase alle tokens and remove URLs, non-alphanumeric characters, and white spaces. Afterwards, a word-based tokenizer of the module tenserflow kereas was applied to tokenize the textual data in the dataset and convert them into sequences of word indices. Furthermore, we split the dataset into train and test data (80:20 ratio). Meanwhile, we computed an embedding index by mapping words with the pre-trained golve embeddings which are vector representations for words trained on different corpuses from wikipedia, twitter, and common crawlers (https://nlp.stanford.edu/projects/glove/). For our project the corpus from common crawler with 6B tokens and 300 dimensions is chosen, as it contains a large number of tokens and vocabularies. Based on the above-mentioned word indices and embedding index, the embedding matrix is computed to prepare for the first embedding layer of a neural network model.
Secondly, the architecture of the neural network model was built. The first embedding layer is the above computed embedding matrix and the weights of the embedding matrix were set not to be updated during training. The next layer is a convolutional layer which can extract features from the data. Thereafter are a max pooling layer and long short term memory (LSTM) layers were selected and the dropout regularization mechanism was also included to improve the news classification accuracy on the test data set.
Thirdly, the results of model accuracy and model loss in relation to epochs were plotted. Moreover, the trained model was saved to be used to give a possibility of prediction whether the news typed or requested by users is fake or true news.
c) Web Development
The task for Web Development was to develop a website that simplifies the checking process of a given news article and to show an uncomplicated output. We concentrated on a website rather than an app because for our matter it will be more useful. For building the website I used HTML, CSS and Javascript.
To enter fake news on the website, a textarea was built in which the user can copy and paste an article or any form of news he or she would like to check. The button ‘Check’ is used to check the value of the textarea and send the entered value to our dataset and the AI machine. With a finished backend the dataset will send back a value, in this case a percentage. This Percentage will be shown in a wrapper. In order to make the process of loading the percentage more interesting for the user it has a loading attribute to it. A disclaimer on the bottom of the website is meant to prevent the user from interpreting the websites results as absolute.
Above you can see a screenshot of our website. To switch between the different subsites on the website, a navigation bar was installed that is sticky to the top of the website. The user can visit the ‘Check’ link (which is described above), the ‘Info’ link that directs to a subsite where Information about fake news and our data set is collected and the ‘Contact’ link where our team is presented.
Results
As an outcome of our data science work, we succeeded in cleaning about 40,000 records in each of the types of news. We performed some statistical analysis and found out some characteristics of fake news and true news. For example, we figured out which words were used most in fake and true news and what kind of words (i.e., positive, negative, uncertainty) were used. Furthermore, we looked at the use of exclamation marks and uppercase letters as well as the number of words a fake news and a true news article consist of.
As a result of our artificial intelligence work, a neural network model with four layers and some regularization mechanism was successfully trained. This model reaches 97.90% accuracy on the test dataset after only 2 epochs.
With respect to web development work, we were able to create a user-friendly interface that will facilitate the examination of potential fake news. The layout of the website is designed to be modern and minimalistic.
All in all, we successfully applied and even deepened the knowledge gained from our tracks in our “Fake Check” project. However, it is crucial in practice to integrate the trained model in python script with the input data from the website and the output into the website to make the designed website functional. Due to the personnel and time restrictions, we did not manage to completely finish our project. Nevertheless, our team members expressed their interest to further carry on the integration part between python script and website, since only the user-friendly website integrated with python script can realize the utilization of our trained deep learning model.
The next steps will be to solve the connection issues regarding the linkage of the python script to the website. And the project can be even further extended to check news in other languages besides English. Through that the application of our project could hopefully make a small contribution to fight fake news.
The Team
Ann Kristin Borchert Data Science Track (with Python)
Lingling Tong Artificial Intelligence Track (LinkedIn)
Fritz Banke Web Development
Mentor
Felix Kleine Bösing