F.A.C.T Fake-News Automated Checking Tool

7 min readJan 28, 2021

This project was part of “the Digital Shaper Program” hosted by the Techlabs Copenhagen in autumn semester 2020/2021.

Introduction

In the last decade social media platforms became major channels for the diffusion of news and information and gave everyone the possibility to share their thoughts and opinions with the world. Not only do most people not check the source of material before sharing it online, but also identifying the original source of news stories has become harder, which can make it difficult to assess their accuracy. Social media has also become an attractive target for abuse and manipulation — false information is sometimes purposely prepared and spread by hostile foreign actors, particularly during elections (Politico, 2017). Consequently, the rise of social media resulted in increase of the prevalence of fake news — false or misleading information presented as news.

According to the Washington Post, researchers at Ohio State University found that fake news were most likely instrumental in diminishing Hillary Clinton’s support on Election Day. The study offers the first look at how fake news affected voter choices, suggesting that about 4% of President Barack Obama’s 2012 supporters were discouraged from voting for Clinton in 2016 because they believed in fake news stories. A few examples of false stories, along with the percentages of Obama supporters who believed they were at least “probably” true (in parenthesis), are as follows:

Clinton was in “very poor health due to a serious illness” (12 percent)
Pope Francis endorsed Trump (8 percent)
Clinton approved weapons sales to Islamic jihadists, “including ISIS” (20 percent)

Moreover, fake news creators are becoming increasingly convincing with the article they publish. Fake news has led many populist parties to gain popularity in the EU. Some social media platforms have tried at reducing the spread of misinformation. For example, Twitter introduced new labels and warning messages that will provide additional context and information on some tweets containing disputed or misleading information related to COVID-19. Even though Twitter’s work is considerable, it is doubtful whether it will be sufficient to significantly improve the situation worldwide

The growing issue of fake news and their repercussions are the reason we decided to create a website that evaluates the veracity of news by extrapolating part text from multiple articles — F.A.C.T. The website provides the public with a tool they can use to test a piece of news on its genuineness, by telling the reader to what extent a piece of information is reliable. This will help people feel safer when reading information online and will hopefully help contain the spread of fake news.

Methodology

We used several tools from our TechLabs track to address our problems. From the basics, namely, understanding which packages to use to more complex tools such as machine learning models. The job was split among the two groups (front and back-end). The back end took care of the data gathering, processing, and model creation. On the other hand, the front-end team took care of the visual part of the app. Their goal was to depict the user interface and bring the codes to life.

In the backend part, a labelled dataset from Kaggle served as the basis for our analysis. The dataset provided us with example text data which was classified as either fake news or not fake news. In the first step, we applied data pre-processing techniques. We cleaned the data by deleting missing values and removing duplicates. In the next step we applied typical text data functions to bring the data in the right form to use it in our models later on. Firstly, we removed punctuation and special characters in our data. Furthermore, we removed typical stop-words, which should not influence the categorization of the text frames. In addition, we applied a lemmatization function to remove inflectional endings and return only the base or dictionary of the specific words (which is referred to as the “lemma”):

To make our text data usable for our later applied models, we applied the Counter Vectorization tool provided in the scikit-learn library in Python. This is a way of tokenizing our data to make Machine Learning and Classification models applicable. We applied three different Classification models to our data with the goal of identifying the most suitable one for our case. The models we applied are the logistic regression, a linear svc, and a multinominal Naïve Bayes classifier:

After applying those three models, we compared the accuracies of the models to identify the most suitable one. These diagnostics led to the result that the logistic regression has proven to be the most accurate one:

However, all three applied models provided very accurate results, so we decided to base our final categorization of the input data on a combination of the three models. We applied a majority vote of these three models on the input data to detect, whether the input text is considered fake news or not fake news.

Regarding the UX Design part of this project, the track enabled the successful development of a low-fidelity prototype of the Web UI in Figma. This was then discussed with and handed over to the web developer for further processing and application. However, the scope as well as the high back-end focus of this project resulted in the creation of a rather simple, easy-to use and minimalistic low fidelity prototype. The front end would have been built with HTML and CSS to match the simple and minimalistic interface, but unfortunately, and as discussed below, this step was, unfortunately, not reached.

Our main line of communication was through slack. We kept each other updated in our group and visualized the progress on Jupyter notebook.

Conclusion and Learnings

We used the project time to not only get to know Data Science, AI, UX, and Web Development and its applications but also expand horizons and connect with fellow team members. The different backgrounds and nationalities made the collaboration engaged and interesting. The diversity definitely contributed to the lively team work session and was one of the highlights during the project work.

Most team members have been relatively new to the different applications and topic within Data Science and Web Development but made great progress throughout the semester by applying the learned contents from DataCamp but also the input from the skilled mentors to the project. While it was at times challenging to apply the fairly theoretical Data Science knowledge from DataCamp to a “real-life” problem, the mentors helped in every step of the way and served as great resources for additional knowledge, while they always made sure everyone was able to follow the teamwork. This dedication has been another highlight during the project phase. In regard to user experience design, especially the hand-on tasks and learning exercises with Figma were highly relevant and enabled fast and valuable progress and learnings.

As aforementioned, this project was the first of this type for most of us, hence a project suitable for beginners was chosen. This included using a high-quality dataset from Kaggle, which was already labelled, making it easy to apply ML models. The accompanying restriction was that standard and rather “simple” models were used for the predictions. While these were easily graspable by the team, and still ensured high accuracy with the testing data, we are aware that more advanced models could have led to a more flexible solution with wider usage.

The biggest challenge when working with fake news is the variability of facets fake news can take on. This identification problem is widely discussed in academic literature and make a definite distinction almost impossible. The rather fluid nature offers a spectrum of Fake News, reaching from Satire, to sloppy or false reporting, to intentionally misleading and deceptive news. The labelled data we analysed focussed on false reporting and manipulated information, not including the other categories. Hence, when applying the model to unlabelled data in the future, our model will have limited applicability to mistakenly fake news or satire. Furthermore, the model lacks testing for unlabelled data, which may further limit the applicability and accuracy in the future.

Furthermore, the teamwork came with several natural challenges. In line with the guidelines and restrictions of the danish authorities, most meetings were conducted online, which especially in the beginning provided difficulties for the beginners. Furthermore, as all group members are full-time students, and many with student positions, the coordination and alignment of available time did not go without common stress factors. With the team split between Front End and Back End developers, this point became especially visible. However, the team managed to stay active on Slack and schedule regular calls to update on the project progress.

Despite successful individual learning experiences and tracks, the team did suffer from one stark limitation. Due to unforeseen challenges inside the project team and personal team commitments, the Web Development part of the project was not able to be fulfilled. This means that the application did not get a website or “front face” to present the underlying model and work. Further, this unfortunately meant that the gathered knowledge of User Experience, Web Development, Data Science and especially learnings about their sensitive interface were not fully exhausted. Despite this drawback, the team members have still been able to achieve their individual learning goals and will continue to expand their skillset in the future.

The project team

Joanna Zaslonka (AI)
Jacob Hornberger (User Experience)
Chris Lin (Web Development)
Philipp Abt (Data Science)
Badr Lafif (Data Science)
Nadine Birkhahn (Data Science)

Reference

https://www.politico.eu/article/fake-news-busters-germany-ben-scott/

F.A.C.T Fake-News Automated Checking Tool

Introduction

Methodology

Conclusion and Learnings

The project team

Written by Inside.TechLabs

No responses yet