This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).


With the increasing use of social media in the past decade it comes to no surprise that the overwhelming flood of information on Twitter (over 6000 tweets every second) contains valuable information about emergencies and natural disasters. The timely analysis and retrieval of this information can be integral for managing real-time crisis response activities and can be of great value for NGOs, emergency response services and news agencies. Unfortunately, a keyword-based analysis is not feasible as important keywords are often used out of context leading to a large number of irrelevant tweets when using this approach. Our project aims to solve this issue by providing a NLP based algorithm that categorizes tweets as “disaster tweets” and “non-disaster tweets” and thereby contribute to better and more efficient disaster management in the future.


For a lot of us, social media is a major part of our lives. This may even go as far as posting on twitter about a a car crash that just happened or about a natural disaster. Some of those tweets may actually help fire or police departments in noticing disasters and figuring out where their assistance is needed the most. However, in the daily flood of tweets (500 million per day as of 2020 according to, those crucial posts are hard to find. Machine learning can enable administrations to organise themselves both more efficiently and effectively.

By using Natural Language Processing, our project Burning Skies is meant to filter out those tweets that contain information relevant to public safety administrations.


After first starting out sourcing and labeling disaster and non-disaster tweets ourselves, we quickly realized that manually acquiring a statistically significant number of tweets would be a tedious and lengthy process. We then decided to further investigate and look for existing datasets and found two data sets with pre-labeled disaster tweets and one data set that contained disaster tweets without labeling. Although the datasets were sourced from different websites, they share the same basic layout consisting of the tweet, the location where the tweet was published and a keyword. The keyword refers to the disaster related catch word of the tweet, which constitutes an abstraction of the depicted topic. With 222 different unique keywords the datasets cover a broad range of topics ranging from “aftershock” to “emergency” or “wildfire”. After thoroughly analyzing the distributions of keywords and locations, we were able to validate that the variance among disaster and non-disaster tweets was acceptable.

Although the keyword and location columns would have provided helpful additional information to train our model in the following steps, we decided to disregard these columns as this information would not be readily available in a realistic disaster scenario and we did not want to introduce a location-based bias to the model. After merging the labeled datasets, we were left with 19,924 labeled tweets for further data processing and modelling.

Data exploration

When inspecting the available datasets, the differences between disaster and non-disaster tweets are especially important and interesting for our use-case. The labeled datasets consist of 28.4% disaster and 71.6% non-disaster tweets (see Fig. 1). As expected, non-disaster tweets use a lot more colloquial languages and emojis, while disaster tweets are generally more descriptive. Among other causes, this is the result of disaster tweets being publications from news agencies more often.

Fig.1 Distribution of disaster and Non-Disaster Tweets

The number of characters in a tweet is similar among both categories with disaster tweets using slightly more characters on average (disaster: 110.75; non-disaster: 102.73). Although the average number of words is almost identical between both categories, the average word length is distinctly longer in disaster tweets (see Fig. 2), which is the consequence of disaster tweets referring to longer and more unique names more frequently, e.g. names of agencies, events or locations.

Fig. 2 Word count and average word lenth distributions.

When looking at the most common words in the dataset, it comes to no surprise that the most common words are stop words and punctuation (see Fig. 3), showing the need for further data cleaning to build a conclusive NLP model.

Fig. 3 Most commonly used words in raw dataset.

Data Cleaning

The biggest challenge in the initial stage of data cleaning were the encoding errors in the CSV files which were a result of inconsistent encoding of the base file. The dataset contained a number of encoding errors ranging from falsely formatted characters (e.g. “&” or “ÌÔ) to the inclusion of HTML tags. We fixed those errors by utilizing regular expression and python packages like unidecode. Emojis posed another challenge as they communicate valuable contextual information that needs to be extracted. For this purpose, we wrote a script to translate emojis into plain text (e.g. 😄 → grinning face with smiling eyes). Duplicates were removed in a two-step process: First we removed identical duplicates and then we used a string-matching algorithm to find and remove partial duplicates. The dataset contained a lot of partial duplicates which were often the result of different news outlets posting an article multiple times or people spamming tweets. Among other things we also removed contractions, punctuation, all non-ASCII characters and cut-off words. Other data processing attempts were not as viable. We used the python package TextBlob the autocorrect spelling mistakes, but the result was too inaccurate because of the frequent use of colloquial language.

Fig. 4 Wordcloud of Disaster and Non-Disaster Tweets after data cleaning.

The last step of our data preprocessing was the generation of training and validation datasets for our model. Because we only had a limited number of labeled tweets, we created smaller subsets from the merged dataset by utilizing both a representative selection algorithm based on the keyword distribution and a random selection algorithm. These subsets were then used to train our NLP model.


We decided to try out different algorithms to determine which model would provide the highest accuracy. The algorithms we tried fall into two different categories: a) neural networks, which tend be more computationally expensive, and b) “classic” statistical methods, that are well established and less complex than their deep learning cousins.

Our candidates for this comparison have been Logistic Regression, the more computationally demanding Random Forest and the Naive Bayes, which is supposed to be suitable for NLP-applications for mathematical reasons. On the other hand, we used (mainly because it is extensively covered by the course material), specifically the workflow comprised in chapter 10, as well as a combination of Word2vec and Long Short-Term Memory (LSTM).

N-Gram + TF-IDF + Logistic Regression

The most accurate approach we used turned out to be N-Gram + TF-IDF text representation combined with Logistic Regression.

Term frequency-inverse document frequency, or TF-IDF for short, is a vectorizer that turns every text sample (a tweet in our case) into a vector. The vector contains information on how often every token — in the simplest case a token is a word — appears in this tweet. The advantage of TF-IDF Vectorization, compared to a simple Count Vectorizer, is that it not simply counts how often a token appears, but it compares the frequency to the usual frequency in other tweets. By doing so, it reduces the weight of very common tokens that do not contain much information relevant for classifying tweets. In fact, TF-IDF beat the Count Vectorizer in accuracy by about one percent. For our dataset, we went with the TF-IDF Vectorizer provided by the python package sklearn and, after some testing, filtered for tokens that had a minimum occurrence of five times and accounted for a maximum of 10% of the text.

One can imagine that information about how often certain words appear is useful for a model that has to decide on whether a tweet deals with a real emergency. For example, words like „evacuated“ and „earthquake” are probably very specific to disaster tweets. On the other hand, we can presume that words like “puppy” or “fluffy” show an opposing pattern.

However, the limits of this approach soon become obvious. When you count only the relative frequency of the words, the tweets „This is a totally dangerous situation and absolutely not harmless“ and „This is not a dangerous situation and absolutely harmless“ result in the same vector. To include the difference between between „dangerous“ and „not dangerous“, we worked with bigrams. With this technique, a token consists of two consecutive words instead of one. Extending this concept to trigrams didn’t improve the result, in our application a combination of unigrams and bigrams proved to be most effective.

The best result was achieved by following these preprocessing steps with a Logistic Regression that reached an accuracy of 81,8%, relatively closely followed by a more computationally demanding Random Forest. You can see an overview of the accuracies of the other models we used in the chart.

Fig. 5: Accuracy of different token representations + machine learning models.

GloVe + LSTM

The much more complex pretrained GloVe word embedding in combination with long short-term memory (LSTM) culminated in a comparably meager 78% after four epochs on unseen test data while accuracy on our validation set stagnated at around 92%. This discrepancy can be attributed to overfitting and may be explained by suboptimal hyperparameters, which can be improved by using grid search. When visualizing the accuracy of both the train and validation set after every epoch, we can see that the accuracy of the data used for validation decreases after the second epoch while the accuracy of the training data continues to increase.

Changing parameters like the learning rate, the batch size, or the optimizer used when compiling the model, may allow the model to generalize much better.

Fig. 6: Accuracy development of the train and validation datasets across multiple epochs when using Word2vec and LSTM., which a) has its own special tokens for, e.g., string starters and unknown characters, b) uses an embedding matrix as its text representation, and c) works with AWD LSTM, got us a solid 81% accuracy on the validation set. However, predicting targets on an unseen test dataset labeled 94% of the samples correctly, thereby surpassing our TF-IDF Logistic Regression combination. We are not sure however why the accuracy was so much better for our test dataset compared to the validation set.

Conclusion & outlook

Our most accurate model categorized tweets correctly into disaster-tweets and non-disaster-tweets about 81,8% percent of the time. That’s is definitely a lot better than the baseline (just saying “not disaster” every time which will be right about 57% of the time). But it’s also far from perfect. What are the reasons for this and what could further improve our model?

First, we can take a look at the tweets that our model misclassified. The list of false positives contains tweets like:

  • hmm scratched my ear now its bleeding
  • bloody insomnia again! grrrr!! #insomnia
  • crazy mom threw teen daughter a nude twister sex party according to her friend50 ⇒
  • when your body’s like ‘go to fuck to sleep sami’ and your mind’s like ‘make an emergency plan for every natural disaster go’

These tweets either describe unfortunate circumstances or actually deal with the topic of natural disasters while not talking about actual disasters. In case of tweets belonging to the latter category, the algorithm rightly classified them as disaster tweets, but as the label was wrong, it dragged the score down.

Another reason the result was not better is the fact that, for some tweets, it is not possible to put neatly put them in one of the categories (even by us) due to lack of context.

To have an algorithm that is able to categorize tweets into tweets about disaster and tweets about anything else is all nice and good, but the question arises how this really improves the world. One use case could be to add a web scraper that scans the web for tweets (maybe filtering for certain hashtags, considering the overwhelming amounts of tweets that get written every day) and have an AI scan them for the most suspicious looking ones. In this case, you have to think about what a classification into suspicious and unsuspicious actually means. It appears reasonable, to not put everything that has a probability of over 50% of being about something dangerous into one box and everything else into another box. But it also isn’t obvious in which direction you shoult move the needle. Is it better for the machine to be more sensitive and rather have a possibly disastrous tweet reviewed by a human? Or is it better to only pick a tweet as necessary to be reviewed if the machine is absolutely sure, as there is going to be enough data for the human to deal with anyway? For these decisions to be made the different versions would have to be tested to find what is most practicable in a real world application.

Our Github Repository

The team

Lara Kösters Artificial Intelligence

David Kösters Data Science: Python

Marco Stoever Artificial Intelligence


Jason O’Reilly

Our community Members share their insights into the TechLabs Experience