Disaster Forecast — Machine Learning for Flood Prediction
This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2021/22).
The flood of June 2021 caught Germany and other parts of Central and Western Europe by surprise. Unprecedented destruction took place with little to no adequate reaction and lacking assistance from the German government, leaving the houses of most affected citizens in shambles. As our title suggests, we prepared weather data and trained a machine learning model using logistic regression and random forest with feature importance analysis and hyperparameter optimization to predict these flood events, in order to raise awareness and help prevent another catastrophe of similar order, which in our generation seem to become more and more likely to occur due to climate change.
Our ultimate goal was to create a model which could predict flood events in Germany based on (preferably) real-time data from the DWD (German Weather Service) or other sources. As an orientation, we glazed at existing “Forest Fire Forecast” Models, which are widely used in the USA to simulate and forecast forest fire, a proportionally relevant problem caused by environmental change.
We started our project with a couple of team meetings to sub modularize the problem mainly into two categories: The Data Preparation Team, consisting of our Data Science Track absolvents, and the actual Model training, done by the Deep Learning Techies. After several weeks of fruitless search for a dataset that met the requirements, namely the occurrence of a flood in connection to weather data, for Germany, we decided to move on to train our model tailored to a different country — which ultimately led us to Bangladesh. From there on, the data scientists prepared the CSV files using the python packages Panda and Numpy, executing some statistical analysis on the way in order to eliminate outliers and verify the usefulness of the dataset. From there, the Deep Learning Techies picked up and trained the model with weather data based on the Bangladesh dataset, which contained 20,000 data points for different weather stations from 1949 to 2013. Two models to predict floods from weather data were used and compared, first logistic regression and second random forest.
As suggested by the results of the Data Science Team, the most important variable in our model is the amount of rainfall, while others like cloud coverage, temperature and hours of bright sunshine play a smaller role. This was investigated by a feature importance analysis.
To find the best model parameters without trial and error, a hyperparameter optimization was used. Therefore, we had to split the data into training and test set, under consideration that the training set consists of older data points than the test set. Then, the best hyperparameters were sought by Grid Search or Randomized Search with a training set split again into training and validation set with the rolling method, so that again the training data were older data points than the validation data. With these found parameters the models were trained, and the precision of the model was calculated with the test set using the accuracy score and the roc score. Analogously to the testing, the trained model can be used with a new dataset consisting of current or forecasted weather data to predict whether floods will occur or not.
We managed to create a model with an accuracy rate of 80% for the region of Bangladesh. Unfortunately, we did not have enough time to train it based on a dataset for Germany, but in our Github repository you will find all necessary tools and a dataset provided by the DWD (which still needs to be modified) if you want to give the code a run on your own!
In its current state, the model is not able to reliably predict flood-events for regions outside of Bangladesh. With some work (e.g. the usage of real time data which did not make the cut into our version of the model) the model could work as an indicator for regions susceptible to future flood events, while adjusting to an abundance of changing environmental conditions.
Constantin Michel Data Science (LinkedIn)
Lars Böke Data Science
Yannik Sacherer Data Science (LinkedIn)
Marvin Kohnen Data Science (LinkedIn)
Helena Monke Deep Learning (LinkedIn)
Tyll Röver Deep Learning
Roles inside the team
As previously described, we split our Team in two camps to increase effective work patterns and not having to attend meetings, while not being able to contribute anything to the topic.
That being said, the Data Science Team collectively went on the hunt for a suitable Dataset. Upon completing this task, they split the different variables of the Dataset among themselves and performed some explorative analysis in order to prevent outliers, supervise positive and negative correlations within the different values and the occurrence of a Flood and guarantee a smooth Handoff to the Deep-Learning team. It certainly came in handy, that Yannik Sacherer had some prior experience while working as a Data Analyst to help out his fellow Techies if needed.
Because the Deep Learning Team had only two members, they tackled their tasks mostly together. First, everyone set out on their own to find suitable methods, but then tried them out together and adjusted them to fit the previously described problem.