COVID! Do you really know about it?
This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Summer Term 2020).
Authors: Johannes Ehls, Yongpeng Zhao
Project summary:
The corona pandemic has been going on for more than 6 months and people already start to forget how it started and how it spread across Germany. It is also often neglected how the disease affects our society and let alone why some countries suffer from less infections and loss of lives, and some experience quite the opposite. Do you really want to understand the pandemic? If yes, then you’re absolutely right here and our blog article will surely fit your taste!
Introduction:
The COVID-19 pandemic is an ongoing global pandemic of coronavirus disease, which has not only cost many lives and caused long-term health damage to lots of people, but also imposed broad impacts in various areas, including general society, economy, culture, ecology, politics and others. Globally, as of the end of September 2020, there have been more than 30,000,000 confirmed cases of COVID-19, including more than 1,000,000 deaths. The project is aimed at one hand to show the dynamic development of the pandemic, as well as its economic and social impacts visually and on the other hand analyze the correlations between infection rate and mortality and numerous socio-economic indexes. Through the result of the project we want to remind everyone of the seriousness of the pandemic and furthermore obtain the features that are critical to reduce infection rate and mortality.
Method:
1. Visualization of the dynamic development of the pandemic and its economic and social impacts
In the first part of our data analysis of COVID-19, we asked ourselves how the spread of the disease across Germany is changing over time. In fact, there are many dashboards and maps across the internet, but mostly they are displaying the actual infection numbers and they lack the opportunity to view the development over time.
So, we decided to build such a map to get an overview of the development of the infection numbers reported by the authorities in Germany. Therefore, we used a dataset from the Robert-Koch-Institut (RKI) which is containing data on every Covid-19 case reported to the authorities in Germany. Some data categories included are federal state (‘Bundesland’), district (‘Landkreis/Kreisfreie Stadt’), sex, age group and date of report to the authorities.
Furthermore, we use a Geojson file which contains mainly data on the borders of the 402 districts of Germany described by polygons and some meta data like the population of each district. Due to some territorial reorganizations and issues regarding the size of the dataset, we had to revise the dataset to find redundant and missing districts and reduce the size of it. Therefore, we used the python package Fuzzywuzzy which is a package for string matching. This package was also very useful to match the districts from the RKI data to the districts in the Geojson file because they were in some cases named slightly different. All this took a lot of time because the data was not always consistent, and some data even had to be edited manually.
We then calculated the infections and infections per 100k inhabitants per week and per district. Together with the name of the district, an unique identifier and the population of the district, these numbers were stored in a new Pandas DataFrame. To get the aspired map with a slider to manipulate the weeks, we subset the dataset by week and create a so-called trace for every week we want to display. We therefore use Plotly, a python package designed for visualizing data in form of plots, charts, and maps. After setting a few parameters and some styling we finally get the map figure.
We visualized the economic and social impact of the COVID-19 pandemic in Germany with python. Several key indexes are chosen as representative. For economic impacts we selected the Gross Domestic Product, the Truck Toll Mileage Index and the Manufacturing index, which respectively stand for the short-term development of an economy, the indication of the development of industrial production and the state of production by selected branches. For social impacts we selected the number of deaths, migrations, and mobility trends. The visualization follows standard data process protocol: import data, deal with NaN values, get a clean table and plot. In the end the corresponding charts were displayed in a straightforward way and can be easily interpreted.
2. Build a machine learning model to analyze correlation and to make prediction
As the second part of the project we tried to implement a machine learning model to find the relationships between the infection rate and mortality as well as many socio-economic indexes, and further to predict these two pandemic indicators based on the socio-economic indexes. The analysis used data of 40 countries, which have the highest mortality of COVID-19 around the globe. We have gathered the raw data in various aspects, from economic indexes like GDP and income, to social indexes like hospital beds and life expectancy, as well as the government response indicators based on a research from Oxford University, which evaluate the lockdown restrictions and their strictness, testing policy, economic support, and etc..[1]
A data normalization was conducted in the first place to eliminate the undue influence from the features on the model caused by different data scales.
After obtaining the centered data, we’ve done a feature selection to get the features that have significant correlations with the pandemic indexes and to filter the features without big relevance. We’ve combined the lasso regression and the random forest regression to get a union set of the features with big importance. To this point, we can already come to some conclusions concerning which of the indexes can contribute to a lower infection rate and mortality.
Then we tried to analyze the data with regularized regression, which can capture the linear correlation between variables and in the meantime punish large coefficients to avoid overfitting. Unfortunately, the model score did not meet our expectations.
Now that it’s clear that the correlations are not simple as linear, we finally selected several tree-based models which can disclose non-linear relationships. In our project, we chose 3 methods: Decision Tree, Gradient Boosting and Random Forest and managed to aggregate them to an averaged result with a voting regressor based on ensemble learning. The hyperparameters in the models are selected with hyperparameter tuning based on Grid Search and Cross Validation. The model performs far better than the linear model and can be used for prediction.
project results:
1. visualization of the dynamic development of the pandemic and its economic and social impacts
1.1 Interactive map
First, we have a look at the results of the interactive map we created.
The heat legend on the right shows the spectrum of values and their relation to colors. As maximum for the color scale we selected 50 infections per week per district and per 100000 inhabitants because this is the limit when exceeded the government imposes stricter restrictions for public life.
As we can see in the pictures, we clearly see the peak of the new reported infections in Germany per week (as of 2nd of October 2020) (Figure 1). Further on, we can identify local outbreaks of the disease like Heinsberg per week (Figure 2), Gütersloh und Warendorf associated with the Tönnies slaughterhouse per week (Figure three). Overall, this visualization gives us a decent overview of the spread of Covid-19 across the German districts over the weeks and offers a solid foundation for further analysis.
As a follow-up analysis we considered looking into if there is a link between the number of commuters in each district and rising infection numbers in Germany as a study already did [2]. Therefore, we would use Machine Learning techniques. But due to shortage of time we cut that one short. Keep an eye on our Github to stay up to date.
1.2 economic impacts:
a) GDP
As shown in the graph, the gross domestic product of Germany has drastically reduced by 10% compared with the same quarter of the previous year.
b) Truck Toll Mileage Index
The Truck Toll Mileage Index, which indicates the development of industrial production, keeps steps with the development of the pandemic and reached its lowest point in early April. Now it has almost returned to its original level.
c) Index of Production by selected branches
As shown above, the index of chemicals and chemical products, of manufacture of machinery and of vehicles all suffered from various degrees of decline, in which the chemical industry index reduced slightest. This can be probably credited to the massive need of chemicals in the fight with the corona virus.
1.3 social impacts
a) death cases
We can clearly see that although the infections have increased again, the death cases could be controlled on a proper level, unlike earlier this year, which can probably be related to the accumulated experience in dealing with the acute symptoms.
b) migrations
As displayed in the chart, migrations in Germany have already reduced to negative in April, instead of later in December, which was shown in the former years.
c) mobility trend
Due to the relaxation of the restrictions and closures, the mobility has gradually recovered to its previous level.
2. correlation analysis and prediction based on machine learning
a) correlation analysis
Combining the Lasso regression and the Random Forest Regression, we can conclude that:
infection rate:
Conducting more tests is the most important to prevent higher infection rate. Other relevant features are HDI (human development index) and Stringency Index (how strict are the restrictions). The population ages 65 and above and life expectancy also influence the outbreak of the disease remarkably, probably because elder people are more susceptible to the virus.
Mortality:
Higher life expectancy stands for a better health condition of the population, which logically leads to a lower mortality. Containment and health index as well as economic support index are both evaluation of the government response and combine ‘lockdown’ restrictions and closures with measures such as testing policy and contact tracing, short term investment in healthcare, as well investments in vaccine, which can contribute to less loss of lives.
b) prediction for infection rate and mortality
With the machine learning model, it’s possible to make a prediction of infection rate and mortality based on respectively four parameters of the target country.
c) other interesting COVFACTS
In the heatmap we can obtain other interesting correlations regarding the facts in the pandemic. For example: Contrary to common thoughts, better developed countries conduct both less total tests and less tests per capita despite their higher income and more medical resources. Countries that are economically less developed responded more aggressively to the pandemic, reflected by more restrictions and closures.
In this article we give an overview of our project work about the COVID-19 pandemic. Now do you want more information? Check our GitHub repository for further information and update![3]
References:
- https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker
- https://www.wirtschaftsdienst.eu/inhalt/jahr/2020/heft/6/beitrag/raeumliche-ausbreitung-von-covid-19-durch-interregionale-verflechtungen.html
- Github adress will be posted soon
TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact aachen@techlabs.org so that the corresponding item can be removed.