Early Prediction of Sepsis from Clinical Data

11 min readNov 2, 2020

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Summer Term 2020).

Mentor: Jöran Rixen

Techies: Asa Walberg, Gerriet-Maximilian Goldschmidt, Hannes Drescher, Raphael Jahn

1. Introduction:

In our project we participated in the early prediction of sepsis by using the datasets of the physionet.org PhysioNet Computing in Cardiology Challenge 2019. Sepsis, severe sepsis and septic shock are different terms used to address disorders characterised by a host response to physiological, pathologic and biochemical abnormalities caused by infection. Sepsis is defined as a “life threatening organ dysfunction caused by a dysregulated host response to infection”, and organ dysfunction is defined by an increase in Sequential Organ Failure Assessment (SOFA) score.

Sepsis is assumed to be one of the leading causes of mortality for hospitalised patients with nearly 6 million deaths each year. However, due to the diverse nature of the possible infection and host reaction, diagnosing sepsis remains to be a difficult task for physicians. Therefore, early prediction of sepsis results in early treatment of patients at-risk and reduces the possibility of severe sepsis and septic shock. It also provides means to develop more effective preventive methods in treating patients with sepsis. This study aims to provide a numerical methodology to automatically predict sepsis in ICU patients at least 6 hours in advance. The dataset used in this study has been published by Physionet as part of the 2019 Computing in Cardiology Challenge 2019.

Our goal was to write a classifier whose SepsisScore* is among the best 200 participants. *The SepsisScore is a number calculated by a utility function which rewards classifiers for early predictions of sepsis and penalizes them for late/missed predictions and for predictions of sepsis for non-sepsis patients.

2. The Data:

The Dataset is structured into two folders (training_A, training_B) containing each 20.000 .psv files. Each .psv file (~pipe separated .csv file) contains data of one patient with columns of vital signs(1–8), laboratory values (9–34), demographics (35–39) and outcome (41, Sepsis Label). The specific name of the columns and their description is shown in Fig. 1. The rows describe hourly measurements in an interval of minimum 8 hours and maximum of 2 weeks. The last column is the Sepsis Label. Sepsis patients have a Sepsis Label of 1.0, when sepsis is detected within the next 6 hours. Before that, the label is 0.0. Non-sepsis patients have a Sepsis Label of 0.0.

The most important vital parameters are updated hourly and the laboratory values are updated daily. Therefore the data set contains a lot of NaN values (Fig. 2). This was the biggest challenge for our predictions, so the data has to be cleaned and optimized. In addition, there are a variety of different units in which the values of the columns are given (for example Celcius, years, mg/dL or mmol/L).

Fig. 2: Percentage of missing values for each column.

3. Data Cleaning:

In the cleaning process our main focus was on replacing the NaN values. We achieved the best results by doing so for each patient separately and afterwards for the combined dataset of all patients.

We start by loading each patient into a list, where one patient is an entry. This allows us to clean all patients in a loop, where the number of existing and missing values is determined first. Depending on the result, the NaN values are now replaced in different ways. If the sum of existing values are greater than the order of interpolation (we choose 3 as order of interpolation) the missing values are calculated by the Akima-interpolation. If the sum is only greater than 0, the missing values are calculated by linear interpolation. Because the interpolation can be done only between two values, it is possible that values at the beginning and at the end are still missing. Therefore, forward fill and backward fill are used to replace these NaNs. These functions copy the first/last available value and use this to replace the NaN before/after.

In the next step, all cleaned patients are saved into one dataframe. This is necessary because the remaining NaNs are replaced by randomly generated values based on the values of all patients. These values are in the range mean value ± standard deviation, three quarters are normally (Gaussian) distributed and one quarter uniform.

As an additional option we can drop columns with a percentage of missing values greater than a specific threshold. We chose 95% of missing values as a threshold because with this many imputed values there could be a great difference between the distributions of the values before and after the cleaning process.

In the pictures below the distributions of the original and the cleaned data is shown. In Fig. 2 one of the first strategies (only Akima-interpolation) is used for generating the values for replacing the NaNs. The distribution deviated strongly from the original one. With the final strategy — as described above — the distribution of the cleaned data is way closer to the original one. It has to be mentioned that the column of PaCO2 has a very high percentage of missing values.

Fig. 2: Distribution of the original data and the data from the first tries of cleaning

Fig. 3: Distribution of the original data and the final status of cleaned data

3.1 Unbalanced dataset:

After we finished cleaning the data we started to train prediction algorithms and were pretty shocked by the initial results. Our Sepsis Score was pretty bad (sometimes even negative) despite pretty good statistical measurements. With a bit of investigation we found out that our models all predicted (nearly) only negative labels due to an high imbalance of positive and negative labels. In the original dataset there are only roughly 1.8% rows with a positive Sepsis Label (from now on referred by Sepsis Ratio). Therefore we had to come up with a way to have a more balanced training dataset.

Fig. 4: Sepsis score in dependence of Sepsis Ratio

Our first approach to this problem was to simply drop negative rows until we had a desired ratio. However, this limited our training dataset size drastically. In the complete dataset we have 1.55 million rows. With 80% of that as a training set, we had 1.25 million rows for training. But for a Sepsis Ratio of 50% (50% positive and 50% negative Sepsis Labels) there were only 55.000 rows left.

As performance scales with training dataset size for every machine learning approach, a better solution is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples do not add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

SMOTE first selects a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

A general downside of the approach is that synthetic examples are created without considering the majority class, possibly resulting in ambiguous examples if there is a strong overlap for the classes. This is probably why we did not achieve a huge improvement with a larger training dataset (also more synthesized instances) after a certain threshold. As we found out 100,000 rows was the perfect balance between diminishing returns and training time (although we could train with 2.55 million rows).

4. Algorithms we used to predict Sepsis:

4.1 Gaussian Naive Bayes:

Description: A classifier based on the Bayesian Theorem. It is assumed that each attribute is only dependent on the class attribute. This often does not correspond to reality, but nevertheless the Naive Bayes often achieves good results as long as the attributes do not correlate too strongly.

How it works: With Gaussian Naive Bayes, the average value of each attribute is calculated. Assuming that the attribute is normally distributed, the standard deviation is calculated and the distribution is determined.

The distributions of each attribute are used to determine the probabilities of its occurrence. When something is classified, the algorithm compares the probabilities of occurrence and chooses the class with the higher probability.

4.2 Random Forest:

A random forest is an ensemble of multiple uncorrelated decision trees. For classification or regression each decision tree in the forest makes a prediction and the majority is the final prediction of the random forest.

A decision tree is a tree in which each internal node represents a “test” on one or more attributes that the resulting groups are as different from each other as possible (and the members of each resulting subgroup are as similar to each other as possible).

When a random forest is trained two mechanisms are in place to ensure that the decision trees are mostly uncorrelated.

1. Bagging (Bootstrap Aggregation) — Decisions trees are very sensitive to the data they are trained on therefore small changes in the training data can result in significantly different tree structures. That’s why each decision tree in a random forest will sample a random subset (with duplicates allowed) of the data set for training.

2. Feature Randomness — In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations in the left node vs. those in the right node. In contrast, each tree in a random forest can pick only from a random subset of features. This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.

4.3 K-Nearest Neighbors:

K-Nearest Neighbors is a non-parametric, lazy learning algorithm. Non-parametric means that it does not make any assumptions about the underlying data. It makes its selection based on the proximity to other data points regardless of what feature the numerical values represent. The input consists of k closest training examples in the feature space and the output is a class membership classified by a plurality vote of its neighbors. The distance between the data points is measured by a function which takes all the different attributes in an n-dimensional room into consideration. The problem with this algorithm is that you have to weight the attributes for the distance measurement between the data points. Thus, the algorithm is given which attributes are decisive for sepsis and the data is falsified. The weighting of the individual attributes was not solvable and led to an insignificant result. Furthermore, it turned out that the runtime was very high for the given data set ( > 4h). The “distances” from each data point to each other data point are measured for each attribute. The KNN algorithm is a strong algorithm, which is suitable for a standard case due to its simplicity, but is not applicable for our specific large data set.

4.4 Support Vector Machines:

Support Vector Machines (SVM) is a machine learning algorithm based on a supervised learning approach. SVM is mainly used for classification and regression analysis of datasets. A dataset is classified based on a hyperplane which splits the data points in the most efficient way. In 2 dimensions data points can be split by a line as shown in the image on the left. However, for n-dimension a hyperplane is required as shown on the right. Datapoints located close to the hyperplane are used as ‘supporting vectors’ which are used for orientation and position of the hyperplane. The distance of the supporting vectors to the hyperplane, also known as margins, is aimed to be as far as possible for correct classification.

Fig.6: Support Vector Machine [source: https://towardsdatascience.com/svm-feature-selection-and-kernels-840781cc1a6c ]

5. Hyperparameter Tuning:

Hyperparameters are settings of an algorithm that can be adjusted to optimize performance. While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set training. In the case of a random forest, hyperparameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node. Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations and evaluate the performance of each model. Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid, evaluating the Sepsis Score for each combination. We let the randomized search run for about five to six hours (note that not all combinations were tested in this time, only about 300 randomly chosen combinations). But this narrowed down the range of possible (and reasonable) hyperparameters while increasing the score slightly (about 2–3%). Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. This last step provided us with hyperparameters which increased the Sepsis Score of our model by more than 5%.

6. Conclusion:

The developed algorithm in this study, combines clinical knowledge, numerical and statistical approaches in order to provide a reliable tool in predicting sepsis in ICU patients 6 hours in advance. The proposed methodology is able to overcome the challenges of missing values and the unbalanced dataset, considering the screening tools. It is found that this methodology can be a good method in predicting septic patients.

Compared to the other participants in the physionet.org PhysioNet Computing in Cardiology Challenge 2019, we reached the 69th place with a Sepsis Score of 39.44% by using the random forest algorithm.

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact aachen@techlabs.org so that the corresponding item can be removed.

Early Prediction of Sepsis from Clinical Data

Written by Inside.TechLabs

No responses yet