Water Quality Prediction via DOC (Dissolved Organic Carbon) in the Swiss Rhine

Inside.TechLabs
TechLabs
Published in
5 min readApr 4, 2020

--

This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2019/20).

Abstract
A data set containing water quality parameters from different stations of the river Rhine in Switzerland was used to predict the parameter DOC (Dissolved Organic Carbon) based on easier to measure parameters. The explorative data analysis showed effects of the lake Bodensee on the Rhine. Furthermore, agricultural influence on the water quality becomes visible.

In the beginning the idea was to predict water quality at a river bathing site based on earlier measurements of chemical parameters upstream, to help ensure a safe and hygienic bathing experience. The continuous flow of rivers makes it hard to predict water quality. Conditions can vary heavily within a day. We switched the original idea to a smaller goal: a prediction of the hard to measure parameter DOC, short for dissolved organic carbon, based on other measured parameters. The second aspect of our project was an explorative data analysis of the data along the river over time.

We used a data set from the Swiss River Monitoring and Survey Programme (NADUF). The stations were defined precise and the data covers many years. Figure 1 shows the stations locations. The data was separated in different excel sheets per station. We combined the 2 values of each month per station into one data frame with a datetime index.

Figure 1: Measurement stations along the Rhing from NADUF dataset. Based on: https://www.pinterest.at/pin/290552613431378896/

Besides the analytic focus on DOC we did not want to miss out on the other collected measurements of the dataset. Using Seaborn we discovered some interesting developments over time in our explorative data analysis. The comparison of two Stations Dieplodsau (“DI”) and Rekingen (“RE”) showed different effects. The influence of the lake Bodensee in between these two stations becomes obvious in figure 1, Nitrate is chosen as a representative parameter. Its heavily used in agriculture and can convert to nitrite. This chemical compound can be deadly for water organisms even in low concentrations. The difference between the stations is declining in the last 10 years. The values for RE remain stable, as the values for RE are declining we can conclude a reduction of the Nitrate loadings getting into the lake. Other quality parameters show the same patterns.

Figure 2: Nitrate values over time for stations Rekingen (RE, blue) and Diepoldsau (DI, orange).

Prediction

In our literature research we found different approaches to the prediction of water quality parameters with ANN´s.

We decided to focus on the prediction of DOC. The parameter describes organic contents smaller than one micrometre solved in the water. The particles are the result of decomposition processes from organic matter like plants and animals. Water washes off a high proportion of organic soils from terrestric areas. The direct wash off contains more DOC compared to water originating from ground water. DOC is an indicator of organic loadings in streams, as well as supporting terrestrial processing (e.g., within soil, forests, and wetlands) of organic matter. As changes occur in land use, atmospheric deposition, and climate, response variables such as DOC will become even more critical to document the effect of those changes.

The conventional way to determine this parameter is rather complex and involves filtration and different chemical reactions. Newer approaches use combustion with synthetic air at high temperatures (~720°C). We aimed for a faster value for DOC based on the easier measurement of other chemical parameters. A calculated hint on the DOC value can be faster and provide an idea already in the field.

We used different network architectures and parameters and compared their resulting RMSE values. To justify the use of ANN´s we also applied simple timeseries models like ARIMA. The resulting RMSE values, between 0.432667 and 0.568238 were higher compared to our ANN`s results.

To compare the classical statistical results with specific ANN’s we decided to use some high-performance and often used libraries for tabular data. The four main libraries in this project are the Fastai-Library, CatBoost, XGBoost and LightGBM. As recommended in many forums and tutorials we optimized our architecture by changing specific parameters of the network (e.g. number of Epochs, Iterations, depth of the neural network or the learning rate).

As input for the model we prepared our data in a way that we can feed it to the ANN. Specifically this means that we cleaned the data from measuring errors and deleted the TOC (total organic carbon) value which is very similar to the DOC value that we want to predict. At next we prepared the date of each measurement and included it to the data as an additional feature. The created data frame was split into a training and a validation set with a size of 20 % up to 25 % of the whole data bunch.

Important chemical features we used for training our models are the temperature, pH-value, conductivity, oxygen, oxygen saturation, alkalinity, nitrite, nitrate, ammonium, nitrogen, phosphor, chloride, phosphorus, silica, potassium, sodium, iron, sulfate, nickel, mercury, lead, copper, zinc, magnesium and calcium. Additional categorical features are the date and the measurement station.

After defining the learning object, we trained our model. We optimized each model by changing the model’s parameters to find a low RMSE value. While analysing our models results, we had to consider that our model doesn’t overfit. This could be done by stopping the training of the model and reducing the number of iterations.

The first library we used for training our model was XGBoost. With this gradient boosting library our lowest RMSE value for the validation set was 0.430105. This result is in between the range of the conventional statistical methods. The next machine learning algorithm we applied on our data is the LightGBM gradient boosting framework. The results are similar to our results of the XGBoost machine learning algorithm. With the Fastai library which is used in many lessons of our TechLabs learning journey we achieved a RMSE of 0.399569. The fourth library we used in this project is Catboost. Compared to the other libraries this one achieved our best RSME value. Applying the CatBoost framework, there was a decrease of 42,02 % in the RMSE value compared to the firstly
used XGBoost library. The mentioned RSME values are shown in the following table 1.

Table 1: Used libraries and the resulting RMSE.

In the project we had to shrink our earlier ambition of water quality prediction. With our models focused one parameter and got results good enough for a quick hint on the DOC value based on easier to measure parameters. This can be helpful in the field but cannot replace the chemical measurement. In the project we were able to apply our newly learned skills in Data cleaning, visualization and prediction via ANN´s for tabular data.

GitHub-Link: https://github.com/techlabsms/project-ws-19-11-Fruehwarnsystem_Gewaesserqualitaet

The Team:
Raimund Koop
AI-Track (LinkedIn)
Jan-Eric Müller AI-Track (LinkedIn)
Oliver Jochmaring AI-Track
Kieran Didi Data Science-Track

Mentor:
Felix Kleine Bösing

--

--

Inside.TechLabs
TechLabs

Our community Members share their insights into the TechLabs Experience