This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).
Anyone who travels frequently with Deutsche Bahn (DB) is probably familiar with this scenario: You access DB’s ticket booking platform, enter all the desired filter parameters and choose the route that best fits your requirements. So far so good.
However, when you look up the ticket price for your desired connection a day or even just an hour later, you are given a different price. Unfortunately, DB’s ticket pricing system seems to be extremely complex, intransparent and impossible to understand even for regular customers.
Therefore, we asked ourselves (as frequent customers) how to consistently find reasonable prices when traveling with DB and we decided to put our newly acquired Data Science skills to the test. Hence, we built a recommender system to identify whether to book a certain connection at a given point in time or to wait until the ticket price drops to the optimal amount.
To develop or solution in Python we:
- built a web scraper.
- used the libraries NumPy, Pandas, and scikit-learn for data cleaning and preparation.
- used Plotly/Dash to develop the frontend in a web app.
- used TensorFlow and Keras for the creation of our backend as a neural network.
Focusing on just one connection makes DB’s ticket system problem even more obvious. On any given day, there are around 40 possible connections for a trip from Muenster to Berlin alone, with prices ranging from 26.90 € to 139.90 €. These prices are determined by various categories: 1st class, 2nd class, number of transfers, travel time, Super Sparpreis, Sparpreis, Flexpreis, Flexpreis Plus, number of transfers, and more.
Given this situation, we think that it is unacceptable that DB, considering its dominance in German rail transport, prices tickets in an obscure way, forcing customers to basically accept any amount. Therefore, we asked ourselves how one may consistently find fair prices when traveling with DB.
Our answer to that question is a system that based on the training of a neural network, compares the prices for a certain connection with its specific parameters against historical values. As a result, a recommendation can be made whether the current price is reasonable, or one should wait with the booking.
Web scraping, data cleaning & team setup
We started with a web scraper and spent weeks screening DB’s website for connections and price parameters. This analysis was limited to connections from Muenster central station to Hamburg, Cologne, Dortmund, Berlin, Duisburg, Aachen, Dresden, Nuremberg, Munich, and Stuttgart main station. Thereupon, we cleaned this collected set of travel data primarily with the help of the Python libraries NumPy and Pandas.
After we had finished the data cleaning, the frontend and the backend were developed simultaneously by two different parts of the team. With this setup, the frontend provided an ideal “beginner friendly” project for our programming newbies, and the backend offered the chance for our coding veterans to put all their knowledge to the test and gain some exciting insights.
Due to this division, each member of our team felt challenged and had the opportunity to learn new things, while not feeling completely overwhelmed with tasks that in no way corresponded to one’s level of expertise. Therefore, we were able to balance the knowledge differences in the team and motivated everyone to contribute to the success of the project.
Hereby, we built the foundation of our project and henceforth began the simultaneous development of both the frontend, used to request the specific travel inquiries from the user as input and afterwards display our recommendation to the user, as well as the underlying backend in the form of a neural network.
Due to our selected TechLabs track (Data Science with Python) the usability and not the design of the frontend was our main focus. Hence, for the development of the frontend we had a basic web application in mind in order for the user to input the specific parameters of the desired connection including the departure and arrival destination as well as travel date and time. During our research to find a suitable development tool, we came across Plotly/Dash, a popular open-source framework for building Data Science web apps directly tied to our Python code. Although Plotly/Dash was not part of our TechLabs course, familiarizing ourselves with its logic did not take long as it is very easy and intuitive to use.
Especially the Dash Core Components, modern UI and HTML elements, were most helpful to develop a simple and clear interface. For this purpose, https://dash.plotly.com/dash-core-components provides an excellent overview of the most important functions. With this summary, we could easily adopt the Plotly/Dash components we specifically needed in Python and code the desired frontend.
In the final web app, the user can select the departure station (always Muenster) and the arrival station (variable) via drop down fields. Furthermore, a date and time picker offers the user the possibility to select the desired travel date and time. Only after actively pressing the submit button the entered data will be transferred. This “delayed” transfer is made possible by a state function. Once the data is submitted, the following two processes are initiated.
On the one hand, the web crawler starts its work on the DB’s website. Based on the parameters of the searched connection entered by the user, it crawls the matching connection. On the other hand, this connection with its specific parameters (e.g. price, number of transfers, etc.) is directed through the previously trained neural network, which then provides a recommendation as to whether the identified connection will likely become more expensive or cheaper in the future. This is accomplished as the neural network is capable of predicting an optimal price and then compares that to the price that DB currently offers to the user. Ultimately, this recommendation is displayed in the frontend.
Needless to say, using a buzzword such as “neural network”, as we did above, always sounds pretty fancy and impressive. However, it is not always clear what exactly it is supposed to mean and how it works in a given context or project. Therefore, in the next section we will describe the development process and setup of our backend (the neural network) in more detail.
After its cleaning, our data set was well structured but not yet ready to be used in a neural network. Since the parameters of each train connection are mainly categorical values, we applied the categorical encoding from the scikit-learn library. This allowed us to change these parameters into numerical values.
We also transformed the time values of departure time, arrival time and travel duration, into integer values. In addition, to obtain a superior neural network result, we converted these numbers to values between zero and one as this makes it easier for the neural network to properly process them. In order to do so, we employed the MinMaxScaler from scikit-learn. At this point, the data was finally ready to be processed by the neural network.
Furthermore, we applied TensorFlow and Keras for the creation of the neural network. Having defined six input variables corresponding to our six different parameters per train connection, we designed only one output variable, because we ultimately wanted to derive only one price recommendation per connection. For this purpose, we used a sequential Keras model. Moreover, the design of the network included five dense layers and 150 neurons in each layer in combination with a “relu” activation function. We used a dropout layer for random deactivation of neurons after each dense layer to prevent any overfitting occurring on the training data. Moreover, for the model compiler, we set the mean squared error function using the Adam optimizer and a learning rate of 0.001.
Having all these settings in place, we trained the model over 5000 epochs. For this purpose, we divided our data into training and test data, trained the model with all possible connections and sequentially with the different data sets. Ultimately, we saved the best performing model. The final neural network was connected to the frontend and is able to communicate with the frontend in order to receive the data of a new inquiry and output the recommendation.
Results of the project
Our current system that allows the interaction between the frontend and backend provides the following solution. A user decides for a DB connection from Muenster to the above-mentioned cities on a certain day and at a certain time. After entering the different parameters in our frontend, the current price of this connection is retrieved from DB’s website via a web scraper. Once acquired, this price and the parameters of the specific connection are automatically induced into the already trained neural network that acts as our backend.
Subsequently, the network now independently identifies the optimal price for this connection and decides whether it is likely that the ticket for the selected connection will be even cheaper or become more expensive. Thereupon, this recommendation is displayed in seconds in the frontend. With this information the user can decide whether to wait or to purchase the ticket right away at the offered price. Hereby, our solution provides the user a transparent perspective on the price development of the sought-after DB connection and thus a better decision can be made.
Moreover, without revealing the algorithm behind DB’s pricing, we managed to predict the likely price evolution of each connection and thus provide DB’s customers with a better basis for decision making. Unfortunately, we were not able to increase transparency, as we are still unable to determine to what extent the price development is influenced by certain parameters.
Finally, we would like to thank the entire TechLabs team, especially the dedicated folks in Muenster. Only because of you, we were given the opportunity to work on this exciting project and perhaps make the world a tiny bit fairer. To conclude, we would like to share our biggest learnings (in no specific order), hoping that they will help you with your next Tech4Good project:
- Open and honest communication and organization helped us to collaborate better.
- To motivate the entire team over the course of the project, it is important not to lose sight of the goal and to find relevant tasks for everyone depending on their level of knowledge.
- Data cleaning and preparation should not be underestimated, it takes a lot of time and is the essential foundation for the success of a coding project.
- A lot of helpful information is freely available on the internet e.g., via GitHub, Stack Overflow or YouTube.
- Experienced mentors are especially helpful for difficult technical but also interpersonal challenges.
Thank you for attentively reading and your interest. If you have any questions, feel free to contact us anytime.
The whole team — A new Hope for Personenfernverkehr