This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).
With our project, we aim to achieve more transparency in the housing market. With the help of our tool, prospective tenants can get a first impression of what a reasonable base rent for a particular flat looks like. For this purpose, we trained a linear regression model with housing data from the German real estate platform ImmobilienScout24 to predict the cold rent of existing apartments based on many more features than just the living space. In this way, our forecast supports the tenant in his or her decision whether or not to rent a flat by reducing the information asymmetry between the homeowner and the tenant.
With this Project we — the Miet’se Katzen — are targeting people who are looking for an apartment and want to compare potential apartment to the market. It is easy to just look at the average price per square meter, but a new or restored apartment with, e.g., two balconies, a new kitchen and huge rooms is most likely more expensive per square meter than an apartment, which does not have these advantages. To prevent one to rely on the simple average rent per square meter, we wanted to create a prediction that includes the soft factors such as condition of an apartment. All in all, the goal is to make the real estate market more transparent.
After defining our scope of work, we had to start research suitable Datasets which we could use as a basis of our work. After lots of brainstorming and researching we found a dataset on Kaggle. Since our attempts to get data from Websites like ImmobilienScout24 or Immonet directly all failed, we decided, that we were better off by making use of the Dataset on Kaggle, since it contained more than enough Data for our needs and was actually scraped from ImmobilienScout24.
Due to the extensive dataset we had to evaluate which features would be the most relevant for our purposes. The Data consists of 269k Datasets with 49 Features.
Preprocessing our data
In order to use the huge amount of data we got from Kaggle we needed to clean it. We quickly realized that this task would take more time than we originally thought it would. Because there was so much data it even took us some time to figure out how we could push our data to GitHub for everyone in our group to use.
As a first step we eliminated all columns that contained descriptions in continuous text. After that the dataset was small enough to push to our group repository. Yey! We then continued to eliminate all features that were only relevant for supplementary costs like electricity and heating. As a result, our model does not take into account the availability of different tariffs in different regions.
The aim of our project was to forecast the base rent of small flats that are suitable for students. Due to that we excluded all flats that were more expensive than 1.500€ per month.
After this elimination process, we had a dataset that was small enough to work with but still big enough to give meaningful results.
As a second preprocessing step, we visualized the impact of different features on the target variable “baseRent”. Therefore, we mostly used scatter plots and (Felix’s favourite 😉) violin plots:
After visualizing the impact of the different features, we decided on a set of around ten variables to use for our model. In nearly all of the categorical columns we used as predictor variables there was some missing data. In order to impute the missing data, we mostly used the empirical distribution and resampled it. In some cases, we have only inserted the most frequent value.
Furthermore, we summarized data in bigger categories. For example, for our model we used decades instead of years. In order to do that we had to summarize the “year constructed” feature. For the “interior quality” feature, we merged the categories “normal” and “simple” and the categories “sophisticated” and “luxury”.
In addition to that we imputed data for some more features, e.g., “number of rooms” and “floor”, but the impact on our model was not big enough to take them into count. For the “number of rooms” feature, this may sound surprising, but it is nevertheless convincing when one considers that the additional information of the feature, which goes beyond the size of the flat captured in the “living space” feature, is relatively small.
As a last preprocessing step we had to scale our feature and target variables. During this step we frequently used the get_dummies() — function from pandas. With that function we generated dummy-variables for all our categorical variables. At last, we created some additional polynomial columns of our most important feature: “living space”.
Building the model
At the beginning of our model building phase, we tried different analysis models to investigate our data in the best possible way and to draw conclusions about the rent level of the investigated apartments. During our track on edyoucated and DataCamp we came across the following models: Linear regression, Random forest and Ada Boost. We have tried and tested all models one after the other to decide which one fits best our requirements. All three models gave nearly similar results. However, the linear regression took less time to calculate which has to do with the amount of data we analysed. So, we decided to use linear regression for our purposes, but kept the random forest model for a feature importance analysis.
The following table illustrates the improvements we have made by including more and more features in our model.
As we can see, the amount of living space and the city are the two most important features. This was also reflected in the feature importance analysis we carried out with the Random Forest model. By adding all the other features, however, we were still able to reduce the mean absolute error by another 13 €. Overall, our model explains 82,73 % of the variance in the test set and our forecasts deviate on average 85,66 € from the true value of the base rent. The mean absolute percentage error amounts to 16,9 %.
Results of the project
As described in the introduction, our goal was to develop a tool that forecasts the cold rent of an apartment as accurately as possible in order to get a better overview, what rent is appropriate and fair for an apartment, for example when you are looking for a new flat.
At the end of the project, we managed to reach our goal and developed a tool that is able to forecast the cold rent for an apartment based on certain criteria.
The user of our tool quickly gets a first impression of whether the cold rent offered in an apartment ad is fair and can thus transparently compare apartment offers and better assess the attractiveness of the offers. In this way, our tool can support the apartment search, which is usually characterized by subjective assessments, by providing an objective rent price forecast.
As a limitation it should be mentioned that the usefulness of our tool depends significantly on the data entered or the quality of the information someone has about an apartment.
The selected features for forecasting the rental price could definitely be extended so that the forecast would become more accurate. The available data set allowed only very undifferentiated evaluation options for some criteria. For example, in the case of the condition of an apartment, it was only possible to distinguish between the three subjective criteria “old”, “new” and “normal”.
Furthermore, our original goal was to perform the forecast as accurately as possible for different zip codes. Unfortunately, this could not be implemented because in some cases there was not enough data available. However, the forecast could be implemented for individual cities.
Despite the limitations mentioned above, we were able to develop a powerful tool that can bring more transparency to the real estate market and thus support tenants in their search for housing.
Felix Albert Data Science: Python
Max Heimsath Data Science: Python
Max Risau Data Science: Python
Kevin Woszczyna Data Science: Python
Kathrin Sandhaus Data Science: Python