Citinator App

Inside.TechLabs
7 min readApr 20, 2021

This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).

Abstract

The citinator app provides its users with a solution to identify neighborhoods that match their personalities. Thereby, it solves the common problem that people usually don’t know the character of the different districts when considering to move to a new city, and therefore facing a great deal of uncertainty when making renting decisions. We have focused on the target group of students looking for shared apartments. The solution uses geosocial data scraped from the apartment portal wg-gesucht.de, which has been analyzed and categorized using basic Natural Language Processing. The result is a taxonomy of 51 characteristics grouped into 8 comparable groups differentiated in location characteristics and quality of community life. Through a web interface, the users can communicate their preferences according to these groups, for which the system will provide a suggestion of the perfect match between user and neighborhood.

Introduction

Local culture can be a tricky matter. Probably everyone has already struggled in front of a computer, browsing websites for holiday apartments, unable to make an informed decision, which neighborhood might be the best fit for their personal interests and needs.

However, this is not only a question of how to spend your time and money on leisure activities. At the heart of the phenomenon lies the inherent difficulty to capture, quantify and communicate the dense, inpenetrable jungle of implicit localized cultures. Unless personally immersed into the realities on site for a certain time, and given a fair amount of linguistic and intercultural skills, human beings have yet to find an easy way, to transport the cultural realities of urban live through the narrow tunnel of a fiber glass cable.

Enter the Citinator App.

The by far most widespread, most ancient — and so far only reliable — means of educating another human being about the culture of a place is by the means of storytelling. This means in the form of written texts — which as we know prove a great challenge to be analyzed and quantified by statistical means.

In our pursuit of finding a solution to help people learn about the hidden characteristics of student life in certain neighborhoods, we identified the internet portal wg-gesucht.de as a promising source of knowledge. On wg-gesucht.de, inhabitants of shared apartments try to find new subtenants for soon to be vacant rooms. They do this by writing ads, describing room, apartment, community life and the neighborhood they are located in.

With their goal in mind — to find a perfect match, which will suits their needs and personalities, and which will have a high likelyhood of not moving out again right away — the writers of the ad have an inherent interest to make the cultural realities of community life as well as the local neighborhood as transparent as possible, and — while of course advertising their offer — to stay as close to reality as possible. Nobody will gain anything from a new subtentant that is getting crankier by the hour realizing obvious lies he or she has been fed before making the decision to move in.

On the upside, since the site is publicly available, it is a question of less than a day to collect all the ads for a mid-sized city, ready for analysis. On the downside, the data consists of qualitative, narrative text. Not exactly statistical spreadsheet information about local culture. But, since a human being can read and “feel” the cultural information, that is woven into the texts, why shouldn’t a computer be able to do so? Welcome to Natural Language Processing, the branch of data science and artificial intelligence, which does exactly that.

Methodology

In order to gather the information on the site, we first built a web scraper, using Beautiful Soup and the Selenium Webdriver, both python packages. What such a bot does, is to access a webpage like a human being, and download the HTML code, which is used to display the content on the webpages, to the local machine.

wg-gesucht.de has a main page, where you can see all the cities for which ads are available, which leads to a subpage for each city containing a paginated directory of all the ads that have ever been posted and not actively deleted afterwards. The amount of information is substantial: for the city of Münster, for instance, this directory contains over 782 directory pages, each 20 ads per page, dating back until 2017.

Using this method, we downloaded 2000 ads for Münster, and 1000 ads for Dortmund, Düsseldorf and Köln respectively, starting from the latest going chronologically backwards. Using HTML tags and CSS classes included in the downloaded information, the content of the pages was then converted into a CSV spreadsheet, differentiating for each ad (amongst other information): city, postcode, street name, house number, neighborhood description, and community life description. Not all ads provide street names and house numbers, however, so we quickly abandoned our hopes for an analysis with exact locations and focused on an aggregated analysis on the postcode area level.

Using the Python package NLTK, a Natural Language Processing library, we could then further process the textual information in the descriptions of neighborhood and community life. In order to be able to analyze the texts, they usually have to be first tokenized and lemmatized. This means to break apart the continuous strings into disjunct words, and then to convert each word into their stem or base version. For instance, “walked” becomes “walking”, “trees” becomes “tree”.

In comparison to English, many languages — and certainly also German — have much more complex conjugation and declination systems. For the successful lemmatization of a language you therefore need specialized models, which operate using complex probability algorithms, which guess the most likely stem in the context of the entire sentence the word is used in — a very resource intensive procedure. For our analysis we have been able to use the HanoverTagger, a probabilistic morphology model for German lemmatization, which can be downloaded for free.

In consultation with our mentor, we reverted to a much more manual analysis process: Using NTLK again, we counted and sorted the occurrences of the individual words used in the ads, differentiating between the descriptions of neighborhood and community life for each ad. The most common 2000 of these words were then manually evaluated and assigned as either: (1) a stop word, i.e. word to be dropped from the analysis, (2) a word to be used for comparative categorization, or (3) a word, which displays local particularities — and which should therefore be maintained, but cannot be used for the standardized categorization and comparison of neighborhoods.

The words that had been selected for comparative categorization were then grouped into relatively homogenous thematic word groups, which were again grouped into comparison groups, hence thematic word groups, which users would probably like to compare against one another in order to learn more about the local culture of a place. All these steps have been performed individually for the description of the neighborhood and community life for each ad.

We then aggregated and cumulated the words and their counts over the postcode areas, and assigned the cumulated word counts to their respective thematic word groups. In order to provide a maximum of comparability, the results were then normalized over the number of datapoints per postcode area, as well as over the average word count per thematic word group. Also, we included a threshold of at least 30 ads per postcode area, in order to prevent distortion due to lack of data.

The result is the following taxonomy of groups:

  • Neighborhood Description (infrastructure, location, traffic, quality)
  • Community Life Description (studies, interests, community, lifestyles)

Results of the project

Exploring the data: Using these categories, you can now explore the neighborhoods and cities, for which sufficient ads have been posted on wg-gesucht.de.

Figure 1: Green infrastructure, e.g. parks or forests
Figure 2: Modes of transport

Here you can see a couple of insights about the cities of Münster, Dortmund, Düsseldorf and Köln based on our data.

Remember that each count represents the number of occurrences of words which have been assigned to the respective group. It does not perform any logical analysis of the texts. Meanings which are expressed using other words or which are hidden between the lines, cannot be reflected using this form of analysis.

The Citinator Web App: Using the Heroku service provider we have launched a web interface, where users are asked to enter their preferences concerning the thematic word groups. The system will then compare the values given by the user with the scores in the database, and output the postcode area with the best match for the users’ needs and interests. The selection process is supported by a graphical representation of the user data overlayed with the scores of the postcode area using colored spiderweb charts.

Current status: At the end of the project, we have been able to gather data for four major student cities in Northrhine-Westfalia, which have been selected with the goal in mind to provide the greatest bandwidth of diversity of urban cultures: Münster, Dortmund, Düsseldorf and Köln. Additionally, we have completed the web interface, which is able to overlay the user entered preferences with a neighborhood selected by the system. The next step in the process would now be to connect the database to the web app, in order for the user to be able to tap into the actual data analyzed.

Our Github Repository

The Team

Andreas Putlitz Data Science (LinkedIn)

Sina Mertens Data Science (LinkedIn)

Benjamin Reuting Artificial Intelligence (LinkedIn)

Finn Christopher Petersen Artificial Intelligence

Mentor

Marcus Cramer (LinkedIn)

--

--

Inside.TechLabs

Our community Members share their insights into the TechLabs Experience