This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).
Kaggle’s Kickstarter dataset provides an overview of different crowdfunding projects that vary in background and degree of achieved success. Projects are either successful if the targeted amount is reached through crowdfunding, or they fail if there is too little support. In our project, we explored the relationships within the dataset to identify relevant indicators for project success. After an initial data cleaning process, we analyzed certain aspects of the dataset and visualized the results. With a subsequent machine learning algorithm (k-NN classification model), we are now able to predict, if a Kickstarter project is more likely to fail or to succeed, given its specific characteristics.
Our project team did not start with a certain idea on what exactly we would like to create. Instead, we all shared the same goal that we wanted to accomplish by participating in the project. Since all members were part of the Data Science track, we wanted to learn how to handle big datasets and create new insight to that data with the help of python. After searching through the web for suitable datasets and brainstorming some ideas, we narrowed down our ideas to three projects.
- churn analysis on a Kaggle dataset
- hotspot map that shows whether there is a lot of traffic at certain restaurants / shops
- machine learning model to predict the success of a Kickstarter campaign
The first project would have probably been the easiest with the most documentation on the web. But it’s also very generic and we decided to leave this one out. The second project was fascinating, but it would also require additional work on a user interface and gathering big amounts of real-time location data could have been a problem. In the end, we agreed that the scope of creating a machine learning model would best fit our team size of three people and give us just the right number of examples as well as challenges.
After clearing the first hurdle by deciding on the project that we would be working on for the next weeks, we structured the project into four main parts.
- exploratory data analysis — first steps
- data preprocessing
- exploratory data analysis — visualization
- machine learning model
Exploratory data analysis — first steps
Before we could really begin any work on our dataset, we first had to understand it. Our data analysis began on kaggle.com because the website offers great insight into the dataset. After our first impression we downloaded the dataset and began setting up our development environments. The IDEs of choice were PyCharm and Visual Studio Code. After we tinkered with the dataset for a while by ourselves and applied our newly gained knowledge on Pandas dataframes, we initialized the git repository and started the collaboration on the next major step.
During our tinkering phase, we could already see that the data had some missing values, and the formatting of some columns was not optimal. Thanks to our Pythonista training on Datacamp, some research on Google and especially Stack Overflow, we were able to complete our desired tasks. We had to remove some rows with missing values and filter the data frame for projects with the state “failed” or “successful”. We wanted to see if the crowdfunding duration had an effect on the outcome. Therefore, we had to get the difference between the dates of the start of the project and the end of the funding period. The difference of these 2 datetimes resulted in a timedelta format, that we converted to a float and added as duration to our dataframe. We also added a new column for the number of characters in the project name and a new column with boolean values, that shows if the category and the main category of the project are different.
Exploratory data analysis — visualization
After having finished the preprocessing and preparation, we were able to dive into the dataset and create insights. For this, we visualized certain aspects of the data using matplotlib and seaborn. We started by discussing what data relations we want to visualize and created a Kanban board to schedule the tasks. The used packages allowed for a convenient and quick way to visualize the desired relations, but a lot of time went into fine-tuning the proper formatting and labeling. We wanted to make sure that the rendered graphs are comprehensible. This also strengthened our matplotlib capabilities.
Machine learning model
Finally, we could start working on our main goal, creating a model to predict the success of Kickstarter campaigns, or so we thought. Although spending a lot of time preprocessing the data in regard to dropping NaN-values and adding additional columns, there still was some work left to do to actually feed it to the machine learning algorithm. We had to drop all the columns that contain values that are not established at the start of a campaign. We did not want to include information like the reached amount since that would distort our model. On top of that, we had to remap some columns since the model only accepts numerical values. Then, we finally scripted our first ML model, a nearest neighbor model. Since time was running out, we decided not to spend much time on finetuning the hyperparameters. We ran two computing intensive GridSearch scripts and decided to use the best performing parameters. As the last step, we created an input script, that leads the user through the process of putting in the campaign information and prints out whether the campaign has a high chance of being successful or is more likely to fail.
In the end, our model only achieved a performance of roughly 60% but that still is a lot better than 50/50 and we gained valuable knowledge along this fascinating journey.
Roles inside the team
Since we all chose the Data Science Track, the project tasks could be completed by everybody. Therefore, we did not divide our group into different roles, but instead worked as a team to collaboratively achieve our goals and find solutions for issues that came up during the process.
Felix Kleine Bösing