This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (summer term 2021).
The goal was to give an overview of the competitive landscape and consumer sentiment for a haircare and styling brand across three eCommerce shops. To achieve this goal, product-related data such as prices, sizes, subcategories, brand name, ratings and reviews was collected. The necessary data was scraped with the help of Python, while the analysis was continued in R to obtain price ranges, product portfolio width and depth for the main competitor brands. Further, sentiment analysis and topic modelling delivered insights into consumer perception. The final result was a competitor grid depicting price levels, portfolio size and consumer sentiment for the focus brand and its main competitors.
We received our project idea from the cooperation of the course “Data Science” at the Marketing Center Münster with a local consumer cosmetics company. The company was interested in an overview of the competitive landscape concerning one of their main brands. More precisely, we were asked to give an overview of product catalogues, pricing, and customer reviews in three eCommerce shops to understand consumer perception of the brand’s competitors. This led to three research questions guiding our project:
- How broad is the product portfolio of competitors in comparison to the brand?
- What is the pricing strategy of competitors in comparison to the brand?
- How do customers perceive the brand’s competitors in the product categories Hair Care and Styling?
To achieve our goal, we first had to collect the necessary data. Therefore, we had to scrape the three shops with Python using the scraper Beautiful Soup 4 and Selenium for dynamic content.
First, we scraped the product names and product page links from the main category pages Hair Care and Styling. We used this data to scrape further product information.
On the product pages we scraped seven different product attributes with the customer reviews being the main part of our project. Additionally, we included product ratings, the sub-categories of both hair care and styling, the brands, product names, all possible product sizes — excluding sets — and all related product prices. Overall, we scraped 18.712 products, 551 brands, more than 40.000 reviews and around 23.000 prices and sizes across all shops. Finally, we received 19 variables for further analysis.
The main elements of our Python code were for loops, if else functions and html tags. Major challenges were the different shop structures as we scraped three individual shops, dynamic content which forced us to switch from Beautiful Soup 4 to Selenium for one shop, as well as data errors such as duplicate reviews and duplicate product pages. After we overcame the mentioned challenges, we finally obtained three different data sets which we then loaded into R for merging, cleaning, and further analysis.
The first step in R was to remove duplicate products in the individual shops. Next, we merged the three data sets with the help of pair blocking. Here we used the brand as a blocking variable and used the product name and subcategory as a comparison variable with a LCS threshold of 0.95.
To prepare the data for our textual analyses we used the TM package to clean our data by removing punctuation, numbers and transforming the text to lower case. Further, we tokenized the data on a word-level and created a DTM for the topic modelling and word frequency clouds.
To answer our first two research questions, we used descriptive statistics in R. Here, we summed up the number of products in each subcategory on a brand-level and highlighted the number of overall subcategories each brand is active in.
To get a comprehensive understanding of pricing strategies of the different brands, we extracted median 100 ml price ranges per sub-category for each focus brand. We visualized our results with the help of box plots in R.
For our third research question, we performed a Sentiment Analysis to understand consumer perceptions of the brand’s competitors
Here we used the sentiment lexicon SentiWS, since it is the only German lexicon which is license free. The lexicon includes words, which are categorized with a positive or a negative polarity score that range from -1 to 1.
To be able to analyze our data, we created different data subsets. For instance, we created one for all major competitors, for single subcategories or even for single brands.
The result of our sentiment analysis were polarity scores for the different brands which helped to understand positive and negative sentiment on a brand level. More precisely, we compared the mean sentiment scores on a brand level and had a look at the top 10 negative and positive sentiment words for each brand.
Further we analyzed the top and flop products according to sentiment for each focus brand as well as in selected subcategories
In order to get an even deeper understanding of what consumers care about in the haircare and styling segment, we also explored the most important Topics consumers talk about when discussing the products, they bought.
Our Approach here was to conduct Topic Models with the LDA algorithm to detect underlying topics in the consumer reviews.
Our goal was to first of all find out the main topics regarding the haircare and styling segment. Additionally, we had a look at the most prominent topics in negative reviews, the topics per subcategory and we also compared topics of the main brand with those of its competitors.
As a result, we found that consumer discussion is not strictly linked to the products subcategory or the product line, but that individual consumer perception can unmask unique topics and products of interest.
As a conclusion we created a competitor grid with the brand sentiment on the x axis, the mean price level on the y axis and the bubble sizes corresponding to product portfolio sizes.
Even though the overall goal of our project was to give a descriptive overview of the competitive landscape in our three shops, we concluded out analysis with some managerial implications for the company.
Laura Dahmen Data Science: R
Marco Feye Data Science: R
Kristina Jungmann Data Science: R