User Generated Content as a Brand Insight Tool to Reveal the Drivers of User Engagement and Evaluate which Topics are Perceived Positive and which Negative

7 min readOct 1, 2021


This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (summer term 2021).


The goal of this project was to gain brand insights by analyzing the sentiment of users on social media regarding topics that are connected to the brand. Positive topics could be promoted, while insights about negative topics helped to shape communication.

Furthermore, we identified drivers for user engagement and were able to build a model that can help make the brands social media appearance more successful.


We chose two approaches for our analysis. Firstly, we looked at the engagement of consumers with the brands:

  • How strong is consumers engagement with the brands content?
  • What do the brands post and do the topics differ across brands?
  • Which factors (f.e. topics) drive engagement with brand posts?

In our second approach we analyze the user generated content to derive a sentiment for the brands:

  • What is the general perception of the brand among consumers?
  • How does it compare to the perception of the competitors?
  • What topics are criticized or seen as favorable?

Leading to two guiding questions of how one can increase engagement and improve perception.

Data Basis

Regarding the scope of our analysis, we identified six brands in the FMCG segment that were investigate more closely. As for the platform we looked at UGC from Instagram, as it is probably the most prominent for the target group of the chosen brands.

We identified four types of posts: brand posts, the comments under the brand posts, posts related to a hashtag and tagged posts. Utilizing two ready-made Instagram scrapers and their underlying functions (with some modifications) we scraped the posts concerning all relevant brands and their hashtags.

After extracting the relevant metadata and applying some cleaning functions, we structure the cleaned data into two datasets for our approaches.

As a result, we are left with 5702 brand posts for Engagement analysis and 919.715 Hashtag and tagged posts and comments.

Approaches and Derived Recommendation for Action

Regression Analysis

In the first part of our analysis we aim to find influences on users’ engagement with brand owned content using a regression analysis. User engagement is a form of user generated content and a very basic method to measure sentiment whereas more engagement is seen as being more positive.

As we analyze multiple brands in comparison our first step is to investigate whether there are actually differences in the engagement level by comparing the average number of likes as well as the average number of comments for each brand in a barplot using ggplot. Looking at the graphs we indeed find considerable differences. Thus, we next aim to find out what aspects drive these variances among the brands.

We consider the number of followers, a brand has, to be the most apparent reason. However, to analyze a relationship between the follower count and the engagement we need the number of followers related to the date when the content was posted. Unfortunately, with the scraping process as described previously we only obtain the follower count at the point of data scraping. There is also no “typical” development of this metric.

Therefore, we must make an assumption and presume gaining followers is rather slow in the beginning but after a while becomes more rapid when the account gets more popular. In conclusion we suggest an exponential function for which the initial follower count is set to the first post’s number of likes and which ends in the current number of followers to approximate follower development. Using this function, we can assign a follower count to every post depending on the date it was published. Calculating the correlation between engagement and exponential follower count yields a value of 0.5 indicating there is indeed a relationship. We can already account for the number of followers before conducting the regression analysis by calculating the engagement rate which is the sum of likes and comments in relation to the follower count.

Since we still find considerable deviations among the brands with regards to the engagement rate we postulate that there must be other influences on engagement besides the follower count. Thus, we next create the variable word count which indicates the number of words used in a post’s caption. Plotting the data in a boxplot using ggplot we can find differences that might be a determinant for or dependent variable. Additionally, we create a categorical variable from the time stamp which is divided into 5 categories. We will also consider this variable in the regression analysis.

Besides these frequency variables we also want to consider the content of a post. We start exploring the posts’ content by first finding common words using a commonality wordcloud. Second, we aim to find brand specific words which we can identify by spltting the text into tokens and calculating TFIDF scores. Concluding from these two steps we find some overlap of brands but also specific contents.

Since we cannot use single words in the regression analysis, we apply topic modeling to our dataset using the LDA algorithm. We find 11 distinct topics that are included in the regression analysis by assigning topics to posts such that the topic with the highest gamma value receives a 1 while every other topic receives a value of zero. This way we create another categorical variable,

Finally, we can run our regression analysis using the caret package’s train function with the dependent variable engagement rate and the independent variables word count, time of day and topic. To validate the model we use 10-fold cross validation and find a significant model with an adjusted R2 of .1136. Although the R2 is rather low we find several significant coefficients and can derive some valuable insights.

Sentiment Analysis

Our second analytical approach utilizes the sentiment analysis to gain insights about how the brand is perceived by Instagram users. Building on these sentiments we then derive topics which are more likely to be perceived as positive or negative. The goal is to give managerial advice on which topics to focus the communication on in order to improve consumers perception of the brand. The data we used for this approach were comments under the brand owned posts and captions of pictures that were uploaded by users themselves and contained the brand as a tagged account or the hashtag corresponding to the brand.

Conducting the sentiment analysis and following topic modeling required a few steps of data preparation.

  1. First, we got rid of stopwords such as “and”, “the”, “to”… Those are often used, but do not contain useful information and would therefore not deliver proper insights.
  2. Secondly, we cleaned the texts from punctation. This will become important in a second.
  3. Then, we tokenized the texts. This means that each case containing a text (combination of multiple words) now was split into multiple cases with each containing only one word. Those cases still shared the same identification variable so we were able to restore the original text.

The tokenized text now allowed to calculate sentiment scores. We did so by using the German Sentiment Dictionary SentiWS. It contains a positive or negative value depending on whether a positive or negative sentiment is indicated by a particular word. We joined the tokenized texts with the Dictionary and thereby received a sentiment value for each word that appeared in our data. By cleaning the punctation before, we ensured that every word was recognized when joined. With the help of our identification variable we could summarize the posts by calculating a sentiment score out of the sum of sentiment values for each word of a text.

Besides calculating sentiment scores for the text itself, we also wanted to assess the sentiment of the emojis used. Those are equally important to express one’s perception about a subject.

In general it was a similar procedure, but an additional step of data preparation had to be added. The emojis themselves had to be encoded into UTF-8 format in order to join the emoji sentiment dictionary.

Finally, we combined both scores and thereby derived an overall sentiment for each text we considered in the analysis.

Topic Modeling

After deploying the sentiment analysis, we split the underlying data set into two for each brand — one for the positive sentiment and one for the negative. The overall sentiment score, aggregated by the sentiment analysis of text and emojis was utilized with the cut off value of 0.

The goal was to find topics that drive consumers both positive and negative. To find the best possible number of topics and the most precise description we utilized three different topic modeling approaches — Cluster Analysis, the Latent Dirichlet Algorithm, and the Structural Topic Modeling. With these we were able to identify positive topics containing the effects and features that made the product distinct from competition. Regarding the negative topics, some consumers complain about side effects that come with the usage of the product. Also, the catfishing by the packaging and the newsletter sign up is perceived as negative.

By highlighting the positive and negative topic’s issues, the following recommendations could be derived for the brand:

  1. Decrease postings about generic topics and instead try to draw the consumers attention to their unique selling proposition.
  2. The negative side effects could be addressed by offering a money back warranty, increasing interaction with unsatisfied customers and offering free samples.

The team

Lucas Fischenich Data Science: R

Oliver Nowak Data Science: R

Niemah Reuning Data Science: R

Sara Widders Data Science: R


Fabian Kraut




Our community Members share their insights into the TechLabs Experience