News Aggregator Project Report
Anders Holck Hartvig, Daniel Hedemann Hansen, Jana Hammerer
1. Introduction
The initiate idea was to develop a program able to scrutinise all the news articles provided by the major news agencies worldwide for a given period of time and transform the content to easily read focal points. Eventually, the program should be able to
- analyse a gigantic body of text often exceeding millions of characters,
- bring about country-dependent focal points on the basis of word frequencies, repetitions and natural emphasis in the written language, and
- subsequently, visualise the findings in a sophisticated and explanatory way.
This allows the user to get an otherwise time-consuming insight in just a second. A quick overview making it possible for the user to more specifically target topics of personal, local or global interest.
News Aggregator is the name of a product able to do just this developed by programming novices, and this post will take you through the considerations, challenges and solutions faced in the development process.
2. The Approach and Programming Goals
Initially, we realised none of us had the slightest programming experience, and our visions and approaches were all heavily affected by ignorance. Any goal, vision, or strategy were doubtful, and the largest struggle was to unify all personal conditions of satisfaction. These conditions of satisfaction would ensure that everybody had a rewarding and fruitful programming journey.
In terms of strategy, the first challenge was twofold: How do we define stepping-stones to foster the advancement of the product without any knowledge of coding, and should we take a brute-force or orchestrated approach? To emphasize the latter concern, should we jump into the programming and get our hands dirty and deal with the challenges as we go along, or should we establish a thorough understanding beforehand. In other words, how do we familiarise ourselves with a complete unique discipline as fast as possible?
Often, programming is considered as an iterative process and this on many levels. (1) In the coding itself, iteration is an actual process of repeating a sequence of command, (2) but it can also describe the constant trial and error method employed in the programming strategy. In our case, not only was the programming iterative, (3) but the very strategy was subject to constant reconsideration as we gained knowledge and insight in the development process. What we thought was the easiest part could turn out to be the most difficult, and we had to adjust our ambitions. Good lord! Only a novice programmer knows the struggle associated with obscure error descriptions and syntax errors.
3. Web Scraping
The very first step in our journey was to determine a news source, and after a thorough consideration of potential political bias, language barriers, and other difficulties, we went with Reuters.com. A former independent, international organisation providing worldwide news in English. An obvious and reliable source, but not perfect. The choice of news source heavily affects the results and focal points our program is likely to provide, and preferably, we source from a long list of trustworthy bureaus comprising every country. However, due to our capabilities and resources, the News Aggregator will for the time being present a solution with potential for upscaling.
We targeted all the regions where Reuters operate, but soon we realised that Reuters’ webpages vary in structure, coding, and layout as shown in Picture 1. This complicated the text-gathering process because we wanted a program able to do all the following automatically:
- navigate, scroll, and interact with their pages,
- locate the body text of each article,
- note down the concerned region of the article,
- size up the publication date, and
- organise the data in useful text files.
In fact, the above-listed steps constitute the process of our web scraping (data gathering), and it was all made possible by the same iterative steps.
We had to analyse how different the webpages were and isolate the specific discrepancies in the coding of the different webpages. This analysis was done manually, and after the discrepancies were identified, a list of technical criteria was specified for our program to automatically determine what type of webpage it was facing.
The number of articles analysed is determined by the extent which we allow our program to go back in time. As you probably know from own experience, older page content is recalled either by scrolling to the bottom of the webpage or by clicking a button requesting older content. In other words, our program has to simulate a human scrolling a webpage or interacting with a button requesting older content. We will give a thorough presentation of the scraping, but Illustration 1 illustrates the main idea.
3.1 Selenium and Chrome Driver
This simulation is imitated with the use of Selenium and its Chrome driver able to navigate and interact with webpages through Google Chrome. Selenium is a Python package using Chrome as an interface and Selenium is able to utilise our written code to simulate a human on Reuters’ webpages by interaction with their pages.
3.2 Specify Regions
In order to instruct Selenium which pages to interact with, we have beforehand specified a list containing the links directing to all Reuters’ different regions:
3.3 Cookies
As you probably have experienced, webpages require your acceptance to obtain cookies, and this is no different in this case. The first time Selenium interacts with a webpage; the webpage requires cookie acceptance. Thus, we had to figure out how to make Selenium able to accept cookies, but only for the first webpage accessed. If not Selenium can determine whether a link is the very first in the list, Selenium will keep looking for a cookie-button even though none exist. This is rendered possible by an if-statement carrying out the cookie-clicking command if a webpage’s link is equal to the first listed in our urls list of links.
3.4 Determine Webpage Type
From another predefined set containing all the regions with the same webpage type, our program can determine to what group of webpages a given region belongs. This is simply done by identifying the URL for a given webpage and check whether the URL code satisfies a criterium. From the URL list, it can be seen that all the links contain a region code, and the if-statement check whether this region code is in our set:
3.5 Scrolling or Clicking
The above if-statement initiates the loop responsible for the whole scraping process of this webpage type. This distinction is extremely important because the following series of loops are dependent on the type of webpage. The biggest difference between the webpage types is whether older content is recalled by scrolling or by clicking a button, and the coding required to carry out the processes are quite different. First, we will go through the scraping of those pages which require Selenium to scroll (Type 1), and secondly, we will go through the webpages requiring Selenium to click a button (Type 2). Regardless of the webpage type, we need to locate the HTML script of any given page, and we assign the name elm (element):
3.5.1 Scraping Type 1
Initially, we specify how many times we want Selenium to scroll the webpage, which is done with a simple variable. Afterwards, Selenium carries out the scrolling with a delay of 1 second to avoid any breakdown:
After the scrolling is conducted, the loaded webpage will now contain all the articles of interest as shown in Illustration 1.
However, making the program able to identify the articles of interest and not various advertisement, toolbars, Facebook hyperlinks and likewise, we specify in detail where to look for links directing to each article by the following code. The coding draws upon the Python library Beautiful Soup, which is highly effective for the HTML parsing required:
The list links now contains the hyperlinks to all the articles for a given period. With the Python library requests, we are able to make an HTTP request, and we run the request.get() method to retrieve data from every hyperlink directing to an article:
For every article, data are retrieved, and this data contain everything such as header, body text, publication date, author and so on. The next step is to navigate, locate and save the useful information in the soup in an easily accessible way for later analysis. The articles for every region will be saved in two different DataFrames:
- a region-specific DataFrame, and
- appended to a DataFrame containing the aggregate text of all regions.
After thorough considerations and trial, we found that the elements of interest were only the body text and header.
3.5.2 Scraping Type 2
A similar scraping process is conducted for the webpages of Type 2. However, instead of specifying a variable indicating scroll times, we specify a variable determining how many times the “earlier content”-button is clicked. Working with Type 1, we carried out all the scrolls before saving any data, however, that is not an option in Type 2. In Type 2, we lose all the loaded articles every time we click the “earlier content”-button. Thus, we must conduct the data saving process before we load a new page with older content. Except for that distinction, the body text and headers are saved in a way identical to that of Type 1.
4. Text Analysis
All articles of each region are saved in a region-specific DataFrame, and simultaneously, the articles of all regions are saved in an aggregate DataFrame allowing us to perform analysis on a global level. As an example of the DataFrame for India, we have Picture 2.
All DataFrames are analysed by the same two processes initiated by a for-loop: single-word analysis and n-gram analysis.
The first process in chronological order comprise (1) punctuation removal, (2) removal of trivial words, (3) tokenization, (4) removal of stopwords, (5) filtering and omission of words with a length > 3, (6) assuring region abbreviations like US, EU, DK and UK are not omitted, (7) lower-casing all words, (8) lemmatization, (9) stemming, (10) removal of trivial words for the second time, (11) transformation to a new DataFrame, and (12) visualisation by bar charts.
Overall, the n-gram analysis is similar, but instead of analysing on single words, we use n-grams. N-grams are sequences of words, and the lengths of these sequences are defined by how many words each sequence contains, which is a parameter we decide.
In order to count the words in the single-word analysis, we have to remove all suffixes and stem the words to their roots. This way of analysis makes all the words easy countable at the expense of context and sentence structure. Thus, we have chosen to also conduct an n-gram analysis, where we count the frequency of often occurring sentence structures like ”united states”, ”hong kong”, ”impeachment trial”, ”president donald trump”, ”world economic forum davos” and similar.
In our n-gram analysis, we do not take the risk of losing the suffixes of the words. The idea of n-grams is to maintain as much structure and intended meaning as possible. In the single-word analysis, we are cognitively capable of comprehending familiar words without suffixes, but longer sentences without suffixes stress the cognitive effort required to understand the sentences. As an example of the difference between the single-word analysis and a 2-gram analysis, Picture 3 has the 20 most occurring words/sequences and their frequencies.
It is quite obvious, that both analyses have their own advantages. In the 2-gram analysis, the context and intended meaning is preserved, but only in the single-word analysis does the word ”Cipollon” occur, which is the stem of the American lawyer Pat Cipollone. In fact, the name Pat Cipollone is a magnificent specimen of the challenges we faced. As you might have guessed, the reason why he does not occur in the 2-gram analysis is because of the fact that his first name was omitted due to its shortness. Nonetheless, the single-word analysis does yield valuable insight, especially when we look at 50 or more words/sequences and not just the 20 most frequent words. In fact, the 2-gram analysis starts to struggle and lack applicability as soon as the sequence frequencies approach 0. Obviously, and n-gram analysis with an even higher parameter like a 5-gram analysis will approach 0 even faster, and in order to make n-gram analyses like 3-gram, 4-gram, and 5-gram useful, we have to analyse a really large body of text. Luckily, that’s just what we got!
To illustrate the difference between n-gram analyses at different parameter values, Picture 4 shows the n-gram analyses of a larger DataFrame comprising articles from the whole world in a given period.
Both the single-word analysis and the n-gram analyses were implemented by a trial and error approach. As already mentioned, the difference in the two types of analyses is the stemming. Thus, the words in the n-gram analysis have not been stemmed to their roots (stemming), but only been subject to a lemmatization process. The lemmatization process is adjacent to the stemming, but not as ”rough” and will only return the lemma of a word. The lemma is the dictionary entry of the words, and thereby the intended meaning of sentences is preserved.
Code 10 displays the single-word analysis. The n-gram analysis is similar, but the stemming process is substituted by the n-gram grouping, which is shown in code 11.
The commands in the box are those which are removed in the n-gram analysis and replaced by the n-gram command displayed in code 11. The n-gram parameter is n.
In coding, what we have accomplished is known as natural language processing (NLP), and to show our success experiences and the trial and error approach to language processing, Illustration 3 sums up the progression in the analysis results. The steps visualised in the illustration are those specified earlier.
As a concluding comment regarding the text analysis, we saved the most frequent words and their frequencies, and similarly, we saved the most frequent n-grams and their frequencies.
5. Visualisation
Following the text analysis, we came to the conclusion that the text analysis was far more time-consuming than expected. Hence, we adjusted our goals and agreed upon a new visualisation strategy. Initially, we strove to visualise the importance and emphasis of each word and each n-gram by plotting the words directly on a world map with a font size corresponding to their occurrence, as illustrated by Picture 5. However, our experience and resources were inadequate, and we decided to strive for a more traditional visualisation by bar charts.
Simply, we decide that we want the 30 most frequent words or n-grams displayed, and we set the n-gram parameter determining if we use 2-grams, 3-grams, 4-grams, or 5-grams. With the data obtained for Brazil, the program provides us with the following five bar charts. Keep in mind the program automatically makes corresponding bar charts for all the ten concerned regions: US, Euro Zone, Middle East, China, Japan, Mexico, Brazil, Africa, Russia and India.
As it can be seen from the bar charts, the higher the order of the n-grams the fewer occurrences we have. To counteracter the uncertainty associated with so few occurrences, we could extend the period of which we scrape. However, that comes at the cost of less sensational news. The longer the period, the more ”older” content is scraped. Bearing that in mind, we suggest you go with either 2- or 3-grams, and to be honest, we think these words give a reasonable insight into a volatile world of news articles.
On behalf of the team behind News Aggregator, we hope you enjoyed reading about our programming experience and acquaintance with a new world of coding. Should you have any further interest in the code itself or would you like to get in touch, you are more than welcome to find us on LinkedIn.
Team Members:
Anders Holck Hartvig
Daniel Hedemann Hansen
Jana Hammerer