Smart Cross-Country Price Recommendation Project
Michele Scarperi, Daniel Kiss, Aleksandra Zajaczkowska, Julius Altenburger, Fredrik Martin
Our project aimed to address a problem in the price comparison service field (also price recommendations). Price recommendation tools that are currently available on the market are lacking many features and crucial accurate information in order to enable consumers to compare prices across borders when it comes to finding the best price possible on a global scale. Therefore, the goal of our project was to build a tool which would address these factors and take into account availability, shipping fees as well as differences in currency (thus converting to selected currency) in order to increase price transparency.
At the project kick-off we decided to focus on only one product category to start with, due to the time frame and scope of this project. We went with smartphones, because we knew it is a popular category and there would be a lot of available data. It gave us the opportunity to utilize real data and concentrate our efforts on developing the price comparison and recommendation functionality. Our aim was to develop a Minimum Viable Product (MVP) which could be extended by more categories and enriched with data in a later stage.
2.1 How to get reliable pricing data and execute currency conversion?
The reliability of our recommendation functionality was dependent mainly on the quality of the data. To present valid search results we wanted to get real world prices of desired products. Moreover, already at an early stage we realised it would be close to impossible to get a big variety of every vendor from different countries, due to the fact that data from different vendors form different countries is very diverse and not easily scrapable. Each vendor and product category would have needed its own scraping script. Hence loads of work, we therefor needed to narrow down the scope and decided to start with smartphones from only Media Markt across different countries. Currency conversion and to be able to compare prices properly became another issue, as conversion rates continuously change and using fixed rates would bias the results.
2.2 How to structure this turmoil and where store it?
For the purpose of comparison, we decided that information such as product type, specifications, price, availability, country and shipping cost should be displayed. However, while scraping the data, we realised that datasets from different countries are structured in completely different ways. Also, the product names were presented in the home country languages. Due to that issue, we had another challenge to solve. Data cleaning and re-structuring became our most time-consuming issue. In addition, we had to decide which database solution to implement which would be the easiest for us to connect to a frond-end.
2.3 What about a UI?
At the beginning we wanted to develop a simple web interface. However as neither of us were doing the web development track and we were limited to time constraints; it was decided that the functionality of our final product would be limited. Thus, forcing us to cut down on features and developing a simple application which could be extended by a proper front-end in the future.
All these challenges were included in our process of Milestone development which helped us keep track of our intermediary goals, while building our project.
3. Methodology & Solutions
3.1 Internal Organization — Scrum
We decided to utilize the strengths of the agile framework named Scrum for our internal project management. This decision was made very early on, before the technical planning began. In order to visualize our tasks, and understand the bigger picture, we used a free online tool called Trello and started filling it with tasks, after we had done an initial brainstorming session on what would be needed to achieve the outcome we wanted. Therefore, we clustered different tasks into several main categories: Backend (Data Science), Frontend (WebDev), Project Management (PM) and Milestones. The Product Backlog contained the Epics — the what do we want to develop features and the Sprint Backlog contained the Stories and Tasks how do we achieve that through smaller steps?
3.2 Getting the data — Scraping
Initial data structure
We had an idea about what kind of data is needed for price — recommendation functionality. Hence, we created an overview to visualize the initial structure of our database. Initially we planned to include also availability, shipping cost, vendor name or historical price. However, website layouts of selected vendors were lacking many of the above mentioned datapoints. For our MVP development, we decided to get the smartphone data from Media Markt websites from 12 different countries such as: Spain, Hungary, Italy, Poland, Netherlands etc. The data explored on these websites didn’t include what we planned for at the beginning. We were able to get product links and texts with major features such as capacity and with a lot of information in foreign languages, unveiling new challenges for us.
Methods for Scraping
By utilizing the knowledge gained continuously through the Data Science track we decided to use scraping as a method. We started out, by trying to develop a beautiful soup script, which didn’t bring the results we wanted, as the structure, in terms of naming of the products, within the different media Markt websites was not consistent overall and one country was not even scrapeable with a script at all. Due to this unexpected complexity of the source websites (Media Markt sites don’t share the same design across countries), a lack of time and lack of knowledge of python scraping libraries, we decided to use UiPath Studio to scrape only the 12 Media Markt Websites for smartphone product data where the process was possible. The attributes obtained were: Product Name, Product URL, Price (in local currency) and Stock information. Our scraping outputs showed that we had to involve additional work to manual pre-processing of data.
3.3 Data cleaning and pre-processing — Excel
Scraped datasets included phone names and all specifications. To simplify the initial search, we decided to leave out features which are crucial for setting the price range. Besides price, country, availability product link and shipment info — features such as product brand, model type and capacity were kept in the description of a product. Unfortunately, completely different data structures required writing separate code to extract desired features for each of them. Thus, to clean and re-structure the data we decided to use Excel as the sample dataset shown here was not too big and thanks to that we saved a huge amount of time. It enabled us to develop a simple, fast and semi manual setup for cleaning. The data was first delimited and parsed with the use of Text Wizard. With the use of the Visual Basic for Application Programming language we implemented a function that split the sentence for us and extracted the desired words. To deal with more complicated and unstructured sentences we created separated functions by the combination of TRIM, LEFT, SUBSTITUTE, MID and MAX methods. It enabled us to extract words containing specific text such as capacity i.e. 64 GB from each sentence.
3.4 Live Currency — API Development
Prices from each Media Markt website are displayed in their respective country currency. To provide a consumer with reliable results we wanted to convert prices in each dataset to e.g. Euro. To get live exchange rate conversion we decided to retrieve accurate currency rates with the use of an API, specifically this one. This API is not in fact real-time, but uses the latest available exchange rate data, therefore the EUR-converted prices give a very good estimate of the actual price.
3.5 Search functionality development
In order to perform a search across all countries, the excel files were each imported as a Pandas DataFrame. As the data formats (in particular, the price) were not consistent across different countries, first some string operations had to be applied to unify the format. (removing “,-”, “€”,etc…) in order to establish consistency across the different DataFrames. The country dfs were merged together into one master df, which was used to perform the search on.
The search function is as follows: search_query(df, column, search, search_leeway=0):
Where df is the dataframe to be searched, column is the name of the column to be searched, search is the search keywords (in a list), and search_leeway is an integer (default 0) that describes the “flexibility” of the search. For example, with a leeway of zero, if someone searches for “Hubawei p30 128GB midnight black” there will be no results, due to the typo. With a leeway of one, the search will show results for just “p30 128GB midnight black” as well. With a leeway of 2, results for “p30 128GB black” “128gb midnight black” and so on, will be shown.
For each word in the search keyword list, the function iterates through the search column of the df, and if a match is found in the string, it increments a “score” value for that row, which is stored in a new column in the df. The function then returns the rows with the highest score (minus the leeway).
A simple Tkinter UI was created to host the search function and make the experience slightly more user-friendly.
4. Project Results
Finally, due to lack of front-end developer in the team we decided to develop a simple application based solely on Python. The user can type the input and search for desired products. The search engine will display all features from the database such as brand name, model, capacity, country and prices converted to Euro. Moreover, the search result will highlight in which country Media Markt offers the least and most expensive phone. A user will be also provided with the availability information and redirection link to the product origin website. Due to time limitations the shipping cost is not available as of now, as this type of data was the hardest to obtain, however this is a step to implement in the future.
5. Conclusion & Learning
Breaking down big problems into smaller pieces, solvable parts and actionable steps. This was the main learning we concluded with throughout the project phase. If you want to develop something new, you should first focus on how to get it working, before considering all the fancy details and features. This doesn’t mean that you shouldn’t think big but start with a MVP (Minimal Viable Product). Because there will be plenty of challenges. Some you thought you have planned for and some that you haven’t planned for. If you try to tackle everything at once you will eventually get nowhere and get frustrated. Try to focus on core functionality first going forward add more features if time allows.
Being agile as a team included the learning we just described above, but also several other important things that helped making progress. Especially if you are new to the domain (Data Science / Programming etc.) you have limited capabilities and resources. Hence, it is important to be flexible and accept that sometimes you must scratch / reject or change the current plan due to the surrounding constraints. What we could have improved in our rather simple process our tasks would run through (Product Backlog -> Sprint Backlog -> To-Do -> In Progress -> Done -> Other) was to include a step for Dependencies. Dependencies would go in between In Progress and Done and would have contained tasks that couldn’t be completed due to any kind of restrictions (e.g. setting up a server but lacking administrator access currently would hold us from setting up a server).
This project gave us a huge opportunity to learn how Python for Data Science purposes can be used in practice. It showed us the struggles which are faced by companies dealing with data and trying to obtain data. Despite the fact that we all started our journey with Python from scratch, we tried to approach a price recommendation engine task as realistic as we could. It was a great learning experience for us. As we are aware of the importance of the Big Data phenomenon, we believe that this project helped us to develop useful skills that are valuable in our future roles.
6. Details about the project team
Michele Scarperi: Data Science
Daniel Kiss: Data Science
Aleksandra Zajaczkowska: Data Science
Julius Altenburger: Data Science
Fredrik Martin: Data Science
The project described in the article can be found in this GitHub repository.