4YearElect U.S. — The development of a diverse country
This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 2020/21).
Abstract
In 2020, the global pandemic revealed how the dynamic of a country can be influenced by external factors. Ever-changing political, socio-cultural, economical or institutional situations offer various KPIs regarding the 50 US states. To understand trends and structures in such areas and their influence on election outcomes, differences between the states are outlined and analyzed. Furthermore, evaluating and predicting KPI’s helps to detect and adequately react to trends and achieve long-term goals.
Around the US election and state characteristics — a visual analysis with Python
To get a better understanding of whether certain characteristics of the 50 US states have an influence on presidential election results, it is essential to not only observe monetary KPIs like the GDP but also observe social-cultural ones like the educational level, health sector and crime rate. By monitoring these, the political orientation can be estimated more accurately and predicted for election campaigns. With this aim, we are combining both tracks’ Data Science and AI based on one language Python. At the beginning we searched for suitable data. This is where the first challenge came up. Since finding datasets for our purpose is not that simple, the original project topic, election prediction and visualizing exactly who votes for the different parties, needed to be slightly adjusted. This resulted in using slightly different KPIs which at first glance aren’t really correlating with the election but are significant for the outcome of the results of each state. Our aim was now to use KPI’s describing a state, look for patterns in the outcome of the election and later on, use these patterns to predict a new outcome of the election.
Since we are examining the US states, a prerequisite for the selection of the data was that all 50 states were included. Furthermore, the states will also function as a merge criterion in the process. In total, we used 10 different data sets that were carefully screened and cleaned. For the prediction later on it was important to build one data frame including all the different KPIs for every State. Due to the cleaning process the 10 data sets needed the same pattern to be merged with python to one table which contains sociological, educational, political, demographical, economical aspects. For visualization of the 238 columns by the 50 states, we used the packages matplotlib and seaborn.
The following two stacked bar plots created with matplotlib show the distribution of the election results by states from 1976 to 2016. Here we can see that some states clearly vote repeatedly for the same party such as Alabama (Republican), Florida (Republican), Minnesota (Democrats) and New York (Democrats). But there are also states where it is hard to say which party dominates the State (e.g., Oregon). Hence, observing trends and predicting results are crucial for campaigning strategies.
Since swing states are decisive for the electoral results, we selected 5 states (NV ,FL, NC, NH, PA) that were most often defined as swing states over the last years. Therefore, the line graph illustrates the total votes of the 5 selected ones.
We can see a strong upward trend in total votes, especially in Florida and Pennsylvania since 1996/2000. Combining this line graph with the aforementioned bar plot, we see that especially States like Pennsylvania are crucial for elections since they are nearly half republican and half democrats. So having a high voting rate highly influences the end result.
A further factor that influences the political view is education. The Top 5 States with regards to the highest graduate degree rate are Massachusetts, New York, Maryland, Virginia and Connecticut. Furthermore, in these States people mainly vote for the Democratic party. States like Nevada, Idaho, Maine States, Oklahoma and Mississippi have the smallest share of students with a graduate degree. Further, these states show a trend of voting the republican party, leading to a connection between voting behavior and educational degree.
Especially during a World Pandemic, affordable or any health insurance at all is crucial for every individual. The number of the insured of each State in the US, here shown by the 5 Swing States, clearly increased after Barack Obama signed the Affordable Care Act in 2010 into law. However, after Donald J. Trump was elected, it shows a slight decrease.
To visualize monetary KPIs of each State we used a boxplot. Looking at the distribution of the US’s GDP we can see the biggest part is generated by California, Texas, New York and Florida. The remaining States carry a rather insignificant part of the US’s GDP.
Future outlook into elections — an algorithmic analysis with Python
With this information as our base, the question occurred on how to predict future election outcomes. The goal was to predict the outcome of all 50 states in the presidential election. To simplify the model, primary elections and caucuses are not included and the prediction refers alone to whether the state voted Republican or Democratic. Because this is a binary classification problem, a model was built with Python’s machine learning library scikit-learn.
To start building and training the model, the election results of 2016 were selected as the basis year/our labels and the remaining columns like GDP, crime, bachelor degree etc. as features/our inputs. Train_test_split() was used to split the data into training and testing sets with a test size of 25%.
Different supervised learning classifiers such as K-Nearest Neighbors (KNN), Random Forest, Support Vector Machines (SVM) and Decision Trees were tried, fitted and trained. While evaluating the classifiers, the Random Forest algorithm showed the best accuracy score of 90% with only one mislabeled element, as shown in the confusion matrix to the right. SVM followed with 83,5% accuracy, Decision Tree with 81,5% and KNN with a score of 60,5%. All accuracy scores are shown after a cross-validation to minimize overfitting.
Random Forest works with an ensembled algorithm, which combines multiple algorithms of the same kind of classifying objects. Creating multiple decision trees from subsets of the training data and merging them generates a more accurate and stable prediction. The accuracy can be improved by optimizing n_estimators (number of trees in the forest), min_sample_split and other criteria. After optimizing different variables, the model settled at 90% accuracy which is, considered the rather small Dataframe as input, a successful result.
To evaluate these results and come back to our original goal to analyze the structure behind voting outcomes, the feature importance technique (classi.feature_importances) assigned a score to our input features. Feature importance scores provide a more in depth insight into the dataset and model. As seen on the right, scores can rank from 0% to 6%.
Even though the significance of influence shows variations between the KPI’s, the general influence is detectable. Measuring the 50 States by different values helped to build an accurate prediction classification model. In conclusion, election outcomes are influenced by many factors and the above represented KIP’s show a fraction of that, through evaluations and predictions we can come closer to understanding the structures behind it.
The Team
Annika Loos Data Science with Python
Theresa Wild Data Science with Python
Caren Vietor Artificial Intelligence with Python
Mentors
Marcus