Francisca Dias
In this analysis I want to see the relationship between 6 Worldwide Governance Indicators and their impact on GDP per capita.
Is there a relationship between any of those indicators and GDP per capita?
How strong is that relationship?
What is the effect of each indicator in GDP per capita?
Given any indicator, can GDP per capita be predicted?
I want to know the impact of each one of these 6 Worldwide Governance Indicators separately (univariate model), and later include all of them in the model (multivariate). The sample includes all countries, and the indicators are from the year 2016. The dataset was taken from the World Bank.
I am applying two different python libraries for estimating the relationship between the dependent value (Log GDP per capita) and the independent variables (features) in the model: stats model and scikit learn. They will both lead to the same results.
I will interpret the results for both Univariate and Multivariate model, introduce some concepts and problems that may arise. I will approach multicollinearity and finally test/split the data so I can validate the model.
Data Preprocessing
Simple Linear Regression
Univariate Regression with Stats Model
Linear Regression with Stats Model
Linear Regression with Scikit-Learn
Multivariate Regression with Stats Model
Multivariate Regression with Scikit-Learn
Model Optimization
Dataset was taken from the World Bank Database.
Control of Corruption: Percentile Rank Control of Corruption captures perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as "capture" of the state by elites and private interests. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
Government Effectiveness: Percentile Rank Government Effectiveness captures perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
Political Stability and Absence of Violence/Terrorism Political Stability and Absence of Violence/Terrorism measures perceptions of the likelihood of political instability and/or politically-motivated violence, including terrorism. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
Regulatory Quality: Percentile Rank Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations that permit and promote private sector development. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
Rule of Law: Percentile Rank Rule of Law captures perceptions of the extent to which agents have confidence in and abide by the rules of society, and in particular the quality of contract enforcement, property rights, the police, and the courts, as well as the likelihood of crime and violence. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
Voice and Accountability: Percentile Rank Voice and Accountability captures perceptions of the extent to which a country's citizens are able to participate in selecting their government, as well as freedom of expression, freedom of association, and a free media. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.
GDP per capita (constant 2010 US$): GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2010 U.S. dollars.
This dataset can be found here.
Import the libraries.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
Import the dataset.
gov_data = pd.read_csv('World_Bank_Governance_Data.csv')
Rename the columns.
gov_data.rename(columns={'Country Name': 'Name',
'Country Code': 'Code',
'Series Name': 'Series',
'Series Code': 'Series_Code',
'2016 [YR2016]': 'Values' }, inplace=True)
Convert the datatype from 'Object' to Numeric.
gov_data = gov_data.convert_objects(convert_numeric=True)
Delete unnecessary columns.
del gov_data['Name']
del gov_data['Series_Code']
Map the dataset to new features names.
gov_data['Series'] = gov_data['Series'].map({'Control of Corruption: Percentile Rank': 'CC',
'Voice and Accountability: Percentile Rank':'VA',
'Government Effectiveness: Percentile Rank':'GE',
'Regulatory Quality: Percentile Rank':'RQ',
'Rule of Law: Percentile Rank':'RL',
'Political Stability and Absence of Violence/Terrorism: Percentile Rank':'PS'})
What are the features?
What is the response?
Transform the dataset in order to have the features in columns so I can perform the analysis.
gov_data_2 = gov_data.set_index('Code')
gov_data_3 = pd.pivot_table(gov_data_2,index=["Code"],values=["Values"],columns=["Series"], aggfunc=[np.sum])
gov_data_4 = pd.DataFrame(gov_data_3.to_records())
gov_data_4.columns.values[1] = 'CC'
gov_data_4.columns.values[2] = 'GE'
gov_data_4.columns.values[3] = 'PS'
gov_data_4.columns.values[4] = 'RL'
gov_data_4.columns.values[5] = 'RQ'
gov_data_4.columns.values[6] = 'VA'
df1 = gov_data_4
Import the dataset related to GDP per capita.
gdp_data = pd.read_csv('/Users/FranciscaDias/Kaggle_Temporary/***Kaggle_Competions***/8.Data_Extract_From_Global_Economic_Prospects/World_Bank_Governance/World_GDP_constant.csv')
gdp_data.head()
Rename the columns so I can merge with the first table.
gdp_data.rename(columns={'Country Name': 'Name',
'Country Code': 'Code',
'Series Name': 'Series',
'Series Code': 'Series_Code',
'2016 [YR2016]': 'GDP_Values' }, inplace=True)
del gdp_data['Name']
del gdp_data['Series']
del gdp_data['Series_Code']
gdp_data = gdp_data.convert_objects(convert_numeric=True)
gdp_data["log_gdp"] = np.log(gdp_data['GDP_Values'])
complete = pd.merge(df1, gdp_data, on='Code', how='outer')
complete.head(3)
How can we measure the impact of governance indicators in GDP per Capita?
I will start by plotting a heatmap where I can see the correlation coefficient between variables.
Next I will show the difference between GDP per capita and the log of gdp per capita.
I will plot the linear relatinoship between Government Effectiveness (GE) and log gdp, first aggregate, and then separate so we can see the countries.
I will measure the relationship between GE and log gdp throught OLS and intepret the results.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="white")
# Compute the correlation matrix
corr = complete.corr()
f, ax = plt.subplots(figsize=(10, 7))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, annot=True, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
As we would expect, most of Governance Indicators are positively correlated with gdp per capita. We can clearly see a redish square in the heatmap and it shows how significant the correlation is between these variables. This correlation is so strong that it can indicate a situation of multicollinearity. It is very likely that these variables, mainly CC, GE, PS, RL, RQ and VA give almost the same information, so multicollinearity really occurs.
total = complete.isnull().sum().sort_values(ascending=False)
percent = (complete.isnull().sum()/complete.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head()
We have to make sure there is no missing data.
complete=complete.dropna(axis=0)
Below I am calculating correlation using two libraries: scipy and numpy
from scipy.stats.stats import pearsonr
from numpy import corrcoef
a = complete['GE']
b = complete['log_gdp']
print(pearsonr(a,b))
print(np.corrcoef(a,b))
Our correlation estimation between target (log_gdp) and the GE indicator is 0.86 which is positive and strong.
This is also called partial correlation since we are trying to model the response (target) using just one predictor, in this case, GE.
sns.distplot(complete['GDP_Values']);
mean larger than the median.
sns.distplot(complete['log_gdp']);
complete.GE.describe()
complete.plot(x='GE', y='log_gdp', kind='scatter')
plt.show()
This plot shows a positive relationship between GE and log GDP per capita.
Higher Government Effectiveness appears to be positively correlated with wealthier economies.
In order to describe this relationship I chooose the linear model, that can be translated to:
log gdp = β0 + β1GE + ui
β0 is the intercept on the y axis
β1 is the line´s slope
ui, random error, are the deviations of observations from the line due to factors that were not considered in the model
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:
y=β0+β1x
What does each term represent?
y is the response
x is the feature
β0 is the intercept
β1 is the coefficient for x
Together, β0 and β1 are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict log GDP!
X = complete['GE']
y = complete['log_gdp']
f, ax = plt.subplots(figsize=(13, 9))
labels = complete['Code']
plt.scatter(X, y, marker='')
for i, label in enumerate(labels):
plt.annotate(label, (X.iloc[i], y.iloc[i]))
plt.xlabel('Government Effectiveness: Percentile Rank in 2016')
plt.ylabel('Log GDP (constant 2010 US$)')
plt.title('OLS relationship between Government Effectiveness and Income')
sns.regplot(x="GE", y="log_gdp", data=complete);
By sorting the dataframe in descending order by log gdp values, if we call the head and tail methods we can see what are the countries with highest an lowest GPD, respectively.
Countries with highest gdp per capita:
Luxembourg
Norway
Switzerland
Ireland
Qatar
Countries with lowest gdp per capita:
Niger
Congo, Dem. Rep.
Liberia
Central African Republic
Burundi
highest = complete.sort_values(['log_gdp'], ascending=[False])
highest.head(5)
highest.tail(5)
To estimate the constant term β0, we need to add a column of 1’s to our dataset (consider the equation if β0 was replaced with β0x and xi=1)
The X variable needs to be extended by a constant value(); the bias will be calculated accordingly. As we might remember, the formula of linear regression is y = bX + b
complete.columns
The steps to building and using a model are:
Define: This is where we choose the model.
Fit : Capture patterns from provided data.
Predict:
Evaluate: here we evauate model´s accuracy
import statsmodels.api as sm
complete['const'] = 1
reg1 = sm.OLS(complete['log_gdp'],complete[['const', 'GE']])
results = reg1.fit()
results.summary()
import statsmodels.formula.api as smf
reg2 = smf.ols(formula = 'log_gdp ~ GE', data = complete)
results = reg2.fit()
results.summary()
We can now write our estimated relationship as
log gdp = 0.04 GE + 6.357
This equation describes the line that best fits our data.
Let´s calculate the average GE in our dataset:
complete.GE.mean()
Let us estimated the expected log of gdp per capita with an average GE of 49:
y = 0.04 * 49 + 6.357
y
Just a reminder that the beta of each feature becomes its unit change measure, which corresponds to the change the outcome will have if the feature increases one unit.
For instance, let us see what happens to log_gdp if we increase one point on GE, from 49 to 50:
y = 0.04 * 50 + 6.357
y
y0=8.317
y1=8.357
(y1-y0)*100
This result can be interpreted as the following:
An increase of unit change in GE, leads to an increase in log GDP by 4%.
This is the same as calling the predict() method:
If we hit the command results.predict() we will get an array of all predicted log gdp for every value of GE.
We can compare the observed and predicted values of log GDP by plotting them on the graph below.
f, ax = plt.subplots(figsize=(12, 9))
sns.regplot(x="GE", y = results.predict(), data = complete, label='predicted')
sns.regplot(x="GE", y = complete['log_gdp'], data = complete,label='observed')
plt.xlabel('Government Effectiveness: Percentile Rank in 2016')
plt.ylabel('Log GDP (constant 2010 US$)')
plt.title('OLS relationship between Government Effectiveness and Income')
plt.legend();
from sklearn import linear_model
linear_regression = linear_model.LinearRegression()
linear_regression
complete.head()
feature_cols = ['GE', 'const']
X = complete[feature_cols]
y = complete.log_gdp
linear_regression.fit(X, y)
# print intercept and coefficients
print(linear_regression.intercept_)
print(linear_regression.coef_)
We can add more variables to our model. So far we only focused on the impact of GE on log gdp, but we still have other variables that we can include in the model. For this purpose we go from a bivariate model to a multivariate model that reflects the additional variables.
X1 = ['const', 'GE', 'CC', 'PS', 'RL', 'RQ', 'VA']
# Estimate an OLS regression for all set of variables
reg3 = sm.OLS(complete['log_gdp'], complete[X1])
results = reg3.fit()
results.summary()
reg4 = smf.ols(formula = 'log_gdp ~ GE + CC + PS + RL + RQ + VA', data = complete)
results = reg4.fit()
results.summary()
Notes to the Results:
Remember the correlation matrix from the beginning?
correlation_matrix = complete.iloc[:, 1:7]
corr = correlation_matrix.corr()
corr
We can see that there's strong correlation between variables, they are all above 0.5.
Another way to see associations among variables is to use the eigenvectors. They recombine the variance among the variables, therefore it is easier to spot the multicollinearity.
# Consider the all columns except code, Value GDP and log gdp
variables = complete.columns[1:-3]
variables
eigenvalues, eigenvectors = np.linalg.eig(corr)
eigenvalues
eigenvectors
Let's investigate the eigenvector on last column, index 5:
eigenvectors[:, 5]
print(variables[1], variables[3])
Removing these two columns would be the best solution.
Now we want our model to generalize well on new data. Therefore we need to test it in that situation.
feature_cols = ['CC', 'PS','RQ', 'VA']
X = complete[feature_cols]
y = complete.log_gdp
Models' practical value come from making predictions on new data, so we should measure performance on data that wasn't used to build the model. Therefore we should split the data and test the model accuracy on data it hasn't seen before - validation data.
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
from sklearn import linear_model
linear_regression = linear_model.LinearRegression()
linear_regression.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = linear_regression.predict(val_X)
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, val_predictions))