Francisca Dias
The only thing that's predictable, it seems, is oil's unpredictability.
I agree with the statement above, that the only thing that is predictable is oil's price unpredictability.
Therefore I have ahead a very difficult task of trying to predict oil prices in Ecuador.
Oil prices have significant impact on Ecuador's economy and it would be nice to see what the model has to tell us.
I am applying ARIMA since this is a time series data set. Oil prices range from 2013 till 2017.
I will try to explain as much as I can all steps I have taken in this analysis.
This dataset can be found here.
import warnings
warnings.filterwarnings('ignore')
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
oil = pd.read_csv("oil_ecuador.csv")
oil.head(3)
oil.tail(3)
# convert to datetime
oil['date'] = pd.to_datetime(oil['date'])
oil = oil.rename(columns={'dcoilwtico': 'oilprice'})
# set date as index
oil.set_index('date', inplace=True)
# Daily data can be difficult to work with since it’s a short amount of time, so I will use monthly averages instead.
oil = oil['oilprice'].resample('MS').mean()
plt.title("Ecuador's oil price Evolution")
plt.xlabel("Date")
plt.ylabel("Oil Price")
oil.plot(figsize=(15, 6))
plt.show()
There are patterns that can be noticed by looking at the graphic.
The time series has a modest seasonality pattern, where the oil price is higher in the middle of each year. But this my be insignificant.
And there is clearly an overall decreasing trend in oil prices.
For time series analysis, ARIMA model in one of the best methods that can be used to forecast.
It incorporates three parameters that take into account the properties of time series: seasonality, trend, and noise.
I will next describe how to automate the process of identifying the optimal set of parameters for the ARIMA model.
In order to do a correct parametrization of ARIMA models, I will make use of Python code to select the optimal parameter values for our ARIMA(p,d,q)(P,D,Q)s model.
I will use a "grid search" (or hyperparameter optimization) that will iterate over different combinations of parameters, and I will use the one that has the lowest AIC value.
The AIC (Akaike Information Criterion) measures how well a model fits the data while taking into account the overall complexity of the model.
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
Please note that I am using s= 12 since this dataset was converted to a monthly basis.
Therefore I have 12 observations for each year.
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
# 12 for yearly periods
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
warnings.filterwarnings("ignore") # specify to ignore warning messages
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(oil,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
#ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:196.28174700310274
# This output yields the lowest AIC and I will be using these parameters to fit the model
mod = sm.tsa.statespace.SARIMAX(oil,
order=(1, 1, 0),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
The coef column shows the weight of each feature and its impact on the data.
The P>|z| column informs us of the significance of each feature weight.
Here, 2 features have a p-value higher than 0.05, so it would be wise to remove them from the model since they appear to be insignificant.
After fitting the Arima model it is important to ensure that none of the assumptions have been violated.
The graphics below allows us to a generate model diagnostics and investigate for any unusual behavior.
results.plot_diagnostics(figsize=(16, 10))
plt.show()
First thing we have to look for is the residuals of our model.They have to be uncorrelated and normally distributed.
If it is not the case, we should further improve the model.
This is the time where we will use the model to predict future oil prices.
We will compare the predicted oil price with the real ones, which wil help us understand the accuracy of our forecasts
The attributes get_prediction() and conf_int() allow us to obtain the values and associated confidence intervals for forecasts of our dataset.
# the code below requires the forecasts to start at June 2015.
# the dynamic=False argument ensures that we produce one-step ahead forecasts,
# meaning that forecasts at each point are generated using the full history up to that point.
pred = results.get_prediction(start=pd.to_datetime('2015-06-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = oil['2013':].plot(label='observed', figsize=(15, 6))
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Ecuador Oil Prices')
plt.legend()
plt.show()
The forecasts align with the true values well, but don't show any obvious trend.
We should now quantify the accuracy of our forecasts.
I will use the MSE (Mean Squared Error), that tell us the average error of our forecasts.
For each predicted value, we compute its distance to the true value and square the result.
Please note that we need to square the errors so that positive and negative values don't cabcel each other.
y_forecasted = pred.predicted_mean
y_truth = oil['2015-06-01':]
# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The MSE of our one-step ahead forecasts yields a value of 35.06, which is a little high.
But there is a better representation of our predictive results can be obtained using dynamic forecasts, that is, setting dynamic=True.
This means that we use information from the data up to a certain point and then the forecasts will be generated using values from previous forecasted values
In the code chunk below, we specify to start computing the dynamic forecasts and confidence intervals from June 2015 onwards.
pred_dynamic = results.get_prediction(start=pd.to_datetime('2015-06-01'), dynamic=True)
pred_dynamic_ci = pred_dynamic.conf_int()
ax = oil['2013':].plot(label='observed', figsize=(15, 6))
pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)
ax.fill_between(pred_dynamic_ci.index,
pred_dynamic_ci.iloc[:, 0],
pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)
ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('2015-01-01'), oil.index[-1],
alpha=.1, zorder=-1)
ax.set_xlabel('Date')
ax.set_ylabel('Ecuador Oil Prices')
plt.legend()
plt.show()
Once again, we verify the performance of our forecasts by computing the MSE:
# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = oil['2015-06-01':]
# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Here we are relying on less historical data than before, but even that, the MSE is high enough that at this point I should be thinking about applying a different model to predict oil prices.
Oil prices are a tricky thing to predict and the natural volatilty attached to it makes it hard to get a reasonable result.
I will finally describe how to forecast future value using Arima.
The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead, in my case, 20 steps.
# Get forecast 20 steps ahead in future
pred_uc = results.get_forecast(steps=20)
# Get confidence intervals of forecasts
pred_ci = pred_uc.conf_int()
ax = oil.plot(label='observed', figsize=(15, 6))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Ecuador Oil Prices')
plt.legend()
plt.show()