- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series Analysis
In this part of the study, we will use continue to use Lake Erie water level data to build ARIMA models and forecast into the future.
In this notebook, we will:
# General Imports
import pandas as pd
import itertools
import time
# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Time Series Imports
from statsmodels.tsa.stattools import adfuller #stationarity test
from statsmodels.tsa.statespace.tools import diff #differencing
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf #acf, pacf plots
from statsmodels.tsa.arima_model import ARIMA #ARIMA
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
We’ll need to do some quick processing to read the data and convert it to have a time series index. We will read the Erie water level dataset from previous notebook.
erie_data = pd.read_csv('erie_data')
erie_data.head()
# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes
# Set date as index
erie_data.set_index('date', inplace=True)
Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.
# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index
ARIMA (Autoregressive Integrated Moving Average) [1] [5] models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.
ARIMA [6] is actually a combination of 3 models:
ARIMA [6] combines the ideas of autoregression and differencing. Instead of using autoregression with actual observed data, we use autoregression on the differences. An ARIMA(p,d,q) model can be written as: \begin{array}{l} y_{t}^{\prime}=c+\phi_{1} y_{t-1}^{\prime}+\cdots+\phi_{p} y_{t-p}^{\prime}+\theta_{1} \varepsilon_{t-1}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $y_{t}^{\prime}$ is the differenced series. The “predictors” on the right hand side include both lagged values of $y_{t}$ and lagged errors.
We will follow this 7 step process to build, validate various ARIMA(p,d,q) models and forecast for future periods:
Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast.
Step 2 — Difference the data: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity. Take as many differences as it takes. Make sure you check seasonal differencing as well.
Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
Step 4 — Split the data into train and test sets.
Step 5 — Build the model: Build the model and obtain predictions on test data.
Step 6 — Validate model: Compare the predicted values to the actuals in the test sample.
Step 7 — Forecasting: Retrain Model on full dataset and forecast for future time periods.
To effectively use ARIMA models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality or has trend is not stationary. In general, a stationary time series will have no predictable patterns in the long-term.
A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.
# Run stationarity test
from statsmodels.tsa.stattools import adfuller
dftest = adfuller(erie_data['Erie'], autolag='AIC')
dftest
The test returns some values with no indication of what these values mean. We can run help(adfuller)
to understand what the values represent. Better yet, we will define a function that creates more informative output from the test. We use AIC for the autolag because it penalizes higher order models over simpler models.
# Define the adf-test function
def aug_dftest(series, title=''):
print(f'Augmented Dickey Fuller Test for: {title}')
result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
labels = ['ADF test statistic','p-value','# lags used','# observations']
output = pd.Series(result[0:4], index=labels)
for key, val in result[4].items():
output[f'critical value ({key})'] = val
print(output.to_string())
if output[1] <= 0.05:
print('Reject the Null hypothesis; Data is stationary')
else:
print('Fail to reject the Null hypothesis; Data is non-stationary')
# Run the Test
aug_dftest(erie_data['Erie'], title='Lake Erie Data')
From the results above, we can see that the p-value > 0.05 which means we fail to reject the null hypothesis of stationarity and our data is non-stationary.
Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.
The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.
First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}
Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}
Let's difference the data to make it stationary.
# Difference the data
from statsmodels.tsa.statespace.tools import diff
diff1_data = diff(erie_data['Erie'], k_diff=1)
# Plot differenced data
fig, axes = plt.subplots(2,1, figsize=(15,10))
erie_data['Erie'].plot(title='Original Data', ax=axes[0])
diff1_data.plot(title='1st Differenced Data', ax=axes[1])
axes[0].set(xlabel='')
axes[1].set(xlabel='')
fig.tight_layout(pad=3.0)
plt.show()
The plot above shows original data and First differenced data. The first differenced data seems to be stationary. Let's run the
Augmented Dickey-Fuller Test
to check for stationarity.
# Run the Test
aug_dftest(diff1_data, title='1st Differenced Data')
Test results confirm that we reached stationarity after the first difference(d).
After a time series has been stationarized by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series [1]. A PACF Plot can reveal recommended AR(p) orders, and an ACF Plot can do the same for MA(q) orders.
Auto Correlation Factor (ACF) - is the correlation of a series with itself lagged by x time units.
Partial Auto Coreelation Factor (PACF) - is the partial correlation coefficients between the series and lags of itself.
Let's run these plots on the first order differenced data to identify the AR(p) and MA(q) orders for ARIMA model.
# Run ACF and PACF plots to determine order
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(2,2, figsize=(15,10))
erie_data['Erie'].plot(title='Original Data', ax=axes[0,0])
diff1_data.plot(title='1st Differenced Data', ax=axes[0,1])
plot_acf(diff1_data, lags=36, ax=axes[1,0], title='ACF - Differenced Data')
plot_pacf(diff1_data, lags=36, ax=axes[1,1], title='PACF - Differenced Data')
axes[0,0].set(xlabel='')
axes[0,1].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
The plot above shows original data, first order differenced data, ACF and PACF plots of differenced data. While differencing appears to remove any trend as shown in the plot of First Order Differenced data, there are still signs of seasonality. The seasonality is seen in the ACF plot displaying a pattern. A higher order differncing is needed for the pattern to be mosly insignificant.
We will perform second and third order differencing of data and compare the ACF and PACF plots for seasonlaity.
Note: Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing
and Seasonal ARIMA model
in the next section.
Let's difference the data to create second and third order differences.
# Difference the data
# Second Order Difference
diff2_data = diff(erie_data['Erie'], k_diff=2)
# Third Order Difference
diff3_data = diff(erie_data['Erie'], k_diff=3)
# Run ACF and PACF plots of differenced data
fig, axes = plt.subplots(3,3, figsize=(15,15))
diff1_data.plot(title='First Order Differences', ax=axes[0,0])
plot_acf(diff1_data, lags=36, ax=axes[0,1], title='ACF')
plot_pacf(diff1_data, lags=36, ax=axes[0,2], title='PACF')
axes[0,0].set(xlabel='')
diff2_data.plot(title='Second Order Differences', ax=axes[1,0])
plot_acf(diff2_data, lags=36, ax=axes[1,1], title='ACF')
plot_pacf(diff2_data, lags=36, ax=axes[1,2], title='PACF')
axes[1,0].set(xlabel='')
diff3_data.plot(title='Third Order Differences', ax=axes[2,0])
plot_acf(diff3_data, lags=36, ax=axes[2,1], title='ACF')
plot_pacf(diff3_data, lags=36, ax=axes[2,2], title='PACF')
axes[2,0].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
The plot above shows first, second and third order differenced data along with their ACF and PACF plots. It appears as though three orders of differencing are required to remove both trend and seasonality. From the 'Third Order Difference - ACF Plot', we can see that the pattern is now removed and most lags are within the bounds.
We will examine the second and third order differenced plots for further analysis.
Let's examine the Third Order Difference Plots
# Run ACF and PACF plots of differenced data
fig, axes = plt.subplots(3,1, figsize=(15,12))
diff3_data.plot(title='Third Order Differences', ax=axes[0])
plot_acf(diff3_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff3_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
From the 'Third Order Difference' ACF Plot, we can see that the pattern is now removed and most lags are within the bounds.
Notice that:
(a) ACF shows a sharp cutoff from lag 1 to lag 2.
(b) Lag 1 autocorrelation is negative as seen in the ACF plot.
(c) PACF shows an exponential decay over various lags.
(d) PACF cuts off at lag 15.
The rules for identifying AR and MA orders [1] are as follows:
If the PACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is positive--i.e., if the series appears slightly "underdifferenced"--then consider adding an AR term to the model. The lag at which the PACF cuts off is the indicated number of AR terms.
If the ACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is negative--i.e., if the series appears slightly "overdifferenced"--then consider adding an MA term to the model. The lag at which the ACF cuts off is the indicated number of MA terms.
As a rule of thumb, a stationarized series displays an AR, MA or ARMA signatures as follows:
AR(p) | MA(q) | ARMA(p,q) | |
---|---|---|---|
acf | Tails off | Cuts off after lag q | Tails off |
pacf | Cuts off after lag q | Tails off | Tails off |
Given the cutoff at lag 15 on the third order-differenced PACF plot, and the lack of significant lags after 1 on the third-differenced ACF plot, this would suggest starting with an ARIMA(15,3,1) model.
ARIMA(15,3,1) is a really complex model to start with. Also, statsmodels
package does not support an ARIMA model with d > 2; we will need SARIMAX to run a complex model such as ARIMA(15,3,1).
Let's explore the second order differenced data to see if we can find a simpler ARIMA model.
Let's examine the Second Order Difference Plots of the data.
# Run ACF and PACF plots of differenced data
fig, axes = plt.subplots(3,1, figsize=(15,12))
diff2_data.plot(title='Second Order Differences', ax=axes[0])
plot_acf(diff2_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff2_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
The plot above shows second order differenced data along with ACF and PACF plots.
Notice that:
(a) Lag 1 autocorrelation is negative as seen in the ACF plot.
(b) ACF shows an exponential decay over various lags.
(c) lack of significant lags on both ACF and PACF plots.
Given the lack of significant lags on the third differenced ACF and PACF plots, we may start with a simpler ARIMA(0,2,1) model and improve from there.
Our goal here is to find the optimal set of parameters (p,d,q) that yields the best performance for our model. There are multiple ways to determine the order(p,d,q)
for an ARIMA(p,d,q) model. We will run a grid search with different values of AR(p), I(d) and MA(q) terms.
Let's create:
The function will evaluate all models created during the stepwise operation and list the best ARIMA model along with its AICs. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.
Alternatively, auto_arima
tool from pmdarima
package can be used which returns the optimal order for an ARIMA model.
# Create a list of ARIMA model orders
p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))
print(pdq)
from statsmodels.tsa.arima_model import ARIMA
# Function to iterate over ARIMA orders
def stepwise_arima(data):
model_res = {}
for param in pdq:
order_aic = []
try:
model = ARIMA(data, order=param)
model_fit = model.fit(solver='nm')
# print(f'ARIMA{param} - AIC: {model_fit.aic}')
model_res[param] = model_fit.aic
except:
continue
# Print model with lowest AIC
low_key = min(model_res, key=model_res.get)
print('\n')
print('---------------------')
print('MODEL WITH LOWEST AIC')
print('---------------------')
print(f'ARIMA{low_key} - AIC: {model_res[low_key]}')
# Run the model
stepwise_arima(erie_data['Erie'])
We can see how an optimal model of ARIMA(2,1,2) was determined by looking at the stepwise results. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.
Now that we have identified ARIMA(2,1,2) as the optimal model, we will split the data and fit this model on training and test sets.
We will spilt the data into train and test sets.
# Split the data
# TRAIN Data
train_data = erie_data.iloc[:1124]
train_data = train_data.astype('float64') # casting as float to avoid errors in prediction
# TEST Data
test_data = erie_data.iloc[1124:]
test_data = test_data.astype('float64')
We will build an ARIMA(2,1,2) model on the training data and fit the model.
from statsmodels.tsa.arima_model import ARIMA
ar212_model = ARIMA(train_data['Erie'], order=(2,1,2))
ar212_results = ar212_model.fit(solver='nm')
ar212_results.summary()
The table in the middle is the coefficients table where the values under ‘coef’ are the weights of the respective terms. Notice here the P-Value in ‘P>|z|’ column is <= 0.05 and hence significant. Let's obtain predictions from this model.
start = len(train_data)
end = len(train_data) + len(test_data)-1
ar212_predictions = ar212_results.predict(start=start, end=end, dynamic=False,
typ='levels')
Passing dynamic=False
means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels'
predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.
For more information on these arguments, check statsmodels
documentation here.
Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.
ar212_pred = pd.DataFrame(ar212_predictions.values,
index = test_data.index,
columns = ['ar212_predictions'])
test_data = pd.concat([test_data, ar212_pred], axis=1)
test_data[['Erie', 'ar212_predictions']].head()
To help us understand the accuracy of ARIMA model, we will plot predictions from the model against test data and calculate loss.
Let's plot Predictions against Test data.
with plt.style.context('fivethirtyeight'):
ax = test_data['Erie'].plot(label='Test Data', legend=True, figsize=(15,6))
ar212_predictions.plot(label='ARIMA(2,1,2) Predictions', legend=True)
ax.autoscale(axis='x',tight=True)
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Model Predictions');
From this plot, we can say that model predictions are not close to test data. The model tends to underestimate or overesimate the data. It seems to capture the seasonality but is not able to capture trend in test data.
We will evaluate our model predictions using RMSE.
from statsmodels.tools.eval_measures import rmse
error = rmse(test_data['Erie'], ar212_predictions)
print(f'ARIMA(2,1,2) RMSE: {error}')
RMSE of ARIMA(2,1,2) model is 0.294. This RMSE is very low compared to the RMSE of Double Exponential Smoothing model (2.2324).
We will now retrain the data on full model and then forecast for 100 months using the model.
We will now retrain our model on full data.
from statsmodels.tsa.arima_model import ARIMA
Arima_Model = ARIMA(erie_data['Erie'], order=(2,1,2))
Arima_results = Arima_Model.fit(solver='nm')
Arima_results.summary()
# Forecast for 100 months
start = len(erie_data)
end = len(erie_data) + 99
Arima_pred = Arima_results.predict(start=start, end=end, dynamic=False, typ='levels')
We will plot the forecasts for 100 months generated by ARIMA model.
with plt.style.context('fivethirtyeight'):
title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
ylabel='Mean Water Levels (metric units)'
ax = erie_data['Erie'].plot(legend=True, title=title, figsize=(18,8))
Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(ylabel=ylabel)
ax.set(xlabel='');
Let's zoom into the forecast time period to understand the trend and seasonality of the forecasts.
with plt.style.context('fivethirtyeight'):
title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
ylabel='Mean Water Levels (metric units)'
ax = erie_data['01-01-2016':]['Erie'].plot(legend=True, title=title, figsize=(18,8))
Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(ylabel=ylabel)
ax.set(xlabel='');
The plot shows original data from 2016 - 2019 and then forecasts generated by the model for next 100 months. The forecasts show seasonality in the future periods however there is no increasing or decreasing trend.
To summarize, in this notebook we:
We have seen that ARIMA(2,1,2) is the best fit model with lowest RMSE. The model seems to capture the seasonality but is not able to capture trend in the test data. The forecasts also show seasonality in the future periods however there is no increasing or decreasing trend.
Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing and Seasonal ARIMA models in the next part of the study.
[1] https://people.duke.edu/~rnau/411arim.htm
[2] https://people.duke.edu/~rnau/411diff.htm
[3] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[4] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[5] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[6] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis