- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series Analysis
In this study, we will use continue to use Lake Erie water level data to build Seasonal ARIMA models and forecast into the future.
In this notebook, we will:
# General Imports
import pandas as pd
import itertools
import time
# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Time Series Imports
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
We’ll need to do some quick processing to read the data and convert it to have a time series index. We will read the Erie water level dataset from previous notebook.
erie_data = pd.read_csv('erie_data')
erie_data.head()
# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes
# Set date as index
erie_data.set_index('date', inplace=True)
Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.
# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index
Represented as: ARIMA (p,d,q) x [P,D,Q] S
A Seasonal ARIMA model [1] is formed by including additional seasonal terms in the ARIMA models. The seasonal part of the model consists of terms that are similar to the non-seasonal components of the model, but involve backshifts of the seasonal period.
In a Seasonal ARIMA model [2] [5], seasonal AR and MA terms predict using data values and errors at times with lags that are multiples of S (the span of the seasonality). With monthly data (and S = 12), a seasonal first order autoregressive model would use $x_{t-12}$ to predict $x_{t}$.
A seasonal model is represented as ARIMA (p,d,q) x [P,D,Q] where P is the number of seasonal autoregressive (SAR) terms, D is the number of seasonal differences, Q is the number of seasonal moving average (SMA) terms.
The Non-seasonal components are: \begin{array}{l} \text { AR: } \phi(B)=1-\phi_{1} B-\ldots-\phi_{p} B^{p} \\ \text { MA: } \theta(B)=1+\theta_{1} B+\ldots+\theta_{q} B^{q} \end{array}
The Seasonal components are: \begin{array}{l} \text { Seasonal } A R: \Phi\left(B^{S}\right)=1-\Phi_{1} B^{S}-\ldots-\Phi_{P} B^{P S} \\ \text { Seasonal MA: } \Theta\left(B^{S}\right)=1+\Theta_{1} B^{S}+\ldots+\Theta_{Q} B^{Q S} \end{array}
We will follow this 7 step process to build, validate various Seasonal ARIMA models and forecast for future periods:
Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast.
Step 2 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference (seasonal and non-seasonal), then check for stationarity.
Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
Step 4 — Split the data into train and test sets
Step 5 — Build the model: Build the model and obtain predictions on test data.
Step 6 — Validate model: Compare the predicted values to the actuals in the test sample.
Step 7 — Forecasting: Retrain Model on full dataset and forecast for future time periods.
To effectively use ARIMA models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality is not stationary.
A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.
From the Stationarity test results above in section 2.1, we can see that the p-value > 0.05 which means we fail to reject the null hypothesis of stationarity and our data is non-stationary.
Differencing [3] is a method of transforming a non-stationary time series into a stationary one. When both trend and seasonality are present, we may need to apply both a non-seasonal first difference and a seasonal difference. A key step in identifying a seasonal model, is to determine whether or not a seasonal difference is needed, in addition to or perhaps instead of a non-seasonal difference. We must look at ACF and PACF plots for all possible combinations of 0 or 1 non-seasonal and seasonal differences [4].
A seasonal difference is the difference between an observation and the previous observation from the same season. \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array} where $m$ is the number of seasons. These are also called “lag-m differences”, as we subtract the observation after a lag of m periods.
We will plot first order difference plots along with ACF and PACF plots for both Seasonal and Non-seasonal differencing.
# Difference the data
from statsmodels.tsa.statespace.tools import diff
# Non-Seasonal First Order Difference
diff1_ns = diff(erie_data['Erie'], k_diff=1)
# Seasonal First Order Difference
diff1_s = diff(erie_data['Erie'], k_seasonal_diff=1)
# Seasonal and Non-Seasonal First Order Difference
diff1_sns = diff(erie_data['Erie'], k_diff=1, k_seasonal_diff=1)
# Run ACF and PACF plots of differenced data
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(3,3, figsize=(15,15))
diff1_ns.plot(title='Non-Seasonal 1st Difference', ax=axes[0,0])
plot_acf(diff1_ns, lags=36, ax=axes[0,1], title='ACF')
plot_pacf(diff1_ns, lags=36, ax=axes[0,2], title='PACF')
diff1_s.plot(title='Seasonal 1st Difference', ax=axes[1,0])
plot_acf(diff1_s, lags=36, ax=axes[1,1], title='ACF')
plot_pacf(diff1_s, lags=36, ax=axes[1,2], title='PACF')
diff1_sns.plot(title='Seasonal-Non-Sea. 1st Difference', ax=axes[2,0])
plot_acf(diff1_sns, lags=36, ax=axes[2,1], title='ACF')
plot_pacf(diff1_sns, lags=36, ax=axes[2,2], title='PACF')
fig.tight_layout(pad=1.0)
plt.show()
The plots above show
Non-seasonal
,Seasonal
,Seasonal and Non-Seasonal
First Order Differenced Plots. We can see that the plots ACF and PACF plots forSeasonal First Order Difference
are very similar to `Seasonal and Non-Seasonal First Order Difference plots.
We will test both Seasonal
, Seasonal and Non-Seasonal
First Order Differenced data for Stationarity.
We will use the function defined in ARIMA Models
part of the study. This function creates more informative output from the test. We use AIC for the autolag because it penalizes higher order models over simpler models.
# Define the adf-test function
from statsmodels.tsa.stattools import adfuller
def aug_dftest(series, title=''):
print(f'Augmented Dickey Fuller Test for: {title}')
result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
labels = ['ADF test statistic','p-value','# lags used','# observations']
output = pd.Series(result[0:4], index=labels)
for key, val in result[4].items():
output[f'critical value ({key})'] = val
print(output.to_string())
if output[1] <= 0.05:
print('Reject the Null hypothesis; Data is stationary')
else:
print('Fail to reject the Null hypothesis; Data is non-stationary')
# Run the Test
aug_dftest(diff1_s, title='Seasonal First Order Differenced Data')
# Run the Test
aug_dftest(diff1_sns, title='Seasonal First Order Differenced Data')
Test results confirm that we reached stationarity after the first seasonal difference(D=1). Seasonal and Non-Seasonal First Order Differenced Data (d=1 and D=1) is also stationary.
A model with only Seasonal First Order Differencing will be a simpler model than a model with both Seasonal and Non-Seasonal First Order Differencing. However, when both trend and seasonality are present, we may need to apply both a non-seasonal first difference and a seasonal difference. We will look at Seasonal
, Seasonal and Non-Seasonal
First Order Differenced Plots in detail to get an optimal ARIMA (p,d,q) x (P,D,Q) Order.
ACF and PACF plots for both Seasonal
, Seasonal and Non-Seasonal
First Order Differenced data look exactly the same. Let's examine the Seasonal First Order Difference Plots of the data.
# Run ACF and PACF plots of differenced data
fig, axes = plt.subplots(3,1, figsize=(15,12))
diff1_s.plot(title='Seasonal First Order Difference', ax=axes[0])
plot_acf(diff1_s, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff1_s, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
Let's examine the Seasonal and Non-Seasonal First Order Difference Plots of the data.
# Run ACF and PACF plots of differenced data
fig, axes = plt.subplots(3,1, figsize=(15,12))
diff1_sns.plot(title='Seasonal and Non-Seasonal First Order Difference', ax=axes[0])
plot_acf(diff1_sns, lags=48, ax=axes[1], title='Seasonal and Non-Seasonal First Order Difference - ACF')
plot_pacf(diff1_sns, lags=48, ax=axes[2], title='Seasonal and Non-Seasonal First Order Difference - PACF')
axes[0].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
The plot above shows Seasonal, Seasonal and Non-Seasonal First Order Differenced data along with ACF and PACF plots.
Notice that:
(a) ACF shows at lags 12, 24 etc., while the PACF cuts off after lag 12, 24.
(b) There are some spikes in ACF and PACF plots at Non-seasonal lags 1 and 2 but they are not very significant.
The rules for identifying Seasonal AR and MA orders [4] [5] are as follows:
If the series has a strong and consistent seasonal pattern, then you should use an order of seasonal differencing--but never use more than one order of seasonal differencing or more than 2 orders of total differencing (seasonal+nonseasonal).
If the autocorrelation at the seasonal period is positive, consider adding an SAR term to the model. If the autocorrelation at the seasonal period is negative, consider adding an SMA term to the model. Do not mix SAR and SMA terms in the same model, and avoid using more than one of either kind.
Given the lack of significant lags on the Non-seasonal ACF and PACF plots and spikes in seasonal ACF lags, we may start with a simpler Seasonal ARIMA(1,0,1)(1,1,0)[12] model and improve from there. We will use the auto_arima
function to determine an optimal order.
From our analysis so far, we know that:
Our goal here is to find the optimal set of parameters (p,d,q) x (P,D,Q) that yield the best performance for our model.
We will define a function seasonal_arima
that runs a grid search to iterate over the various Seasonal and Non-Seasonal orders with the following differencing scenarios:
The function will evaluate all models created during the stepwise operation and list the best Seasonal ARIMA model for the differencing scenarios listed above along with their AICs. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.
Note that results from all models can be printed by uncommenting the commented line of code in function definition.
Non-Seasonal Difference (d=0) and Seasonal Difference (D=1)
# Create a list of Non-Seasonal orders
p = q = range(0,3)
d = range(0, 1)
ns_diff0_pdq = list(itertools.product(p, d, q))
print(ns_diff0_pdq)
# Create a list of Seasonal orders
seasonal_pdq = [(x[0], 1, x[2], 12) for x in ns_diff0_pdq]
print(seasonal_pdq)
# Function to iterate over Seasonal and Non-Seasonal orders
from statsmodels.tsa.statespace.sarimax import SARIMAX
def seasonal_arima(data, pdq):
model_order = [0,0]
min_aic = 1000
for param in pdq:
for seas_param in seasonal_pdq:
try:
model = SARIMAX(data, order=param, seasonal_order=seas_param)
model_fit = model.fit(solver='nm')
aic = model_fit.aic
# print(f'SARIMA{param}x{seas_param} - AIC: {aic}')
if aic < min_aic:
min_aic = aic
model_order[0] = param
model_order[1] = seas_param
except:
continue
# Print model with lowest AIC
print('---------------------')
print('MODEL WITH LOWEST AIC')
print('---------------------')
print(f'SARIMA{model_order[0]}x{model_order[1]} - AIC: {min_aic}')
Based on the number of Non-Seasonal (9) and Seasonal (9) orders this function will iterate over 81 (9x9) different models to return a model with the lowest AIC.
%%time
# Apply function on Erie data
seasonal_arima(erie_data['Erie'], ns_diff0_pdq)
Non-Seasonal Difference (d=1) and Seasonal Difference (D=1)
# Create a list of Non-Seasonal orders
p = q = range(0, 3)
d = range(1, 2)
ns_diff1_pdq = list(itertools.product(p, d, q))
print(ns_diff1_pdq)
# Create a list of Seasonal orders
seasonal_pdq = [(x[0], 1, x[2], 12) for x in ns_diff0_pdq]
print(seasonal_pdq)
Based on the number of Non-Seasonal (9) and Seasonal (9) orders this function will iterate over 81 (9x9) different models to return a model with the lowest AIC.
%%time
# Apply function on Erie data
seasonal_arima(erie_data['Erie'], ns_diff1_pdq)
The results above show that the best model with Non-Seasonal Difference (d=0) and Seasonal Difference (D=1) is an ARIMA (1, 0, 1) (0, 1, 1, 12)
model (NS_Diff0_Model) with an AIC of -3247.453
. This model has a lower AIC than the Non-Seasonal Difference (d=1) and Seasonal Difference (D=1) ARIMA (2, 1, 1), (0, 1, 2, 12)
model (NS_Diff1_Model) with an AIC of -3236.416
.
As mentioned above, the recommended order is the one with the lowest Akaike Information Criterion or AIC score.
AlC for NS_Diff0_Model is lower and it is a simpler model than NS_Diff1_Model. However, we will fit both models on our data to see which one fits better.
We will spilt the data into train and test sets.
# Split the data
# TRAIN Data
train_data = erie_data.iloc[:1124]
train_data = train_data.astype('float64') # casting as float to avoid errors in prediction
# TEST Data
test_data = erie_data.iloc[1124:]
test_data = test_data.astype('float64')
We will build both SARIMA (1, 0, 1)x(0, 1, 1, 12) model and SARIMA (2, 1, 1)x(0, 1, 2, 12) model.
NS_Diff0_Model
)¶We will build a Seasonal ARIMA (1,0,1) (0,1,1,12) model on the training data and fit the model. This model has Non-Seasonal Difference (d=0) and Seasonal Difference (D=1).
# Build NS_Diff0_Model
NS_Diff0_Model = SARIMAX(train_data['Erie'], order=(1,0,1), seasonal_order=(0,1,1,12))
NS_Diff0_results = NS_Diff0_Model.fit(solver='nm')
NS_Diff0_results.summary()
start = len(train_data)
end = len(train_data) + len(test_data)-1
NS_Diff0_predictions = NS_Diff0_results.predict(start=start, end=end,
dynamic=False, typ='levels')
Passing dynamic=False
means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels'
predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.
For more information on these arguments, check statsmodels
documentation here.
Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.
NS_Diff0 = pd.DataFrame(NS_Diff0_predictions.values,
index = test_data.index,
columns = ['NS_Diff0_predictions'])
test_data = pd.concat([test_data, NS_Diff0], axis=1)
test_data[['Erie', 'NS_Diff0_predictions']].head()
NS_Diff1_Model
)¶We will build a Seasonal ARIMA (2,1,1) (0,1,2,12) model on the training data and fit the model. This model has Non-Seasonal Difference (d=1) and Seasonal Difference (D=1).
# Build NS_Diff1_Model
NS_Diff1_Model = SARIMAX(train_data['Erie'], order=(2,1,1), seasonal_order=(0,1,2,12))
NS_Diff1_results = NS_Diff1_Model.fit(solver='nm')
NS_Diff1_results.summary()
start = len(train_data)
end = len(train_data) + len(test_data)-1
NS_Diff1_predictions = NS_Diff1_results.predict(start=start, end=end,
dynamic=False, typ='levels')
Passing dynamic=False
means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels'
predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.
For more information on these arguments, check statsmodels
documentation here.
Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.
NS_Diff1 = pd.DataFrame(NS_Diff1_predictions.values,
index = test_data.index,
columns = ['NS_Diff1_predictions'])
test_data = pd.concat([test_data, NS_Diff1], axis=1)
test_data[['Erie', 'NS_Diff1_predictions']].head()
To help us understand the accuracy of our models, we will plot predictions from NS_Diff0_Model
and NS_Diff1_Model
models against Test data and calculate loss.
Let's plot Predictions from NS_Diff0_Model and NS_Diff1_Model against Test data.
with plt.style.context('fivethirtyeight'):
fig, axes = plt.subplots(2,1, figsize=(18,15))
# NS_Diff0_Model
title = 'SARIMA (1,0,1)(0,1,1,12) Predictions (NS_Diff0_Model)'
label = 'SARIMA(1,0,1)(0,1,1,12) Predictions'
test_data['Erie'].plot(label='Test Data', legend=True, ax=axes[0], title=title)
NS_Diff0_predictions.plot(label=label, legend=True, ax=axes[0])
axes[0].set(xlabel="",ylabel="Mean Water Level (metric units)")
# NS_Diff1_Model
title = 'SARIMA(2,1,1)(0,1,2,12) Predictions (NS_Diff1_Model)'
label = 'SARIMA(2,1,1)(0,1,2,12) Predictions'
test_data['Erie'].plot(label='Test Data', legend=True, ax=axes[1], title=title)
NS_Diff1_predictions.plot(label=label, legend=True, ax=axes[1])
axes[1].set(xlabel="",ylabel="Mean Water Level (metric units)");
The plots above show model predictions from
NS_Diff0_Model
andNS_Diff1_Model
models along with observed test data. We can see that both models capture seasonality of the data well.NS_Diff1_Model
models seems to capture the Trend a little better thanNS_Diff0_Model
model.
We will calculate the losses for both models to confirm performance.
We will evaluate our model predictions using RMSE.
from statsmodels.tools.eval_measures import rmse
NS_Diff0_error = rmse(test_data['Erie'], NS_Diff0_predictions)
NS_Diff1_error = rmse(test_data['Erie'], NS_Diff1_predictions)
print(f'SARIMA(1,0,1)(2,1,0,12) `NS_Diff0_Model` RMSE: {NS_Diff0_error}')
print(f'SARIMA(2,1,1)(0,1,2,12) `NS_Diff1_Model` RMSE: {NS_Diff1_error}')
The results above show that RMSE of SARIMA(2,1,1)(0,1,2,12) (
NS_Diff1_Model
) model is lower than the as it captures the trend and seasonality of test data well. We will now use this model for forecasting.
We will now retrain SARIMA(2,1,1)(0,1,2,12)
model on full data.
# Retrain on full dataset
SArima_Model = SARIMAX(erie_data['Erie'], order=(2,1,1), seasonal_order=(0,1,2,12))
SArima_results = SArima_Model.fit(solver='nm')
SArima_results.summary()
We will run model diagnostics to investigate any unusual behavior.
with plt.style.context('fivethirtyeight'):
SArima_results.plot_diagnostics(figsize=(16,8))
plt.show()
The properties of residuals should reflect the properties of underlying process. Model diagnostics suggests [5] that:
- Residuals do not show a pattern
- The ACF plot shows only 1st lag =1. All other lags are are small and within the confidence band.
- Model residuals are near normally distributed.
Top Left : The residual errors seem to fluctuate around a mean of zero but have a non-uniform variance.
Top Right : The density plot suggest normal distribution with mean zero.
Bottom left : All the dots should fall perfectly in line with some deviations towards the right implying skewed distribution.
Bottom Right : The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.
We will forecast for 100 months into the future. Let's get the forecast results along with confidence intervals.
pred_uc = SArima_results.get_forecast(steps=100)
print(pred_uc.summary_frame())
with plt.style.context('fivethirtyeight'):
title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
ylabel='Mean Water Levels (metric units)'
pred_ci = pred_uc.conf_int() #get confidence interval
ax = erie_data['Erie'].plot(label='observed', figsize=(14, 7),title=title)
pred_uc.predicted_mean.plot(ax=ax, label='Forecast') #plot predictions
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('')
ax.set(ylabel=ylabel)
plt.legend()
plt.show()
Let's zoom into the forecast time period to understand the trend and seasonality of the forecasts.
with plt.style.context('fivethirtyeight'):
title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
ylabel='Mean Water Levels (metric units)'
pred_ci = pred_uc.conf_int() #get confidence interval
ax = erie_data['01-01-2017':]['Erie'].plot(label='Observed', figsize=(14, 7),title=title)
pred_uc.predicted_mean.plot(ax=ax, label='Forecast') #plot predictions
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('')
ax.set(ylabel=ylabel)
plt.legend()
plt.show()
Our model clearly captured seasonality and trend in water levels. The forecasts show some consistency in trend with water levels not increasing as sharply as they has been for the last few years. As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.
To summarize, in this notebook we:
We have seen that SARIMA(2,1,1)(0,1,2,12) model is the best model as it captures the trend and seasonality of data well and has the lowest RMSE. The model forecasts show consistency in trend with water levels not increasing as sharply as they has been for the last few years. As we forecast into the future, we become less confident in our values as reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.
So far, we have looked at water level data for Lake Erie and tried to forecast water levels into the future. Factors such as precipitation, air temperature, snow cover etc. can also impact the amount of water levels in Lake Erie. To consider these other variables for forecasting water levels, we will look at Vector based modeing techniques such as Vector AutoRegressive Moving Average (VARMA) Models in the next part of this study.
[1] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[2] https://online.stat.psu.edu/stat510/lesson/4/4.1
[3] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[4] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[5] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis