Lake Erie Water Levels - A Time Series Analysis (Part 3/4)¶

👟 Ready To Run!

🚀 Data Exploration and Cleaning

🤖 Machine Learning

⏱️ Time Series Analysis

In this study, we will use continue to use Lake Erie water level data to build Seasonal ARIMA models and forecast into the future.

Read Lake Erie water level data and index it as a time series
Discuss Stationarity of data
Difference the data to make it stationary
- Seasonal first order differencing
- Seasonal and Non-seasonal first order differencing
Use the ACF and PACF plots to decide AR term(s) and MA term(s) for SARIMA model
Identify optimal SARIMA (p,d,q) x (P,D,Q) order using grid search
Split the data
Built and fit optimal SARIMA model
- Obtain Predictions
Evaluate Model performance
- Plot predictions and calculate loss
Forecast for 100 months into the future

Part 3: Seasonal ARIMA Models¶

# General Imports
import pandas as pd
import itertools
import time

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Time Series Imports
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read and Preprocess Data¶

We’ll need to do some quick processing to read the data and convert it to have a time series index. We will read the Erie water level dataset from previous notebook.

erie_data = pd.read_csv('erie_data')
erie_data.head()

# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes

date     datetime64[ns]
Erie            float64
DES12           float64
TES12           float64
dtype: object

# Set date as index
erie_data.set_index('date', inplace=True)

Setting a DatetimeIndex Frequency¶

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index

DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
               '1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
               '1918-09-01', '1918-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=1224, freq='MS')

Seasonal ARIMA Models¶

Represented as: ARIMA (p,d,q) x [P,D,Q] S

A Seasonal ARIMA model [1] is formed by including additional seasonal terms in the ARIMA models. The seasonal part of the model consists of terms that are similar to the non-seasonal components of the model, but involve backshifts of the seasonal period.

In a Seasonal ARIMA model [2] [5], seasonal AR and MA terms predict using data values and errors at times with lags that are multiples of S (the span of the seasonality). With monthly data (and S = 12), a seasonal first order autoregressive model would use $x_{t-12}$ to predict $x_{t}$.

A seasonal model is represented as ARIMA (p,d,q) x [P,D,Q] where P is the number of seasonal autoregressive (SAR) terms, D is the number of seasonal differences, Q is the number of seasonal moving average (SMA) terms.

The Non-seasonal components are: \begin{array}{l} \text { AR: } \phi(B)=1-\phi_{1} B-\ldots-\phi_{p} B^{p} \\ \text { MA: } \theta(B)=1+\theta_{1} B+\ldots+\theta_{q} B^{q} \end{array}

The Seasonal components are: \begin{array}{l} \text { Seasonal } A R: \Phi\left(B^{S}\right)=1-\Phi_{1} B^{S}-\ldots-\Phi_{P} B^{P S} \\ \text { Seasonal MA: } \Theta\left(B^{S}\right)=1+\Theta_{1} B^{S}+\ldots+\Theta_{Q} B^{Q S} \end{array}

We will follow this 7 step process to build, validate various Seasonal ARIMA models and forecast for future periods:

Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast.
Step 2 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference (seasonal and non-seasonal), then check for stationarity.
Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
Step 4 — Split the data into train and test sets
Step 5 — Build the model: Build the model and obtain predictions on test data.
Step 6 — Validate model: Compare the predicted values to the actuals in the test sample.
Step 7 — Forecasting: Retrain Model on full dataset and forecast for future time periods.

Step 1 - Test for Stationarity¶

To effectively use ARIMA models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality is not stationary.

A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.

From the Stationarity test results above in section 2.1, we can see that the p-value > 0.05 which means we fail to reject the null hypothesis of stationarity and our data is non-stationary.

Step 2 - Differencing¶

Differencing [3] is a method of transforming a non-stationary time series into a stationary one. When both trend and seasonality are present, we may need to apply both a non-seasonal first difference and a seasonal difference. A key step in identifying a seasonal model, is to determine whether or not a seasonal difference is needed, in addition to or perhaps instead of a non-seasonal difference. We must look at ACF and PACF plots for all possible combinations of 0 or 1 non-seasonal and seasonal differences [4].

A seasonal difference is the difference between an observation and the previous observation from the same season. \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array} where $m$ is the number of seasons. These are also called “lag-m differences”, as we subtract the observation after a lag of m periods.

We will plot first order difference plots along with ACF and PACF plots for both Seasonal and Non-seasonal differencing.

# Difference the data
from statsmodels.tsa.statespace.tools import diff

# Non-Seasonal First Order Difference
diff1_ns = diff(erie_data['Erie'], k_diff=1)

# Seasonal First Order Difference
diff1_s = diff(erie_data['Erie'], k_seasonal_diff=1)

# Seasonal and Non-Seasonal First Order Difference
diff1_sns = diff(erie_data['Erie'], k_diff=1, k_seasonal_diff=1)

# Run ACF and PACF plots of differenced data
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(3,3, figsize=(15,15))

diff1_ns.plot(title='Non-Seasonal 1st Difference', ax=axes[0,0])
plot_acf(diff1_ns, lags=36, ax=axes[0,1], title='ACF')
plot_pacf(diff1_ns, lags=36, ax=axes[0,2], title='PACF')

diff1_s.plot(title='Seasonal 1st Difference', ax=axes[1,0])
plot_acf(diff1_s, lags=36, ax=axes[1,1], title='ACF')
plot_pacf(diff1_s, lags=36, ax=axes[1,2], title='PACF')

diff1_sns.plot(title='Seasonal-Non-Sea. 1st Difference', ax=axes[2,0])
plot_acf(diff1_sns, lags=36, ax=axes[2,1], title='ACF')
plot_pacf(diff1_sns, lags=36, ax=axes[2,2], title='PACF')

fig.tight_layout(pad=1.0)
plt.show()

The plots above show Non-seasonal, Seasonal, Seasonal and Non-Seasonal First Order Differenced Plots. We can see that the plots ACF and PACF plots for Seasonal First Order Difference are very similar to `Seasonal and Non-Seasonal First Order Difference plots.

We will test both Seasonal, Seasonal and Non-Seasonal First Order Differenced data for Stationarity.

Test Stationarity for Seasonal First Order Differenced Data¶

We will use the function defined in ARIMA Models part of the study. This function creates more informative output from the test. We use AIC for the autolag because it penalizes higher order models over simpler models.

# Define the adf-test function
from statsmodels.tsa.stattools import adfuller

def aug_dftest(series, title=''):
    
    print(f'Augmented Dickey Fuller Test for: {title}')
    result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    output = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        output[f'critical value ({key})'] = val
        
    print(output.to_string())
        
    if output[1] <= 0.05:
        print('Reject the Null hypothesis; Data is stationary')
    else:
        print('Fail to reject the Null hypothesis; Data is non-stationary')

# Run the Test
aug_dftest(diff1_s, title='Seasonal First Order Differenced Data')

Augmented Dickey Fuller Test for: Seasonal First Order Differenced Data
ADF test statistic     -1.539824e+01
p-value                 3.241136e-28
# lags used             2.300000e+01
# observations          1.198000e+03
critical value (1%)    -3.435820e+00
critical value (5%)    -2.863956e+00
critical value (10%)   -2.568056e+00
Reject the Null hypothesis; Data is stationary

Test Stationarity for Seasonal and Non-Seasonal First Order Differenced Data¶

# Run the Test
aug_dftest(diff1_sns, title='Seasonal First Order Differenced Data')

Augmented Dickey Fuller Test for: Seasonal First Order Differenced Data
ADF test statistic     -1.539824e+01
p-value                 3.241136e-28
# lags used             2.300000e+01
# observations          1.198000e+03
critical value (1%)    -3.435820e+00
critical value (5%)    -2.863956e+00
critical value (10%)   -2.568056e+00
Reject the Null hypothesis; Data is stationary

Test results confirm that we reached stationarity after the first seasonal difference(D=1). Seasonal and Non-Seasonal First Order Differenced Data (d=1 and D=1) is also stationary.

A model with only Seasonal First Order Differencing will be a simpler model than a model with both Seasonal and Non-Seasonal First Order Differencing. However, when both trend and seasonality are present, we may need to apply both a non-seasonal first difference and a seasonal difference. We will look at Seasonal, Seasonal and Non-Seasonal First Order Differenced Plots in detail to get an optimal ARIMA (p,d,q) x (P,D,Q) Order.

Determine ARIMA (p,d,q) x (P,D,Q) Order¶

Examine Seasonal First Order Difference Plots¶

ACF and PACF plots for both Seasonal, Seasonal and Non-Seasonal First Order Differenced data look exactly the same. Let's examine the Seasonal First Order Difference Plots of the data.

# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff1_s.plot(title='Seasonal First Order Difference', ax=axes[0])
plot_acf(diff1_s, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff1_s, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

Examine Seasonal and Non-Seasonal First Order Difference Plots¶

Let's examine the Seasonal and Non-Seasonal First Order Difference Plots of the data.

# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff1_sns.plot(title='Seasonal and Non-Seasonal First Order Difference', ax=axes[0])
plot_acf(diff1_sns, lags=48, ax=axes[1], title='Seasonal and Non-Seasonal First Order Difference - ACF')
plot_pacf(diff1_sns, lags=48, ax=axes[2], title='Seasonal and Non-Seasonal First Order Difference - PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

Evaluate ACF and PACF Plots¶

The plot above shows Seasonal, Seasonal and Non-Seasonal First Order Differenced data along with ACF and PACF plots.

Notice that:
(a) ACF shows at lags 12, 24 etc., while the PACF cuts off after lag 12, 24.
(b) There are some spikes in ACF and PACF plots at Non-seasonal lags 1 and 2 but they are not very significant.

The rules for identifying Seasonal AR and MA orders [4] [5] are as follows:

If the series has a strong and consistent seasonal pattern, then you should use an order of seasonal differencing--but never use more than one order of seasonal differencing or more than 2 orders of total differencing (seasonal+nonseasonal).

If the autocorrelation at the seasonal period is positive, consider adding an SAR term to the model. If the autocorrelation at the seasonal period is negative, consider adding an SMA term to the model. Do not mix SAR and SMA terms in the same model, and avoid using more than one of either kind.

Given the lack of significant lags on the Non-seasonal ACF and PACF plots and spikes in seasonal ACF lags, we may start with a simpler Seasonal ARIMA(1,0,1)(1,1,0)[12] model and improve from there. We will use the auto_arima function to determine an optimal order.

Optimal ARIMA (p,d,q) x (P,D,Q) Order¶

From our analysis so far, we know that:

Lake Erie data is not stationary.
Seasonal First Order Differencing makes the data stationary.
We may need to apply both a non-seasonal first difference and a seasonal first difference

Our goal here is to find the optimal set of parameters (p,d,q) x (P,D,Q) that yield the best performance for our model.

We will define a function seasonal_arima that runs a grid search to iterate over the various Seasonal and Non-Seasonal orders with the following differencing scenarios:

Seasonal First Order Differencing (d=0, D=1)
Seasonal and Non-Seasonal First Order Differencing (d=1, D=1)

The function will evaluate all models created during the stepwise operation and list the best Seasonal ARIMA model for the differencing scenarios listed above along with their AICs. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.

Note that results from all models can be printed by uncommenting the commented line of code in function definition.

Seasonal First Order Differenced Model (NS_Diff0_Model)¶

Non-Seasonal Difference (d=0) and Seasonal Difference (D=1)

# Create a list of Non-Seasonal orders
p = q = range(0,3)
d = range(0, 1)
ns_diff0_pdq = list(itertools.product(p, d, q))
print(ns_diff0_pdq)

[(0, 0, 0), (0, 0, 1), (0, 0, 2), (1, 0, 0), (1, 0, 1), (1, 0, 2), (2, 0, 0), (2, 0, 1), (2, 0, 2)]

# Create a list of Seasonal orders
seasonal_pdq = [(x[0], 1, x[2], 12) for x in ns_diff0_pdq]
print(seasonal_pdq)

[(0, 1, 0, 12), (0, 1, 1, 12), (0, 1, 2, 12), (1, 1, 0, 12), (1, 1, 1, 12), (1, 1, 2, 12), (2, 1, 0, 12), (2, 1, 1, 12), (2, 1, 2, 12)]

# Function to iterate over Seasonal and Non-Seasonal orders
from statsmodels.tsa.statespace.sarimax import SARIMAX

def seasonal_arima(data, pdq):
    model_order = [0,0]
    min_aic = 1000
    for param in pdq:
        for seas_param in seasonal_pdq:
            try:
                model = SARIMAX(data, order=param, seasonal_order=seas_param)
                model_fit = model.fit(solver='nm')
                aic = model_fit.aic
#                 print(f'SARIMA{param}x{seas_param} - AIC: {aic}')
                if aic < min_aic:
                    min_aic = aic
                    model_order[0] = param
                    model_order[1] = seas_param
                
            except:
                continue
    
#   Print model with lowest AIC
    print('---------------------')
    print('MODEL WITH LOWEST AIC')
    print('---------------------')
    print(f'SARIMA{model_order[0]}x{model_order[1]} - AIC: {min_aic}')

Based on the number of Non-Seasonal (9) and Seasonal (9) orders this function will iterate over 81 (9x9) different models to return a model with the lowest AIC.

%%time
# Apply function on Erie data
seasonal_arima(erie_data['Erie'], ns_diff0_pdq)

---------------------
MODEL WITH LOWEST AIC
---------------------
SARIMA(1, 0, 1)x(0, 1, 1, 12) - AIC: -3247.453338976233
CPU times: user 9min 8s, sys: 4.26 s, total: 9min 13s
Wall time: 6min 25s

Seasonal and Non-Seasonal First Order Differenced Model (NS_Diff1_Model)¶

Non-Seasonal Difference (d=1) and Seasonal Difference (D=1)

# Create a list of Non-Seasonal orders
p = q = range(0, 3)
d = range(1, 2)
ns_diff1_pdq = list(itertools.product(p, d, q))
print(ns_diff1_pdq)

[(0, 1, 0), (0, 1, 1), (0, 1, 2), (1, 1, 0), (1, 1, 1), (1, 1, 2), (2, 1, 0), (2, 1, 1), (2, 1, 2)]

# Create a list of Seasonal orders
seasonal_pdq = [(x[0], 1, x[2], 12) for x in ns_diff0_pdq]
print(seasonal_pdq)

[(0, 1, 0, 12), (0, 1, 1, 12), (0, 1, 2, 12), (1, 1, 0, 12), (1, 1, 1, 12), (1, 1, 2, 12), (2, 1, 0, 12), (2, 1, 1, 12), (2, 1, 2, 12)]

Based on the number of Non-Seasonal (9) and Seasonal (9) orders this function will iterate over 81 (9x9) different models to return a model with the lowest AIC.

%%time
# Apply function on Erie data
seasonal_arima(erie_data['Erie'], ns_diff1_pdq)

---------------------
MODEL WITH LOWEST AIC
---------------------
SARIMA(2, 1, 1)x(0, 1, 2, 12) - AIC: -3236.415833446461
CPU times: user 9min 18s, sys: 3.18 s, total: 9min 21s
Wall time: 6min 23s

The results above show that the best model with Non-Seasonal Difference (d=0) and Seasonal Difference (D=1) is an ARIMA (1, 0, 1) (0, 1, 1, 12) model (NS_Diff0_Model) with an AIC of -3247.453. This model has a lower AIC than the Non-Seasonal Difference (d=1) and Seasonal Difference (D=1) ARIMA (2, 1, 1), (0, 1, 2, 12) model (NS_Diff1_Model) with an AIC of -3236.416.

As mentioned above, the recommended order is the one with the lowest Akaike Information Criterion or AIC score.

AlC for NS_Diff0_Model is lower and it is a simpler model than NS_Diff1_Model. However, we will fit both models on our data to see which one fits better.

Step 4 - Split the Data¶

We will spilt the data into train and test sets.

# Split the data

# TRAIN Data
train_data = erie_data.iloc[:1124]
train_data = train_data.astype('float64') # casting as float to avoid errors in prediction

# TEST Data
test_data = erie_data.iloc[1124:]
test_data = test_data.astype('float64')

Step 5 - Build the Model¶

We will build both SARIMA (1, 0, 1)x(0, 1, 1, 12) model and SARIMA (2, 1, 1)x(0, 1, 2, 12) model.

Seasonal ARIMA (1,0,1) (0,1,1,12) (`NS_Diff0_Model`)¶

We will build a Seasonal ARIMA (1,0,1) (0,1,1,12) model on the training data and fit the model. This model has Non-Seasonal Difference (d=0) and Seasonal Difference (D=1).

# Build NS_Diff0_Model
NS_Diff0_Model = SARIMAX(train_data['Erie'], order=(1,0,1), seasonal_order=(0,1,1,12))
NS_Diff0_results = NS_Diff0_Model.fit(solver='nm')
NS_Diff0_results.summary()

Obtain Predictions¶

start = len(train_data)
end = len(train_data) + len(test_data)-1

NS_Diff0_predictions = NS_Diff0_results.predict(start=start, end=end, 
                                                dynamic=False, typ='levels')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments, check statsmodels documentation here.

Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.

NS_Diff0 = pd.DataFrame(NS_Diff0_predictions.values,
                       index = test_data.index,
                       columns = ['NS_Diff0_predictions'])
test_data = pd.concat([test_data, NS_Diff0], axis=1)
test_data[['Erie', 'NS_Diff0_predictions']].head()

Seasonal ARIMA (2,1,1) (0,1,2,12) (`NS_Diff1_Model`)¶

We will build a Seasonal ARIMA (2,1,1) (0,1,2,12) model on the training data and fit the model. This model has Non-Seasonal Difference (d=1) and Seasonal Difference (D=1).

# Build NS_Diff1_Model
NS_Diff1_Model = SARIMAX(train_data['Erie'], order=(2,1,1), seasonal_order=(0,1,2,12))
NS_Diff1_results = NS_Diff1_Model.fit(solver='nm')
NS_Diff1_results.summary()

Obtain Predictions¶

start = len(train_data)
end = len(train_data) + len(test_data)-1

NS_Diff1_predictions = NS_Diff1_results.predict(start=start, end=end, 
                                                dynamic=False, typ='levels')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments, check statsmodels documentation here.

Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.

NS_Diff1 = pd.DataFrame(NS_Diff1_predictions.values,
                       index = test_data.index,
                       columns = ['NS_Diff1_predictions'])
test_data = pd.concat([test_data, NS_Diff1], axis=1)
test_data[['Erie', 'NS_Diff1_predictions']].head()

Evaluate Models¶

To help us understand the accuracy of our models, we will plot predictions from NS_Diff0_Model and NS_Diff1_Model models against Test data and calculate loss.

Plot Predictions¶

Let's plot Predictions from NS_Diff0_Model and NS_Diff1_Model against Test data.

with plt.style.context('fivethirtyeight'):
    fig, axes = plt.subplots(2,1, figsize=(18,15))

    # NS_Diff0_Model
    title = 'SARIMA (1,0,1)(0,1,1,12) Predictions (NS_Diff0_Model)'
    label = 'SARIMA(1,0,1)(0,1,1,12) Predictions'
    test_data['Erie'].plot(label='Test Data', legend=True, ax=axes[0], title=title)
    NS_Diff0_predictions.plot(label=label, legend=True, ax=axes[0])
    axes[0].set(xlabel="",ylabel="Mean Water Level (metric units)")


    # NS_Diff1_Model
    title = 'SARIMA(2,1,1)(0,1,2,12) Predictions (NS_Diff1_Model)'
    label = 'SARIMA(2,1,1)(0,1,2,12) Predictions'
    test_data['Erie'].plot(label='Test Data', legend=True, ax=axes[1], title=title)
    NS_Diff1_predictions.plot(label=label, legend=True, ax=axes[1])
    axes[1].set(xlabel="",ylabel="Mean Water Level (metric units)");

The plots above show model predictions from NS_Diff0_Model and NS_Diff1_Model models along with observed test data. We can see that both models capture seasonality of the data well. NS_Diff1_Model models seems to capture the Trend a little better than NS_Diff0_Model model.

We will calculate the losses for both models to confirm performance.

Calculate Loss¶

We will evaluate our model predictions using RMSE.

from statsmodels.tools.eval_measures import rmse

NS_Diff0_error = rmse(test_data['Erie'], NS_Diff0_predictions)
NS_Diff1_error = rmse(test_data['Erie'], NS_Diff1_predictions)

print(f'SARIMA(1,0,1)(2,1,0,12) `NS_Diff0_Model` RMSE: {NS_Diff0_error}')
print(f'SARIMA(2,1,1)(0,1,2,12) `NS_Diff1_Model` RMSE: {NS_Diff1_error}')

SARIMA(1,0,1)(2,1,0,12) `NS_Diff0_Model` RMSE: 0.32047283959443296
SARIMA(2,1,1)(0,1,2,12) `NS_Diff1_Model` RMSE: 0.2995695262516669

The results above show that RMSE of SARIMA(2,1,1)(0,1,2,12) (NS_Diff1_Model) model is lower than the as it captures the trend and seasonality of test data well. We will now use this model for forecasting.

Step 7 - Forecasting into the Future¶

Retrain on Full Data¶

We will now retrain SARIMA(2,1,1)(0,1,2,12) model on full data.

# Retrain on full dataset
SArima_Model = SARIMAX(erie_data['Erie'], order=(2,1,1), seasonal_order=(0,1,2,12))
SArima_results = SArima_Model.fit(solver='nm')
SArima_results.summary()

Model Diagnostics¶

We will run model diagnostics to investigate any unusual behavior.

with plt.style.context('fivethirtyeight'):
    SArima_results.plot_diagnostics(figsize=(16,8))
    plt.show()

The properties of residuals should reflect the properties of underlying process. Model diagnostics suggests [5] that:

Residuals do not show a pattern

The ACF plot shows only 1st lag =1. All other lags are are small and within the confidence band.

Model residuals are near normally distributed.

Top Left : The residual errors seem to fluctuate around a mean of zero but have a non-uniform variance.

Top Right : The density plot suggest normal distribution with mean zero.

Bottom left : All the dots should fall perfectly in line with some deviations towards the right implying skewed distribution.

Bottom Right : The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.

Forecasting¶

We will forecast for 100 months into the future. Let's get the forecast results along with confidence intervals.

pred_uc = SArima_results.get_forecast(steps=100)
print(pred_uc.summary_frame())

Erie              mean   mean_se  mean_ci_lower  mean_ci_upper
2020-01-01  174.701202  0.062352     174.578994     174.823409
2020-02-01  174.707747  0.100539     174.510694     174.904800
2020-03-01  174.801486  0.128087     174.550440     175.052532
2020-04-01  174.933769  0.148565     174.642586     175.224951
2020-05-01  175.009094  0.164523     174.686636     175.331553
...                ...       ...            ...            ...
2027-12-01  174.833991  0.631710     173.595862     176.072120
2028-01-01  174.854735  0.635743     173.608702     176.100769
2028-02-01  174.860574  0.639811     173.606568     176.114580
2028-03-01  174.953194  0.643854     173.691264     176.215125
2028-04-01  175.086617  0.647856     173.816842     176.356392

[100 rows x 4 columns]

Plot Forecasts¶

with plt.style.context('fivethirtyeight'):
    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    pred_ci = pred_uc.conf_int() #get confidence interval
    ax = erie_data['Erie'].plot(label='observed', figsize=(14, 7),title=title)
    pred_uc.predicted_mean.plot(ax=ax, label='Forecast') #plot predictions
    ax.fill_between(pred_ci.index,
                    pred_ci.iloc[:, 0],
                    pred_ci.iloc[:, 1], color='k', alpha=.25)
    ax.set_xlabel('')
    ax.set(ylabel=ylabel)
    plt.legend()
    plt.show()

Let's zoom into the forecast time period to understand the trend and seasonality of the forecasts.

with plt.style.context('fivethirtyeight'):
    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    pred_ci = pred_uc.conf_int() #get confidence interval
    ax = erie_data['01-01-2017':]['Erie'].plot(label='Observed', figsize=(14, 7),title=title)
    pred_uc.predicted_mean.plot(ax=ax, label='Forecast') #plot predictions
    ax.fill_between(pred_ci.index,
                    pred_ci.iloc[:, 0],
                    pred_ci.iloc[:, 1], color='k', alpha=.25)
    ax.set_xlabel('')
    ax.set(ylabel=ylabel)
    plt.legend()
    plt.show()

Our model clearly captured seasonality and trend in water levels. The forecasts show some consistency in trend with water levels not increasing as sharply as they has been for the last few years. As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

Summary¶

To summarize, in this notebook we:

Read Lake Erie water level data and indexed it as a time series.
Checked stationarity of data.
Seasonally and Non-seasonally differenced the data to make it stationary.
Used the ACF and PACF plots to decide AR term(s) and MA term(s).
Identified optimal SARIMA order using grid search and both Seasonal, Non-seasonal differencing .
Split the data, built and fit optimal SARIMA model on the data.
Validated model performance.
Performed forecasting for future periods.

We have seen that SARIMA(2,1,1)(0,1,2,12) model is the best model as it captures the trend and seasonality of data well and has the lowest RMSE. The model forecasts show consistency in trend with water levels not increasing as sharply as they has been for the last few years. As we forecast into the future, we become less confident in our values as reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

So far, we have looked at water level data for Lake Erie and tried to forecast water levels into the future. Factors such as precipitation, air temperature, snow cover etc. can also impact the amount of water levels in Lake Erie. To consider these other variables for forecasting water levels, we will look at Vector based modeing techniques such as Vector AutoRegressive Moving Average (VARMA) Models in the next part of this study.

References¶

[1] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[2] https://online.stat.psu.edu/stat510/lesson/4/4.1
[3] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[4] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[5] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis

	date	Erie	DES12	TES12
0	1918-01-01	173.90	173.881745	173.859612
1	1918-02-01	173.82	173.749419	173.880169
2	1918-03-01	174.01	174.160248	174.171219
3	1918-04-01	174.02	174.051395	174.139210
4	1918-05-01	173.98	173.950891	174.031776

Dep. Variable:	Erie	No. Observations:	1124
Model:	SARIMAX(1, 0, 1)x(0, 1, 1, 12)	Log Likelihood	1500.318
Date:	Sat, 29 Aug 2020	AIC	-2992.635
Time:	14:12:55	BIC	-2972.580
Sample:	01-01-1918	HQIC	-2985.052
	- 08-01-2011
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	0.9656	0.008	115.865	0.000	0.949	0.982
ma.L1	0.2744	0.028	9.912	0.000	0.220	0.329
ma.S.L12	-0.9538	0.011	-86.315	0.000	-0.975	-0.932
sigma2	0.0038	0.000	30.850	0.000	0.004	0.004

Ljung-Box (Q):	31.84	Jarque-Bera (JB):	203.64
Prob(Q):	0.82	Prob(JB):	0.00
Heteroskedasticity (H):	0.99	Skew:	0.76
Prob(H) (two-sided):	0.91	Kurtosis:	4.44

	Erie	NS_Diff0_predictions
date
2011-09-01	174.30	174.281175
2011-10-01	174.27	174.179219
2011-11-01	174.21	174.106513
2011-12-01	174.40	174.106436
2012-01-01	174.39	174.129626

Ljung-Box (Q):	38.55	Jarque-Bera (JB):	176.39
Prob(Q):	0.54	Prob(JB):	0.00
Heteroskedasticity (H):	1.00	Skew:	0.70
Prob(H) (two-sided):	1.00	Kurtosis:	4.36

Ljung-Box (Q):	33.83	Jarque-Bera (JB):	179.37
Prob(Q):	0.74	Prob(JB):	0.00
Heteroskedasticity (H):	0.94	Skew:	0.70
Prob(H) (two-sided):	0.56	Kurtosis:	4.26

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	0.3940	0.338	1.165	0.244	-0.269	1.057
ar.L2	-0.1147	0.091	-1.262	0.207	-0.293	0.063
ma.L1	-0.1186	0.338	-0.351	0.725	-0.780	0.543
ma.S.L12	-0.9746	0.028	-34.667	0.000	-1.030	-0.919
ma.S.L24	0.0114	0.028	0.406	0.685	-0.044	0.066
sigma2	0.0039	0.000	29.963	0.000	0.004	0.004

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	1.0870	0.082	13.292	0.000	0.927	1.247
ar.L2	-0.2801	0.028	-9.845	0.000	-0.336	-0.224
ma.L1	-0.8221	0.083	-9.941	0.000	-0.984	-0.660
ma.S.L12	-0.9897	0.028	-35.915	0.000	-1.044	-0.936
ma.S.L24	0.0226	0.027	0.825	0.409	-0.031	0.076
sigma2	0.0039	0.000	31.470	0.000	0.004	0.004

Lake Erie Water Levels - A Time Series Analysis (Part 3/4)¶

Table of Contents

Part 3: Seasonal ARIMA Models¶

Read and Preprocess Data¶

Setting a DatetimeIndex Frequency¶

Seasonal ARIMA Models¶

Step 1 - Test for Stationarity¶

Step 2 - Differencing¶

Test Stationarity for Seasonal First Order Differenced Data¶

Test Stationarity for Seasonal and Non-Seasonal First Order Differenced Data¶

Determine ARIMA (p,d,q) x (P,D,Q) Order¶

Examine Seasonal First Order Difference Plots¶

Examine Seasonal and Non-Seasonal First Order Difference Plots¶

Evaluate ACF and PACF Plots¶

Optimal ARIMA (p,d,q) x (P,D,Q) Order¶

Seasonal First Order Differenced Model (NS_Diff0_Model)¶

Seasonal and Non-Seasonal First Order Differenced Model (NS_Diff1_Model)¶

Step 4 - Split the Data¶

Step 5 - Build the Model¶

Seasonal ARIMA (1,0,1) (0,1,1,12) (NS_Diff0_Model)¶

Obtain Predictions¶

Seasonal ARIMA (2,1,1) (0,1,2,12) (NS_Diff1_Model)¶

Obtain Predictions¶

Evaluate Models¶

Plot Predictions¶

Calculate Loss¶

Step 7 - Forecasting into the Future¶

Retrain on Full Data¶

Model Diagnostics¶

Forecasting¶

Plot Forecasts¶

Summary¶

References¶

Seasonal ARIMA (1,0,1) (0,1,1,12) (`NS_Diff0_Model`)¶

Seasonal ARIMA (2,1,1) (0,1,2,12) (`NS_Diff1_Model`)¶