Lake Erie Water Levels - A Time Series Analysis (Part 2/4)

  • 👟 Ready To Run!
  • 🚀 Data Exploration and Cleaning
  • 🤖 Machine Learning
  • ⏱️ Time Series Analysis

In this part of the study, we will use continue to use Lake Erie water level data to build ARIMA models and forecast into the future.

In this notebook, we will:

  • Read Lake Erie water level data and index it as a time series
  • Discuss Stationarity of data
  • Difference the data to make it stationary
    • One order differencing
    • Second order differencing
    • Third order differencing
  • Use the ACF and PACF plots to decide AR term(s) and MA term(s) for ARIMA model
  • Identify optimal ARIMA order using grid search.
  • Split the data
  • Built and fit optimal ARIMA model
    • Obtain Predictions
  • Evaluate Model performance
    • Plot predictions and calculate loss
  • Forecast for 100 months into the future

Part 2: ARIMA Models

In [1]:
# General Imports
import pandas as pd
import itertools
import time

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Time Series Imports
from statsmodels.tsa.stattools import adfuller  #stationarity test
from statsmodels.tsa.statespace.tools import diff  #differencing
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf  #acf, pacf plots
from statsmodels.tsa.arima_model import ARIMA  #ARIMA

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read and Preprocess Data

We’ll need to do some quick processing to read the data and convert it to have a time series index. We will read the Erie water level dataset from previous notebook.

In [3]:
erie_data = pd.read_csv('erie_data')
erie_data.head()
Out[3]:
date Erie DES12 TES12
0 1918-01-01 173.90 173.881745 173.859612
1 1918-02-01 173.82 173.749419 173.880169
2 1918-03-01 174.01 174.160248 174.171219
3 1918-04-01 174.02 174.051395 174.139210
4 1918-05-01 173.98 173.950891 174.031776
In [4]:
# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes
Out[4]:
date     datetime64[ns]
Erie            float64
DES12           float64
TES12           float64
dtype: object
In [5]:
# Set date as index
erie_data.set_index('date', inplace=True)

Setting a DatetimeIndex Frequency

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

In [6]:
# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index
Out[6]:
DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
               '1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
               '1918-09-01', '1918-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=1224, freq='MS')

ARIMA(p,d,q) Models

ARIMA (Autoregressive Integrated Moving Average) [1] [5] models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

ARIMA [6] is actually a combination of 3 models:

  • AR(p) Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period. An AR(p) model can be written as: \begin{array}{l} y_{t}=c+\phi_{1} y_{t-1}+\phi_{2} y_{t-2}+\cdots+\phi_{p} y_{t-p}+\varepsilon_{t} \end{array} where $\phi$ is the auto-regressive parameter and $\varepsilon_{t}$ is the error term.
  • I(d) Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary. First Order Differencing can be written as: \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}
  • MA(q) Moving Average - a model that uses previous errors as predictors in a regression like model. An MA(q) model can be written as: \begin{array}{l} y_{t}=c+\theta_{1} \varepsilon_{t-1}+\theta_{2} \varepsilon_{t-2}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $\theta$ is the moving average parameter and $\varepsilon_{t}$ is the error term.

ARIMA [6] combines the ideas of autoregression and differencing. Instead of using autoregression with actual observed data, we use autoregression on the differences. An ARIMA(p,d,q) model can be written as: \begin{array}{l} y_{t}^{\prime}=c+\phi_{1} y_{t-1}^{\prime}+\cdots+\phi_{p} y_{t-p}^{\prime}+\theta_{1} \varepsilon_{t-1}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $y_{t}^{\prime}$ is the differenced series. The “predictors” on the right hand side include both lagged values of $y_{t}$ and lagged errors.


We will follow this 7 step process to build, validate various ARIMA(p,d,q) models and forecast for future periods:

Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast.
Step 2 — Difference the data: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity. Take as many differences as it takes. Make sure you check seasonal differencing as well.
Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
Step 4 — Split the data into train and test sets.
Step 5 — Build the model: Build the model and obtain predictions on test data.
Step 6 — Validate model: Compare the predicted values to the actuals in the test sample.
Step 7 — Forecasting: Retrain Model on full dataset and forecast for future time periods.

Step 1 - Test for Stationarity

To effectively use ARIMA models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality or has trend is not stationary. In general, a stationary time series will have no predictable patterns in the long-term.

A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.

In [7]:
# Run stationarity test
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(erie_data['Erie'], autolag='AIC')
dftest
Out[7]:
(-2.2215353279105003,
 0.19852241818344096,
 23,
 1200,
 {'1%': -3.435811119579282,
  '5%': -2.8639515467824075,
  '10%': -2.5680539506944444},
 -3053.729018688542)

The test returns some values with no indication of what these values mean. We can run help(adfuller) to understand what the values represent. Better yet, we will define a function that creates more informative output from the test. We use AIC for the autolag because it penalizes higher order models over simpler models.

In [8]:
# Define the adf-test function

def aug_dftest(series, title=''):
    
    print(f'Augmented Dickey Fuller Test for: {title}')
    result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    output = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        output[f'critical value ({key})'] = val
        
    print(output.to_string())
        
    if output[1] <= 0.05:
        print('Reject the Null hypothesis; Data is stationary')
    else:
        print('Fail to reject the Null hypothesis; Data is non-stationary')
          
In [9]:
# Run the Test
aug_dftest(erie_data['Erie'], title='Lake Erie Data')
Augmented Dickey Fuller Test for: Lake Erie Data
ADF test statistic        -2.221535
p-value                    0.198522
# lags used               23.000000
# observations          1200.000000
critical value (1%)       -3.435811
critical value (5%)       -2.863952
critical value (10%)      -2.568054
Fail to reject the Null hypothesis; Data is non-stationary

From the results above, we can see that the p-value > 0.05 which means we fail to reject the null hypothesis of stationarity and our data is non-stationary.

Step 2 - Differencing

Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.

The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.

First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}

Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}

Let's difference the data to make it stationary.

In [10]:
# Difference the data
from statsmodels.tsa.statespace.tools import diff

diff1_data = diff(erie_data['Erie'], k_diff=1)
In [12]:
# Plot differenced data
fig, axes = plt.subplots(2,1, figsize=(15,10))

erie_data['Erie'].plot(title='Original Data', ax=axes[0])
diff1_data.plot(title='1st Differenced Data', ax=axes[1])
axes[0].set(xlabel='')
axes[1].set(xlabel='')

fig.tight_layout(pad=3.0)
plt.show()

The plot above shows original data and First differenced data. The first differenced data seems to be stationary. Let's run the Augmented Dickey-Fuller Test to check for stationarity.

Test Stationarity for First Order Differenced Data

In [13]:
# Run the Test
aug_dftest(diff1_data, title='1st Differenced Data')
Augmented Dickey Fuller Test for: 1st Differenced Data
ADF test statistic     -7.142271e+00
p-value                 3.300614e-10
# lags used             2.300000e+01
# observations          1.199000e+03
critical value (1%)    -3.435816e+00
critical value (5%)    -2.863954e+00
critical value (10%)   -2.568055e+00
Reject the Null hypothesis; Data is stationary

Test results confirm that we reached stationarity after the first difference(d).

Step 3 - Determine the order of ARIMA(p,d,q) Model

After a time series has been stationarized by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series [1]. A PACF Plot can reveal recommended AR(p) orders, and an ACF Plot can do the same for MA(q) orders.

Auto Correlation Factor (ACF) - is the correlation of a series with itself lagged by x time units.
Partial Auto Coreelation Factor (PACF) - is the partial correlation coefficients between the series and lags of itself.

Let's run these plots on the first order differenced data to identify the AR(p) and MA(q) orders for ARIMA model.

In [14]:
# Run ACF and PACF plots to determine order
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(2,2, figsize=(15,10))

erie_data['Erie'].plot(title='Original Data', ax=axes[0,0])
diff1_data.plot(title='1st Differenced Data', ax=axes[0,1])
plot_acf(diff1_data, lags=36, ax=axes[1,0], title='ACF - Differenced Data')
plot_pacf(diff1_data, lags=36, ax=axes[1,1], title='PACF - Differenced Data')
axes[0,0].set(xlabel='')
axes[0,1].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows original data, first order differenced data, ACF and PACF plots of differenced data. While differencing appears to remove any trend as shown in the plot of First Order Differenced data, there are still signs of seasonality. The seasonality is seen in the ACF plot displaying a pattern. A higher order differncing is needed for the pattern to be mosly insignificant.

We will perform second and third order differencing of data and compare the ACF and PACF plots for seasonlaity.

Note: Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing and Seasonal ARIMA model in the next section.

Second and Third Order Differencing

Let's difference the data to create second and third order differences.

In [15]:
# Difference the data

# Second Order Difference
diff2_data = diff(erie_data['Erie'], k_diff=2)

# Third Order Difference
diff3_data = diff(erie_data['Erie'], k_diff=3)
In [16]:
# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,3, figsize=(15,15))

diff1_data.plot(title='First Order Differences', ax=axes[0,0])
plot_acf(diff1_data, lags=36, ax=axes[0,1], title='ACF')
plot_pacf(diff1_data, lags=36, ax=axes[0,2], title='PACF')
axes[0,0].set(xlabel='')

diff2_data.plot(title='Second Order Differences', ax=axes[1,0])
plot_acf(diff2_data, lags=36, ax=axes[1,1], title='ACF')
plot_pacf(diff2_data, lags=36, ax=axes[1,2], title='PACF')
axes[1,0].set(xlabel='')

diff3_data.plot(title='Third Order Differences', ax=axes[2,0])
plot_acf(diff3_data, lags=36, ax=axes[2,1], title='ACF')
plot_pacf(diff3_data, lags=36, ax=axes[2,2], title='PACF')
axes[2,0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows first, second and third order differenced data along with their ACF and PACF plots. It appears as though three orders of differencing are required to remove both trend and seasonality. From the 'Third Order Difference - ACF Plot', we can see that the pattern is now removed and most lags are within the bounds.

We will examine the second and third order differenced plots for further analysis.

Examine Third Order Difference Plots

Let's examine the Third Order Difference Plots

In [17]:
# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff3_data.plot(title='Third Order Differences', ax=axes[0])
plot_acf(diff3_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff3_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

From the 'Third Order Difference' ACF Plot, we can see that the pattern is now removed and most lags are within the bounds.

Notice that:
(a) ACF shows a sharp cutoff from lag 1 to lag 2.
(b) Lag 1 autocorrelation is negative as seen in the ACF plot. (c) PACF shows an exponential decay over various lags. (d) PACF cuts off at lag 15.

The rules for identifying AR and MA orders [1] are as follows:

If the PACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is positive--i.e., if the series appears slightly "underdifferenced"--then consider adding an AR term to the model. The lag at which the PACF cuts off is the indicated number of AR terms.

If the ACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is negative--i.e., if the series appears slightly "overdifferenced"--then consider adding an MA term to the model. The lag at which the ACF cuts off is the indicated number of MA terms.

As a rule of thumb, a stationarized series displays an AR, MA or ARMA signatures as follows:

AR(p) MA(q) ARMA(p,q)
acf Tails off Cuts off after lag q Tails off
pacf Cuts off after lag q Tails off Tails off

Given the cutoff at lag 15 on the third order-differenced PACF plot, and the lack of significant lags after 1 on the third-differenced ACF plot, this would suggest starting with an ARIMA(15,3,1) model.

ARIMA(15,3,1) is a really complex model to start with. Also, statsmodels package does not support an ARIMA model with d > 2; we will need SARIMAX to run a complex model such as ARIMA(15,3,1).

Let's explore the second order differenced data to see if we can find a simpler ARIMA model.

Examine Second Order Difference Plots

Let's examine the Second Order Difference Plots of the data.

In [18]:
# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff2_data.plot(title='Second Order Differences', ax=axes[0])
plot_acf(diff2_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff2_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows second order differenced data along with ACF and PACF plots.

Notice that:
(a) Lag 1 autocorrelation is negative as seen in the ACF plot.
(b) ACF shows an exponential decay over various lags. (c) lack of significant lags on both ACF and PACF plots.

Given the lack of significant lags on the third differenced ACF and PACF plots, we may start with a simpler ARIMA(0,2,1) model and improve from there.

Optimal ARIMA(p,d,q) Order

Our goal here is to find the optimal set of parameters (p,d,q) that yields the best performance for our model. There are multiple ways to determine the order(p,d,q) for an ARIMA(p,d,q) model. We will run a grid search with different values of AR(p), I(d) and MA(q) terms.

Let's create:

  1. A list of different ARIMA orders to iterate over.
  2. Define a function that iterates through these ARIMA orders to find the best model with the lowest AIC.

The function will evaluate all models created during the stepwise operation and list the best ARIMA model along with its AICs. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.

Alternatively, auto_arima tool from pmdarima package can be used which returns the optimal order for an ARIMA model.

In [20]:
# Create a list of ARIMA model orders

p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))
print(pdq)
[(0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 1, 2), (0, 2, 0), (0, 2, 1), (0, 2, 2), (1, 0, 0), (1, 0, 1), (1, 0, 2), (1, 1, 0), (1, 1, 1), (1, 1, 2), (1, 2, 0), (1, 2, 1), (1, 2, 2), (2, 0, 0), (2, 0, 1), (2, 0, 2), (2, 1, 0), (2, 1, 1), (2, 1, 2), (2, 2, 0), (2, 2, 1), (2, 2, 2)]
In [21]:
from statsmodels.tsa.arima_model import ARIMA

# Function to iterate over ARIMA orders
def stepwise_arima(data):
    model_res = {}
    for param in pdq:
        order_aic = []
        try:
            model = ARIMA(data, order=param)
            model_fit = model.fit(solver='nm')
#             print(f'ARIMA{param} - AIC: {model_fit.aic}')
            model_res[param] = model_fit.aic
        except:
            continue

#     Print model with lowest AIC
    low_key = min(model_res, key=model_res.get)
    print('\n')
    print('---------------------')
    print('MODEL WITH LOWEST AIC')
    print('---------------------')
    print(f'ARIMA{low_key} - AIC: {model_res[low_key]}')
In [22]:
# Run the model
stepwise_arima(erie_data['Erie'])
Optimization terminated successfully.
         Current function value: 0.403685
         Iterations: 18
         Function evaluations: 36
Optimization terminated successfully.
         Current function value: -0.195560
         Iterations: 36
         Function evaluations: 68
Optimization terminated successfully.
         Current function value: -0.906966
         Iterations: 1
         Function evaluations: 2
Optimization terminated successfully.
         Current function value: -1.070825
         Iterations: 20
         Function evaluations: 40
Optimization terminated successfully.
         Current function value: -1.090631
         Iterations: 43
         Function evaluations: 82
Optimization terminated successfully.
         Current function value: -0.957149
         Iterations: 1
         Function evaluations: 2
Optimization terminated successfully.
         Current function value: -0.965268
         Iterations: 11
         Function evaluations: 23
Optimization terminated successfully.
         Current function value: -1.067860
         Iterations: 181
         Function evaluations: 347
Optimization terminated successfully.
         Current function value: -0.914925
         Iterations: 33
         Function evaluations: 65
Optimization terminated successfully.
         Current function value: -1.084694
         Iterations: 49
         Function evaluations: 92
Optimization terminated successfully.
         Current function value: -1.109327
         Iterations: 70
         Function evaluations: 127
Optimization terminated successfully.
         Current function value: -1.085368
         Iterations: 22
         Function evaluations: 42
Optimization terminated successfully.
         Current function value: -1.094674
         Iterations: 50
         Function evaluations: 92
Optimization terminated successfully.
         Current function value: -1.090155
         Iterations: 75
         Function evaluations: 135
Optimization terminated successfully.
         Current function value: -0.962895
         Iterations: 61
         Function evaluations: 120
Optimization terminated successfully.
         Current function value: -1.116869
         Iterations: 53
         Function evaluations: 97
Optimization terminated successfully.
         Current function value: -1.120923
         Iterations: 71
         Function evaluations: 128
Optimization terminated successfully.
         Current function value: -1.121017
         Iterations: 111
         Function evaluations: 191
Optimization terminated successfully.
         Current function value: -1.097312
         Iterations: 45
         Function evaluations: 79
Optimization terminated successfully.
         Current function value: -1.169312
         Iterations: 120
         Function evaluations: 209
Warning: Maximum number of iterations has been exceeded.
Optimization terminated successfully.
         Current function value: -0.974263
         Iterations: 25
         Function evaluations: 49
Warning: Maximum number of iterations has been exceeded.


---------------------
MODEL WITH LOWEST AIC
---------------------
ARIMA(2, 1, 2) - AIC: -3049.447695911407

We can see how an optimal model of ARIMA(2,1,2) was determined by looking at the stepwise results. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.

Now that we have identified ARIMA(2,1,2) as the optimal model, we will split the data and fit this model on training and test sets.

Step 4 - Split the Data

We will spilt the data into train and test sets.

In [25]:
# Split the data

# TRAIN Data
train_data = erie_data.iloc[:1124]
train_data = train_data.astype('float64') # casting as float to avoid errors in prediction

# TEST Data
test_data = erie_data.iloc[1124:]
test_data = test_data.astype('float64')

Step 5 - Build the Model

We will build an ARIMA(2,1,2) model on the training data and fit the model.

In [26]:
from statsmodels.tsa.arima_model import ARIMA

ar212_model = ARIMA(train_data['Erie'], order=(2,1,2))
ar212_results = ar212_model.fit(solver='nm')
ar212_results.summary()
Optimization terminated successfully.
         Current function value: -1.258087
         Iterations: 345
         Function evaluations: 612
Out[26]:
ARIMA Model Results
Dep. Variable: D.Erie No. Observations: 1123
Model: ARIMA(2, 1, 2) Log Likelihood 1412.832
Method: css-mle S.D. of innovations 0.068
Date: Sat, 29 Aug 2020 AIC -2813.664
Time: 12:29:27 BIC -2783.521
Sample: 02-01-1918 HQIC -2802.272
- 08-01-2011
coef std err z P>|z| [0.025 0.975]
const 0.0002 5.68e-06 32.046 0.000 0.000 0.000
ar.L1.D.Erie 1.7317 9.27e-05 1.87e+04 0.000 1.731 1.732
ar.L2.D.Erie -1.0000 nan nan nan nan nan
ma.L1.D.Erie -1.7325 0.005 -374.461 0.000 -1.742 -1.723
ma.L2.D.Erie 1.0000 0.005 187.835 0.000 0.990 1.010
Roots
Real Imaginary Modulus Frequency
AR.1 0.8658 -0.5003j 1.0000 -0.0834
AR.2 0.8658 +0.5003j 1.0000 0.0834
MA.1 0.8662 -0.4996j 1.0000 -0.0833
MA.2 0.8662 +0.4996j 1.0000 0.0833

The table in the middle is the coefficients table where the values under ‘coef’ are the weights of the respective terms. Notice here the P-Value in ‘P>|z|’ column is <= 0.05 and hence significant. Let's obtain predictions from this model.

Obtain Predictions

In [27]:
start = len(train_data)
end = len(train_data) + len(test_data)-1

ar212_predictions = ar212_results.predict(start=start, end=end, dynamic=False,
                                          typ='levels')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments, check statsmodels documentation here.

Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.

In [28]:
ar212_pred = pd.DataFrame(ar212_predictions.values,
                       index = test_data.index,
                       columns = ['ar212_predictions'])
test_data = pd.concat([test_data, ar212_pred], axis=1)
test_data[['Erie', 'ar212_predictions']].head()
Out[28]:
Erie ar212_predictions
date
2011-09-01 174.30 174.284671
2011-10-01 174.27 174.191203
2011-11-01 174.21 174.124725
2011-12-01 174.40 174.103124
2012-01-01 174.39 174.132243

Step 6 - Evaluate Model

To help us understand the accuracy of ARIMA model, we will plot predictions from the model against test data and calculate loss.

Plot Predictions

Let's plot Predictions against Test data.

In [29]:
with plt.style.context('fivethirtyeight'):

    ax = test_data['Erie'].plot(label='Test Data', legend=True, figsize=(15,6))
    ar212_predictions.plot(label='ARIMA(2,1,2) Predictions', legend=True)
    ax.autoscale(axis='x',tight=True)

    ax.set_xlabel("")
    ax.set_ylabel("Mean Water Level (metric units)")
    ax.set_title('Model Predictions');

From this plot, we can say that model predictions are not close to test data. The model tends to underestimate or overesimate the data. It seems to capture the seasonality but is not able to capture trend in test data.

Calculate Loss

We will evaluate our model predictions using RMSE.

In [30]:
from statsmodels.tools.eval_measures import rmse

error = rmse(test_data['Erie'], ar212_predictions)
print(f'ARIMA(2,1,2) RMSE: {error}')
ARIMA(2,1,2) RMSE: 0.29388421456122293

RMSE of ARIMA(2,1,2) model is 0.294. This RMSE is very low compared to the RMSE of Double Exponential Smoothing model (2.2324).

Step 7 - Forecasting into the Future

We will now retrain the data on full model and then forecast for 100 months using the model.

Retrain on Full Data

We will now retrain our model on full data.

In [31]:
from statsmodels.tsa.arima_model import ARIMA

Arima_Model = ARIMA(erie_data['Erie'], order=(2,1,2))
Arima_results = Arima_Model.fit(solver='nm')
Arima_results.summary()
Warning: Maximum number of iterations has been exceeded.
Out[31]:
ARIMA Model Results
Dep. Variable: D.Erie No. Observations: 1223
Model: ARIMA(2, 1, 2) Log Likelihood 1530.724
Method: css-mle S.D. of innovations 0.069
Date: Sat, 29 Aug 2020 AIC -3049.448
Time: 12:35:46 BIC -3018.793
Sample: 02-01-1918 HQIC -3037.911
- 12-01-2019
coef std err z P>|z| [0.025 0.975]
const 0.0006 4.9e-06 130.808 0.000 0.001 0.001
ar.L1.D.Erie 1.7317 8.52e-05 2.03e+04 0.000 1.732 1.732
ar.L2.D.Erie -1.0000 nan nan nan nan nan
ma.L1.D.Erie -1.7332 0.004 -438.833 0.000 -1.741 -1.726
ma.L2.D.Erie 1.0000 0.005 222.147 0.000 0.991 1.009
Roots
Real Imaginary Modulus Frequency
AR.1 0.8659 -0.5003j 1.0000 -0.0834
AR.2 0.8659 +0.5003j 1.0000 0.0834
MA.1 0.8666 -0.4990j 1.0000 -0.0831
MA.2 0.8666 +0.4990j 1.0000 0.0831
In [37]:
# Forecast for 100 months

start = len(erie_data)
end = len(erie_data) + 99

Arima_pred = Arima_results.predict(start=start, end=end, dynamic=False, typ='levels')

Plot Forecasts

We will plot the forecasts for 100 months generated by ARIMA model.

In [38]:
with plt.style.context('fivethirtyeight'):

    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    ax = erie_data['Erie'].plot(legend=True, title=title, figsize=(18,8))
    Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
    ax.autoscale(axis='x',tight=True)
    ax.set(ylabel=ylabel)
    ax.set(xlabel='');

Let's zoom into the forecast time period to understand the trend and seasonality of the forecasts.

In [40]:
with plt.style.context('fivethirtyeight'):
    
    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    ax = erie_data['01-01-2016':]['Erie'].plot(legend=True, title=title, figsize=(18,8))
    Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
    ax.autoscale(axis='x',tight=True)
    ax.set(ylabel=ylabel)
    ax.set(xlabel='');

The plot shows original data from 2016 - 2019 and then forecasts generated by the model for next 100 months. The forecasts show seasonality in the future periods however there is no increasing or decreasing trend.

Summary

To summarize, in this notebook we:

  1. Read Lake Erie water level data and indexed it as a time series.
  2. Checked stationarity of data.
  3. Differenced the data to make it stationary.
  4. Used the ACF and PACF plots to decide AR term(s) and MA term(s).
  5. Identified optimal ARIMA order using grid search.
  6. Split the data, built and fit optimal ARIMA model on the data.
  7. Validated model performance.
  8. Performed forecasting for future periods.

We have seen that ARIMA(2,1,2) is the best fit model with lowest RMSE. The model seems to capture the seasonality but is not able to capture trend in the test data. The forecasts also show seasonality in the future periods however there is no increasing or decreasing trend.

Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing and Seasonal ARIMA models in the next part of the study.

References

[1] https://people.duke.edu/~rnau/411arim.htm
[2] https://people.duke.edu/~rnau/411diff.htm
[3] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[4] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[5] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[6] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis