Lake Erie Water Levels - A Time Series Analysis (Part 2/4)¶

👟 Ready To Run!

🚀 Data Exploration and Cleaning

🤖 Machine Learning

⏱️ Time Series Analysis

In this part of the study, we will use continue to use Lake Erie water level data to build ARIMA models and forecast into the future.

Read Lake Erie water level data and index it as a time series
Discuss Stationarity of data
Difference the data to make it stationary
- One order differencing
- Second order differencing
- Third order differencing
Use the ACF and PACF plots to decide AR term(s) and MA term(s) for ARIMA model
Identify optimal ARIMA order using grid search.
Split the data
Built and fit optimal ARIMA model
- Obtain Predictions
Evaluate Model performance
- Plot predictions and calculate loss
Forecast for 100 months into the future

Part 2: ARIMA Models¶

# General Imports
import pandas as pd
import itertools
import time

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Time Series Imports
from statsmodels.tsa.stattools import adfuller  #stationarity test
from statsmodels.tsa.statespace.tools import diff  #differencing
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf  #acf, pacf plots
from statsmodels.tsa.arima_model import ARIMA  #ARIMA

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read and Preprocess Data¶

We’ll need to do some quick processing to read the data and convert it to have a time series index. We will read the Erie water level dataset from previous notebook.

erie_data = pd.read_csv('erie_data')
erie_data.head()

# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes

date     datetime64[ns]
Erie            float64
DES12           float64
TES12           float64
dtype: object

# Set date as index
erie_data.set_index('date', inplace=True)

Setting a DatetimeIndex Frequency¶

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index

DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
               '1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
               '1918-09-01', '1918-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=1224, freq='MS')

ARIMA(p,d,q) Models¶

ARIMA (Autoregressive Integrated Moving Average) [1] [5] models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

ARIMA [6] is actually a combination of 3 models:

AR(p) Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period. An AR(p) model can be written as: \begin{array}{l} y_{t}=c+\phi_{1} y_{t-1}+\phi_{2} y_{t-2}+\cdots+\phi_{p} y_{t-p}+\varepsilon_{t} \end{array} where $\phi$ is the auto-regressive parameter and $\varepsilon_{t}$ is the error term.

I(d) Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary. First Order Differencing can be written as: \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}

MA(q) Moving Average - a model that uses previous errors as predictors in a regression like model. An MA(q) model can be written as: \begin{array}{l} y_{t}=c+\theta_{1} \varepsilon_{t-1}+\theta_{2} \varepsilon_{t-2}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $\theta$ is the moving average parameter and $\varepsilon_{t}$ is the error term.

ARIMA [6] combines the ideas of autoregression and differencing. Instead of using autoregression with actual observed data, we use autoregression on the differences. An ARIMA(p,d,q) model can be written as: \begin{array}{l} y_{t}^{\prime}=c+\phi_{1} y_{t-1}^{\prime}+\cdots+\phi_{p} y_{t-p}^{\prime}+\theta_{1} \varepsilon_{t-1}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $y_{t}^{\prime}$ is the differenced series. The “predictors” on the right hand side include both lagged values of $y_{t}$ and lagged errors.

We will follow this 7 step process to build, validate various ARIMA(p,d,q) models and forecast for future periods:

Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast.
Step 2 — Difference the data: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity. Take as many differences as it takes. Make sure you check seasonal differencing as well.
Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
Step 4 — Split the data into train and test sets.
Step 5 — Build the model: Build the model and obtain predictions on test data.
Step 6 — Validate model: Compare the predicted values to the actuals in the test sample.
Step 7 — Forecasting: Retrain Model on full dataset and forecast for future time periods.

Step 1 - Test for Stationarity¶

To effectively use ARIMA models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality or has trend is not stationary. In general, a stationary time series will have no predictable patterns in the long-term.

A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.

# Run stationarity test
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(erie_data['Erie'], autolag='AIC')
dftest

(-2.2215353279105003,
 0.19852241818344096,
 23,
 1200,
 {'1%': -3.435811119579282,
  '5%': -2.8639515467824075,
  '10%': -2.5680539506944444},
 -3053.729018688542)

The test returns some values with no indication of what these values mean. We can run help(adfuller) to understand what the values represent. Better yet, we will define a function that creates more informative output from the test. We use AIC for the autolag because it penalizes higher order models over simpler models.

# Define the adf-test function

def aug_dftest(series, title=''):
    
    print(f'Augmented Dickey Fuller Test for: {title}')
    result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    output = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        output[f'critical value ({key})'] = val
        
    print(output.to_string())
        
    if output[1] <= 0.05:
        print('Reject the Null hypothesis; Data is stationary')
    else:
        print('Fail to reject the Null hypothesis; Data is non-stationary')

# Run the Test
aug_dftest(erie_data['Erie'], title='Lake Erie Data')

Augmented Dickey Fuller Test for: Lake Erie Data
ADF test statistic        -2.221535
p-value                    0.198522
# lags used               23.000000
# observations          1200.000000
critical value (1%)       -3.435811
critical value (5%)       -2.863952
critical value (10%)      -2.568054
Fail to reject the Null hypothesis; Data is non-stationary

From the results above, we can see that the p-value > 0.05 which means we fail to reject the null hypothesis of stationarity and our data is non-stationary.

Step 2 - Differencing¶

Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.

The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.

First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}

Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}

Let's difference the data to make it stationary.

# Difference the data
from statsmodels.tsa.statespace.tools import diff

diff1_data = diff(erie_data['Erie'], k_diff=1)

# Plot differenced data
fig, axes = plt.subplots(2,1, figsize=(15,10))

erie_data['Erie'].plot(title='Original Data', ax=axes[0])
diff1_data.plot(title='1st Differenced Data', ax=axes[1])
axes[0].set(xlabel='')
axes[1].set(xlabel='')

fig.tight_layout(pad=3.0)
plt.show()

The plot above shows original data and First differenced data. The first differenced data seems to be stationary. Let's run the Augmented Dickey-Fuller Test to check for stationarity.

Test Stationarity for First Order Differenced Data¶

# Run the Test
aug_dftest(diff1_data, title='1st Differenced Data')

Augmented Dickey Fuller Test for: 1st Differenced Data
ADF test statistic     -7.142271e+00
p-value                 3.300614e-10
# lags used             2.300000e+01
# observations          1.199000e+03
critical value (1%)    -3.435816e+00
critical value (5%)    -2.863954e+00
critical value (10%)   -2.568055e+00
Reject the Null hypothesis; Data is stationary

Test results confirm that we reached stationarity after the first difference(d).

Step 3 - Determine the order of ARIMA(p,d,q) Model¶

After a time series has been stationarized by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series [1]. A PACF Plot can reveal recommended AR(p) orders, and an ACF Plot can do the same for MA(q) orders.

Auto Correlation Factor (ACF) - is the correlation of a series with itself lagged by x time units.
Partial Auto Coreelation Factor (PACF) - is the partial correlation coefficients between the series and lags of itself.

Let's run these plots on the first order differenced data to identify the AR(p) and MA(q) orders for ARIMA model.

# Run ACF and PACF plots to determine order
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(2,2, figsize=(15,10))

erie_data['Erie'].plot(title='Original Data', ax=axes[0,0])
diff1_data.plot(title='1st Differenced Data', ax=axes[0,1])
plot_acf(diff1_data, lags=36, ax=axes[1,0], title='ACF - Differenced Data')
plot_pacf(diff1_data, lags=36, ax=axes[1,1], title='PACF - Differenced Data')
axes[0,0].set(xlabel='')
axes[0,1].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows original data, first order differenced data, ACF and PACF plots of differenced data. While differencing appears to remove any trend as shown in the plot of First Order Differenced data, there are still signs of seasonality. The seasonality is seen in the ACF plot displaying a pattern. A higher order differncing is needed for the pattern to be mosly insignificant.

We will perform second and third order differencing of data and compare the ACF and PACF plots for seasonlaity.

Note: Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing and Seasonal ARIMA model in the next section.

Second and Third Order Differencing¶

Let's difference the data to create second and third order differences.

# Difference the data

# Second Order Difference
diff2_data = diff(erie_data['Erie'], k_diff=2)

# Third Order Difference
diff3_data = diff(erie_data['Erie'], k_diff=3)

# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,3, figsize=(15,15))

diff1_data.plot(title='First Order Differences', ax=axes[0,0])
plot_acf(diff1_data, lags=36, ax=axes[0,1], title='ACF')
plot_pacf(diff1_data, lags=36, ax=axes[0,2], title='PACF')
axes[0,0].set(xlabel='')

diff2_data.plot(title='Second Order Differences', ax=axes[1,0])
plot_acf(diff2_data, lags=36, ax=axes[1,1], title='ACF')
plot_pacf(diff2_data, lags=36, ax=axes[1,2], title='PACF')
axes[1,0].set(xlabel='')

diff3_data.plot(title='Third Order Differences', ax=axes[2,0])
plot_acf(diff3_data, lags=36, ax=axes[2,1], title='ACF')
plot_pacf(diff3_data, lags=36, ax=axes[2,2], title='PACF')
axes[2,0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows first, second and third order differenced data along with their ACF and PACF plots. It appears as though three orders of differencing are required to remove both trend and seasonality. From the 'Third Order Difference - ACF Plot', we can see that the pattern is now removed and most lags are within the bounds.

We will examine the second and third order differenced plots for further analysis.

Examine Third Order Difference Plots¶

Let's examine the Third Order Difference Plots

# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff3_data.plot(title='Third Order Differences', ax=axes[0])
plot_acf(diff3_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff3_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

From the 'Third Order Difference' ACF Plot, we can see that the pattern is now removed and most lags are within the bounds.

Notice that:
(a) ACF shows a sharp cutoff from lag 1 to lag 2.
(b) Lag 1 autocorrelation is negative as seen in the ACF plot. (c) PACF shows an exponential decay over various lags. (d) PACF cuts off at lag 15.

The rules for identifying AR and MA orders [1] are as follows:

If the PACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is positive--i.e., if the series appears slightly "underdifferenced"--then consider adding an AR term to the model. The lag at which the PACF cuts off is the indicated number of AR terms.

If the ACF of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is negative--i.e., if the series appears slightly "overdifferenced"--then consider adding an MA term to the model. The lag at which the ACF cuts off is the indicated number of MA terms.

As a rule of thumb, a stationarized series displays an AR, MA or ARMA signatures as follows:

	AR(p)	MA(q)	ARMA(p,q)
acf	Tails off	Cuts off after lag q	Tails off
pacf	Cuts off after lag q	Tails off	Tails off

Given the cutoff at lag 15 on the third order-differenced PACF plot, and the lack of significant lags after 1 on the third-differenced ACF plot, this would suggest starting with an ARIMA(15,3,1) model.

ARIMA(15,3,1) is a really complex model to start with. Also, statsmodels package does not support an ARIMA model with d > 2; we will need SARIMAX to run a complex model such as ARIMA(15,3,1).

Let's explore the second order differenced data to see if we can find a simpler ARIMA model.

Examine Second Order Difference Plots¶

Let's examine the Second Order Difference Plots of the data.

# Run ACF and PACF plots of differenced data

fig, axes = plt.subplots(3,1, figsize=(15,12))

diff2_data.plot(title='Second Order Differences', ax=axes[0])
plot_acf(diff2_data, lags=48, ax=axes[1], title='ACF')
plot_pacf(diff2_data, lags=48, ax=axes[2], title='PACF')
axes[0].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows second order differenced data along with ACF and PACF plots.

Notice that:
(a) Lag 1 autocorrelation is negative as seen in the ACF plot.
(b) ACF shows an exponential decay over various lags. (c) lack of significant lags on both ACF and PACF plots.

Given the lack of significant lags on the third differenced ACF and PACF plots, we may start with a simpler ARIMA(0,2,1) model and improve from there.

Optimal ARIMA(p,d,q) Order¶

Our goal here is to find the optimal set of parameters (p,d,q) that yields the best performance for our model. There are multiple ways to determine the order(p,d,q) for an ARIMA(p,d,q) model. We will run a grid search with different values of AR(p), I(d) and MA(q) terms.

Let's create:

A list of different ARIMA orders to iterate over.
Define a function that iterates through these ARIMA orders to find the best model with the lowest AIC.

The function will evaluate all models created during the stepwise operation and list the best ARIMA model along with its AICs. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.

Alternatively, auto_arima tool from pmdarima package can be used which returns the optimal order for an ARIMA model.

# Create a list of ARIMA model orders

p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))
print(pdq)

[(0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 1, 2), (0, 2, 0), (0, 2, 1), (0, 2, 2), (1, 0, 0), (1, 0, 1), (1, 0, 2), (1, 1, 0), (1, 1, 1), (1, 1, 2), (1, 2, 0), (1, 2, 1), (1, 2, 2), (2, 0, 0), (2, 0, 1), (2, 0, 2), (2, 1, 0), (2, 1, 1), (2, 1, 2), (2, 2, 0), (2, 2, 1), (2, 2, 2)]

from statsmodels.tsa.arima_model import ARIMA

# Function to iterate over ARIMA orders
def stepwise_arima(data):
    model_res = {}
    for param in pdq:
        order_aic = []
        try:
            model = ARIMA(data, order=param)
            model_fit = model.fit(solver='nm')
#             print(f'ARIMA{param} - AIC: {model_fit.aic}')
            model_res[param] = model_fit.aic
        except:
            continue

#     Print model with lowest AIC
    low_key = min(model_res, key=model_res.get)
    print('\n')
    print('---------------------')
    print('MODEL WITH LOWEST AIC')
    print('---------------------')
    print(f'ARIMA{low_key} - AIC: {model_res[low_key]}')

# Run the model
stepwise_arima(erie_data['Erie'])

Optimization terminated successfully.
         Current function value: 0.403685
         Iterations: 18
         Function evaluations: 36
Optimization terminated successfully.
         Current function value: -0.195560
         Iterations: 36
         Function evaluations: 68
Optimization terminated successfully.
         Current function value: -0.906966
         Iterations: 1
         Function evaluations: 2
Optimization terminated successfully.
         Current function value: -1.070825
         Iterations: 20
         Function evaluations: 40
Optimization terminated successfully.
         Current function value: -1.090631
         Iterations: 43
         Function evaluations: 82
Optimization terminated successfully.
         Current function value: -0.957149
         Iterations: 1
         Function evaluations: 2
Optimization terminated successfully.
         Current function value: -0.965268
         Iterations: 11
         Function evaluations: 23
Optimization terminated successfully.
         Current function value: -1.067860
         Iterations: 181
         Function evaluations: 347
Optimization terminated successfully.
         Current function value: -0.914925
         Iterations: 33
         Function evaluations: 65
Optimization terminated successfully.
         Current function value: -1.084694
         Iterations: 49
         Function evaluations: 92
Optimization terminated successfully.
         Current function value: -1.109327
         Iterations: 70
         Function evaluations: 127
Optimization terminated successfully.
         Current function value: -1.085368
         Iterations: 22
         Function evaluations: 42
Optimization terminated successfully.
         Current function value: -1.094674
         Iterations: 50
         Function evaluations: 92
Optimization terminated successfully.
         Current function value: -1.090155
         Iterations: 75
         Function evaluations: 135
Optimization terminated successfully.
         Current function value: -0.962895
         Iterations: 61
         Function evaluations: 120
Optimization terminated successfully.
         Current function value: -1.116869
         Iterations: 53
         Function evaluations: 97
Optimization terminated successfully.
         Current function value: -1.120923
         Iterations: 71
         Function evaluations: 128
Optimization terminated successfully.
         Current function value: -1.121017
         Iterations: 111
         Function evaluations: 191
Optimization terminated successfully.
         Current function value: -1.097312
         Iterations: 45
         Function evaluations: 79
Optimization terminated successfully.
         Current function value: -1.169312
         Iterations: 120
         Function evaluations: 209
Warning: Maximum number of iterations has been exceeded.
Optimization terminated successfully.
         Current function value: -0.974263
         Iterations: 25
         Function evaluations: 49
Warning: Maximum number of iterations has been exceeded.


---------------------
MODEL WITH LOWEST AIC
---------------------
ARIMA(2, 1, 2) - AIC: -3049.447695911407

We can see how an optimal model of ARIMA(2,1,2) was determined by looking at the stepwise results. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.

Now that we have identified ARIMA(2,1,2) as the optimal model, we will split the data and fit this model on training and test sets.

Step 4 - Split the Data¶

We will spilt the data into train and test sets.

# Split the data

# TRAIN Data
train_data = erie_data.iloc[:1124]
train_data = train_data.astype('float64') # casting as float to avoid errors in prediction

# TEST Data
test_data = erie_data.iloc[1124:]
test_data = test_data.astype('float64')

Step 5 - Build the Model¶

We will build an ARIMA(2,1,2) model on the training data and fit the model.

from statsmodels.tsa.arima_model import ARIMA

ar212_model = ARIMA(train_data['Erie'], order=(2,1,2))
ar212_results = ar212_model.fit(solver='nm')
ar212_results.summary()

Optimization terminated successfully.
         Current function value: -1.258087
         Iterations: 345
         Function evaluations: 612

The table in the middle is the coefficients table where the values under ‘coef’ are the weights of the respective terms. Notice here the P-Value in ‘P>|z|’ column is <= 0.05 and hence significant. Let's obtain predictions from this model.

Obtain Predictions¶

start = len(train_data)
end = len(train_data) + len(test_data)-1

ar212_predictions = ar212_results.predict(start=start, end=end, dynamic=False,
                                          typ='levels')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values). Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments, check statsmodels documentation here.

Let’s reorganize this set of predictions by creating a dataframe that contains our model predictions and then concatenating that with the test data.

ar212_pred = pd.DataFrame(ar212_predictions.values,
                       index = test_data.index,
                       columns = ['ar212_predictions'])
test_data = pd.concat([test_data, ar212_pred], axis=1)
test_data[['Erie', 'ar212_predictions']].head()

Step 6 - Evaluate Model¶

To help us understand the accuracy of ARIMA model, we will plot predictions from the model against test data and calculate loss.

Plot Predictions¶

Let's plot Predictions against Test data.

with plt.style.context('fivethirtyeight'):

    ax = test_data['Erie'].plot(label='Test Data', legend=True, figsize=(15,6))
    ar212_predictions.plot(label='ARIMA(2,1,2) Predictions', legend=True)
    ax.autoscale(axis='x',tight=True)

    ax.set_xlabel("")
    ax.set_ylabel("Mean Water Level (metric units)")
    ax.set_title('Model Predictions');

From this plot, we can say that model predictions are not close to test data. The model tends to underestimate or overesimate the data. It seems to capture the seasonality but is not able to capture trend in test data.

Calculate Loss¶

We will evaluate our model predictions using RMSE.

from statsmodels.tools.eval_measures import rmse

error = rmse(test_data['Erie'], ar212_predictions)
print(f'ARIMA(2,1,2) RMSE: {error}')

ARIMA(2,1,2) RMSE: 0.29388421456122293

RMSE of ARIMA(2,1,2) model is 0.294. This RMSE is very low compared to the RMSE of Double Exponential Smoothing model (2.2324).

Step 7 - Forecasting into the Future¶

We will now retrain the data on full model and then forecast for 100 months using the model.

Retrain on Full Data¶

We will now retrain our model on full data.

from statsmodels.tsa.arima_model import ARIMA

Arima_Model = ARIMA(erie_data['Erie'], order=(2,1,2))
Arima_results = Arima_Model.fit(solver='nm')
Arima_results.summary()

Warning: Maximum number of iterations has been exceeded.

# Forecast for 100 months

start = len(erie_data)
end = len(erie_data) + 99

Arima_pred = Arima_results.predict(start=start, end=end, dynamic=False, typ='levels')

Plot Forecasts¶

We will plot the forecasts for 100 months generated by ARIMA model.

with plt.style.context('fivethirtyeight'):

    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    ax = erie_data['Erie'].plot(legend=True, title=title, figsize=(18,8))
    Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
    ax.autoscale(axis='x',tight=True)
    ax.set(ylabel=ylabel)
    ax.set(xlabel='');

Let's zoom into the forecast time period to understand the trend and seasonality of the forecasts.

with plt.style.context('fivethirtyeight'):
    
    title = 'Monthly Mean Water Levels (metric units) - Lake Erie'
    ylabel='Mean Water Levels (metric units)'

    ax = erie_data['01-01-2016':]['Erie'].plot(legend=True, title=title, figsize=(18,8))
    Arima_pred.plot(label='ARIMA(2,1,2) Forecast', legend=True)
    ax.autoscale(axis='x',tight=True)
    ax.set(ylabel=ylabel)
    ax.set(xlabel='');

The plot shows original data from 2016 - 2019 and then forecasts generated by the model for next 100 months. The forecasts show seasonality in the future periods however there is no increasing or decreasing trend.

Summary¶

To summarize, in this notebook we:

Read Lake Erie water level data and indexed it as a time series.
Checked stationarity of data.
Differenced the data to make it stationary.
Used the ACF and PACF plots to decide AR term(s) and MA term(s).
Identified optimal ARIMA order using grid search.
Split the data, built and fit optimal ARIMA model on the data.
Validated model performance.
Performed forecasting for future periods.

We have seen that ARIMA(2,1,2) is the best fit model with lowest RMSE. The model seems to capture the seasonality but is not able to capture trend in the test data. The forecasts also show seasonality in the future periods however there is no increasing or decreasing trend.

Seasonality is reduced more quickly by taking first order seasonal and regular differences. We will discuss Seasonal Differecing and Seasonal ARIMA models in the next part of the study.

References¶

[1] https://people.duke.edu/~rnau/411arim.htm
[2] https://people.duke.edu/~rnau/411diff.htm
[3] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[4] https://faculty.fuqua.duke.edu/~rnau/Decision411_2007/seasarim.htm
[5] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[6] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis

	date	Erie	DES12	TES12
0	1918-01-01	173.90	173.881745	173.859612
1	1918-02-01	173.82	173.749419	173.880169
2	1918-03-01	174.01	174.160248	174.171219
3	1918-04-01	174.02	174.051395	174.139210
4	1918-05-01	173.98	173.950891	174.031776

Dep. Variable:	D.Erie	No. Observations:	1123
Model:	ARIMA(2, 1, 2)	Log Likelihood	1412.832
Method:	css-mle	S.D. of innovations	0.068
Date:	Sat, 29 Aug 2020	AIC	-2813.664
Time:	12:29:27	BIC	-2783.521
Sample:	02-01-1918	HQIC	-2802.272
	- 08-01-2011

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.0002	5.68e-06	32.046	0.000	0.000	0.000
ar.L1.D.Erie	1.7317	9.27e-05	1.87e+04	0.000	1.731	1.732
ar.L2.D.Erie	-1.0000	nan	nan	nan	nan	nan
ma.L1.D.Erie	-1.7325	0.005	-374.461	0.000	-1.742	-1.723
ma.L2.D.Erie	1.0000	0.005	187.835	0.000	0.990	1.010

	Real	Imaginary	Modulus	Frequency
AR.1	0.8658	-0.5003j	1.0000	-0.0834
AR.2	0.8658	+0.5003j	1.0000	0.0834
MA.1	0.8662	-0.4996j	1.0000	-0.0833
MA.2	0.8662	+0.4996j	1.0000	0.0833

	Erie	ar212_predictions
date
2011-09-01	174.30	174.284671
2011-10-01	174.27	174.191203
2011-11-01	174.21	174.124725
2011-12-01	174.40	174.103124
2012-01-01	174.39	174.132243