Lake Erie Water Levels - A Time Series Analysis (Part 4/4)

  • 👟 Ready To Run!
  • 🚀 Data Exploration and Cleaning
  • 🤖 Machine Learning
  • ⏱️ Time Series Analysis

So far, in this study, we have looked at Univariate time series modeling techniques where a single variable Water Level varied over time. Next, we will look at Multivariate [1] time series analysis where each variable depends not only on its past values but also has some dependency on past values of other variables. We will predict how Water Level in Lake Erie varies over time along with other variables such as Precipitation and Air Temperature over the lake.

We will use continue to use Mean Water Level (m) data for Lake Erie. We will also bring in Precipitation (mm) data for Erie basin (land+lake) and Mean Air Temperature (${^\circ}C$) over Lake Erie.

In this notebook, we will:

  • Read Water level, Precipitation and Air Temperature data for Lake Erie and index it as a time series
  • Discuss Stationarity of data
  • Difference the data to make it stationary
  • Identify optimal VARMA(p,q) order using grid search
  • Split the data
  • Built and fit optimal VARMA model
    • Obtain Predictions
  • Evaluate Model performance
    • Plot predictions and calculate loss

Part 4: VARMA Models

In [202]:
# General Imports
import pandas as pd
import itertools
import time

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Time Series Imports
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults
from statsmodels.tools.eval_measures import rmse


# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read and Preprocess Data

We will use Mean Water Level data for Lake Erie, Precipitation data for Erie basin (land+lake) and Mean Air Temperature data over Lake Erie. We’ll need to do some quick processing to read the data and convert it to have a time series index.

Mean Air Temperature Data

The air temperature data is available for the period Jan'1948 - Dec'2019 and is measured in ${^\circ}C$. The data is in a table format with Year as rows and each Month as a column. It needs to be converted to the correct format so that each Year-Month combination is a row. We will read the data and process it for use.

In [112]:
# Read data
air_temp = pd.read_csv('Erie_AirTemp.csv')
air_temp.head()
Out[112]:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1948 -7.08 -3.82 1.70 9.68 12.30 18.82 22.42 21.23 18.65 9.92 7.73 0.58
1 1949 0.38 0.03 1.83 7.12 14.80 21.99 23.92 22.19 15.73 14.15 4.50 1.10
2 1950 1.36 -2.74 -1.11 4.81 13.62 18.84 20.56 20.65 16.66 13.15 3.13 -3.35
3 1951 -2.20 -2.41 2.10 7.15 14.24 19.04 21.59 20.26 16.91 13.00 1.79 -1.09
4 1952 -1.21 -1.45 1.35 8.67 12.65 20.96 23.76 21.13 17.76 9.04 6.39 1.19
In [113]:
# Convert rows to columns
air_temp = air_temp.melt(id_vars='Year', var_name='Month', value_name='Temp(C)')
In [114]:
# Create date column from Year and Month
air_temp['date'] = pd.to_datetime(air_temp.Year.astype(str) + '-' + air_temp.Month.astype(str))
In [115]:
# Drop Year and Month cols
air_temp.drop(columns = ['Year', 'Month'], inplace=True)
In [116]:
# Sort dataframe by date
air_temp.sort_values(by=['date'], inplace=True)
air_temp.head()
Out[116]:
Temp(C) date
0 -7.08 1948-01-01
72 -3.82 1948-02-01
144 1.70 1948-03-01
216 9.68 1948-04-01
288 12.30 1948-05-01
In [117]:
# Print Min and Max values
print(min(air_temp['date']))
print(max(air_temp['date']))
1948-01-01 00:00:00
2019-12-01 00:00:00
In [118]:
# Set date as index
air_temp.set_index('date', inplace=True)
air_temp.head()
Out[118]:
Temp(C)
date
1948-01-01 -7.08
1948-02-01 -3.82
1948-03-01 1.70
1948-04-01 9.68
1948-05-01 12.30

Precipitation Data

The precipiation data is available for the period Jan'1940 - Dec'2019 and is measured in "mm". This data is also in a table format with Year as rows and each Month as a column and . The data needs to be converted to the correct format so that each Year-Month combination is a row. We will read the data and process it for use.

In [119]:
# Read data
prec_data = pd.read_csv('erie_precipitation.csv')
prec_data.head()
Out[119]:
YYYY Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1940 43.33 56.34 60.84 85.20 105.98 115.51 51.04 117.42 56.04 58.01 83.60 86.72
1 1941 47.12 24.49 33.44 42.03 65.40 88.56 73.08 69.69 34.59 105.22 55.24 44.83
2 1942 44.83 67.85 85.49 60.02 112.47 84.91 113.05 78.22 92.14 76.74 92.73 74.12
3 1943 50.39 41.97 67.62 84.96 171.98 83.87 139.35 67.10 68.60 61.48 45.46 19.64
4 1944 26.54 53.74 82.09 104.72 93.48 90.67 43.93 77.09 70.31 32.26 54.35 67.89
In [120]:
# Convert rows to columns
prec_data = prec_data.melt(id_vars='YYYY', var_name='Month', value_name='Prec(mm)')
In [121]:
# Create date column from Year and Month
prec_data['date'] = pd.to_datetime(prec_data.YYYY.astype(str) + '-' + prec_data.Month.astype(str))
In [122]:
# Drop Year and Month cols
prec_data.drop(columns = ['YYYY', 'Month'], inplace=True)
In [123]:
# Sort dataframe by date
prec_data.sort_values(by=['date'], inplace=True)
prec_data.head()
Out[123]:
Prec(mm) date
0 43.33 1940-01-01
80 56.34 1940-02-01
160 60.84 1940-03-01
240 85.20 1940-04-01
320 105.98 1940-05-01
In [124]:
# Print Min and Max values
print(prec_data.shape)
print(min(prec_data['date']))
print(max(prec_data['date']))
(960, 2)
1940-01-01 00:00:00
2019-12-01 00:00:00

Since air temperature data is available starting from Jan'1948, we will subset precipitation data for the same time period.

In [125]:
# Subset data
prec_data = prec_data[prec_data['date']>='1948-01-01']
prec_data.head()
Out[125]:
Prec(mm) date
8 45.46 1948-01-01
88 63.61 1948-02-01
168 110.12 1948-03-01
248 81.60 1948-04-01
328 102.88 1948-05-01
In [126]:
# Set date as index
prec_data.set_index('date', inplace=True)
prec_data.head()
Out[126]:
Prec(mm)
date
1948-01-01 45.46
1948-02-01 63.61
1948-03-01 110.12
1948-04-01 81.60
1948-05-01 102.88

Mean Water Level Data

The mean water level data is available for the period Jan'1918 - Dec'2019 and is measured in "m". We will continue to use Lake Erie water level data we have used so far in this study.

In [127]:
erie_data = pd.read_csv('erie_data')
erie_data.head()
Out[127]:
date Erie SES12 DES12 TES12
0 1918-01-01 173.90 173.900000 173.881745 173.859612
1 1918-02-01 173.82 173.887692 173.749419 173.880169
2 1918-03-01 174.01 173.906509 174.160248 174.171219
3 1918-04-01 174.02 173.923969 174.051395 174.139210
4 1918-05-01 173.98 173.932589 173.950891 174.031776
In [128]:
# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes
Out[128]:
date     datetime64[ns]
Erie            float64
SES12           float64
DES12           float64
TES12           float64
dtype: object
In [129]:
# Drop cols
erie_data.drop(columns=['SES12','DES12','TES12'], inplace=True)
In [130]:
# Rename Erie col to Water level
erie_data = erie_data.rename(columns={'Erie':'Water_level(m)'})
In [131]:
# Print Min and Max values
print(erie_data.shape)
print(min(erie_data['date']))
print(max(erie_data['date']))
(1224, 2)
1918-01-01 00:00:00
2019-12-01 00:00:00

Since air temperature data is available starting from Jan'1948, we will subset water level data for the same time period.

In [132]:
# Subset Data
erie_data = erie_data[erie_data['date']>='1948-01-01']
erie_data.head()
Out[132]:
date Water_level(m)
360 1948-01-01 174.08
361 1948-02-01 173.99
362 1948-03-01 174.14
363 1948-04-01 174.38
364 1948-05-01 174.50
In [133]:
# Set date as index
erie_data.set_index('date', inplace=True)

Merge Data

We will merge water level, precipitation and air temperature data into a single dataframe.

In [134]:
erie_data = erie_data.join(air_temp)
erie_data.head()
Out[134]:
Water_level(m) Temp(C)
date
1948-01-01 174.08 -7.08
1948-02-01 173.99 -3.82
1948-03-01 174.14 1.70
1948-04-01 174.38 9.68
1948-05-01 174.50 12.30
In [135]:
erie_data = erie_data.join(prec_data)
erie_data.head()
Out[135]:
Water_level(m) Temp(C) Prec(mm)
date
1948-01-01 174.08 -7.08 45.46
1948-02-01 173.99 -3.82 63.61
1948-03-01 174.14 1.70 110.12
1948-04-01 174.38 9.68 81.60
1948-05-01 174.50 12.30 102.88

Setting a DatetimeIndex Frequency

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

In [136]:
# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index
Out[136]:
DatetimeIndex(['1948-01-01', '1948-02-01', '1948-03-01', '1948-04-01',
               '1948-05-01', '1948-06-01', '1948-07-01', '1948-08-01',
               '1948-09-01', '1948-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=864, freq='MS')

VARMA Models

Represented as: VARMA (p,q)

VARMA (Vector Autoregressive Moving Average) models [1] [2] [3] evaluate multiple variables where each variable depends not only on its past values but also has some dependency on past values of other variables.

A K-dimensional VARMA model of order $(p,q)$ considers each variable $y_K$ in the data.

For example, the system of equations for a 2-dimensional VARMA(1,1) model is:

    $y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \theta_{11,1}\varepsilon_{1,t-1} + \theta_{12,1}\varepsilon_{2,t-1} + \varepsilon_{1,t}$
    $y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \theta_{21,1}\varepsilon_{1,t-1} + \theta_{22,1}\varepsilon_{2,t-1} + \varepsilon_{2,t}$

where:
coefficient $\phi_{ii,l}$ captures the influence of the $l$th lag of variable $y_i$ on itself,
coefficient $\phi_{ij,l}$ captures the influence of the $l$th lag of variable $y_j$ on $y_i$,
coefficient $\theta_{ii,l}$ captures the influence of the $l$th lag of error $\varepsilon_i$ on itself,
coefficient $\theta_{ij,l}$ captures the influence of the $l$th lag of error $\varepsilon_j$ on $\varepsilon_i$,
and $\varepsilon_{1,t}$ and $\varepsilon_{2,t}$ are residual white noise.


We will follow this 8 step process to build, validate VARMA model and forecast for future periods:

Step 1 — Visualize the data.
Step 2 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can model the data.
Step 3 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity.
Step 4 — Select the optimal p and q terms for model order.
Step 5 — Split the data into train and test sets.
Step 6 — Build the model: Build the model and obtain predictions on test data.
Step 7 — Invert transform the data.
Step 8 — Validate model: Compare the predicted values to the actuals in the test sample.

Step 1 - Visualize the Data

In [142]:
fig, axes = plt.subplots(3,1, figsize=(15,12))

erie_data['Water_level(m)'].plot(title='Mean Water Levels (in metres)', ax=axes[0])
erie_data['Temp(C)'].plot(title='Avg. Monthly Temperature (in Celsius)', ax=axes[1])
erie_data['Prec(mm)'].plot(title='Mean Precipitation (in mm)', ax=axes[2])
axes[0].set(xlabel='')
axes[1].set(xlabel='')
axes[2].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows Mean water level, Temperature and Precipitation variation for Lake Erie. The data for Water Level shows an increasing and then decreasing trend. The data for Average Monthly Temperature shows no trend but seasonality. The data for Mean Precipitation shows a slight trend with seasonality and cyclic variations. Let's test the data for Stationarity.

Step 2 - Test for Stationarity

To effectively use time series models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality is not stationary.

A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.

We will define a function that creates more informative output from the adfuller() test. We will use AIC for the autolag because it penalizes higher order models over simpler models.

In [144]:
# Define the adf-test function

def aug_dftest(series, title=''):
    
    print(f'Augmented Dickey Fuller Test for: {title}')
    result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    output = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        output[f'critical value ({key})'] = val
        
    print(output.to_string())
        
    if output[1] <= 0.05:
        print('Reject the Null hypothesis; Data is stationary')
    else:
        print('Fail to reject the Null hypothesis; Data is non-stationary')
          
In [232]:
# Run Test on each variable
for col in erie_data:
    aug_dftest(erie_data[col], title=col)
    print()
Augmented Dickey Fuller Test for: Water_level(m)
ADF test statistic       -2.557847
p-value                   0.102026
# lags used              18.000000
# observations          845.000000
critical value (1%)      -3.438112
critical value (5%)      -2.864966
critical value (10%)     -2.568595
Fail to reject the Null hypothesis; Data is non-stationary

Augmented Dickey Fuller Test for: Temp(C)
ADF test statistic     -5.856578e+00
p-value                 3.488693e-07
# lags used             2.100000e+01
# observations          8.420000e+02
critical value (1%)    -3.438140e+00
critical value (5%)    -2.864979e+00
critical value (10%)   -2.568601e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Prec(mm)
ADF test statistic      -24.788931
p-value                   0.000000
# lags used               0.000000
# observations          863.000000
critical value (1%)      -3.437950
critical value (5%)      -2.864895
critical value (10%)     -2.568556
Reject the Null hypothesis; Data is stationary

Both temperature and precipitation variables are stationary, however water level variable is not. Let's difference the data to make it stationary.

Step 3 - Differencing

Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.

The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.

First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}

Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}

We'll take a first order difference of the entire DataFrame and re-run the augmented Dickey-Fuller tests.

In [218]:
# Difference the data
erie_transformed = erie_data.diff()
In [219]:
# Drop NA values
erie_transformed = erie_transformed.dropna()

# Run Test on each variable
for col in erie_data:
    aug_dftest(erie_data[col], title=col)
    print()
Augmented Dickey Fuller Test for: Water Level
ADF test statistic     -8.467925e+00
p-value                 1.504803e-13
# lags used             1.900000e+01
# observations          8.430000e+02
critical value (1%)    -3.438131e+00
critical value (5%)    -2.864975e+00
critical value (10%)   -2.568599e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Temp
ADF test statistic     -1.201978e+01
p-value                 3.042569e-22
# lags used             2.100000e+01
# observations          8.410000e+02
critical value (1%)    -3.438149e+00
critical value (5%)    -2.864983e+00
critical value (10%)   -2.568603e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Prec
ADF test statistic     -1.074121e+01
p-value                 2.815082e-19
# lags used             2.100000e+01
# observations          8.410000e+02
critical value (1%)    -3.438149e+00
critical value (5%)    -2.864983e+00
critical value (10%)   -2.568603e+00
Reject the Null hypothesis; Data is stationary

Test results confirm that we reached stationarity after the first difference(d).

Step 4 - Optimal Order (p,q)

Our goal here is to find the optimal set of model orders (p,q) that yields the best performance for our model. We will run a grid search with different values of AR(p) and MA(q) terms.

We will:

  1. Create a list of different model orders to iterate over.
  2. Define a function that iterates through these orders to find the best VARMA model with the lowest AIC.

The function will evaluate all models created during the stepwise operation and list the best VARMA model along with its AIC score. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.

In [224]:
# Create a list of ARIMA model orders

p = q = range(0, 6)
pq = list(itertools.product(p, q))
print(pq)
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5)]
In [225]:
from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults

# Function to iterate over VARMA orders
def stepwise_arima(data):
    model_res = {}
    for param in pq:
        order_aic = []
        try:
            model = VARMAX(data, order=param, trend='c')
            model_fit = model.fit(maxiter=1000, disp=False)
            print(f'VARMA{param} - AIC: {model_fit.aic}')
            model_res[param] = model_fit.aic
        except:
            continue

#   Print model with lowest AIC
    low_key = min(model_res, key=model_res.get)
    print('\n')
    print('---------------------')
    print('MODEL WITH LOWEST AIC')
    print('---------------------')
    print(f'VARMA{low_key} - AIC: {model_res[low_key]}')
In [226]:
# Run the model
stepwise_arima(erie_transformed)
VARMA(0, 1) - AIC: 10927.584294474686
VARMA(0, 2) - AIC: 10817.83428773417
VARMA(0, 3) - AIC: 10917.28923733559
VARMA(0, 4) - AIC: 10767.155111840013
VARMA(0, 5) - AIC: 10703.46873777406
VARMA(1, 0) - AIC: 10812.857441568016
VARMA(1, 1) - AIC: 10287.693015779276
VARMA(1, 2) - AIC: 10169.255144627745
VARMA(1, 3) - AIC: 10202.068136827704
VARMA(1, 4) - AIC: 10266.197629393806
VARMA(1, 5) - AIC: 10470.947493650605
VARMA(2, 0) - AIC: 10443.261607580705
VARMA(2, 1) - AIC: 10134.888598894773
VARMA(2, 2) - AIC: 9619.139106435141
VARMA(2, 3) - AIC: 9741.637181771464
VARMA(2, 4) - AIC: 9893.122845734946
VARMA(2, 5) - AIC: 9992.482279574771
VARMA(3, 0) - AIC: 10111.425344015883
VARMA(3, 1) - AIC: 9627.765420040938
VARMA(3, 2) - AIC: 9830.42421112183
VARMA(3, 3) - AIC: 9790.19141276467
VARMA(3, 4) - AIC: 9700.559540037539
VARMA(3, 5) - AIC: 9746.210152406711
VARMA(4, 0) - AIC: 9867.816602597377
VARMA(4, 1) - AIC: 9283.210210268599
VARMA(4, 2) - AIC: 9500.298701218588
VARMA(4, 3) - AIC: 9564.51839635842
VARMA(4, 4) - AIC: 9448.638452724328
VARMA(4, 5) - AIC: 9540.271510594474
VARMA(5, 0) - AIC: 9589.60613089416
VARMA(5, 1) - AIC: 9139.176849660413
VARMA(5, 2) - AIC: 9164.110682415027
VARMA(5, 3) - AIC: 9250.73093253131
VARMA(5, 4) - AIC: 9404.102129263021
VARMA(5, 5) - AIC: 9323.898187956052


---------------------
MODEL WITH LOWEST AIC
---------------------
VARMA(5, 1) - AIC: 9139.176849660413

The function determined VARMA(5,1) as an optimal model. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.

Now that we have identified VARMA(5,1) as the optimal model, we will split the data and fit this model on training and test sets.

Step 5 - Split the Data

It will be useful to define a number of observations variable for our test set. For this analysis, let's use data for last 60 months as test set.

In [168]:
nobs=60
train, test = erie_transformed[0:-nobs], erie_transformed[-nobs:]
In [169]:
print(train.shape)
print(test.shape)
(803, 3)
(60, 3)

Step 6 - Build the Model

We will build and fit the VARMA(5, 1) model.

In [237]:
model = VARMAX(train, order=(5, 1), trend='c')
results = model.fit(maxiter=1000, disp=False)
results.summary()
Out[237]:
Statespace Model Results
Dep. Variable: ['Water_level(m)', 'Temp(C)', 'Prec(mm)'] No. Observations: 803
Model: VARMA(5,1) Log Likelihood -4195.825
+ intercept AIC 8517.649
Date: Sun, 13 Sep 2020 BIC 8813.015
Time: 19:25:04 HQIC 8631.095
Sample: 02-01-1948
- 12-01-2014
Covariance Type: opg
Ljung-Box (Q): 118.51, 81.96, 57.95 Jarque-Bera (JB): 82.69, 12.97, 20.76
Prob(Q): 0.00, 0.00, 0.03 Prob(JB): 0.00, 0.00, 0.00
Heteroskedasticity (H): 1.10, 0.98, 0.96 Skew: 0.47, -0.21, 0.31
Prob(H) (two-sided): 0.45, 0.84, 0.75 Kurtosis: 4.25, 3.47, 3.48
Results for equation Water_level(m)
coef std err z P>|z| [0.025 0.975]
intercept -0.0002 0.002 -0.096 0.923 -0.004 0.004
L1.Water_level(m) 0.1063 0.238 0.446 0.656 -0.361 0.574
L1.Temp(C) 0.0039 0.002 2.488 0.013 0.001 0.007
L1.Prec(mm) 0.0016 0.001 2.876 0.004 0.001 0.003
L2.Water_level(m) 0.0017 0.052 0.033 0.974 -0.100 0.104
L2.Temp(C) -0.0036 0.002 -2.107 0.035 -0.007 -0.000
L2.Prec(mm) 0.0013 0.001 2.558 0.011 0.000 0.002
L3.Water_level(m) 0.1099 0.052 2.098 0.036 0.007 0.212
L3.Temp(C) -0.0068 0.001 -5.405 0.000 -0.009 -0.004
L3.Prec(mm) 0.0011 0.000 2.835 0.005 0.000 0.002
L4.Water_level(m) 0.1293 0.062 2.098 0.036 0.009 0.250
L4.Temp(C) -0.0046 0.002 -2.292 0.022 -0.009 -0.001
L4.Prec(mm) 0.0007 0.000 2.203 0.028 7.27e-05 0.001
L5.Water_level(m) 0.0345 0.063 0.544 0.586 -0.090 0.159
L5.Temp(C) -0.0042 0.002 -2.676 0.007 -0.007 -0.001
L5.Prec(mm) 0.0003 0.000 1.783 0.075 -2.79e-05 0.001
L1.e(Water_level(m)) -0.1588 0.237 -0.671 0.502 -0.623 0.305
L1.e(Temp(C)) -0.0004 0.002 -0.210 0.834 -0.004 0.003
L1.e(Prec(mm)) -0.0005 0.001 -0.878 0.380 -0.002 0.001
Results for equation Temp(C)
coef std err z P>|z| [0.025 0.975]
intercept 0.0055 0.072 0.076 0.939 -0.135 0.146
L1.Water_level(m) -4.5112 11.764 -0.383 0.701 -27.568 18.546
L1.Temp(C) 0.3984 0.080 4.971 0.000 0.241 0.556
L1.Prec(mm) 0.0598 0.031 1.911 0.056 -0.002 0.121
L2.Water_level(m) 6.0320 3.036 1.987 0.047 0.082 11.982
L2.Temp(C) 0.0415 0.092 0.452 0.651 -0.138 0.221
L2.Prec(mm) 0.0529 0.027 1.928 0.054 -0.001 0.107
L3.Water_level(m) 2.1178 3.099 0.683 0.494 -3.957 8.192
L3.Temp(C) -0.1544 0.068 -2.269 0.023 -0.288 -0.021
L3.Prec(mm) 0.0451 0.022 2.023 0.043 0.001 0.089
L4.Water_level(m) 14.4403 3.205 4.505 0.000 8.158 20.722
L4.Temp(C) -0.2614 0.105 -2.489 0.013 -0.467 -0.056
L4.Prec(mm) 0.0251 0.016 1.541 0.123 -0.007 0.057
L5.Water_level(m) -2.5946 3.260 -0.796 0.426 -8.984 3.794
L5.Temp(C) -0.3925 0.083 -4.725 0.000 -0.555 -0.230
L5.Prec(mm) 0.0029 0.009 0.307 0.759 -0.016 0.021
L1.e(Water_level(m)) -0.5461 11.511 -0.047 0.962 -23.108 22.016
L1.e(Temp(C)) -0.9064 0.068 -13.319 0.000 -1.040 -0.773
L1.e(Prec(mm)) -0.0641 0.031 -2.059 0.040 -0.125 -0.003
Results for equation Prec(mm)
coef std err z P>|z| [0.025 0.975]
intercept -0.0822 1.078 -0.076 0.939 -2.195 2.031
L1.Water_level(m) -60.9691 73.318 -0.832 0.406 -204.669 82.731
L1.Temp(C) 1.7739 0.639 2.777 0.005 0.522 3.026
L1.Prec(mm) -0.7268 0.195 -3.728 0.000 -1.109 -0.345
L2.Water_level(m) -46.7644 23.215 -2.014 0.044 -92.265 -1.264
L2.Temp(C) 1.3350 0.625 2.137 0.033 0.111 2.559
L2.Prec(mm) -0.5320 0.178 -2.987 0.003 -0.881 -0.183
L3.Water_level(m) -52.6401 22.981 -2.291 0.022 -97.683 -7.597
L3.Temp(C) 1.2115 0.492 2.463 0.014 0.248 2.175
L3.Prec(mm) -0.3644 0.144 -2.523 0.012 -0.647 -0.081
L4.Water_level(m) -21.4982 24.555 -0.876 0.381 -69.626 26.629
L4.Temp(C) -0.1265 0.676 -0.187 0.852 -1.452 1.199
L4.Prec(mm) -0.2408 0.104 -2.313 0.021 -0.445 -0.037
L5.Water_level(m) -43.3006 20.073 -2.157 0.031 -82.642 -3.959
L5.Temp(C) -1.0387 0.509 -2.040 0.041 -2.037 -0.041
L5.Prec(mm) -0.1004 0.062 -1.625 0.104 -0.222 0.021
L1.e(Water_level(m)) 6.8105 71.951 0.095 0.925 -134.212 147.833
L1.e(Temp(C)) -0.1609 0.838 -0.192 0.848 -1.804 1.482
L1.e(Prec(mm)) -0.0484 0.189 -0.257 0.797 -0.418 0.321
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.Water_level(m) 0.0566 0.001 43.699 0.000 0.054 0.059
sqrt.cov.Water_level(m).Temp(C) 0.2842 0.074 3.867 0.000 0.140 0.428
sqrt.var.Temp(C) 1.8833 0.049 38.741 0.000 1.788 1.979
sqrt.cov.Water_level(m).Prec(mm) 12.5972 0.988 12.754 0.000 10.661 14.533
sqrt.cov.Temp(C).Prec(mm) 1.4881 1.066 1.396 0.163 -0.601 3.578
sqrt.var.Prec(mm) 24.8267 0.626 39.686 0.000 23.601 26.053


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Model Diagnostics

We will run model diagnostics to investigate any unusual behavior.

In [238]:
with plt.style.context('fivethirtyeight'):
    results.plot_diagnostics(figsize=(16,8))
    plt.show()

The properties of residuals should reflect the properties of underlying process. Model diagnostics suggests [5] that:

  1. Residuals do not show a pattern
  2. The ACF plot shows only 1st lag =1. All other lags are are small and within the confidence band.
  3. Model residuals are near normally distributed.

Top Left : The residual errors seem to fluctuate around a mean of zero but have a non-uniform variance.

Top Right : The density plot suggest normal distribution with mean zero.

Bottom left : All the dots should fall perfectly in line with some deviations towards the right implying skewed distribution.

Bottom Right : The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.

Predict for next 60 months

We will use the VARMA(5,1) model to forecast for the next 60 months.

In [183]:
erie_forecast = results.forecast(60)
erie_forecast.head()
Out[183]:
Water_level(m) Temp(C) Prec(mm)
2015-01-01 -0.009948 -3.454743 20.633181
2015-02-01 0.043344 0.537420 0.224909
2015-03-01 0.061179 2.491800 6.027532
2015-04-01 0.089384 6.309388 4.976766
2015-05-01 0.069648 4.993909 -2.845398

Step 7 - Invert the Transformation

Remember that the forecasted values represent first-order differences. To compare them to the original data we have to roll back each difference. To roll back a first-order difference we take the most recent value on the training side of the original series, and add it to a cumulative sum of forecasted values.

In [184]:
# Build the forecast values from the first difference set
erie_forecast['Water_Forecast'] = erie_data['Water_level(m)'].iloc[-nobs-1] 
                                    + erie_forecast['Water_level(m)'].cumsum()
In [185]:
# Now build the forecast values from the first difference set
erie_forecast['Temp_Forecast'] = erie_data['Temp(C)'].iloc[-nobs-1] + erie_forecast['Temp(C)'].cumsum()
In [186]:
# Now build the forecast values from the first difference set
erie_forecast['Prec_Forecast'] = erie_data['Prec(mm)'].iloc[-nobs-1] + erie_forecast['Prec(mm)'].cumsum()
In [187]:
erie_forecast.head()
Out[187]:
Water_level(m) Temp(C) Prec(mm) Water_Forecast Temp_Forecast Prec_Forecast
2015-01-01 -0.009948 -3.454743 20.633181 174.160052 -1.384743 50.173181
2015-02-01 0.043344 0.537420 0.224909 174.203396 -0.847323 50.398091
2015-03-01 0.061179 2.491800 6.027532 174.264574 1.644476 56.425622
2015-04-01 0.089384 6.309388 4.976766 174.353959 7.953864 61.402388
2015-05-01 0.069648 4.993909 -2.845398 174.423607 12.947773 58.556990

Combine Forecasts

We will combine forecasts with original data in a dataframe.

In [188]:
test_df = pd.concat([erie_data.iloc[-60:],erie_forecast[['Water_Forecast','Temp_Forecast','Prec_Forecast']]],axis=1)
In [189]:
test_df.head()
Out[189]:
Water_level(m) Temp(C) Prec(mm) Water_Forecast Temp_Forecast Prec_Forecast
date
2015-01-01 174.16 -3.9 45.28 174.160052 -1.384743 50.173181
2015-02-01 174.03 -3.4 50.56 174.203396 -0.847323 50.398091
2015-03-01 173.99 1.6 25.33 174.264574 1.644476 56.425622
2015-04-01 174.22 7.7 86.07 174.353959 7.953864 61.402388
2015-05-01 174.31 13.7 126.41 174.423607 12.947773 58.556990

Step 8 - Validate Model Results

Plot the results

We will plot VARMA(5,1) predictions along with Test set.

In [235]:
with plt.style.context('fivethirtyeight'):
    fig, axes = plt.subplots(3,1, figsize=(18,18))

    # Water Level
    title = 'VARMA (5,1) Predictions (Water Level)'
    label = 'VARMA (5,1) Predictions'
    test_df['Water_level(m)'].plot(label='Water Level Test Data', legend=True, ax=axes[0], title=title)
    test_df['Water_Forecast'].plot(label=label, legend=True, ax=axes[0])
    axes[0].set(xlabel="",ylabel="Mean Water Level (m)")

    # Temperature
    title = 'VARMA (5,1) Predictions (Temperature)'
    label = 'VARMA (5,1) Predictions'
    test_df['Temp(C)'].plot(label='Temperature Test Data', legend=True, ax=axes[1], title=title)
    test_df['Temp_Forecast'].plot(label=label, legend=True, ax=axes[1])
    axes[1].set(xlabel="",ylabel="Avg. Temperature (C)")
    
    # Precipitation
    title = 'VARMA (5,1) Predictions (Precipitation)'
    label = 'VARMA (5,1) Predictions'
    test_df['Prec(mm)'].plot(label='Precipitation Test Data', legend=True, ax=axes[2], title=title)
    test_df['Prec_Forecast'].plot(label=label, legend=True, ax=axes[2])
    axes[2].set(xlabel="",ylabel="Mean Precipitation (mm)")  

The plots above show VARMA(5,1) model predictions:
For Water Level, the model predictions seem to follow the test data well. The model seems to capture seasonality however lags in capturing the exact trend.
For Temperature, the model captures our test data fairly well. The model seems to underit the data but overall is a pretty good model.
For Precipitation, the model is not a good fit. It is not able to capture the variations in test data.

Calculate Loss

We will calculate RMSE for each variable and compare to actual data.

In [242]:
RMSE1 = rmse(test_df['Water_level(m)'], test_df['Water_Forecast'])
actual1 = max(test_df['Water_level(m)']) - min(test_df['Water_level(m)'])
print(f'Water Level RMSE: {RMSE1:.3f}')
print(f'Water Level Test Data Variation: {actual1:.2f}')
Water Level RMSE: 0.354
Water Level Variation: 1.15
In [243]:
RMSE2 = rmse(test_df['Temp(C)'], test_df['Temp_Forecast'])
actual2 = max(test_df['Temp(C)']) - min(test_df['Temp(C)'])
print(f'Temperature RMSE: {RMSE2:.3f}')
print(f'Temperature Test Data Variation: {actual2:.2f}')
Temperature RMSE: 3.413
Temperature Test Data Variation: 27.93
In [244]:
RMSE3 = rmse(test_df['Prec(mm)'], test_df['Prec_Forecast'])
actual3 = max(test_df['Prec(mm)']) - min(test_df['Prec(mm)'])
print(f'Precipitation RMSE: {RMSE3:.3f}')
print(f'Precipitation Test Data Variation: {actual3:.2f}')
Precipitation RMSE: 34.251
Precipitation Test Data Variation: 134.18

RMSE values give us a estimate of how much the model predictions vary compared to the actual test data. We can see that the VARMA(5,1) model does poorly for Water Level (RMSE: 3.41) compared to ARIMA(2,1,2) model (RMSE: 0.29) however it tells us that there is interdepence between Water Level, Temperature and Precipitation, at least for the timespan we investigated. Similarly, RMSE for Temperature is pretty low as this model seems to fit fairly well. The RMSE for Precipitation however is high compared to the original values and the model does not seem to fit the data well.

Summary

To summarize, in this notebook we utilized a Multivariate Time Series modeling technique to:

  1. Read Water level, Precipitation and Air Temperature data for Lake Erie and indexed it as a time series.
  2. Checked stationarity of data and differenced the data to make it stationary.
  3. Identified optimal VARMA(p,q) order using grid search.
  4. Split the data, built and fit optimal VARMA model on the data.
  5. Inverted the data.
  6. Generated predictions and validated model performance.

Here is a summary of the models along with the RMSE that were created throughout the study.

Model RMSE
Double Exponential Smoothing 2.232
Triple Exponential Smoothing 0.258
ARIMA(2,1,2) 0.294
SARIMA(1,0,1)(2,1,0,12) 0.321
SARIMA(2,1,1)(0,1,2,12) 0.299
VARMA(5,1) for Water Level 0.354
VARMA(5,1) for Temperature 3.413
VARMA(5,1) for Precipitation 34.251

Conclusion

Time series are one of the most common data types encountered in daily life where data is collected at constant time intervals. Weather, stock prices, electricity usage, daily burger sales or even blood pressure variations are some of the examples of data that can be collected at regular intervals.

In this study, we explored one such example of forecasting water levels in Lake Erie. Various reports [4] [5] have shown that in recent years high water levels wreaked havoc on the communities bordering the Great Lakes leading to widespread damage in coastal cities.

  • We started with Univariate modeling techniques such as Exponential Smoothing Models (Single, Double and Triple) to capture trend and seasonality in data.
  • Next, we moved to ARIMA (Autoregressive Integrated Moving Average) models which combine the ideas of autoregression and differencing and aim to describe the autocorrelations in the data.
  • Next, we explored Seasonal ARIMA models which are formed by including additional seasonal terms in the ARIMA models. We saw that SARIMA was the best model for forecasting Water Levels as it captured the trend and seasonality of data well.
  • Finally, we utilized Multivariate time series modeling technique VARMA (Vector Autoregressive Moving Average) to forecast Water Level using Precipitation Air Temperature data.

The results from this study can be utilized by various state govt. executives, residents and small business owners to minimize the impacts of rising water levels on their daily lives, their livelihood and to better prepare for the future months.

References

[1] Serban, N. (2018). Unit 3: Multivariate Time Series Modeling [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis
[2] https://online.stat.psu.edu/stat510/lesson/11/11.2
[3] Lütkepohl H. (2005) Estimation of VARMA Models. In: New Introduction to Multiple Time Series Analysis. Springer, Berlin, Heidelberg.
[4] https://www.theguardian.com/environment/2019/sep/03/great-lakes-region-flooding-climate-crisis
[5] https://www.eurekalert.org/pub_releases/2020-07/uot-srm070720.php