Lake Erie Water Levels - A Time Series Analysis (Part 4/4)¶

👟 Ready To Run!

🚀 Data Exploration and Cleaning

🤖 Machine Learning

⏱️ Time Series Analysis

So far, in this study, we have looked at Univariate time series modeling techniques where a single variable Water Level varied over time. Next, we will look at Multivariate [1] time series analysis where each variable depends not only on its past values but also has some dependency on past values of other variables. We will predict how Water Level in Lake Erie varies over time along with other variables such as Precipitation and Air Temperature over the lake.

We will use continue to use Mean Water Level (m) data for Lake Erie. We will also bring in Precipitation (mm) data for Erie basin (land+lake) and Mean Air Temperature (${^\circ}C$) over Lake Erie.

Read Water level, Precipitation and Air Temperature data for Lake Erie and index it as a time series
Discuss Stationarity of data
Difference the data to make it stationary
Identify optimal VARMA(p,q) order using grid search
Split the data
Built and fit optimal VARMA model
- Obtain Predictions
Evaluate Model performance
- Plot predictions and calculate loss

Part 4: VARMA Models¶

# General Imports
import pandas as pd
import itertools
import time

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Time Series Imports
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults
from statsmodels.tools.eval_measures import rmse


# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read and Preprocess Data¶

We will use Mean Water Level data for Lake Erie, Precipitation data for Erie basin (land+lake) and Mean Air Temperature data over Lake Erie. We’ll need to do some quick processing to read the data and convert it to have a time series index.

Mean Air Temperature Data¶

The air temperature data is available for the period Jan'1948 - Dec'2019 and is measured in ${^\circ}C$. The data is in a table format with Year as rows and each Month as a column. It needs to be converted to the correct format so that each Year-Month combination is a row. We will read the data and process it for use.

# Read data
air_temp = pd.read_csv('Erie_AirTemp.csv')
air_temp.head()

# Convert rows to columns
air_temp = air_temp.melt(id_vars='Year', var_name='Month', value_name='Temp(C)')

# Create date column from Year and Month
air_temp['date'] = pd.to_datetime(air_temp.Year.astype(str) + '-' + air_temp.Month.astype(str))

# Drop Year and Month cols
air_temp.drop(columns = ['Year', 'Month'], inplace=True)

# Sort dataframe by date
air_temp.sort_values(by=['date'], inplace=True)
air_temp.head()

# Print Min and Max values
print(min(air_temp['date']))
print(max(air_temp['date']))

1948-01-01 00:00:00
2019-12-01 00:00:00

# Set date as index
air_temp.set_index('date', inplace=True)
air_temp.head()

Precipitation Data¶

The precipiation data is available for the period Jan'1940 - Dec'2019 and is measured in "mm". This data is also in a table format with Year as rows and each Month as a column and . The data needs to be converted to the correct format so that each Year-Month combination is a row. We will read the data and process it for use.

# Read data
prec_data = pd.read_csv('erie_precipitation.csv')
prec_data.head()

# Convert rows to columns
prec_data = prec_data.melt(id_vars='YYYY', var_name='Month', value_name='Prec(mm)')

# Create date column from Year and Month
prec_data['date'] = pd.to_datetime(prec_data.YYYY.astype(str) + '-' + prec_data.Month.astype(str))

# Drop Year and Month cols
prec_data.drop(columns = ['YYYY', 'Month'], inplace=True)

# Sort dataframe by date
prec_data.sort_values(by=['date'], inplace=True)
prec_data.head()

# Print Min and Max values
print(prec_data.shape)
print(min(prec_data['date']))
print(max(prec_data['date']))

(960, 2)
1940-01-01 00:00:00
2019-12-01 00:00:00

Since air temperature data is available starting from Jan'1948, we will subset precipitation data for the same time period.

# Subset data
prec_data = prec_data[prec_data['date']>='1948-01-01']
prec_data.head()

# Set date as index
prec_data.set_index('date', inplace=True)
prec_data.head()

Mean Water Level Data¶

The mean water level data is available for the period Jan'1918 - Dec'2019 and is measured in "m". We will continue to use Lake Erie water level data we have used so far in this study.

erie_data = pd.read_csv('erie_data')
erie_data.head()

# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes

date     datetime64[ns]
Erie            float64
SES12           float64
DES12           float64
TES12           float64
dtype: object

# Drop cols
erie_data.drop(columns=['SES12','DES12','TES12'], inplace=True)

# Rename Erie col to Water level
erie_data = erie_data.rename(columns={'Erie':'Water_level(m)'})

# Print Min and Max values
print(erie_data.shape)
print(min(erie_data['date']))
print(max(erie_data['date']))

(1224, 2)
1918-01-01 00:00:00
2019-12-01 00:00:00

Since air temperature data is available starting from Jan'1948, we will subset water level data for the same time period.

# Subset Data
erie_data = erie_data[erie_data['date']>='1948-01-01']
erie_data.head()

# Set date as index
erie_data.set_index('date', inplace=True)

Merge Data¶

We will merge water level, precipitation and air temperature data into a single dataframe.

erie_data = erie_data.join(air_temp)
erie_data.head()

erie_data = erie_data.join(prec_data)
erie_data.head()

Setting a DatetimeIndex Frequency¶

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index

DatetimeIndex(['1948-01-01', '1948-02-01', '1948-03-01', '1948-04-01',
               '1948-05-01', '1948-06-01', '1948-07-01', '1948-08-01',
               '1948-09-01', '1948-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=864, freq='MS')

VARMA Models¶

Represented as: VARMA (p,q)

VARMA (Vector Autoregressive Moving Average) models [1] [2] [3] evaluate multiple variables where each variable depends not only on its past values but also has some dependency on past values of other variables.

A K-dimensional VARMA model of order $(p,q)$ considers each variable $y_K$ in the data.

For example, the system of equations for a 2-dimensional VARMA(1,1) model is:

$y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \theta_{11,1}\varepsilon_{1,t-1} + \theta_{12,1}\varepsilon_{2,t-1} + \varepsilon_{1,t}$
$y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \theta_{21,1}\varepsilon_{1,t-1} + \theta_{22,1}\varepsilon_{2,t-1} + \varepsilon_{2,t}$

where:
coefficient $\phi_{ii,l}$ captures the influence of the $l$th lag of variable $y_i$ on itself,
coefficient $\phi_{ij,l}$ captures the influence of the $l$th lag of variable $y_j$ on $y_i$,
coefficient $\theta_{ii,l}$ captures the influence of the $l$th lag of error $\varepsilon_i$ on itself,
coefficient $\theta_{ij,l}$ captures the influence of the $l$th lag of error $\varepsilon_j$ on $\varepsilon_i$,
and $\varepsilon_{1,t}$ and $\varepsilon_{2,t}$ are residual white noise.

We will follow this 8 step process to build, validate VARMA model and forecast for future periods:

Step 1 — Visualize the data.
Step 2 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can model the data.
Step 3 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity.
Step 4 — Select the optimal p and q terms for model order.
Step 5 — Split the data into train and test sets.
Step 6 — Build the model: Build the model and obtain predictions on test data.
Step 7 — Invert transform the data.
Step 8 — Validate model: Compare the predicted values to the actuals in the test sample.

Step 1 - Visualize the Data¶

fig, axes = plt.subplots(3,1, figsize=(15,12))

erie_data['Water_level(m)'].plot(title='Mean Water Levels (in metres)', ax=axes[0])
erie_data['Temp(C)'].plot(title='Avg. Monthly Temperature (in Celsius)', ax=axes[1])
erie_data['Prec(mm)'].plot(title='Mean Precipitation (in mm)', ax=axes[2])
axes[0].set(xlabel='')
axes[1].set(xlabel='')
axes[2].set(xlabel='')

fig.tight_layout(pad=1.0)
plt.show()

The plot above shows Mean water level, Temperature and Precipitation variation for Lake Erie. The data for Water Level shows an increasing and then decreasing trend. The data for Average Monthly Temperature shows no trend but seasonality. The data for Mean Precipitation shows a slight trend with seasonality and cyclic variations. Let's test the data for Stationarity.

Step 2 - Test for Stationarity¶

To effectively use time series models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality is not stationary.

A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.

We will define a function that creates more informative output from the adfuller() test. We will use AIC for the autolag because it penalizes higher order models over simpler models.

# Define the adf-test function

def aug_dftest(series, title=''):
    
    print(f'Augmented Dickey Fuller Test for: {title}')
    result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    output = pd.Series(result[0:4], index=labels)
    for key, val in result[4].items():
        output[f'critical value ({key})'] = val
        
    print(output.to_string())
        
    if output[1] <= 0.05:
        print('Reject the Null hypothesis; Data is stationary')
    else:
        print('Fail to reject the Null hypothesis; Data is non-stationary')

# Run Test on each variable
for col in erie_data:
    aug_dftest(erie_data[col], title=col)
    print()

Augmented Dickey Fuller Test for: Water_level(m)
ADF test statistic       -2.557847
p-value                   0.102026
# lags used              18.000000
# observations          845.000000
critical value (1%)      -3.438112
critical value (5%)      -2.864966
critical value (10%)     -2.568595
Fail to reject the Null hypothesis; Data is non-stationary

Augmented Dickey Fuller Test for: Temp(C)
ADF test statistic     -5.856578e+00
p-value                 3.488693e-07
# lags used             2.100000e+01
# observations          8.420000e+02
critical value (1%)    -3.438140e+00
critical value (5%)    -2.864979e+00
critical value (10%)   -2.568601e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Prec(mm)
ADF test statistic      -24.788931
p-value                   0.000000
# lags used               0.000000
# observations          863.000000
critical value (1%)      -3.437950
critical value (5%)      -2.864895
critical value (10%)     -2.568556
Reject the Null hypothesis; Data is stationary

Both temperature and precipitation variables are stationary, however water level variable is not. Let's difference the data to make it stationary.

Step 3 - Differencing¶

Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.

The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.

First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}

Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}

We'll take a first order difference of the entire DataFrame and re-run the augmented Dickey-Fuller tests.

# Difference the data
erie_transformed = erie_data.diff()

# Drop NA values
erie_transformed = erie_transformed.dropna()

# Run Test on each variable
for col in erie_data:
    aug_dftest(erie_data[col], title=col)
    print()

Augmented Dickey Fuller Test for: Water Level
ADF test statistic     -8.467925e+00
p-value                 1.504803e-13
# lags used             1.900000e+01
# observations          8.430000e+02
critical value (1%)    -3.438131e+00
critical value (5%)    -2.864975e+00
critical value (10%)   -2.568599e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Temp
ADF test statistic     -1.201978e+01
p-value                 3.042569e-22
# lags used             2.100000e+01
# observations          8.410000e+02
critical value (1%)    -3.438149e+00
critical value (5%)    -2.864983e+00
critical value (10%)   -2.568603e+00
Reject the Null hypothesis; Data is stationary

Augmented Dickey Fuller Test for: Prec
ADF test statistic     -1.074121e+01
p-value                 2.815082e-19
# lags used             2.100000e+01
# observations          8.410000e+02
critical value (1%)    -3.438149e+00
critical value (5%)    -2.864983e+00
critical value (10%)   -2.568603e+00
Reject the Null hypothesis; Data is stationary

Test results confirm that we reached stationarity after the first difference(d).

Step 4 - Optimal Order (p,q)¶

Our goal here is to find the optimal set of model orders (p,q) that yields the best performance for our model. We will run a grid search with different values of AR(p) and MA(q) terms.

We will:

Create a list of different model orders to iterate over.
Define a function that iterates through these orders to find the best VARMA model with the lowest AIC.

The function will evaluate all models created during the stepwise operation and list the best VARMA model along with its AIC score. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.

# Create a list of ARIMA model orders

p = q = range(0, 6)
pq = list(itertools.product(p, q))
print(pq)

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5)]

from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults

# Function to iterate over VARMA orders
def stepwise_arima(data):
    model_res = {}
    for param in pq:
        order_aic = []
        try:
            model = VARMAX(data, order=param, trend='c')
            model_fit = model.fit(maxiter=1000, disp=False)
            print(f'VARMA{param} - AIC: {model_fit.aic}')
            model_res[param] = model_fit.aic
        except:
            continue

#   Print model with lowest AIC
    low_key = min(model_res, key=model_res.get)
    print('\n')
    print('---------------------')
    print('MODEL WITH LOWEST AIC')
    print('---------------------')
    print(f'VARMA{low_key} - AIC: {model_res[low_key]}')

# Run the model
stepwise_arima(erie_transformed)

VARMA(0, 1) - AIC: 10927.584294474686
VARMA(0, 2) - AIC: 10817.83428773417
VARMA(0, 3) - AIC: 10917.28923733559
VARMA(0, 4) - AIC: 10767.155111840013
VARMA(0, 5) - AIC: 10703.46873777406
VARMA(1, 0) - AIC: 10812.857441568016
VARMA(1, 1) - AIC: 10287.693015779276
VARMA(1, 2) - AIC: 10169.255144627745
VARMA(1, 3) - AIC: 10202.068136827704
VARMA(1, 4) - AIC: 10266.197629393806
VARMA(1, 5) - AIC: 10470.947493650605
VARMA(2, 0) - AIC: 10443.261607580705
VARMA(2, 1) - AIC: 10134.888598894773
VARMA(2, 2) - AIC: 9619.139106435141
VARMA(2, 3) - AIC: 9741.637181771464
VARMA(2, 4) - AIC: 9893.122845734946
VARMA(2, 5) - AIC: 9992.482279574771
VARMA(3, 0) - AIC: 10111.425344015883
VARMA(3, 1) - AIC: 9627.765420040938
VARMA(3, 2) - AIC: 9830.42421112183
VARMA(3, 3) - AIC: 9790.19141276467
VARMA(3, 4) - AIC: 9700.559540037539
VARMA(3, 5) - AIC: 9746.210152406711
VARMA(4, 0) - AIC: 9867.816602597377
VARMA(4, 1) - AIC: 9283.210210268599
VARMA(4, 2) - AIC: 9500.298701218588
VARMA(4, 3) - AIC: 9564.51839635842
VARMA(4, 4) - AIC: 9448.638452724328
VARMA(4, 5) - AIC: 9540.271510594474
VARMA(5, 0) - AIC: 9589.60613089416
VARMA(5, 1) - AIC: 9139.176849660413
VARMA(5, 2) - AIC: 9164.110682415027
VARMA(5, 3) - AIC: 9250.73093253131
VARMA(5, 4) - AIC: 9404.102129263021
VARMA(5, 5) - AIC: 9323.898187956052


---------------------
MODEL WITH LOWEST AIC
---------------------
VARMA(5, 1) - AIC: 9139.176849660413

The function determined VARMA(5,1) as an optimal model. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.

Now that we have identified VARMA(5,1) as the optimal model, we will split the data and fit this model on training and test sets.

Step 5 - Split the Data¶

It will be useful to define a number of observations variable for our test set. For this analysis, let's use data for last 60 months as test set.

nobs=60
train, test = erie_transformed[0:-nobs], erie_transformed[-nobs:]

print(train.shape)
print(test.shape)

(803, 3)
(60, 3)

Step 6 - Build the Model¶

We will build and fit the VARMA(5, 1) model.

model = VARMAX(train, order=(5, 1), trend='c')
results = model.fit(maxiter=1000, disp=False)
results.summary()

Model Diagnostics¶

We will run model diagnostics to investigate any unusual behavior.

with plt.style.context('fivethirtyeight'):
    results.plot_diagnostics(figsize=(16,8))
    plt.show()

The properties of residuals should reflect the properties of underlying process. Model diagnostics suggests [5] that:

Residuals do not show a pattern

The ACF plot shows only 1st lag =1. All other lags are are small and within the confidence band.

Model residuals are near normally distributed.

Top Left : The residual errors seem to fluctuate around a mean of zero but have a non-uniform variance.

Top Right : The density plot suggest normal distribution with mean zero.

Bottom left : All the dots should fall perfectly in line with some deviations towards the right implying skewed distribution.

Bottom Right : The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.

Predict for next 60 months¶

We will use the VARMA(5,1) model to forecast for the next 60 months.

erie_forecast = results.forecast(60)
erie_forecast.head()

Step 7 - Invert the Transformation¶

Remember that the forecasted values represent first-order differences. To compare them to the original data we have to roll back each difference. To roll back a first-order difference we take the most recent value on the training side of the original series, and add it to a cumulative sum of forecasted values.

# Build the forecast values from the first difference set
erie_forecast['Water_Forecast'] = erie_data['Water_level(m)'].iloc[-nobs-1] 
                                    + erie_forecast['Water_level(m)'].cumsum()

# Now build the forecast values from the first difference set
erie_forecast['Temp_Forecast'] = erie_data['Temp(C)'].iloc[-nobs-1] + erie_forecast['Temp(C)'].cumsum()

# Now build the forecast values from the first difference set
erie_forecast['Prec_Forecast'] = erie_data['Prec(mm)'].iloc[-nobs-1] + erie_forecast['Prec(mm)'].cumsum()

erie_forecast.head()

Combine Forecasts¶

We will combine forecasts with original data in a dataframe.

test_df = pd.concat([erie_data.iloc[-60:],erie_forecast[['Water_Forecast','Temp_Forecast','Prec_Forecast']]],axis=1)

test_df.head()

Step 8 - Validate Model Results¶

Plot the results¶

We will plot VARMA(5,1) predictions along with Test set.

with plt.style.context('fivethirtyeight'):
    fig, axes = plt.subplots(3,1, figsize=(18,18))

    # Water Level
    title = 'VARMA (5,1) Predictions (Water Level)'
    label = 'VARMA (5,1) Predictions'
    test_df['Water_level(m)'].plot(label='Water Level Test Data', legend=True, ax=axes[0], title=title)
    test_df['Water_Forecast'].plot(label=label, legend=True, ax=axes[0])
    axes[0].set(xlabel="",ylabel="Mean Water Level (m)")

    # Temperature
    title = 'VARMA (5,1) Predictions (Temperature)'
    label = 'VARMA (5,1) Predictions'
    test_df['Temp(C)'].plot(label='Temperature Test Data', legend=True, ax=axes[1], title=title)
    test_df['Temp_Forecast'].plot(label=label, legend=True, ax=axes[1])
    axes[1].set(xlabel="",ylabel="Avg. Temperature (C)")
    
    # Precipitation
    title = 'VARMA (5,1) Predictions (Precipitation)'
    label = 'VARMA (5,1) Predictions'
    test_df['Prec(mm)'].plot(label='Precipitation Test Data', legend=True, ax=axes[2], title=title)
    test_df['Prec_Forecast'].plot(label=label, legend=True, ax=axes[2])
    axes[2].set(xlabel="",ylabel="Mean Precipitation (mm)")

The plots above show VARMA(5,1) model predictions:
For Water Level, the model predictions seem to follow the test data well. The model seems to capture seasonality however lags in capturing the exact trend.
For Temperature, the model captures our test data fairly well. The model seems to underit the data but overall is a pretty good model.
For Precipitation, the model is not a good fit. It is not able to capture the variations in test data.

Calculate Loss¶

We will calculate RMSE for each variable and compare to actual data.

RMSE1 = rmse(test_df['Water_level(m)'], test_df['Water_Forecast'])
actual1 = max(test_df['Water_level(m)']) - min(test_df['Water_level(m)'])
print(f'Water Level RMSE: {RMSE1:.3f}')
print(f'Water Level Test Data Variation: {actual1:.2f}')

Water Level RMSE: 0.354
Water Level Variation: 1.15

RMSE2 = rmse(test_df['Temp(C)'], test_df['Temp_Forecast'])
actual2 = max(test_df['Temp(C)']) - min(test_df['Temp(C)'])
print(f'Temperature RMSE: {RMSE2:.3f}')
print(f'Temperature Test Data Variation: {actual2:.2f}')

Temperature RMSE: 3.413
Temperature Test Data Variation: 27.93

RMSE3 = rmse(test_df['Prec(mm)'], test_df['Prec_Forecast'])
actual3 = max(test_df['Prec(mm)']) - min(test_df['Prec(mm)'])
print(f'Precipitation RMSE: {RMSE3:.3f}')
print(f'Precipitation Test Data Variation: {actual3:.2f}')

Precipitation RMSE: 34.251
Precipitation Test Data Variation: 134.18

RMSE values give us a estimate of how much the model predictions vary compared to the actual test data. We can see that the VARMA(5,1) model does poorly for Water Level (RMSE: 3.41) compared to ARIMA(2,1,2) model (RMSE: 0.29) however it tells us that there is interdepence between Water Level, Temperature and Precipitation, at least for the timespan we investigated. Similarly, RMSE for Temperature is pretty low as this model seems to fit fairly well. The RMSE for Precipitation however is high compared to the original values and the model does not seem to fit the data well.

Summary¶

To summarize, in this notebook we utilized a Multivariate Time Series modeling technique to:

Read Water level, Precipitation and Air Temperature data for Lake Erie and indexed it as a time series.
Checked stationarity of data and differenced the data to make it stationary.
Identified optimal VARMA(p,q) order using grid search.
Split the data, built and fit optimal VARMA model on the data.
Inverted the data.
Generated predictions and validated model performance.

Here is a summary of the models along with the RMSE that were created throughout the study.

Model	RMSE
Double Exponential Smoothing	2.232
Triple Exponential Smoothing	0.258
ARIMA(2,1,2)	0.294
SARIMA(1,0,1)(2,1,0,12)	0.321
SARIMA(2,1,1)(0,1,2,12)	0.299
VARMA(5,1) for Water Level	0.354
VARMA(5,1) for Temperature	3.413
VARMA(5,1) for Precipitation	34.251

Conclusion¶

Time series are one of the most common data types encountered in daily life where data is collected at constant time intervals. Weather, stock prices, electricity usage, daily burger sales or even blood pressure variations are some of the examples of data that can be collected at regular intervals.

In this study, we explored one such example of forecasting water levels in Lake Erie. Various reports [4] [5] have shown that in recent years high water levels wreaked havoc on the communities bordering the Great Lakes leading to widespread damage in coastal cities.

We started with Univariate modeling techniques such as Exponential Smoothing Models (Single, Double and Triple) to capture trend and seasonality in data.
Next, we moved to ARIMA (Autoregressive Integrated Moving Average) models which combine the ideas of autoregression and differencing and aim to describe the autocorrelations in the data.
Next, we explored Seasonal ARIMA models which are formed by including additional seasonal terms in the ARIMA models. We saw that SARIMA was the best model for forecasting Water Levels as it captured the trend and seasonality of data well.
Finally, we utilized Multivariate time series modeling technique VARMA (Vector Autoregressive Moving Average) to forecast Water Level using Precipitation Air Temperature data.

The results from this study can be utilized by various state govt. executives, residents and small business owners to minimize the impacts of rising water levels on their daily lives, their livelihood and to better prepare for the future months.

References¶

[1] Serban, N. (2018). Unit 3: Multivariate Time Series Modeling [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis
[2] https://online.stat.psu.edu/stat510/lesson/11/11.2
[3] Lütkepohl H. (2005) Estimation of VARMA Models. In: New Introduction to Multiple Time Series Analysis. Springer, Berlin, Heidelberg.
[4] https://www.theguardian.com/environment/2019/sep/03/great-lakes-region-flooding-climate-crisis
[5] https://www.eurekalert.org/pub_releases/2020-07/uot-srm070720.php

	Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
0	1948	-7.08	-3.82	1.70	9.68	12.30	18.82	22.42	21.23	18.65	9.92	7.73	0.58
1	1949	0.38	0.03	1.83	7.12	14.80	21.99	23.92	22.19	15.73	14.15	4.50	1.10
2	1950	1.36	-2.74	-1.11	4.81	13.62	18.84	20.56	20.65	16.66	13.15	3.13	-3.35
3	1951	-2.20	-2.41	2.10	7.15	14.24	19.04	21.59	20.26	16.91	13.00	1.79	-1.09
4	1952	-1.21	-1.45	1.35	8.67	12.65	20.96	23.76	21.13	17.76	9.04	6.39	1.19

	YYYY	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
0	1940	43.33	56.34	60.84	85.20	105.98	115.51	51.04	117.42	56.04	58.01	83.60	86.72
1	1941	47.12	24.49	33.44	42.03	65.40	88.56	73.08	69.69	34.59	105.22	55.24	44.83
2	1942	44.83	67.85	85.49	60.02	112.47	84.91	113.05	78.22	92.14	76.74	92.73	74.12
3	1943	50.39	41.97	67.62	84.96	171.98	83.87	139.35	67.10	68.60	61.48	45.46	19.64
4	1944	26.54	53.74	82.09	104.72	93.48	90.67	43.93	77.09	70.31	32.26	54.35	67.89

	Prec(mm)	date
0	43.33	1940-01-01
80	56.34	1940-02-01
160	60.84	1940-03-01
240	85.20	1940-04-01
320	105.98	1940-05-01

	date	Erie	SES12	DES12	TES12
0	1918-01-01	173.90	173.900000	173.881745	173.859612
1	1918-02-01	173.82	173.887692	173.749419	173.880169
2	1918-03-01	174.01	173.906509	174.160248	174.171219
3	1918-04-01	174.02	173.923969	174.051395	174.139210
4	1918-05-01	173.98	173.932589	173.950891	174.031776

	date	Water_level(m)
360	1948-01-01	174.08
361	1948-02-01	173.99
362	1948-03-01	174.14
363	1948-04-01	174.38
364	1948-05-01	174.50

	Prec(mm)	date
8	45.46	1948-01-01
88	63.61	1948-02-01
168	110.12	1948-03-01
248	81.60	1948-04-01
328	102.88	1948-05-01

Dep. Variable:	['Water_level(m)', 'Temp(C)', 'Prec(mm)']	No. Observations:	803
Model:	VARMA(5,1)	Log Likelihood	-4195.825
	+ intercept	AIC	8517.649
Date:	Sun, 13 Sep 2020	BIC	8813.015
Time:	19:25:04	HQIC	8631.095
Sample:	02-01-1948
	- 12-01-2014
Covariance Type:	opg

Ljung-Box (Q):	118.51, 81.96, 57.95	Jarque-Bera (JB):	82.69, 12.97, 20.76
Prob(Q):	0.00, 0.00, 0.03	Prob(JB):	0.00, 0.00, 0.00
Heteroskedasticity (H):	1.10, 0.98, 0.96	Skew:	0.47, -0.21, 0.31
Prob(H) (two-sided):	0.45, 0.84, 0.75	Kurtosis:	4.25, 3.47, 3.48

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-0.0002	0.002	-0.096	0.923	-0.004	0.004
L1.Water_level(m)	0.1063	0.238	0.446	0.656	-0.361	0.574
L1.Temp(C)	0.0039	0.002	2.488	0.013	0.001	0.007
L1.Prec(mm)	0.0016	0.001	2.876	0.004	0.001	0.003
L2.Water_level(m)	0.0017	0.052	0.033	0.974	-0.100	0.104
L2.Temp(C)	-0.0036	0.002	-2.107	0.035	-0.007	-0.000
L2.Prec(mm)	0.0013	0.001	2.558	0.011	0.000	0.002
L3.Water_level(m)	0.1099	0.052	2.098	0.036	0.007	0.212
L3.Temp(C)	-0.0068	0.001	-5.405	0.000	-0.009	-0.004
L3.Prec(mm)	0.0011	0.000	2.835	0.005	0.000	0.002
L4.Water_level(m)	0.1293	0.062	2.098	0.036	0.009	0.250
L4.Temp(C)	-0.0046	0.002	-2.292	0.022	-0.009	-0.001
L4.Prec(mm)	0.0007	0.000	2.203	0.028	7.27e-05	0.001
L5.Water_level(m)	0.0345	0.063	0.544	0.586	-0.090	0.159
L5.Temp(C)	-0.0042	0.002	-2.676	0.007	-0.007	-0.001
L5.Prec(mm)	0.0003	0.000	1.783	0.075	-2.79e-05	0.001
L1.e(Water_level(m))	-0.1588	0.237	-0.671	0.502	-0.623	0.305
L1.e(Temp(C))	-0.0004	0.002	-0.210	0.834	-0.004	0.003
L1.e(Prec(mm))	-0.0005	0.001	-0.878	0.380	-0.002	0.001

	Water_level(m)	Temp(C)	Prec(mm)
2015-01-01	-0.009948	-3.454743	20.633181
2015-02-01	0.043344	0.537420	0.224909
2015-03-01	0.061179	2.491800	6.027532
2015-04-01	0.089384	6.309388	4.976766
2015-05-01	0.069648	4.993909	-2.845398

	Water_level(m)	Temp(C)	Prec(mm)	Water_Forecast	Temp_Forecast	Prec_Forecast
date
2015-01-01	174.16	-3.9	45.28	174.160052	-1.384743	50.173181
2015-02-01	174.03	-3.4	50.56	174.203396	-0.847323	50.398091
2015-03-01	173.99	1.6	25.33	174.264574	1.644476	56.425622
2015-04-01	174.22	7.7	86.07	174.353959	7.953864	61.402388
2015-05-01	174.31	13.7	126.41	174.423607	12.947773	58.556990

Results for equation Temp(C)
	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	0.0055	0.072	0.076	0.939	-0.135	0.146
L1.Water_level(m)	-4.5112	11.764	-0.383	0.701	-27.568	18.546
L1.Temp(C)	0.3984	0.080	4.971	0.000	0.241	0.556
L1.Prec(mm)	0.0598	0.031	1.911	0.056	-0.002	0.121
L2.Water_level(m)	6.0320	3.036	1.987	0.047	0.082	11.982
L2.Temp(C)	0.0415	0.092	0.452	0.651	-0.138	0.221
L2.Prec(mm)	0.0529	0.027	1.928	0.054	-0.001	0.107
L3.Water_level(m)	2.1178	3.099	0.683	0.494	-3.957	8.192
L3.Temp(C)	-0.1544	0.068	-2.269	0.023	-0.288	-0.021
L3.Prec(mm)	0.0451	0.022	2.023	0.043	0.001	0.089
L4.Water_level(m)	14.4403	3.205	4.505	0.000	8.158	20.722
L4.Temp(C)	-0.2614	0.105	-2.489	0.013	-0.467	-0.056
L4.Prec(mm)	0.0251	0.016	1.541	0.123	-0.007	0.057
L5.Water_level(m)	-2.5946	3.260	-0.796	0.426	-8.984	3.794
L5.Temp(C)	-0.3925	0.083	-4.725	0.000	-0.555	-0.230
L5.Prec(mm)	0.0029	0.009	0.307	0.759	-0.016	0.021
L1.e(Water_level(m))	-0.5461	11.511	-0.047	0.962	-23.108	22.016
L1.e(Temp(C))	-0.9064	0.068	-13.319	0.000	-1.040	-0.773
L1.e(Prec(mm))	-0.0641	0.031	-2.059	0.040	-0.125	-0.003

Results for equation Prec(mm)
	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-0.0822	1.078	-0.076	0.939	-2.195	2.031
L1.Water_level(m)	-60.9691	73.318	-0.832	0.406	-204.669	82.731
L1.Temp(C)	1.7739	0.639	2.777	0.005	0.522	3.026
L1.Prec(mm)	-0.7268	0.195	-3.728	0.000	-1.109	-0.345
L2.Water_level(m)	-46.7644	23.215	-2.014	0.044	-92.265	-1.264
L2.Temp(C)	1.3350	0.625	2.137	0.033	0.111	2.559
L2.Prec(mm)	-0.5320	0.178	-2.987	0.003	-0.881	-0.183
L3.Water_level(m)	-52.6401	22.981	-2.291	0.022	-97.683	-7.597
L3.Temp(C)	1.2115	0.492	2.463	0.014	0.248	2.175
L3.Prec(mm)	-0.3644	0.144	-2.523	0.012	-0.647	-0.081
L4.Water_level(m)	-21.4982	24.555	-0.876	0.381	-69.626	26.629
L4.Temp(C)	-0.1265	0.676	-0.187	0.852	-1.452	1.199
L4.Prec(mm)	-0.2408	0.104	-2.313	0.021	-0.445	-0.037
L5.Water_level(m)	-43.3006	20.073	-2.157	0.031	-82.642	-3.959
L5.Temp(C)	-1.0387	0.509	-2.040	0.041	-2.037	-0.041
L5.Prec(mm)	-0.1004	0.062	-1.625	0.104	-0.222	0.021
L1.e(Water_level(m))	6.8105	71.951	0.095	0.925	-134.212	147.833
L1.e(Temp(C))	-0.1609	0.838	-0.192	0.848	-1.804	1.482
L1.e(Prec(mm))	-0.0484	0.189	-0.257	0.797	-0.418	0.321

Error covariance matrix
	coef	std err	z	P>\|z\|	[0.025	0.975]
sqrt.var.Water_level(m)	0.0566	0.001	43.699	0.000	0.054	0.059
sqrt.cov.Water_level(m).Temp(C)	0.2842	0.074	3.867	0.000	0.140	0.428
sqrt.var.Temp(C)	1.8833	0.049	38.741	0.000	1.788	1.979
sqrt.cov.Water_level(m).Prec(mm)	12.5972	0.988	12.754	0.000	10.661	14.533
sqrt.cov.Temp(C).Prec(mm)	1.4881	1.066	1.396	0.163	-0.601	3.578
sqrt.var.Prec(mm)	24.8267	0.626	39.686	0.000	23.601	26.053