- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series Analysis
So far, in this study, we have looked at Univariate time series modeling techniques where a single variable Water Level
varied over time. Next, we will look at Multivariate [1] time series analysis where each variable depends not only on its past values but also has some dependency on past values of other variables. We will predict how Water Level
in Lake Erie varies over time along with other variables such as Precipitation
and Air Temperature
over the lake.
We will use continue to use Mean Water Level (m) data for Lake Erie. We will also bring in Precipitation (mm) data for Erie basin (land+lake) and Mean Air Temperature (${^\circ}C$) over Lake Erie.
In this notebook, we will:
# General Imports
import pandas as pd
import itertools
import time
# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Time Series Imports
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults
from statsmodels.tools.eval_measures import rmse
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
We will use Mean Water Level data for Lake Erie, Precipitation data for Erie basin (land+lake) and Mean Air Temperature data over Lake Erie. We’ll need to do some quick processing to read the data and convert it to have a time series index.
The air temperature data is available for the period Jan'1948 - Dec'2019 and is measured in ${^\circ}C$. The data is in a table format with Year
as rows and each Month
as a column. It needs to be converted to the correct format so that each Year-Month
combination is a row. We will read the data and process it for use.
# Read data
air_temp = pd.read_csv('Erie_AirTemp.csv')
air_temp.head()
# Convert rows to columns
air_temp = air_temp.melt(id_vars='Year', var_name='Month', value_name='Temp(C)')
# Create date column from Year and Month
air_temp['date'] = pd.to_datetime(air_temp.Year.astype(str) + '-' + air_temp.Month.astype(str))
# Drop Year and Month cols
air_temp.drop(columns = ['Year', 'Month'], inplace=True)
# Sort dataframe by date
air_temp.sort_values(by=['date'], inplace=True)
air_temp.head()
# Print Min and Max values
print(min(air_temp['date']))
print(max(air_temp['date']))
# Set date as index
air_temp.set_index('date', inplace=True)
air_temp.head()
The precipiation data is available for the period Jan'1940 - Dec'2019 and is measured in "mm". This data is also in a table format with Year
as rows and each Month
as a column and . The data needs to be converted to the correct format so that each Year-Month
combination is a row. We will read the data and process it for use.
# Read data
prec_data = pd.read_csv('erie_precipitation.csv')
prec_data.head()
# Convert rows to columns
prec_data = prec_data.melt(id_vars='YYYY', var_name='Month', value_name='Prec(mm)')
# Create date column from Year and Month
prec_data['date'] = pd.to_datetime(prec_data.YYYY.astype(str) + '-' + prec_data.Month.astype(str))
# Drop Year and Month cols
prec_data.drop(columns = ['YYYY', 'Month'], inplace=True)
# Sort dataframe by date
prec_data.sort_values(by=['date'], inplace=True)
prec_data.head()
# Print Min and Max values
print(prec_data.shape)
print(min(prec_data['date']))
print(max(prec_data['date']))
Since air temperature data is available starting from Jan'1948, we will subset precipitation data for the same time period.
# Subset data
prec_data = prec_data[prec_data['date']>='1948-01-01']
prec_data.head()
# Set date as index
prec_data.set_index('date', inplace=True)
prec_data.head()
The mean water level data is available for the period Jan'1918 - Dec'2019 and is measured in "m". We will continue to use Lake Erie water level data we have used so far in this study.
erie_data = pd.read_csv('erie_data')
erie_data.head()
# Convert date to datetime column
erie_data['date'] = pd.to_datetime(erie_data['date'])
erie_data.dtypes
# Drop cols
erie_data.drop(columns=['SES12','DES12','TES12'], inplace=True)
# Rename Erie col to Water level
erie_data = erie_data.rename(columns={'Erie':'Water_level(m)'})
# Print Min and Max values
print(erie_data.shape)
print(min(erie_data['date']))
print(max(erie_data['date']))
Since air temperature data is available starting from Jan'1948, we will subset water level data for the same time period.
# Subset Data
erie_data = erie_data[erie_data['date']>='1948-01-01']
erie_data.head()
# Set date as index
erie_data.set_index('date', inplace=True)
We will merge water level, precipitation and air temperature data into a single dataframe.
erie_data = erie_data.join(air_temp)
erie_data.head()
erie_data = erie_data.join(prec_data)
erie_data.head()
Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.
# Set Frequency
erie_data.index.freq = 'MS'
erie_data.index
Represented as: VARMA (p,q)
VARMA (Vector Autoregressive Moving Average) models [1] [2] [3] evaluate multiple variables where each variable depends not only on its past values but also has some dependency on past values of other variables.
A K-dimensional VARMA model of order $(p,q)$ considers each variable $y_K$ in the data.
For example, the system of equations for a 2-dimensional VARMA(1,1) model is:
$y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \theta_{11,1}\varepsilon_{1,t-1} + \theta_{12,1}\varepsilon_{2,t-1} + \varepsilon_{1,t}$
$y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \theta_{21,1}\varepsilon_{1,t-1} + \theta_{22,1}\varepsilon_{2,t-1} + \varepsilon_{2,t}$
where:
coefficient $\phi_{ii,l}$ captures the influence of the $l$th lag of variable $y_i$ on itself,
coefficient $\phi_{ij,l}$ captures the influence of the $l$th lag of variable $y_j$ on $y_i$,
coefficient $\theta_{ii,l}$ captures the influence of the $l$th lag of error $\varepsilon_i$ on itself,
coefficient $\theta_{ij,l}$ captures the influence of the $l$th lag of error $\varepsilon_j$ on $\varepsilon_i$,
and $\varepsilon_{1,t}$ and $\varepsilon_{2,t}$ are residual white noise.
We will follow this 8 step process to build, validate VARMA model and forecast for future periods:
Step 1 — Visualize the data.
Step 2 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can model the data.
Step 3 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity.
Step 4 — Select the optimal p and q terms for model order.
Step 5 — Split the data into train and test sets.
Step 6 — Build the model: Build the model and obtain predictions on test data.
Step 7 — Invert transform the data.
Step 8 — Validate model: Compare the predicted values to the actuals in the test sample.
fig, axes = plt.subplots(3,1, figsize=(15,12))
erie_data['Water_level(m)'].plot(title='Mean Water Levels (in metres)', ax=axes[0])
erie_data['Temp(C)'].plot(title='Avg. Monthly Temperature (in Celsius)', ax=axes[1])
erie_data['Prec(mm)'].plot(title='Mean Precipitation (in mm)', ax=axes[2])
axes[0].set(xlabel='')
axes[1].set(xlabel='')
axes[2].set(xlabel='')
fig.tight_layout(pad=1.0)
plt.show()
The plot above shows Mean water level, Temperature and Precipitation variation for Lake Erie. The data for Water Level shows an increasing and then decreasing trend. The data for Average Monthly Temperature shows no trend but seasonality. The data for Mean Precipitation shows a slight trend with seasonality and cyclic variations. Let's test the data for Stationarity.
To effectively use time series models, we need to understand if our data is stationary. A stationary series [2] has constant mean and variance over time. A time series that shows seasonality is not stationary.
A test for stationarity usually involves a hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary. The alternate hypothesis $H_1$ supports stationarity. To determine whether a series is stationary we can use the augmented Dickey-Fuller Test.
We will define a function that creates more informative output from the adfuller()
test. We will use AIC for the autolag because it penalizes higher order models over simpler models.
# Define the adf-test function
def aug_dftest(series, title=''):
print(f'Augmented Dickey Fuller Test for: {title}')
result = adfuller(series.dropna(), autolag='AIC') # dropna() handles differenced data
labels = ['ADF test statistic','p-value','# lags used','# observations']
output = pd.Series(result[0:4], index=labels)
for key, val in result[4].items():
output[f'critical value ({key})'] = val
print(output.to_string())
if output[1] <= 0.05:
print('Reject the Null hypothesis; Data is stationary')
else:
print('Fail to reject the Null hypothesis; Data is non-stationary')
# Run Test on each variable
for col in erie_data:
aug_dftest(erie_data[col], title=col)
print()
Both temperature and precipitation variables are stationary, however water level variable is not. Let's difference the data to make it stationary.
Differencing [3] is a method of transforming a non-stationary time series into a stationary one. Often the data is not stationary but differences in data might be stationary. If there is an obvious upward or downward linear trend, a first difference may be needed. A quadratic trend might need a 2nd order difference. Often (not always) a first difference (non-seasonal) will “detrend” the data.
The first differencing value ${y}^\prime_{t}$ is the difference between the observation at current time period ${y}_{t}$ and the previous time period ${y}_{t-1}$. If these values fail to revolve around a constant mean and variance then we find the second differencing using the values of the first differencing.
First Order Differencing \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}
Second Order Differencing \begin{array}{l} y_{t}^{\prime \prime} &=y_{t}^{\prime}-y_{t-1}^{\prime} \\ &=\left(y_{t}-y_{t-1}\right)-\left(y_{t-1}-y_{t-2}\right) \\ &=y_{t}-2 y_{t-1}+y_{t-2} \end{array}
We'll take a first order difference of the entire DataFrame and re-run the augmented Dickey-Fuller tests.
# Difference the data
erie_transformed = erie_data.diff()
# Drop NA values
erie_transformed = erie_transformed.dropna()
# Run Test on each variable
for col in erie_data:
aug_dftest(erie_data[col], title=col)
print()
Test results confirm that we reached stationarity after the first difference(d).
Our goal here is to find the optimal set of model orders (p,q) that yields the best performance for our model. We will run a grid search with different values of AR(p) and MA(q) terms.
We will:
The function will evaluate all models created during the stepwise operation and list the best VARMA model along with its AIC score. The recommended order is the one with the lowest Akaike Information Criterion or AIC score. The AIC value will allow us to compare how well a model fits the data and takes into account the complexity of a model, so models that have a better fit while using fewer features will receive a better (lower) AIC score than similar models that utilize more features.
# Create a list of ARIMA model orders
p = q = range(0, 6)
pq = list(itertools.product(p, q))
print(pq)
from statsmodels.tsa.statespace.varmax import VARMAX, VARMAXResults
# Function to iterate over VARMA orders
def stepwise_arima(data):
model_res = {}
for param in pq:
order_aic = []
try:
model = VARMAX(data, order=param, trend='c')
model_fit = model.fit(maxiter=1000, disp=False)
print(f'VARMA{param} - AIC: {model_fit.aic}')
model_res[param] = model_fit.aic
except:
continue
# Print model with lowest AIC
low_key = min(model_res, key=model_res.get)
print('\n')
print('---------------------')
print('MODEL WITH LOWEST AIC')
print('---------------------')
print(f'VARMA{low_key} - AIC: {model_res[low_key]}')
# Run the model
stepwise_arima(erie_transformed)
The function determined
VARMA(5,1)
as an optimal model. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.
Now that we have identified VARMA(5,1)
as the optimal model, we will split the data and fit this model on training and test sets.
It will be useful to define a number of observations variable for our test set. For this analysis, let's use data for last 60 months as test set.
nobs=60
train, test = erie_transformed[0:-nobs], erie_transformed[-nobs:]
print(train.shape)
print(test.shape)
We will build and fit the VARMA(5, 1) model.
model = VARMAX(train, order=(5, 1), trend='c')
results = model.fit(maxiter=1000, disp=False)
results.summary()
We will run model diagnostics to investigate any unusual behavior.
with plt.style.context('fivethirtyeight'):
results.plot_diagnostics(figsize=(16,8))
plt.show()
The properties of residuals should reflect the properties of underlying process. Model diagnostics suggests [5] that:
- Residuals do not show a pattern
- The ACF plot shows only 1st lag =1. All other lags are are small and within the confidence band.
- Model residuals are near normally distributed.
Top Left : The residual errors seem to fluctuate around a mean of zero but have a non-uniform variance.
Top Right : The density plot suggest normal distribution with mean zero.
Bottom left : All the dots should fall perfectly in line with some deviations towards the right implying skewed distribution.
Bottom Right : The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.
We will use the VARMA(5,1) model to forecast for the next 60 months.
erie_forecast = results.forecast(60)
erie_forecast.head()
Remember that the forecasted values represent first-order differences. To compare them to the original data we have to roll back each difference. To roll back a first-order difference we take the most recent value on the training side of the original series, and add it to a cumulative sum of forecasted values.
# Build the forecast values from the first difference set
erie_forecast['Water_Forecast'] = erie_data['Water_level(m)'].iloc[-nobs-1]
+ erie_forecast['Water_level(m)'].cumsum()
# Now build the forecast values from the first difference set
erie_forecast['Temp_Forecast'] = erie_data['Temp(C)'].iloc[-nobs-1] + erie_forecast['Temp(C)'].cumsum()
# Now build the forecast values from the first difference set
erie_forecast['Prec_Forecast'] = erie_data['Prec(mm)'].iloc[-nobs-1] + erie_forecast['Prec(mm)'].cumsum()
erie_forecast.head()
We will combine forecasts with original data in a dataframe.
test_df = pd.concat([erie_data.iloc[-60:],erie_forecast[['Water_Forecast','Temp_Forecast','Prec_Forecast']]],axis=1)
test_df.head()
We will plot VARMA(5,1)
predictions along with Test set.
with plt.style.context('fivethirtyeight'):
fig, axes = plt.subplots(3,1, figsize=(18,18))
# Water Level
title = 'VARMA (5,1) Predictions (Water Level)'
label = 'VARMA (5,1) Predictions'
test_df['Water_level(m)'].plot(label='Water Level Test Data', legend=True, ax=axes[0], title=title)
test_df['Water_Forecast'].plot(label=label, legend=True, ax=axes[0])
axes[0].set(xlabel="",ylabel="Mean Water Level (m)")
# Temperature
title = 'VARMA (5,1) Predictions (Temperature)'
label = 'VARMA (5,1) Predictions'
test_df['Temp(C)'].plot(label='Temperature Test Data', legend=True, ax=axes[1], title=title)
test_df['Temp_Forecast'].plot(label=label, legend=True, ax=axes[1])
axes[1].set(xlabel="",ylabel="Avg. Temperature (C)")
# Precipitation
title = 'VARMA (5,1) Predictions (Precipitation)'
label = 'VARMA (5,1) Predictions'
test_df['Prec(mm)'].plot(label='Precipitation Test Data', legend=True, ax=axes[2], title=title)
test_df['Prec_Forecast'].plot(label=label, legend=True, ax=axes[2])
axes[2].set(xlabel="",ylabel="Mean Precipitation (mm)")
The plots above show
VARMA(5,1)
model predictions:
ForWater Level
, the model predictions seem to follow the test data well. The model seems to capture seasonality however lags in capturing the exact trend.
ForTemperature
, the model captures our test data fairly well. The model seems to underit the data but overall is a pretty good model.
ForPrecipitation
, the model is not a good fit. It is not able to capture the variations in test data.
We will calculate RMSE for each variable and compare to actual data.
RMSE1 = rmse(test_df['Water_level(m)'], test_df['Water_Forecast'])
actual1 = max(test_df['Water_level(m)']) - min(test_df['Water_level(m)'])
print(f'Water Level RMSE: {RMSE1:.3f}')
print(f'Water Level Test Data Variation: {actual1:.2f}')
RMSE2 = rmse(test_df['Temp(C)'], test_df['Temp_Forecast'])
actual2 = max(test_df['Temp(C)']) - min(test_df['Temp(C)'])
print(f'Temperature RMSE: {RMSE2:.3f}')
print(f'Temperature Test Data Variation: {actual2:.2f}')
RMSE3 = rmse(test_df['Prec(mm)'], test_df['Prec_Forecast'])
actual3 = max(test_df['Prec(mm)']) - min(test_df['Prec(mm)'])
print(f'Precipitation RMSE: {RMSE3:.3f}')
print(f'Precipitation Test Data Variation: {actual3:.2f}')
RMSE values give us a estimate of how much the model predictions vary compared to the actual test data. We can see that the
VARMA(5,1)
model does poorly for Water Level (RMSE: 3.41) compared to ARIMA(2,1,2) model (RMSE: 0.29) however it tells us that there is interdepence between Water Level, Temperature and Precipitation, at least for the timespan we investigated. Similarly, RMSE for Temperature is pretty low as this model seems to fit fairly well. The RMSE for Precipitation however is high compared to the original values and the model does not seem to fit the data well.
To summarize, in this notebook we utilized a Multivariate Time Series modeling technique to:
Here is a summary of the models along with the RMSE that were created throughout the study.
Model | RMSE |
---|---|
Double Exponential Smoothing | 2.232 |
Triple Exponential Smoothing | 0.258 |
ARIMA(2,1,2) | 0.294 |
SARIMA(1,0,1)(2,1,0,12) | 0.321 |
SARIMA(2,1,1)(0,1,2,12) | 0.299 |
VARMA(5,1) for Water Level | 0.354 |
VARMA(5,1) for Temperature | 3.413 |
VARMA(5,1) for Precipitation | 34.251 |
Time series are one of the most common data types encountered in daily life where data is collected at constant time intervals. Weather, stock prices, electricity usage, daily burger sales or even blood pressure variations are some of the examples of data that can be collected at regular intervals.
In this study, we explored one such example of forecasting water levels in Lake Erie. Various reports [4] [5] have shown that in recent years high water levels wreaked havoc on the communities bordering the Great Lakes leading to widespread damage in coastal cities.
The results from this study can be utilized by various state govt. executives, residents and small business owners to minimize the impacts of rising water levels on their daily lives, their livelihood and to better prepare for the future months.
[1] Serban, N. (2018). Unit 3: Multivariate Time Series Modeling [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis
[2] https://online.stat.psu.edu/stat510/lesson/11/11.2
[3] Lütkepohl H. (2005) Estimation of VARMA Models. In: New Introduction to Multiple Time Series Analysis. Springer, Berlin, Heidelberg.
[4] https://www.theguardian.com/environment/2019/sep/03/great-lakes-region-flooding-climate-crisis
[5] https://www.eurekalert.org/pub_releases/2020-07/uot-srm070720.php