Lake Erie Water Levels - A Time Series Analysis (Part 1/4)

  • 👟 Ready To Run!
  • 🚀 Data Exploration and Cleaning
  • 🤖 Machine Learning
  • ⏱️ Time Series Analysis

In summer 2019, high water levels wreaked havoc on the communities bordering the Great Lakes leading to widespread damage in coastal cities, eroded shorelines and beaches, destroyed state park campgrounds and interrupted visitor experiences [1] [2].

In May'2019, Lake Erie hit its highest average monthly water level since 1918 [3]. Cleveland, Ohio on Lake Erie’s shore saw more rainy days than any year since 1953 [1].

Source: https://www.columbian.com/news/2019/may/14/a-rising-lake-erie-closes-streets-ferry-leaves-debris/

According to Richard B Rood, a professor in the University of Michigan’s department of climate and space sciences and engineering, "Great Lakes basin, which holds 90% of the nation’s freshwater, can expect similar shifts in the coming decades as world temperatures increase. Higher temperatures give the atmosphere more capacity to hold evaporated water which is why storms are dumping more rain than 50 years ago."[1].

Lake Erie, carved out by glacier ice less than 4,000 years ago, is the fourth-largest lake (by surface area) of the five Great Lakes in North America. It is also the shallowest, and smallest by volume of the Great Lakes with an average depth of 62ft (19m) and volume of 116 cubic miles [8]. For comparison, Lake Superior has an average depth of 483ft (147m) and a volume of 2,900 cubic miles. Since Lake Erie is the shallowest and smallest, it also has the shortest average water residence time (mean time that water spends in a particular lake).

In this study, we will use Mean Water Level data for Lake Erie, Precipitation data for Erie basin (land+lake) and Mean Air Temperature data over Lake Erie to understand trends and forecast how water levels will vary into the future. We will explore various Univariate and Multivariate Time Series modeling techniques such as:

  1. Exponential Smoothing Models
  2. ARIMA (Autoregressive Integrated Moving Average) Models
  3. SARIMA (Seasonal Autoregressive Integrated Moving Average) Models
  4. VARMA (Vector Autoregressive Moving Average) Models

We will start by exploring Mean Water Level data for Lake Erie from 1918 - 2019.

Part 1: Exponential Smoothing Models

In this notebook, we will:

  • Read Lake Erie water level data
  • Perform datatime operations to index it as a time series
  • Explore the data
  • Decomposed data to separate the trend and seasonality components
  • Built and fit different Exponential Smoothing models on the data
    • Simple Exponential Smoothing
    • Double Exponential Smoothing
    • Triple Exponential Smoothing
  • Perform prediction on test data and evaluated model performance
    • Split the data
    • Fit models
    • Generate predictions
    • Evaluate model performance
In [1]:
# General Imports
import pandas as pd

# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pylab import rcParams
rcParams['figure.figsize'] = (15,8)

# Time Series Imports
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

Read the Data

In [2]:
lakes_data = pd.read_csv('Lakes Data.txt')
lakes_data.head()
Out[2]:
month year Superior Michigan-Huron St. Clair Erie Ontario
0 Jan 1918 183.25 176.71 174.59 173.90 74.74
1 feb 1918 183.20 176.73 174.74 173.82 74.72
2 mar 1918 183.17 176.80 174.74 174.01 74.92
3 apr 1918 183.14 176.89 174.84 174.02 75.10
4 may 1918 183.22 176.99 175.00 173.98 75.09
In [3]:
# Add date colum
lakes_data['date'] = pd.to_datetime(lakes_data.year.astype(str) + '-' + lakes_data.month.astype(str))
lakes_data.head()
Out[3]:
month year Superior Michigan-Huron St. Clair Erie Ontario date
0 Jan 1918 183.25 176.71 174.59 173.90 74.74 1918-01-01
1 feb 1918 183.20 176.73 174.74 173.82 74.72 1918-02-01
2 mar 1918 183.17 176.80 174.74 174.01 74.92 1918-03-01
3 apr 1918 183.14 176.89 174.84 174.02 75.10 1918-04-01
4 may 1918 183.22 176.99 175.00 173.98 75.09 1918-05-01

Subset Data for Erie

In [4]:
erie_data = lakes_data[['Erie','date']].copy()
erie_data.head()
Out[4]:
Erie date
0 173.90 1918-01-01
1 173.82 1918-02-01
2 174.01 1918-03-01
3 174.02 1918-04-01
4 173.98 1918-05-01
In [5]:
type(erie_data)
Out[5]:
pandas.core.frame.DataFrame

Setting a DatetimeIndex Frequency

Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.

In [6]:
# Assign date as Index
erie_data.set_index('date', inplace=True)
erie_data.index
Out[6]:
DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
               '1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
               '1918-09-01', '1918-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=1224, freq=None)
In [7]:
# Update frequency of data to make it Month-Start
erie_data.index.freq = 'MS'
erie_data.index
Out[7]:
DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
               '1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
               '1918-09-01', '1918-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='date', length=1224, freq='MS')

Data Exploration

"A time series is anything that is observed sequentially over time"[5]. Time series is the data where same response is known for many time periods. Some examples of time series data are Temperature readings at an airport, Stock price every two seconds or Electricity consumption in a month etc.

Time series data can exhibit a veriety of patterns [4]. Data may have:

  1. Trends over time when there is a long-term increase or decrease in the data such as increasing or decreasing stock price.
  2. Seasonality when a time series is affected by time of the year or the day of the week such as monthly temperatures or annual precipitation. Seasonality is always of a fixed and known frequency.
  3. Cyclical variations when the data exhibit rise and falls that are not of a fixed frequency. The fluctuations are are often related to the business cycle such as weekly sales cycles.
  4. Randomness where the data shows random fluctuations with no strong patterns such as stock price which may vary simply due to random variation.

Let's plot the Erie water level data to understand patterns.

Plot the Data

In [8]:
# Plot the Data
ax = erie_data.plot(figsize=(18,6))
ax.set(ylabel='Mean Water Level (in metric units)', title='Lake Erie Water Levels over Time', xlabel='');

The plot above shows mean water level variation for Lake Erie from Jan-1919 to Dec-2019. The data shows an increasing trend from 1930s - 1950s followed by a decreasng trend till late 1960s. In the early 2000, water levels show little trend with a sudden increasing streak starting from around 2010. The water levels in 2019 are the highest ever recorded in the history of Lake Erie.

Top 10 Water Levels

In [9]:
top10 = erie_data.sort_values(by='Erie',ascending=False).head(10)
In [10]:
from matplotlib import dates
In [11]:
ax = top10.plot(kind='bar', ylim=(174.8,175.2), figsize=(16,8))
ax.set(xlabel='', ylabel='Mean Water Level (in metric units)', title='Highest 10 Mean Water Levels - Lake Erie');

Lowest 10 Water Levels

In [12]:
low10 = erie_data.sort_values(by='Erie',ascending=False).tail(10)
In [13]:
ax = low10.plot(kind='bar', ylim=(173, 173.35), figsize=(16,8))
ax.set(ylabel='Mean Water Level (in metric units)',title='Lowest 10 Mean Water Levels - Lake Erie',xlabel='');

Data Decomposition

It is often helpful to split a time series into several components, each representing an underlying pattern to improve understanding of the time series. There are two forms of classical decomposition: an additive and a multiplicative decomposition. Btoh methods assume that the seasonal component is constant from year to year. Additive model is preferred when trend seems to be linear. Multiplicative model works better when data seems to grow exponentially.

We will use Seasonal and Trend decomposition using Loess (STL) decomposition [5]. STL decomposition has many advantages over other methods:

  1. It can handle Non linear relationships
  2. It can handle any type of seasonality
  3. The method is robust to outliers.

Let's decompose the water level data using STL Decomposition.

In [14]:
from pylab import rcParams
rcParams['figure.figsize'] = (12,10)
from statsmodels.tsa.seasonal import STL

stl = STL(erie_data, seasonal=13, robust=True)
res = stl.fit()
fig = res.plot()

The three components Trend, Seasonality and Residuals are shown in the bottom three panels of the plot.

  • The trend panel shows an increasing and decreasing trend in data over time.
  • The seasonal panel shows strong seasonality is data which also changes over time.
  • The residual component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.

These components can be added together to reconstruct the data shown in the top panel.

Exponential Smoothing Models

Exponential smoothing methods [7] are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, more recent the observation are weighted higher than past observations. We will look at the following Exponential Smoothing Models:

  1. Simple Exponential Smoothing
  2. Double Exponential Smoothing (Holt's Moethod)
  3. Triple Exponential Smoothing (Holt Winter's Moethod)

Simple Exponential Smoothing

In Simple Exponential Smoothing [6] [7] method, the weight of each observation decreases exponentially as we move back in time i.e. the smallest weights are associated with the oldest observations. If $\alpha$ is small, more weight is given to observations from the distant past and if $\alpha$ is large (close to 1), more weight is given to recent observations.

This method is suitable for forecasting data with no clear trend or seasonal pattern. We will fit a Simple Exponential Smoothing model to the data.

$$ \hat{Y}_{t}=\alpha\left(Y_{t}+\sum_{i=1}^{r}(1-\alpha)^{i} Y_{t-i}\right) \\ where\ 0 ≤ \alpha ≤ 1 $$
In [15]:
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

sesModel = SimpleExpSmoothing(erie_data)

Based on the statsmodels documentation for SimpleExpSmoothing, we need to specify alpha which is based on span. We will use a span of 12 for monthly data.

In [16]:
# Fit Model
span = 12
alpha = 2/(span+1)

sesFittted = sesModel.fit(smoothing_level=alpha, optimized=False)
In [17]:
# Store fitted values
erie_data['SES12'] = sesFittted.fittedvalues.shift(-1)
In [18]:
# Plot Data
ax = erie_data.plot(figsize=(18,8))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie');

The plot above shows Simple Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.

In [19]:
# Zoom into last 2 years
with plt.style.context('fivethirtyeight'):
    ax = erie_data['2018-01-01':].plot()
    ax.set_xlabel("")
    ax.set_ylabel("Mean Water Level (metric units)")
    ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
    ax.autoscale(axis='x', tight=True);

From the plot, we can see that Simple Exponential Smoothing model does not fit the data well. The model is not able to capture seasonality or trend as it overfits or underfits the data. It is not a good representation of the time series data.


Double Exponential Smoothing (Holt's Method)

Double Exponential Smoothing [6] [7] extends simple exponential smoothing method to allow the forecasting of data with a trend. This method adds a second smoothing factor $\beta$ (beta) that addresses trends in the data. Like the alpha factor, values for the beta factor fall between zero and one ($0<\beta≤1$). The benefit here is that the model can anticipate future increases or decreases where the level model would only work from recent calculations.

$$ \begin{array}{l} L_{t}=\alpha Y_{t}+(1-\alpha)\left(L_{t-1}+T_{t-1}\right) \\ T_{t}=\beta\left(L_{t}-L_{t-1}\right)+(1-\beta) T_{t-1} \\ \hat{Y}_{t}=L_{t-1}+T_{t-1}\\ \end{array}\\ $$

where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. Also, $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend.

We will fit a Double Exponential Smoothing model to the data.

We can also address different types of change (growth/decay) in the trend. If a time series displays a straight-line sloped trend, an additive adjustment is used. If the time series displays an exponential (curved) trend, a multiplicative adjustment is used.

In [20]:
# Create Model
from statsmodels.tsa.holtwinters import ExponentialSmoothing

desModel = ExponentialSmoothing(erie_data['Erie'], trend='add')
In [21]:
# Fit Model
desFitted = desModel.fit()
In [22]:
# Store fitted values
erie_data['DES12'] = desFitted.fittedvalues.shift(-1)
In [23]:
# Plot model
ax = erie_data[['Erie', 'DES12']].plot(figsize=(18,12))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);

The plot above shows Double Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.

In [24]:
with plt.style.context('fivethirtyeight'):
    ax = erie_data['2018-01-01':].plot(figsize=(18,10))
    ax.set_xlabel("")
    ax.set_ylabel("Mean Water Level (metric units)")
    ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
    ax.autoscale(axis='x', tight=True);

The plot above shows Double Exponential Smoothing (DES12) model plotted along with original data and Simple Exponential Smoothing. We can see that Double Exponential Smoothing fits the data much better than the Simple model. DES12 model captures the trend better but still seems to overfit and underfit the data at different intervals.


Triple Exponential Smoothing

Triple Exponential Smoothing [6] [7], the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data. method comprises of three smoothing equations — one for the level, one for the trend and one for the seasonal component with corresponding smoothing parameters $\alpha$, $\beta$ and $\gamma$.

There are two variations to this method that differ based on the seasonal component. The additive method is preferred when seasonal variations are roughly constant, while multiplicative is preferred when seasonal variations are changing proportional to the level of the series. The equations below are shown for additive method.

\begin{array}{l} L_{t}=\alpha\left(Y_{t}-S_{t-p}\right)+(1-\alpha)\left(L_{t-1}+T_{t-1}\right) \\ T_{t}=\beta\left(L_{t}-L_{t-1}\right)+(1-\beta) T_{t-1} \\ S_{t}=\delta\left(Y_{t}-L_{t}\right)+(1-\delta) S_{t-p} \\ \hat{Y}_{t}=L_{t-1}+T_{t-1}+S_{t-p} \end{array}

where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend. Also, $S_{t}$ is the seasonal component at time t, $\delta$ is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.

We will fit a Triple Exponential Smoothing model to our data.

In [25]:
# Create Model
tesModel = ExponentialSmoothing(erie_data['Erie'], trend='add', seasonal='add', seasonal_periods=12)
In [26]:
# Fit Model
tesFitted = tesModel.fit()
In [27]:
# Store fitted values
erie_data['TES12'] = tesFitted.fittedvalues.shift(-1)
In [28]:
# Plot model
ax = erie_data[['Erie','TES12']].plot(figsize=(18,10))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);

The plot above shows Triple Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.

In [29]:
with plt.style.context('fivethirtyeight'):
    ax = erie_data['2018-01-01':].plot(figsize=(18,10))
    ax.set_xlabel("")
    ax.set_ylabel("Mean Water Level (metric units)")
    ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
    ax.autoscale(axis='x', tight=True);

The plot above shows Triple Exponential Smoothing (TES12) model plotted along with original data, Simple and Double Exponential Smoothing models. We can see that Triple Exponential Smoothing model doesn't fit the data as well as the Double Exponential Smoothing model.

Complex models are assumed to perform better. We will test this assumption when we forecast the data. We'll see that having the ability to predict fluctuating seasonal patterns greatly improves our forecast.

Forecasting

When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. We will:

  1. Split the data into Train and Test sets
  2. Fit Double and Triple Exponential Smoothing model on Training data
  3. Forecast both models for the same time period as Test data
  4. Evaluate models against Test Data
In [30]:
# Check data
erie_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1224 entries, 1918-01-01 to 2019-12-01
Freq: MS
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Erie    1224 non-null   float64
 1   SES12   1223 non-null   float64
 2   DES12   1223 non-null   float64
 3   TES12   1223 non-null   float64
dtypes: float64(4)
memory usage: 87.8 KB

Split the Data

The data has 1224 observations. We will split the data into train and test sets keeping 48 observations (4 years of data) in the test set.

In [31]:
# Split data
train_data = erie_data.iloc[:1176]
test_data = erie_data.iloc[1176:]

Fit the Models

Fit Double and Triple Exponential Smoothing model on Training data.

In [32]:
# Fit models
des_fittedmodel = ExponentialSmoothing(train_data['Erie'], trend='add').fit()
tes_fittedmodel = ExponentialSmoothing(train_data['Erie'], trend='add', 
                                       seasonal='add', seasonal_periods=12).fit()

Create Predictions - Forecasting

Forecast both models for 48 time periods.

In [33]:
des_predictions = des_fittedmodel.forecast(48).rename('DES Forecast')
tes_predictions = tes_fittedmodel.forecast(48).rename('TES Forecast')

Evaluate Model Performance

We will plot the training data, test data and the predictions from both models to compare performance.

In [39]:
# Plot predictions
train_data['Erie'].plot(legend=True, label='Train', figsize=(18,10))
test_data['Erie'].plot(legend=True, label='Test')
des_predictions.plot(legend=True, label='Double Expoential Smoothing Predictions')
tes_predictions.plot(legend=True, label='Triple Expoential Smoothing Predictions');

The plot above shows training data, test data, predictions from Double and Triple Exponential Smoothing models. Let's look at the last 48 time periods to clearly understand model performance.

In [40]:
with plt.style.context('fivethirtyeight'):
    test_data['Erie'].plot(legend=True, label='Test', title='Model Predictions', figsize=(18,10))
    des_predictions.plot(legend=True, label='Double Expoential Smoothing Predictions')
    tes_predictions.plot(legend=True, label='Triple Expoential Smoothing Predictions');

The plot above shows that Triple Exponential Smoothing model seems to fit the data better than Double Exponential Smoothing model.

Check Losses

Let's evaluate the Root Mean Squared Error (RMSE) from both models to see which model had a lower loss and performed better.

In [36]:
from statsmodels.tools.eval_measures import rmse
In [37]:
rmse_des = rmse(test_data['Erie'], des_predictions)
print(f'RMSE for Double Exponential Smoothing Model is: {rmse_des}')

rmse_tes = rmse(test_data['Erie'], tes_predictions)
print(f'RMSE for Triple Exponential Smoothing Model is: {rmse_tes}')
RMSE for Double Exponential Smoothing Model is: 2.2324657022542147
RMSE for Triple Exponential Smoothing Model is: 0.2483124840455114

The results show that Triple Exponential Smoothing model has a lower error and seems to fit the data better. We can see that neither Double nor Triple Exponential Smoothing models seem to capture trend of time series data well although seasonality is captured fairly well by Triple Exponential Smoothing model.

Export Data
In [38]:
# Write to csv
erie_data.to_csv('erie_data')

Summary

To summarize, in this notebook we:

  1. Read Lake Erie water level data and indexed it as a time series.
  2. Explored and then decomposed data to separate the trend and seasonality components.
  3. Built and fit different Exponential Smoothing models on the data.
  4. Performed prediction on test data and evaluated model performance.

We have seen that neither Double nor Triple Exponential Smoothing models seem to capture trend of time series data well. In the next part, we will look at some more complex models such as ARIMA models that are better at capturing seasonality and trends in data.

References