- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series Analysis
In summer 2019, high water levels wreaked havoc on the communities bordering the Great Lakes leading to widespread damage in coastal cities, eroded shorelines and beaches, destroyed state park campgrounds and interrupted visitor experiences [1] [2].
In May'2019, Lake Erie hit its highest average monthly water level since 1918 [3]. Cleveland, Ohio on Lake Erie’s shore saw more rainy days than any year since 1953 [1].
Source: https://www.columbian.com/news/2019/may/14/a-rising-lake-erie-closes-streets-ferry-leaves-debris/
According to Richard B Rood, a professor in the University of Michigan’s department of climate and space sciences and engineering, "Great Lakes basin, which holds 90% of the nation’s freshwater, can expect similar shifts in the coming decades as world temperatures increase. Higher temperatures give the atmosphere more capacity to hold evaporated water which is why storms are dumping more rain than 50 years ago."[1].
Lake Erie, carved out by glacier ice less than 4,000 years ago, is the fourth-largest lake (by surface area) of the five Great Lakes in North America. It is also the shallowest, and smallest by volume of the Great Lakes with an average depth of 62ft (19m) and volume of 116 cubic miles [8]. For comparison, Lake Superior has an average depth of 483ft (147m) and a volume of 2,900 cubic miles. Since Lake Erie is the shallowest and smallest, it also has the shortest average water residence time (mean time that water spends in a particular lake).
In this study, we will use Mean Water Level data for Lake Erie, Precipitation data for Erie basin (land+lake) and Mean Air Temperature data over Lake Erie to understand trends and forecast how water levels will vary into the future. We will explore various Univariate and Multivariate Time Series modeling techniques such as:
We will start by exploring Mean Water Level data for Lake Erie from 1918 - 2019.
In this notebook, we will:
# General Imports
import pandas as pd
# Imports for Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pylab import rcParams
rcParams['figure.figsize'] = (15,8)
# Time Series Imports
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
lakes_data = pd.read_csv('Lakes Data.txt')
lakes_data.head()
# Add date colum
lakes_data['date'] = pd.to_datetime(lakes_data.year.astype(str) + '-' + lakes_data.month.astype(str))
lakes_data.head()
erie_data = lakes_data[['Erie','date']].copy()
erie_data.head()
type(erie_data)
Note that our DatetimeIndex does not have a frequency. In order to build models, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS. A full list of time series offset aliases can be found here.
# Assign date as Index
erie_data.set_index('date', inplace=True)
erie_data.index
# Update frequency of data to make it Month-Start
erie_data.index.freq = 'MS'
erie_data.index
"A time series is anything that is observed sequentially over time"[5]. Time series is the data where same response is known for many time periods. Some examples of time series data are Temperature readings at an airport, Stock price every two seconds or Electricity consumption in a month etc.
Time series data can exhibit a veriety of patterns [4]. Data may have:
Trends
over time when there is a long-term increase or decrease in the data such as increasing or decreasing stock price.Seasonality
when a time series is affected by time of the year or the day of the week such as monthly temperatures or annual precipitation. Seasonality is always of a fixed and known frequency.Cyclical variations
when the data exhibit rise and falls that are not of a fixed frequency. The fluctuations are are often related to the business cycle such as weekly sales cycles.Randomness
where the data shows random fluctuations with no strong patterns such as stock price which may vary simply due to random variation.Let's plot the Erie water level data to understand patterns.
# Plot the Data
ax = erie_data.plot(figsize=(18,6))
ax.set(ylabel='Mean Water Level (in metric units)', title='Lake Erie Water Levels over Time', xlabel='');
The plot above shows mean water level variation for Lake Erie from Jan-1919 to Dec-2019. The data shows an increasing trend from 1930s - 1950s followed by a decreasng trend till late 1960s. In the early 2000, water levels show little trend with a sudden increasing streak starting from around 2010. The water levels in 2019 are the highest ever recorded in the history of Lake Erie.
top10 = erie_data.sort_values(by='Erie',ascending=False).head(10)
from matplotlib import dates
ax = top10.plot(kind='bar', ylim=(174.8,175.2), figsize=(16,8))
ax.set(xlabel='', ylabel='Mean Water Level (in metric units)', title='Highest 10 Mean Water Levels - Lake Erie');
low10 = erie_data.sort_values(by='Erie',ascending=False).tail(10)
ax = low10.plot(kind='bar', ylim=(173, 173.35), figsize=(16,8))
ax.set(ylabel='Mean Water Level (in metric units)',title='Lowest 10 Mean Water Levels - Lake Erie',xlabel='');
It is often helpful to split a time series into several components, each representing an underlying pattern to improve understanding of the time series. There are two forms of classical decomposition: an additive and a multiplicative decomposition. Btoh methods assume that the seasonal component is constant from year to year. Additive model is preferred when trend seems to be linear. Multiplicative model works better when data seems to grow exponentially.
We will use Seasonal and Trend decomposition using Loess (STL)
decomposition [5]. STL decomposition has many advantages over other methods:
Let's decompose the water level data using STL Decomposition.
from pylab import rcParams
rcParams['figure.figsize'] = (12,10)
from statsmodels.tsa.seasonal import STL
stl = STL(erie_data, seasonal=13, robust=True)
res = stl.fit()
fig = res.plot()
The three components Trend, Seasonality and Residuals are shown in the bottom three panels of the plot.
- The trend panel shows an increasing and decreasing trend in data over time.
- The seasonal panel shows strong seasonality is data which also changes over time.
- The residual component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.
These components can be added together to reconstruct the data shown in the top panel.
Exponential smoothing methods [7] are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, more recent the observation are weighted higher than past observations. We will look at the following Exponential Smoothing Models:
In Simple Exponential Smoothing [6] [7] method, the weight of each observation decreases exponentially as we move back in time i.e. the smallest weights are associated with the oldest observations. If $\alpha$ is small, more weight is given to observations from the distant past and if $\alpha$ is large (close to 1), more weight is given to recent observations.
This method is suitable for forecasting data with no clear trend or seasonal pattern. We will fit a Simple Exponential Smoothing model to the data.
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
sesModel = SimpleExpSmoothing(erie_data)
Based on the statsmodels documentation for SimpleExpSmoothing
, we need to specify alpha
which is based on span
. We will use a span
of 12 for monthly data.
# Fit Model
span = 12
alpha = 2/(span+1)
sesFittted = sesModel.fit(smoothing_level=alpha, optimized=False)
# Store fitted values
erie_data['SES12'] = sesFittted.fittedvalues.shift(-1)
# Plot Data
ax = erie_data.plot(figsize=(18,8))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie');
The plot above shows Simple Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.
# Zoom into last 2 years
with plt.style.context('fivethirtyeight'):
ax = erie_data['2018-01-01':].plot()
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);
From the plot, we can see that Simple Exponential Smoothing model does not fit the data well. The model is not able to capture seasonality or trend as it overfits or underfits the data. It is not a good representation of the time series data.
Double Exponential Smoothing [6] [7] extends simple exponential smoothing method to allow the forecasting of data with a trend. This method adds a second smoothing factor $\beta$ (beta) that addresses trends in the data. Like the alpha factor, values for the beta factor fall between zero and one ($0<\beta≤1$). The benefit here is that the model can anticipate future increases or decreases where the level model would only work from recent calculations.
where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. Also, $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend.
We will fit a Double Exponential Smoothing model to the data.
We can also address different types of change (growth/decay) in the trend. If a time series displays a straight-line sloped trend, an additive adjustment is used. If the time series displays an exponential (curved) trend, a multiplicative adjustment is used.
# Create Model
from statsmodels.tsa.holtwinters import ExponentialSmoothing
desModel = ExponentialSmoothing(erie_data['Erie'], trend='add')
# Fit Model
desFitted = desModel.fit()
# Store fitted values
erie_data['DES12'] = desFitted.fittedvalues.shift(-1)
# Plot model
ax = erie_data[['Erie', 'DES12']].plot(figsize=(18,12))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);
The plot above shows Double Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.
with plt.style.context('fivethirtyeight'):
ax = erie_data['2018-01-01':].plot(figsize=(18,10))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);
The plot above shows
Double Exponential Smoothing (DES12)
model plotted along with original data and Simple Exponential Smoothing. We can see that Double Exponential Smoothing fits the data much better than the Simple model.DES12
model captures the trend better but still seems to overfit and underfit the data at different intervals.
Triple Exponential Smoothing [6] [7], the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data. method comprises of three smoothing equations — one for the level, one for the trend and one for the seasonal component with corresponding smoothing parameters $\alpha$, $\beta$ and $\gamma$.
There are two variations to this method that differ based on the seasonal component. The additive
method is preferred when seasonal variations are roughly constant, while multiplicative
is preferred when seasonal variations are changing proportional to the level of the series. The equations below are shown for additive
method.
where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend. Also, $S_{t}$ is the seasonal component at time t, $\delta$ is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.
We will fit a Triple Exponential Smoothing model to our data.
# Create Model
tesModel = ExponentialSmoothing(erie_data['Erie'], trend='add', seasonal='add', seasonal_periods=12)
# Fit Model
tesFitted = tesModel.fit()
# Store fitted values
erie_data['TES12'] = tesFitted.fittedvalues.shift(-1)
# Plot model
ax = erie_data[['Erie','TES12']].plot(figsize=(18,10))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);
The plot above shows Triple Exponential Smoothing model along with original data. Let's zoom into the data for last two years (2018-2019) to see how the model fits.
with plt.style.context('fivethirtyeight'):
ax = erie_data['2018-01-01':].plot(figsize=(18,10))
ax.set_xlabel("")
ax.set_ylabel("Mean Water Level (metric units)")
ax.set_title('Monthly Mean Water Levels (metric units) - Lake Erie')
ax.autoscale(axis='x', tight=True);
The plot above shows
Triple Exponential Smoothing (TES12)
model plotted along with original data, Simple and Double Exponential Smoothing models. We can see that Triple Exponential Smoothing model doesn't fit the data as well as the Double Exponential Smoothing model.
Complex models are assumed to perform better. We will test this assumption when we forecast the data. We'll see that having the ability to predict fluctuating seasonal patterns greatly improves our forecast.
When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. We will:
# Check data
erie_data.info()
The data has 1224 observations. We will split the data into train and test sets keeping 48 observations (4 years of data) in the test set.
# Split data
train_data = erie_data.iloc[:1176]
test_data = erie_data.iloc[1176:]
Fit Double and Triple Exponential Smoothing model on Training data.
# Fit models
des_fittedmodel = ExponentialSmoothing(train_data['Erie'], trend='add').fit()
tes_fittedmodel = ExponentialSmoothing(train_data['Erie'], trend='add',
seasonal='add', seasonal_periods=12).fit()
Forecast both models for 48 time periods.
des_predictions = des_fittedmodel.forecast(48).rename('DES Forecast')
tes_predictions = tes_fittedmodel.forecast(48).rename('TES Forecast')
We will plot the training data, test data and the predictions from both models to compare performance.
# Plot predictions
train_data['Erie'].plot(legend=True, label='Train', figsize=(18,10))
test_data['Erie'].plot(legend=True, label='Test')
des_predictions.plot(legend=True, label='Double Expoential Smoothing Predictions')
tes_predictions.plot(legend=True, label='Triple Expoential Smoothing Predictions');
The plot above shows training data, test data, predictions from Double and Triple Exponential Smoothing models. Let's look at the last 48 time periods to clearly understand model performance.
with plt.style.context('fivethirtyeight'):
test_data['Erie'].plot(legend=True, label='Test', title='Model Predictions', figsize=(18,10))
des_predictions.plot(legend=True, label='Double Expoential Smoothing Predictions')
tes_predictions.plot(legend=True, label='Triple Expoential Smoothing Predictions');
The plot above shows that Triple Exponential Smoothing model seems to fit the data better than Double Exponential Smoothing model.
Let's evaluate the Root Mean Squared Error (RMSE) from both models to see which model had a lower loss and performed better.
from statsmodels.tools.eval_measures import rmse
rmse_des = rmse(test_data['Erie'], des_predictions)
print(f'RMSE for Double Exponential Smoothing Model is: {rmse_des}')
rmse_tes = rmse(test_data['Erie'], tes_predictions)
print(f'RMSE for Triple Exponential Smoothing Model is: {rmse_tes}')
The results show that Triple Exponential Smoothing model has a lower error and seems to fit the data better. We can see that neither Double nor Triple Exponential Smoothing models seem to capture trend of time series data well although seasonality is captured fairly well by Triple Exponential Smoothing model.
# Write to csv
erie_data.to_csv('erie_data')
To summarize, in this notebook we:
We have seen that neither Double nor Triple Exponential Smoothing models seem to capture trend of time series data well. In the next part, we will look at some more complex models such as ARIMA models that are better at capturing seasonality and trends in data.
[1] https://www.theguardian.com/environment/2019/sep/03/great-lakes-region-flooding-climate-crisis
[2] https://www.eurekalert.org/pub_releases/2020-07/uot-srm070720.php
[3] https://www.cleveland.com/news/2019/06/lake-eries-record-high-water-level-how-does-it-affect-you.html
[4] https://online.stat.psu.edu/stat510/lesson/1/1.1
[5] Hyndman, Rob J. & Athanasopoulos, George. & OTexts.com, issuing body. (2014). Forecasting : principles and practice. [Heathmont?, Victoria] : https://otexts.com/fpp2/
[6] https://online.stat.psu.edu/stat501/node/1001
[7] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[8] https://en.wikipedia.org/wiki/Lake_Erie