Part 5 - Working with Time Series Data

Table of Contents

In the previous notebook, we learned how to be more productive with Pandas by using sophisticated multi-level indexing, aggregating and combining data. In this notebook, we will explore the capabilities for working with Time Series data. We will start with the default datetime object in Python and then jump to data structures for working with time series data in Pandas. Let's dive into the details.

Time Series are one of the most common types of structured data that we encounter in daily life. Stock prices, weather data, energy usage, and even digital health, are all examples of data that can be collected at different time intervals. Pandas was developed in the context of financial modeling, so it contains an extensive set of tools for working with dates, times, and time-indexed data. Date and Time data comes in various flavors such as:

In this notebook, we will briefly introduce date and time data types in native python and then focus on how to work with date/time data in Pandas.

Date and Time in Python

Python's basic objects for working with time series data reside in the datetime module. Let's look at some examples.

Building datetime object

datetime objects can be used to quickly perform a host of useful functionalities. A date can be built in various ways and then properties of a datetime object can be used to get specific date and time details from it.

datetime.now() creates a datetime object with current date and time down to the microsecond.

A datetime object can also be created by specifying year, month, day, and other details.

Converting between String and DateTime

strftime and strptime methods can be used to format datetime objects and pandas Timestamp objects (discussed later in this section).

Using strftime and strptime

strftime can be used to convert a datetime object to a string according to a given format. Standard string format codes for printing dates can read about in the strftime section of Python's datetime documentation.

Similarly, strptime can be used to parse a string into a datetime object.

While datetime.strptime is a good way to parse a date when a format is known, it can be annoying to write a format each time.

Using parser.parse

dateutil module provides the parser.parse method that can parse dates from a variety of string formats.

Date and Time in Pandas

Pandas provides the following fundamental data structures for working with time series data:

The Basics

Pandas provides a Timestamp object, which combines the ease of datetime and dateutil with the efficient storage of numpy.datetime64. The to_datetime method parses many different kinds of date representations returning a Timestamp object.

Passing a single date to to_datetime returns a Timestamp.

strftime can be used to convert this object to a string according to a given format.

DatetimeIndex

Passing a series of dates by default returns a DatetimeIndex which can be used to index data in a Series or DataFrame.

DatetimeIndex objects do not have a frequency (hourly, daily, monthly etc.) by default, as they are just snapshots in time. As a result, arithmetic operations such as addition, subtraction, or multiplication cannot be performed directly.

Pandas also supports converting integer or float epoch times to Timestamp and DatetimeIndex. The default unit is nanoseconds, since that is how Timestamp objects are stored internally.

Notice that the date values change based on the unit specified.

PeriodIndex

A Timestamp represents a point in time, whereas a Period represents an interval in time. Time Periods can be used to check if a specific event occurs within a certain period, such as when monitoring the number of flights taking off or the average stock price during a period.

A DatetimeIndex object can be converted to a PeriodIndex using the to_period() function by specifying a frequency (such as D to indicate daily frequency).

Since a period represents a time interval, it has a start_time and an end_time.

Note that the start_time and end_time are DatetimeIndex objects because the start and end times are just a snapshot in time of the time period.

To reiterate the concept, let's look at another example.

Since Period is an interval of time, the test returns True showing that Timestamp lies within the time interval.

Arithmetic Operations

Now that a frequency is associated with the object, various arithmetic operations can be performed.

Similarly, we can create time periods with monthly frequency and perform arithmetic operations.

TimedeltaIndex

Time deltas represent the temporal difference between two datetime objects. Time deltas come in handy when you need to calculate the difference between two dates. A TimedeltaIndex can be easily created by subtracting a date from dates.

Date Range and Frequency

Regular date sequences can be created using functions, such as pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time deltas. For many applications, this is sufficient. Fixed frequency, such as daily, monthly, or every 15 minutes, are often desirable. Pandas provides a full suite of standard time series frequencies found here.

Note that the output when using date_range() is a DatetimeIndex object where each date is a snapshot in time (Timestamp).

pd.period_range() generated eight periods with monthly frequency. Note that the output is a PeriodIndex object. As mentioned earlier, Period represents an interval in time, whereas Timestamp represents a point in time.

Combining Frequency Codes

Frequency codes can also be combined with numbers to specify other frequencies. For example, a frequency of 1 hour and 30 minutes can be created by combining the hour H and minute T codes.

Similarly, a frequency of 1 day 5 hours and 30 mins can be created by combining the day D, hour H and minute T codes. As an example, we will create a timedelta_range.

Indexing and Selection

Pandas time series tools provide the ability to use dates and times as indices to organize data. This allows for the benefits of indexed data, such as automatic alignment, data slicing, and selection etc.

Pandas was developed with a financial context, so it includes some very specific tools for financial data. The pandas-datareader package (installable via conda install pandas-datareader) can import financial data from a number of available sources. Here, we will load stock price data for GE as an example.

Pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond level. Scalar values from a DatetimeIndex are pandas Timestamp objects.

Resampling, Shifting, and Windowing

Resampling

The process of converting a time series from one frequency to another is called Resampling. When higher frequency data is aggregated to lower frequency, it is called downsampling, while converting lower frequency to higher frequency is called upsampling. For simplicity, we'll use just the closing price Close data.

Resampling can be done using the resample() method, or the much simpler asfreq() method.

We will downsample the data using 'business year end' frequency BA and create a plot of the data returned after applying the two functions.

Downsample Plot

Plot the down-sampled data to compare the returned data of the two functions.

We can see that at each point, resample returns the average of the previous year, as shown by the dotted line, while asfreq reports the value at the end of the year, as shown by dashed line.

Upsampling involves converting from a low frequency to a higher frequency where no aggregation is needed. resample() and asfreq() are largely equivalent in the case of upsampling. The default for both methods is to leave the up-sampled points empty (filled with NA values). The asfreq() method accepts arguments to specify how values are imputed.

We will subset the data and then upsample with daily D frequency.

The default is to leave the up-sampled points empty (filled with NA values). Forward ffill or Backward bfill methods can be used to impute missing values.

Upsample Plot

Plot the up-sampled data to compare the data returned from various fill methods.

The top plot shows upsampled data using a daily frequency with default settings where non-business days are NA values that do not appear on the plot. The bottom plot shows forward and backward fill strategies for filling the gaps.

Shifting

A common use case of time series is shifting of data in time i.e. moving data backward and forward through time. Pandas includes shift() and tshift() methods for shifting data.

In both cases, the shift is specified in multiples of the frequency. Let's look at some examples.

Both forward and backward shift() opertions shift the data leaving the index unmodified. Let's look at how index is modified with tshift().

The index for the original data ranges from 2008-01-02 - 2008-01-15. Using thsift() for shifting backward, we see that the index now ranges from 2007-12-31 - 2008-01-11. Shift takes the same frequency as the frequency of datetime.

Plot the Data

Let's look at another example of shifting data using shift() and tshift() to shift the ge data. We will plot the data to visualize the differences.

The top panel in the plot shows ge data with a red line showing a local date. The middle panel shows the shift(900) operation which shifts the data by 900 days, leaving NA values at early indices. This is represented by the fact that there is no line on the plot for first 900 days. The bottom panel shows the tshift(900) operation, which shifts the index by 900 days, changing the start and end date ranges as shown.

Rolling Window

Rolling statistics are another time series specific operation where data is evaluated over a sliding window. Rolling operations are useful for smoothing noisy data. The rolling() operator behaves similarly to resample and groupby operations, but instead of grouping, it enables grouping over a sliding window.

The plot shows GE stock price data. The dashed line represents 250-day moving window average of the stock price.

Time Zones

We live in a global world where many companies operate in different time zones. This makes it crucial to carefully analyze the data based on the correct time zone. Many users work with time series in UTC (coordinated universal time) time which is the current international standard. Time zones are expressed as offsets from UTC; for example, California is seven hours behind UTC during daylight saving time (DST) and eight hours behind the rest of the year.

Localization and Conversion

Time series objects in Pandas do not have an assigned time zone by default. Let's consider the GE stock price ge data as an example.

The index's tz field is None. We can assign a time zone using tz_localize method.

Once a time series has been localized to a particular time zone, it can be easily converted to another time zone with tz_convert.

Epoch time can be read as timezone-naive timestamps and then localized to the appropriate timezone using the tz_localize method.

Operating between TIme Zones

If two time series with different time zones are combined, the result will be UTC.

Common Time Zones

Time zone information in python comes from a third party library called pytz (installable using conda install pytz). Let's look at some examples.

To get a time zone object, pytz.timezone can be used.

Common Use Cases

Importing data is the first step in any data science project. Often, you’ll work with data that contains date and time elements. In this section, we will see how to:

  1. Read date columns from data.
  2. Split a column with date and time into separate columns.
  3. Combine different date and time columns to form a datetime column.

We will use sample earthquake data with date and time information to illustrate this example. The data is stored in '.csv' format as an item. We will download the '.csv' file in a folder, as shown below, and then import the data for analysis.

Note: the dataset used in this example has been curated for illustration purposes.

Import data with date/time

Data with dates can be easily imported as datetime by setting the parse_dates parameter. Let's import the data and check the data types.

The column with date and time information is imported as a datetime data type.

Split into multiple columns

To split a column with date and time information into separate columns, Series.dt can be used to access the values of the series such as year, month, day etc.

New columns have been created for various date and time information.

Notice that date column is of object data type. It can be easily converted to a datetime object using pd.to_datetime.

Combine columns with Date/Time information

Consider a scenario where the data did not have a datetime column but the year, month, day, hour, minute, second date and time elements were stored as individual columns as shown below.

In such a scenario, a datetime object can be easily created by using the pd.to_to_datetime method. The method combines date and time information in various columns and returns a datetime64 object.

Conclusion

In this part of the guide series, you have seen in detail how to work with Time Series data. Here, we briefly introduced date and time data types in native python and then focused on date/time data in Pandas. You have seen how date_range can be created with frequencies. We discussed various indexing and selection operations on time series data. Next, we introduced time series specific operations, such as resmaple(), shift(), tshift() and rolling(). We also briefly discussed time zones and operating on data with different time zones.

References

[1] Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd. ed.). O'Reilly Media, Inc.

[2] Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.