Part 4 - Productivity with Pandas

Table of Contents

In the previous notebook, we introduced Pandas, which provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive.

In this notebook, we will build on our knowledge of Pandas to be more productive. Pandas provides sophisticated, multi-level indexing functionality, along with the ability to perform data aggregation operations, such as grouping, merging, and joining data. It also provides capabilities for working with Time Series data that involves navigating and manipulating various date ranges and time indices. Let's dive into the details.

Working with Categorical Data

In many practical Data Science activities, you may come across data that contain categorical variables. These variables are typically stored as text values in columns. For such data, you may want to find the unique elements, frequency of each category present, or transform the categorical data into suitable numeric values.

Pandas provides various approaches to handle categorical data. To get started, let's create a small dataset and look at some examples. Seaborn library comes preloaded with some sample datasets. We will load the tips data from seaborn for our analysis.

Unique Values and Value Counts

Pandas provides methods such as unique(), nunique(), and value_counts() to extract information about the values in a column.

unique() and nunique()

unique() can be used to identify the unique elements of a column.

The result is an array which can be easily converted to a list by chaining the tolist() function.

Similarly, unique() can be applied on the index.

nunique() can be used to count the number of unique values in a column.

value_counts()

value_counts() are used to determine the frequency of different values present in the column.

reset_index() can be chained to the value_counts() operation to easily get the results as a DataFrame.

One Hot Encoding

Many machine learning algorithms do not support the presence of categorical values in data. Pandas provides various approaches to transform the categorical data into suitable numeric values to create dummy variables, and one such approach is called One Hot Encoding. The basic strategy is to convert each category value into a new column and assign a 0 or 1 (True/False) value to the column. Dummy variables can be created using get_dummies.

The resulting dataset contains four new columns (one for each day) day_Thur, day_Fri, day_Sat, day_Sun. You can pass as many category columns as you would like and choose how to label the columns using prefix parameter.

Binning Continous Variables

You may come across scenarios where you need to bin continuous data into discrete chunks to be used as a categorical variable. We can use pd.cut() function to cut our data into discrete buckets.

The function results in five equal-width bins. We can also specify bin edges to create specific non-uniform bins.

The operation creates two non-uniform categories for total_bill.

Data Aggregation

Summarizing data by applying various aggregation functions such as sum(), mean(), median() etc. to each group or category within the data is a critical component of a data analysis workflow. Simple aggregations can give you a high level overview but are often not enough to get a deeper understanding of the data.

Pandas provides a flexible groupby() operation which allows for quick and efficient aggregation on subsets of data.

GroupBy

The name "group by" comes from a command in the SQL language. Hadley Wickham, author of popular packages in R programming language, described grouping operations by coining the term split-apply-combine.

The image below shows a mockup of a simple group aggregation.

groupby() method can be used to apply the basic split-apply-combine operation on a DataFrame by specifying the desired column name.

The method returns a DataFrameGroupBy object. No actual computation has been performed by the groupby() method yet. The idea is that this object has all the information needed to then apply some operation to each of the groups in the data. This "lazy evaluation" approach means that common aggregation functions can be implemented very efficiently using groupby(). For example, to compute the mean, we can call mean() method on the GroupBy object.

The data has been aggregated according to the group key and is now indexed by the unique values in the sex column. By default, all of the numeric columns are aggregated.

Using Multiple Keys

Multiple column names can be passed as group keys to group the data appropriately. Let's group the data by smoker and day columns.

The data is now indexed by the unique values in the smoker and day columns. Similarly, other aggregation operations such as sum(), median(), std() etc. can be applied to the groups within data.

Using aggregate()

aggregate() method allows for even greater flexibility by taking a string, a function, or a list and computing all the aggregates at once. The example below shows minimum aggregation operation being used as a string, median being called as a function, and all aggregation operations being passed as a list.

Aggregation functions can also be passed as a dictionary, mapping column names to operations that are to be applied on that column. The example below shows min operation applied to total_bill column and max operation applied to tip column.

A more complex operation could involve passing a list of operations to be applied to a specific column.

Selecting a Subset of Columns

For large datasets, it may be desirable to aggregate a specific column or only a subset of columns. As an example, we can group the data by smoker and compute mean for tip column as follows:

Similarly, we can group the data by smoker and day columns, compute median for tip column.

Pivot Tables

Pivot Table is a popular operation that is commonly used on tabular data in spreadsheets. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. Pivot Tables are essentially a multidimensional version of GroupBy. Pandas includes a pandas.pivot_table function and DataFrame also has a pivot_table method.

Seaborn library comes preloaded with some sample datasets. We will load the titanic dataset from seaborn for our analysis and look at some examples.

Let's say we want to look at the average number of people that survived by both sex and class. We can get the results using both GroupBy and pivot_table.

We can see that the pivot_table approach is much more readable than the GroupBy and produces the same result. The default aggregation operation is mean.

Multi-level Pivot Table

Just as in GroupBy, the data can be grouped at multiple levels using pivot_table. Suppose we want to group survival by sex and age. Since age is a continuous variable, we can create bins for age using pd.cut function and then group the data.

The operation can be applied to columns in a similar fashion. Suppose we want to group survival by sex and age and look at the data by class and fare.

We can discretize the fare variable into equal-sized buckets based on sample quantiles using pd.qcut and then group the data.

Using aggfunc

aggfunc keyword can be used to specify the aggregate functions that can be applied to different columns.

To compute totals along each grouping, margins keyword can be used.

describe() method

Convenience methods, such as describe(), can be used to compute several common aggregates for each column. It also comes in handy when you are trying to understand the overall properties of a dataset.

describe() on GroupBy

describe() can be used on a groupby() object to get common aggregates for a subset of data.

Combining Data

Data science workflows often involve combining data from different sources to enhance the analysis. There are multiple ways in which data can be combined ranging from the straightforward concatenation of two different datasets, to more complicated database-style joins.

concat()

concat() can be used to stack data frames together along an axis.

By default, concat() works row-wise within the DataFrame (along axis=0).

ignore_index flag can be used to ignore the index when it is not necessary.

You an specify an axis along which the concatenation should take place. If axis parameter is not specified, the concatenation works row-wise generating NaN values for unmatched columns.

When axis='columns' is specified, the concatenation works along columns.

append()

append() works similar to concat() but does not modify the original object. It creates a new object with the combined data. The method works row-wise within the DataFrame (along axis=0). This method is not very efficient, as it involves the creation of a new index and data buffer.

merge()

merge() joins rows in DataFrame based on one or more keys. It works as the entry point for all standard database join operations. Let's create sample data and look at some examples.

Notice that a column to join the data was not specified. merge() uses the overlapping column names as keys for joining data. It is a good practice to explicitly specify the column to join the data using on keyword. 'Marketing' and 'Operations' values and associated data are missing from the result as the operation returns only common set.

If the column names are different in the DataFrame, then left_on and right_on keywords can be used.

The redundant column can be dropped as needed using the drop() method.

Inner Join

By default merge() performs an inner join. The result is an intersection, or the common set found in both DataFrame. The merge operations we just saw were all inner joins. Different join types such as left, right, outer can be specified using the how= parameter.

'Marketing' and 'Operations' values and associated data are missing from the result, as the operation returns only common set.

Left Join

This join returns all records from the left DataFrame and the matched records from the right DataFrame.

The result is a Cartesian product of the rows. Since there were two 'Engineering' rows in the left DataFrame and two in the right, there are four 'Engineering' rows in the result. Similarly, there was one 'Accounting' row in left DataFrame and two in the right, resulting in two 'Accounting' rows. The row for 'Operations' is missing as this join only keeps matched rows from right DataFrame.

Right Join

This join returns all records from the right DataFrame and the matched records from the left DataFrame.

The result is a Cartesian product of the rows. Since there were two 'Engineering' rows in the right DataFrame and two in the left, there are four 'Engineering' rows in the result. Similarly, there was one 'Accounting' row in left DataFrame and two in the right, resulting in two 'Accounting' rows. The row for 'Marketing' is missing as this join only keeps matched rows from right DataFrame.

Outer Join

The outer join takes the union of the keys, combining the effect of applying both left and right joins.

The result is a Cartesian product of the rows using all key combinations filling in all missing values with NAs.

Using index to merge

Index can also be used as the key for merging by specifying the left_index and/or right_index flags.

join()

join instance of a DataFrame can also be used for merging by index. The how keword can be specified for the type of join.

'Marketing' and 'Operations' values and associated data are missing from the result, as the operation returns only common set.

The result shows all data, filling in the missing values with NAs.

Hierarchical Indexing

Hierarchical indexing (also known as multi-indexing) allows you to have multiple (two or more) index levels within a single index on an axis. It provides a way for representing higher dimensional data in a lower dimensional form. Let's start with a simple example, creating a series with multi-index.

Multi-indexed Series

With this indexing, you can easily index or slice the series using the index. However, what if you wanted to select all the values for 2015? The tuple-based index is essentially a multi-index and Pandas MultiIndex type allows us to create multi-level indexes. This provides us with the flexibility to perform operations on indexes easily and efficiently.

We will create a multi-indexed index for pop_series using the MultiIndex type.

Notice that the new_index object contains multiple levels of indexing, the state names and the years. We can now reindex the pop_series to see hierarchical representation of the data.

Subset Selection

Subset selection of multi-indexed data is similar to what we have seen in the previous part of this guide series. Let's take a quick look.

We can now easily access the data for second index and answer our question about selecting all the values for 2015.

unstack() and stack()

unstack() method will convert a multi-indexed Series into a DataFrame, and naturally stack() would do the opposite.

The result is a DataFrame where second level index (years) is converted to columns, and first level index (states) remains as the index of the DataFrame.

level can be specified to unstack() by a specific index level. Specifying level=0 will unstack based on the outermost index level i.e. by 'state'.

Specifying level=1 unstacks by the inner index, in this case 'year'.

Multi-indexed DataFrame

In a DataFrame, both rows and columns can have multiple levels of indices. Let's create some sample data and take a look.

sales_data is essentially four dimensional data with 'sales person', 'product', 'year' and 'quarter' as its dimensions.

Subset Selection

Since columns in a DataFrame are individual Series, the syntax used for multi-indexed Series applies to the columns.

.loc and .iloc index operators can also be used.

Sorting

By Index and Level

sort_index() can be used to sort the index or levels within your data. By default, the indexing operartion is performed on the outermost index (level=0) and in ascending order.

Specifying level=1 sorts by the inner index.

Sorting can be applied on columns by specifying axis='columns'.

By Value

sort_values() can be used to sort the values in a DataFrame by one or more columns.

For multi-indexed data, column label must be unique. So, the values passed to by= parameter must be a tuple with elements corresponding to each level.

Multiple columns, or a combination or column and index, can be specified by passing them as a list of tuples.

Data Aggregations

We have seen data aggregations in a previous section in this notebook. Various aggregation functions such as sum(), mean(), median() can be applied to multi-indexed data.

level parameter controls the subset of data to which aggregation is applied. Let's look at some examples.

Conclusion

In this part of the guide series we learned about how to be more productive with Pandas. We started with Data Aggregation using groupby and pivot_table. Next, we discussed how data can be combined using concat(), append(), merge(), and join() methods. You have seen how data can be indexed at multiple levels in the Hierarchical indexing section. Here, we discussed multi-indexed Series and DataFrame, including selection, sorting, and aggregation methods. We also looked at how to look at unique values and value counts for categorical data.

In the next part of this guide series, we will explore the capabilities for working with Time Series data.

References

[1] Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd. ed.). O'Reilly Media, Inc.

[2] Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.