Part 2 - Introduction to NumPy

Table of Contents

Introduction to NumPy

NumPy, short for Numerical Python, is the core library for scientific computing in Python. It has been designed specifically for performing basic and advanced array operations. It primarily supports multi-dimensional arrays and vectors for complex arithmetic operations. Here are some things you will find in NumPy:

  1. ndarray, an efficient multidimensional array object providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.

  2. Mathematical functions for fast operations on entire arrays of data without having to write loops.

  3. Tools for reading/writing array data to disk and working with memory-mapped files.

  4. Linear algebra, random number generation, and Fourier transform capabilities.

NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python. They provide:

NumPy by itself does not provide modeling or scientific functionality, but an understanding of NumPy arrays and array-oriented computing will help you use tools, such as pandas, much more effectively.

Installation and Import

A typical installation of Python API comes with Numpy. You can use pip or conda to install it.

Once NumPy is installed, you can import it as:

You can also check the version of NumPy that is installed:

Creating Arrays

One of the key features of NumPy is its N-dimensional array object ndarray. An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements in the array must be the same type. So, an array is like a grid of values, all of the same type. The values in an array are indexed by a tuple of nonnegative integers.

Creating N-dimensional Array

The easiest way to create an array is to use the array function. We can initialize numpy arrays from Python lists. Let's create one, two, and three dimensional arrays using lists.

Creating arrays using random number generator

We can also generate arrays using NumPy's random number generator. Numpy's np.random module contains rand, randn and randint functions that can be used to generate different random numbers from different distributions.

Let's create a 2-D array using rand.

Now, let's create a 3-D array using randn.

Let's create a 2-D array of random integers between 2 and 10 using randint.

Attibutes of a NumPy Array

Each array has attributes such as:

Let's look at each of these attributes. We will use array r2 defined above to check these attributes. These attributes are extremely useful and come in handy during the data exploration phase of a project.

ndim: Number of dimensions

shape: Size of each dimension

size: Total number of elements in the array

dtype: Data type of the array

Elements of an array can be accessed in multiple ways. We can use [] to access individual elements on an array. We can also use slice notation, marked by the colon (:) character to access subarrays. Indexing and slicing of NumPy arrays is very similar to Python Lists.

Array Indexing - Accessing Single Elements

In a one-dimensional array, the value can be accessed by specifying the desired index. In a multi-dimensional array, value can be accessed using a comma-separated indices. We will use arrays defined above to look at some examples.

One-dimensional Array

To index from the end of the array, you can use negative indices.

Multi-dimensional Array

In a multi-dimensional array, the elements at each index are no longer scalars but rather sub-arrays. Elements of a multi-dimensional array can be accessed using a comma-separated list of indices. If you omit later indices, the returned object will be a sub-array. Let's take a look.

2-D Array

Accessing $0^{th}$ index resulted in a sub-array at index 0. To access a specific element, we can pass a list of indices.

3-D Array

Accessing $0^{th}$ index resulted in a sub-array at index 0. To access a specific element, we can pass a list of indices.

Array Slicing

We can use slice notation, marked by the colon (:) character to access sub-arrays of ndarrays. To access a slice of an array x, we can use the NumPy slicing syntax x[start:stop:slice]. Let's look at accessing sub-arrays in one dimension and in multiple dimensions.

One-dimensional Array

Multi-dimensional Array

Slices in a multi-dimensional array can be accessed using a comma-separated list of indices. If you omit later indices, the returned object will be a sub-array. Let's take a look.

Indexing and Slicing can be combined to access single rows or columns on an array.

Reshaping Arrays

Arrays can be converted from one shape to another without copying any data. To do this, we can pass a tuple indicating the new shape to the reshape array instance method. By reshaping an array, we can add or remove dimensions or change the number of elements in each dimension. Let's take a look.

One-dimensional Array

When reshaping, the size of the reshaped array must match the total number of elements in the actual array. For example, an array of 5 elements cannot be reshaped to (2,3) or (3,2) arrays. When this computation is performed, an error will be generated as shown.

The error shows that an array of size 5 arr1 cannot be reshaped into an array of size 6 (2 x 3).

Multi-dimensional Array

Reshape the array to shape (4,3)

Transposing an Array

Transposing is a special form of reshaping that swaps the axes. To transpose an array, simply use the T attribute of an array object.

arr is of shape (6,2). Transposing this array swaps the axes to return a shape of (2,6).

Flattening an Array

The opposite operation of reshape from one-dimensional to a higher dimension is typically known as flattening.

arr of shape (6,2) is flattened to return a shape of (12,).

Array Computation

Vectorization

Computation on NumPy arrays can be very fast, or it can be very slow, and the key to making it fast, is to use Vectorization. The practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In NumPy arrays, this is accomplished by simply performing an operation on the array, which will then be applied to each element. Vectorized operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents. Let's look at an example.

Imagine we have an array of values and would like to compute the sum of values. A straightforward approach using explicit loops might look like this:

Python is unable to take advantage of the fact that the array’s contents are all of a single data type. It first examines the object's type and does a dynamic lookup of the correct function to use for that type, which slows down the computation massively.

Recall that NumPy’s ndarray are homogeneous: an array can only contain data of a single type. NumPy takes advantage of the fact and delegates the task of performing mathematical operations on the array’s contents to optimized, compiled code. The result is a tremendous speedup over the explicit loops in Python.

Let's use the vectorized function np.sum() and time how long it takes to run the computation.

The computation is over 50 times faster when performed using NumPy’s vectorized function. So, when computational efficiency is important, one should avoid performing explicit for-loops in Python. NumPy provides a whole suite of vectorized functions called universal functions, or ufunc, that perform element-wise operations on data in ndarrays.

Universal Functions

A universal function "ufunc" is a function that performs element-wise operations on data in ndarrays. ufuncs exist in two flavors:

A complete list of NumPy universal functions can be found here. Let's look at some examples of ufuncs.

Single Array - Examples

The examples illustrate universal functions being applied to a single array.

Multiple Arrays - Examples

The examples illustrate universal functions being applied to more than one array.

Matrix Multiplication

The dot function is used to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices.

Inner product of vectors

The result shows dot product between a one-dimensional array with another one-dimensional array which returns a scalar.

Matrix - vector product

The result shows dot product between a two-dimensional array of shape (3,2) with a one-dimensional array which returns a one-dimensional array.

Matrix - matrix product

The result shows dot product between a two-dimensional array of shape (3,2) with another two-dimensional array of shape (2,3) which returns a two-dimensional array of shape(3,3).

Broadcasting

Broadcasting is a powerful mechanism that describes how arithmetic works between arrays of different shapes. It is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes. Broadcasting provides another way of utilizing NumPy's vectorized operations on arrays.

You can read more about Broadcasting here. Let's look at some examples.

Add scalar to an array:

We can think of this as an operation that stretches or duplicates the value 3 into the array [3, 3, 3], and adds the results. The advantage of NumPy's broadcasting is that this duplication does not actually take place, but it is a useful mental model as we think about broadcasting.

We can similarly extend this to arrays of higher dimension.

Add two arrays:

Multiply two arrays:

When multiplying aar1 of shape (1,3) with arr3 of shape (3,1), the broadcasting operation returns a (3,3) array.

Comparison Operators

With Broadcasting, we saw that using arithmatic operators such as +, -, *, / and others on arrays leads to element-wise operations. NumPy also implements various comparison operators such as <(less than), > (greater than) and others as element-wise ufuncs. The result of these comparison operators is an array with a Boolean data type.

Boolean Arrays

A number of useful operations can be applied to the boolean arrays to get informative results.

Working with Boolean Arrays

Let's say we want to know if an array has any values less than 5 or how many values in the array are greater than 5. Once we have a boolean array, we can easily apply various NumPy operations to get the results.

Let's look at some examples. We will set a seed value to ensure that the same random arrays are generated every time.

np.all and np.any can be applied along particular axes.

Boolean Operators

Now, let's change the question and say we want to know about all the values less than eight and greater than two. This, and other such questions can be answered through Python's bitwise logic operators &, |, ^, and ~. Let's look at an example.

Count of values that are less than eight and greater than two.

Count of values that are greater than eight or equal to five.

Boolean Masks

Boolean arrays can be used as masks to select specific subsets of the data. It selects the elements of an array that satisfy some condition where the output is a numpy array of elements for which the condition is satisfied. Let's take a look.

We are now free to combine various comparison and boolean operators with masks to ask even more complex questions. Let's create two boolean masks from our arr array :

  1. Mask of values greater than 4
  2. Mask of values smaller than 6

Now let's try to answer a few questions starting with getting a sum of all values that are less than 6.

Mean of all values that are less than 6.

Minimum from the values that are greater than 4.

All values that are not greater than 4.

All values that are less than 6 and greater than 4.

Plotting Arrays

Creating visualizations is one of the most important tasks in data analysis. It is critical to visualize data as part of the exploratory process. matplotlib is popularly used as the de facto plotting library and it integrates very well with Python. Let's create some plots using arrays.

Simple Plots

Line Plot

Histogram

Scatter Plot

Bar Plot

Subplots

Multiple plots can be added next to each other using subplots().

2-D Array as an Image

Images can be considered as array of dimension (m, n). Let's plot some ndarrays as images.

Create a random array of dimension (50, 50) and plot as an image.

Create an array of dimension (15,8) and plot as an image.

The image displayed is colorful because matplotlib is using the default colormap (a mapping from values in the array to colors). The default colormap in matplotlib is viridis, which maps low numbers to purple and high numbers to yellow. The relationship of numbers to colors can be seen using colorbar.

colormap can be easily changed using the cmap argument.

3-D Surface Plots

Various 3-D plots can be created using matplotlib. 3-D plots are enabled by importing the mplot3d toolkit. Once this submodule is imported, a three-dimensional axes can be created by passing the keyword projection='3d' to any of the normal axes creation. Let's create a 3-D surface plot.

Conclusion

In this part of the guide series we introduced NumPy, a foundational package for numerical computing in Python. We discussed how N-dimensional arrays ndarray can be created and then accessed in multiple ways using indexing and slicing. You have seen in detail how universal functions use the concept of Vectorization to perform element-wise operations on arrays. You were also introduced to the basics of plotting arrays.

In the next part of this guide series, you will learn about Introduction to Pandas.

References

[1] Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd. ed.). O'Reilly Media, Inc.

[2] Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.

[3] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2