Introduction to Data Engineering using Python (5-part study)

Part-1 Introduction to Data Engineering
There are numerous libraries in the scientific python ecosystem. In this part of the guide series, we will briefly discuss some of the key libraries used for data processing, visualization, and modeling. We will introduce the concept of the Spatially Enabled DataFrame (SEDF) and how it adds “spatial” abilities to the data. You will also see an end-to-end example of using SEDF through the machine learning lifecycle, starting from reading data into SEDF, to exporting a SEDF.

Part-2 Introduction to NumPy
In this part of the guide series we introduced NumPy, a foundational package for numerical computing in Python. We discuss how N-dimensional arrays, ndarray, can be created and then accessed in multiple ways using indexing and slicing. You will see in detail how universal functions use the concept of Vectorization to perform element-wise operations on arrays. We also introduce the basics of plotting arrays.

Part-3 Introduction to Pandas
In this part of the guide series we introduce Pandas, a Python package that builds on NumPy and provides data structures and functions designed to make working with structured data fast, easy, and expressive.

You will see how DataFrame can be created and then data can be accessed using .loc and .iloc operators. We discuss in detail how to check the different data types in a DataFrame and ways to change these data types. We also discuss how to perform various operations on a DataFrame (i.e. Arithmetic, Reindex, Add and Drop data) and to work with missing data. We briefly introduce working with a Series object as well.

Part-4 Productivity with Pandas
In this part of the guide series we learn about how to be more productive with Pandas. We start with Data Aggregation using groupby and pivot_table. Next, we discuss how data can be combined using concat(), append(), merge(), and join() methods. You will see how data can be indexed at multiple levels in the Hierarchical indexing section. Here, we discuss multi-indexed Series and DataFrame, including selection, sorting, and aggregation methods. We also explore working with categorical data.

Part-5 Time Series Analysis with Pandas
In this part of the guide series, you will see in detail how to work with Time Series data. Here, we briefly introduce date and time data types in native python and then focused on date/time data in Pandas. You will see how date_range can be created with frequencies. We discuss various indexing and selection operations on time series data. Next, we introduce time series specific operations, such as resmaple(), shift(), tshift() and rolling(). We also briefly discuss time zones and operating on data with different time zones.

Mohit Aggarwal

Introduction to Data Engineering using Python (5-part study)

Language: Python