EDA - Thyroid Classification

Table of Contents

Introduction

This notebook performs various exploratory data analysis tasks needed for thyroid classification project.

Problem Statement: To build a classification methodology to predict the type of Thyroid a person has, based on the below features.

Column Description Column Description
age Age of the person TSH_measured true or false
sex Male or Female TSH thyroid stimulating hormone floating value
on_thyroxine true or false T3_measured true or false
on_antithyroid_medication true or false T3 triiodothyronine value
sick true or false TT4_measured true or false
pregnant true or false TT4 Thyroxine value
thyroid_surgery true or false T4U_measured true or false
I131_treatment true or false T4U numerical value
query_hypothyroid true or false FTI_measured true or false
query_hyperthyroid true or false FTI Free Thyroxine Index
lithium true or false TBG_measured true or false
goitre true or false TBG Thyroid Binding Globulin value
tumor true or false referral_source different sources of referals
hypopituitary true or false Class different types of thyroid
psych true or false

Read the data

Data exported from database has 3972 records and 30 columns.

Observations:

- The data does not show any missing values. However, looking closely, we can see that missing values are indicated by `?`.<br>
- Some columns such as `FTI_measured` with t/f values are just indicating whether the next column `FTI` has a value or not. 

Let's count the number of missing and remove additional columns that do not add any value.

Data Exploration

Missing Values

Identify missing

All records in the TBG column are missing. We will drop this column later.

Replace ? with Nan

Remove columns with duplicate information

Columns with _measured in the column name contain t/f values which are just indicators of whether the next column FTI has a value or not. Since these columns do not add any information to the data, we will remove these columns.

t in a column with _measured indicates that the next column has a value and f indicates a missing value in the next column.

We will drop the columns with _measured.

Observation:

- Majority of the data is categorical with two categories.    
- Referral source column has 5 and the Class column has 4 categories.

Before we impute the missing data, let's map the binary categories and encode the multi-categories to nurical data as needed.

Feature Engineering

In this section, we will map the features with binary categories and encode the features with multi-categories to numerical data.

Convert sex column to numerical

Convert columns with binary categories to numerical

There are many columns with two unqiue t and f categories. Let's map these to numerical 0 and 1.

One-Hot encode referral source

The column has multiple categories. Let's create dummy variables for these categories

Label encode Class

The coulmn has multiple categories. Since Class represents various severities of thyroid, we will label encode the feature.

Impute Missing

The data has a lot of missing values and we will use KNNImputer to impute the missing.

Handle Outliers

Let's check the distribution of some numeric features.

Observation:

- The distributions of some features look skewed which could be due to the presence of outliers.      

Let's create box-plots to determine outliers.

Observations:

- Box plots show outliers in most features.    
- Age cannot be > 400, so we will remove the record.    
- TSH normally range from 0.1-15 however there is no upper limit. A quick google search shows that most labs can measure TSH upto 150 mlU/ml. We will remove records will TSH > 150.    

Remove recrods with age > 100 and TSH > 150

Observations:

- Removing outlier from `age` has significantly improved the distribution and it looks normal.    
- Removing outlier from `TSH` has also improved the distribution a little but it is still heavilty skewed to the right.    
- Similarly, other features also seem to be skewed to the right.

Let's apply some transforations to the data to see if the data can be normally distributed.

Data Transformation

Apply some transforations to the data to see if the data can be normally distributed.

Apply log transformation

The data seems to be skewed to the left after applyting log transformation.

Apply Box-cox tranformation

Box Cox transformation does a good job of bringing the features close to a Normal distribution.

Distribution of Class

In this section, we will check the distribution of our dependent variable Class. We will aso use oversampling technqiue to handle imbalanced class.

The data is highly imbalanced with majority of data represented by Class = 1.

Let's oversample the data.

Oversample using RandomOverSampler()

Clustering

Here, we determine the best number of clusters in which the data can be split into.

Modeling

This section shows 2/4 modelling techniques used to model the training data. Refer to the application for full details on various models and hypter-parameter tuning performed.

XGB

SVM

-- End of Exploration --