How to treat missing values in data in Python

Before machine learning algorithms can be used, some data pre-processing is required, missing value treatment is one of them.

How to find missing values in Python?

The function isnull() of a pandas data frame helps to find missing values in each column.

LoanData.isnull().sum()

# Creating a sample data frame

import pandas as pd
import numpy as np
ColumnNames=[‘CIBIL’,‘AGE’,‘GENDER’ ,‘SALARY’, ‘APPROVE_LOAN’]
DataValues=[ [480, 28, ‘M’, 610000, ‘Yes’],
             [480, np.nan, ‘M’,140000, ‘No’],
             [480, 29, ‘M’,420000, ‘No’],
             [490, 30, ‘M’,420000, ‘No’],
             [500, 27, ‘M’,420000, ‘No’],
             [510, np.nan, ‘F’,190000, ‘No’],
             [550, 24, ‘M’,330000, np.nan],
             [560, 34, ‘M’,160000, ‘No’],
             [560, 25, ‘F’,300000, ‘Yes’],
             [570, 34, ‘M’,450000, ‘Yes’],
             [590, 30, ‘F’,140000, ‘Yes’],
             [600, np.nan, ‘F’,600000, ‘Yes’],
             [600, 22, ‘M’,400000, ‘No’],
             [600, 25, ‘F’,490000, ‘Yes’],
             [610, 32, ‘F’,120000, np.nan],
             [630, 29, ‘F’,360000, ‘Yes’],
             [630, 30, ‘F’,480000, ‘Yes’],
             [660, 29, ‘F’,460000, ‘Yes’],
             [700, 32, ‘M’,470000, ‘Yes’],
             [740, 28, ‘M’,400000, ‘Yes’]]
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Finding out missing values
LoanData.isnull().sum()

How to treat missing values?

Once you have found the missing values in each column, then you have below options

  • Deleting all the missing values
  • Replacing missing values with the median for continuous columns
  • Replacing missing values with the mode for categorical columns
  • Interpolating values

How to delete all missing values at once?

This option is exercised only if the number of rows containing missing values is much less than the total number of rows.

The dropna() function of the pandas data frame removes all those rows that contain at least one missing value.

1
2
3
4
# Code to delete all the missing values at once
print(‘Before Deleting missing values:’, LoanData.shape)
LoanDataCleaned=LoanData.dropna()
print(‘After Deleting missing values:’, LoanDataCleaned.shape)

Replacing missing values using median/mode

Missing values treatment is done separately for each column in data. If the column is continuous, then its missing values will be replaced by the median of the same column. If the column is categorical, then the missing values will be replaced by the mode of the same column.

1
2
3
4
5
6
7
# Replacing with median value for a numeric variable
MedianAge=LoanData[‘AGE’].median()
LoanData[‘AGE’]=LoanData[‘AGE’].fillna(value=MedianAge)
# Replacing with mode value for a categorical variable
ModeValue=LoanData[‘APPROVE_LOAN’].mode()[0]
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].fillna(value=ModeValue)

Replacing missing values using interpolation

Instead of replacing all missing values with just one value(median/mode), you can also choose to interpolate the missing values based on the data present in the nearby location. This is called interpolation.

1
2
3
4
5
# Replacing missing values by interpolation for a numeric variable
LoanData[‘AGE’]=LoanData[‘AGE’].interpolate(method=‘linear’)
# Replacing missing values by interpolation for a categorical variable
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].interpolate(method=‘ffill’)

 

Leave a Reply

Your email address will not be published. Required fields are marked *