How to treat outliers in data in Python

This scenario can happen when you are doing regression or classification in machine learning.

Regression: The target variable is numeric and one of the predictors is categorical
Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

Null hypothesis(H0): The variables are not correlated with each other
P-value: The probability of Null hypothesis being true
Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

# Generating sample data

import pandas as pd

ColumnNames=[‘FuelType’,‘CarPrice’]

DataValues= [[ ‘Petrol’, 2000],

[ ‘Petrol’, 2100],

[ ‘Petrol’, 1900],

[ ‘Petrol’, 2150],

[ ‘Petrol’, 2100],

[ ‘Petrol’, 2200],

[ ‘Petrol’, 1950],

[ ‘Diesel’, 2500],

[ ‘Diesel’, 2700],

[ ‘Diesel’, 2900],

[ ‘Diesel’, 2850],

[ ‘Diesel’, 2600],

[ ‘Diesel’, 2500],

[ ‘Diesel’, 2700],

[ ‘CNG’, 1500],

[ ‘CNG’, 1400],

[ ‘CNG’, 1600],

[ ‘CNG’, 1650],

[ ‘CNG’, 1600],

[ ‘CNG’, 1500],

[ ‘CNG’, 1500]

]

#Create the Data Frame

CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(CarData.head())

########################################################

# f_oneway() function takes the group data as input and

# returns F-statistic and P-value

from scipy.stats import f_oneway

# Running the one-way anova test between CarPrice and FuelTypes

# Assumption(H0) is that FuelType and CarPrices are NOT correlated

# Finds out the Prices data for each FuelType as a list

CategoryGroupLists=CarData.groupby(‘FuelType’)[‘CarPrice’].apply(list)

# Performing the ANOVA test

# We accept the Assumption(H0) only when P-Value > 0.05

AnovaResults = f_oneway(*CategoryGroupLists)

print(‘P-Value for Anova is: ‘, AnovaResults[1])

Sample Output : P-Value for Anova: 4.355e-12

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other.

Leave a Reply Cancel reply