How to treat outliers in data in Python

This scenario can happen when you are doing regression or classification in machine learning.

  • Regression: The target variable is numeric and one of the predictors is categorical
  • Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

  • Null hypothesis(H0): The variables are not correlated with each other
  • P-value: The probability of Null hypothesis being true
  • Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
  • Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Generating sample data
import pandas as pd
ColumnNames=[‘FuelType’,‘CarPrice’]
DataValues= [[  ‘Petrol’,   2000],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   1900],
             [  ‘Petrol’,   2150],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   2200],
             [  ‘Petrol’,   1950],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘Diesel’,   2900],
             [  ‘Diesel’,   2850],
             [  ‘Diesel’,   2600],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1400],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1650],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1500]
        
          
           ]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and
# returns F-statistic and P-value
from scipy.stats import f_oneway
# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated
# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby(‘FuelType’)[‘CarPrice’].apply(list)
# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value &gt; 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print(‘P-Value for Anova is: ‘, AnovaResults[1])

Sample Output : P-Value for Anova: 4.355e-12

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other. 

Leave a Reply

Your email address will not be published. Required fields are marked *