How to treat outliers in data in Python

This scenario can happen when you are doing regression or classification in machine learning.

  • Regression: The target variable is numeric and one of the predictors is categorical
  • Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

  • Null hypothesis(H0): The variables are not correlated with each other
  • P-value: The probability of Null hypothesis being true
  • Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
  • Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Generating sample data
import pandas as pd
ColumnNames=[‘FuelType’,‘CarPrice’]
DataValues= [[  ‘Petrol’,   2000],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   1900],
             [  ‘Petrol’,   2150],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   2200],
             [  ‘Petrol’,   1950],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘Diesel’,   2900],
             [  ‘Diesel’,   2850],
             [  ‘Diesel’,   2600],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1400],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1650],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1500]
        
          
           ]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and
# returns F-statistic and P-value
from scipy.stats import f_oneway
# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated
# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby(‘FuelType’)[‘CarPrice’].apply(list)
# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value &gt; 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print(‘P-Value for Anova is: ‘, AnovaResults[1])

Sample Output : P-Value for Anova: 4.355e-12

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other. 

How to treat missing values in data in Python

Before machine learning algorithms can be used, some data pre-processing is required, missing value treatment is one of them.

How to find missing values in Python?

The function isnull() of a pandas data frame helps to find missing values in each column.

LoanData.isnull().sum()

# Creating a sample data frame

import pandas as pd
import numpy as np
ColumnNames=[‘CIBIL’,‘AGE’,‘GENDER’ ,‘SALARY’, ‘APPROVE_LOAN’]
DataValues=[ [480, 28, ‘M’, 610000, ‘Yes’],
             [480, np.nan, ‘M’,140000, ‘No’],
             [480, 29, ‘M’,420000, ‘No’],
             [490, 30, ‘M’,420000, ‘No’],
             [500, 27, ‘M’,420000, ‘No’],
             [510, np.nan, ‘F’,190000, ‘No’],
             [550, 24, ‘M’,330000, np.nan],
             [560, 34, ‘M’,160000, ‘No’],
             [560, 25, ‘F’,300000, ‘Yes’],
             [570, 34, ‘M’,450000, ‘Yes’],
             [590, 30, ‘F’,140000, ‘Yes’],
             [600, np.nan, ‘F’,600000, ‘Yes’],
             [600, 22, ‘M’,400000, ‘No’],
             [600, 25, ‘F’,490000, ‘Yes’],
             [610, 32, ‘F’,120000, np.nan],
             [630, 29, ‘F’,360000, ‘Yes’],
             [630, 30, ‘F’,480000, ‘Yes’],
             [660, 29, ‘F’,460000, ‘Yes’],
             [700, 32, ‘M’,470000, ‘Yes’],
             [740, 28, ‘M’,400000, ‘Yes’]]
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Finding out missing values
LoanData.isnull().sum()

How to treat missing values?

Once you have found the missing values in each column, then you have below options

  • Deleting all the missing values
  • Replacing missing values with the median for continuous columns
  • Replacing missing values with the mode for categorical columns
  • Interpolating values

How to delete all missing values at once?

This option is exercised only if the number of rows containing missing values is much less than the total number of rows.

The dropna() function of the pandas data frame removes all those rows that contain at least one missing value.

1
2
3
4
# Code to delete all the missing values at once
print(‘Before Deleting missing values:’, LoanData.shape)
LoanDataCleaned=LoanData.dropna()
print(‘After Deleting missing values:’, LoanDataCleaned.shape)

Replacing missing values using median/mode

Missing values treatment is done separately for each column in data. If the column is continuous, then its missing values will be replaced by the median of the same column. If the column is categorical, then the missing values will be replaced by the mode of the same column.

1
2
3
4
5
6
7
# Replacing with median value for a numeric variable
MedianAge=LoanData[‘AGE’].median()
LoanData[‘AGE’]=LoanData[‘AGE’].fillna(value=MedianAge)
# Replacing with mode value for a categorical variable
ModeValue=LoanData[‘APPROVE_LOAN’].mode()[0]
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].fillna(value=ModeValue)

Replacing missing values using interpolation

Instead of replacing all missing values with just one value(median/mode), you can also choose to interpolate the missing values based on the data present in the nearby location. This is called interpolation.

1
2
3
4
5
# Replacing missing values by interpolation for a numeric variable
LoanData[‘AGE’]=LoanData[‘AGE’].interpolate(method=‘linear’)
# Replacing missing values by interpolation for a categorical variable
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].interpolate(method=‘ffill’)

 

K-fold cross validation

The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm or configuration on a dataset.Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

The algorithm of the k-Fold technique:

  1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
  2. Split the dataset into k equal (if possible) parts (they are called folds)
  3. Choose k – 1 folds as the training set. The remaining fold will be the test set
  4. Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
  5. Validate on the test set
  6. Save the result of the validation
  7. Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
  8. To get the final score average the results that you got on step 6.

Regularization Technique

Regularization helps to solve over fitting problem in machine learning. Simple model will be a very poor generalization of data.

At the same time, complex model may not perform well in test data due to over fitting. We need to choose the right model in between simple and complex model. Regularization helps to choose preferred model complexity, so that model is better at predicting.

Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term. It can be used for many machine learning algorithms.

What is a t-test ?

The t-test is a statistical hypothesis that takes samples from both groups to determine if there is a significant difference between the means of the two groups.

It compares both sample mean and standard deviations while considering sample size and the degree of variability of the data.

Steps:
1) State a hypothesis. A hypothesis is classified as a null hypothesis ( H0) and an
2) alternative hypothesis (Ha) that rejects the null hypothesis. The null and alternate hypotheses are defined according to the type of test being performed.
3) Collect sample data.
4) Conduct the test.
5) Reject or fail to reject your null hypothesis H0

Interaction term in Linear Regression

In linear regression when additional terms are added to the regression model to account for the possibility that the relationship between an independent variable (predictor) and the dependent variable (outcome) depends on the value of another independent variable.
In simpler terms, they represent the idea that the effect of one variable on the outcome is not constant but varies depending on the level of another variable.

Mathematically, an interaction term in a linear regression model takes the form of a product between two or more independent variables. For example, if you have two independent variables, X1 and X2, and you suspect that the effect of X1 on the outcome (Y) depends on the value of X2, you can introduce an interaction term like this:

Y = β0 + β1 * X1 + β2 * X2 + β3 * (X1 * X2) + ε

In this equation:

Y represents the dependent variable (the one you’re trying to predict).
X1 and X2 are independent variables.
β0, β1, β2, and β3 are the regression coefficients that represent the relationship between the variables.
ε represents the error term.
The coefficient β3 measures the strength and direction of the interaction effect. If β3 is statistically significant and positive, it indicates that the effect of X1 on Y increases as X2 increases. If β3 is negative, it suggests that the effect of X1 on Y decreases as X2 increases. If β3 is not statistically significant, it implies that there is no interaction effect between X1 and X2.

Interaction terms are useful when you suspect that the relationship between variables is more complex than a simple additive relationship. They allow you to capture how the relationship between two variables changes in the presence of other variables, potentially leading to a better understanding of the underlying data and improved model accuracy. However, it’s essential to be cautious when adding interaction terms, as including too many can lead to overfitting (as the equation polynomial increases, the more overfit the model), and they should be based on theoretical or domain knowledge rather than testing multiple combinations blindly.

How to visualize the relationship between two continuous variables in Python

Scatter plot is the chart used when you want to visualize the relationship between two continuous variables in data. Typically used in Supervised ML(Regression). Where the target variable is a continuous variable. So if you want to check which continuous predictor has a clear relationship with the target variable, then you look at the scatter plots.

Consider the below scenario Here the target variable is “Weight” and we are trying to predict it based on the number of hours a person works out at the gym and the number of calories they consume in a day.

If you plot the scatter chart between weight and calories, you can see an increasing trend. We can easily deduce from this graph that, if the calory intake increases, then the weight also increases. This is known as a positive correlation. We can see a “clear trend”, hence, there is a relationship between weight and calories. In other words, the predictor variable calories can be used to predict weight.

Similarly, you can see there is a clear decreasing trend between Weight and the Hours, It means if the number of hours at the gym increases, the weight decreases. This is known as a Negative correlation. Again, there is a “clear trend”, hence there is a relationship between weight and hours. In other words, hours can be used to predict weight.

What is Hypothesis testing and P-Value?

Hypothesis means assumption.

To test whether our assumption is correct based on given data is Hypothesis testing.

Take a example of a  tire factory. The radius of the ideal tire must be 16 inches. However, even if there is a deviation of 8% then it is accepted. Hence in this scenario, we can apply hypothesis testing.

  1. Define the Null Hypothesis (H0): The radius of the tire= 16 Inch
  2. Define the alternate Hypothesis(Ha): The radius of the tire != 16 Inch
  3. Define the error tolerance limit: 8%
  4. Conduct the test
  5. Look at the P-value generated by the test: P-value= 0.79
  6. If P-Value > 0.05 then accept the Null Hypothesis otherwise reject it. : Accept the Hypothesis, Hence, The tire produced is of good quality

P-Value is the probability of H0 being True.

The higher the P-value, the better the chances of our assumption(H0) to be true. The Textbook threshold to reject a Null Hypothesis is 5%. So, if P-Value is less than 0.05, this means there is less than 5% chance of Null Hypothesis being true, hence it is rejected. Otherwise, if P-Value is more than 0.05, then the Null Hypothesis is accepted.

Class 1 : 09/11/2023 – Comments/ Findings

The first assignment talks about the Disease Control and Prevention 2018 data, It has three data sets i.e.) Diabetes, Obesity and Inactivity percentages based on the county’s in the US.

Diabetes, Obesity and Inactivity dataset has 3142, 363 and 1370 records.

I see that the the datasets are connected with a same column FIPS and I have joined the three datasets using FIPS column and I got 354 records in common.

Using this Data which is complete and contains all the 3 values(Diabetes, Obesity and Inactivity) of all the county’s data.

I did EDA using the below steps for all the 3 columns:
steps:
a) plotted the histogram column.
b) calculated Mean, Median, StdDev, Skewness and Kurtosis.
c) create a probplot

Observations:

1) Obesity: The Histogram looks slight skewed towards the left, its confirmed by the Kurtosis value which is 4.1, skewness 0.6 and in the prob plot also confirms the same.
2) Inactivity: The Histogram looks slight skewed towards the Right and its confirmed by the Kurtosis value which is 2.4, skewness -0.34 and in the prob plot also confirms the same.
3) Diabetic: The Histogram looks highly skewed towards the Right and its confirmed by the Kurtosis value which is 15.32, skewness -2.68 and in the prob plot also confirms the same.

I did a scatter plot on “Inactive” and “obesity” column.

I Split the dataset into 70 and 30 for Train and Train respectively.
Now I created Linear regression model using Diabetic column as target variable and inactive column as predictor variable and got R2 value is 0.20

In the future classes  will be exploring Breusch-Pagan test and Heteroscedasticity.

PS: I am not able to attach .ipynb file of the code in this Blog