Bootstrapping in Python

Before you put the ML model into production, it must be tested for accuracy. This is why we split the available data into training and testing. Typically 70% for training and the remaining 30% for testing the model.

This activity of splitting the data randomly is called sampling. Now when we are measuring the accuracy of machine learning models, there is a chance that the sample is good and It means that the accuracy may come high due to the split of data in a way where the testing data has very similar rows to training data, hence the model will perform better!

To rule out this factor, we try to perform sampling multiple times by changing the seed value in the train_test_split() function. This is called Bootstrapping. simply put, splitting the data into training and testing randomly “multiple times”.

We train at least 5-times so that you are sure, the testing accuracy which you are getting was not just by chance, it is similar for all the different samples.

The final accuracy is the average of the accuracies from all sampling iterations.

Bootstrapping Code

import numpy as np
ColumnNames=[‘Hours’,‘Calories’, ‘Weight’]
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()
#Separate Target Variable and Predictor Variables
TargetVariable=‘Weight’
Predictors=[‘Hours’,‘Calories’]
X=GymData[Predictors].values
y=GymData[TargetVariable].values
#### Bootstrapping ####
########################################################
# Creating empty list to hold accuracy values
AccuracyValues=[]
n_times=5
## Performing bootstrapping
for i in range(n_times):
    #Split the data into training and testing set
    from sklearn.model_selection import train_test_split
    # Chaning the seed value for each iteration
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42+i)
    ########################################################
    ###### Single Decision Tree Regression in Python #######
    from sklearn import tree
    #choose from different tunable hyper parameters
    RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion=‘mse’)
    #Creating the model on Training Data
    DTree=RegModel.fit(X_train,y_train)
    prediction=DTree.predict(X_test)
    #Measuring accuracy on Testing Data
    Accuracy=100 (np.mean(np.abs((y_test prediction) / y_test)) * 100)
    
    # Storing accuracy values
    AccuracyValues.append(np.round(Accuracy))
    
################################################
# Result of all bootstrapping trials
print(AccuracyValues)
# Final accuracy
print(‘Final average accuracy’,np.mean(AccuracyValues))
Output: 
[94.0, 95.0, 97.0, 98.0, 93.0]
Final average accuracy 95.4

 

Leave a Reply

Your email address will not be published. Required fields are marked *