Artificial Neural Networks

Below is a diagram illustrating the Deep Learning ANN’s architecture utilized in this case study.

I’m utilizing one output layer with one neuron, two hidden levels with five neurons each, and two hidden layers. Can you adjust these figures please? Yes, you are able to modify the total number of neurons in each layer as well as the number of hidden layers.

Lastly, select the combination that yields the highest level of accuracy. This is how the ANN model is tuned.

In the code snippet we utilize the “module, from the Keras library to construct a sequence of ANN layers that are stacked one after another. Each layer is defined using the “module of Keras. Here we specify aspects such as the number of neurons, in each layer the weight initialization technique used in the network and which activation function should be applied to each neuron in that layer.

understanding the hyperparameters in below code snippets

  • units=5: This means we are creating a layer with five neurons in it. Each of these five neurons will be receiving the values of inputs, for example, the values of ‘Age’ will be passed to all five neurons, similarly all other columns.
  • input_dim=7: This means there are seven predictors in the input data which is expected by the first layer. If you see the second dense layer, we don’t specify this value, because the Sequential model passes this information further to the next layers.
  • kernel_initializer=’normal’: When the Neurons start their computation, some algorithm has to decide the value for each weight. This parameter specifies that. You can choose different values for it like ‘normal’ or ‘glorot_uniform’.
  • activation=’relu’: This specifies the activation function for the calculations inside each neuron. You can choose values like ‘relu’, ‘tanh’, ‘sigmoid’, etc.
  • batch_size=20: This specifies how many rows will be passed to the Network in one go after which the SSE calculation will begin and the neural network will start adjusting its weights based on the errors.
    When all the rows are passed in the batches of 20 rows each as specified in this parameter, then we call that 1-epoch. Or one full data cycle. This is also known as mini-batch gradient descent. A small value of batch_size will make the ANN look at the data slowly, like 2 rows at a time or 4 rows at a time which could lead to overfitting, as compared to a large value like 20 or 50 rows at a time, which will make the ANN look at the data fast which could lead to underfitting. Hence a proper value must be chosen using hyperparameter tuning.
  • Epochs=50: The same activity of adjusting weights continues for 50 times, as specified by this parameter. In simple terms, the ANN looks at the full training data 50 times and adjusts its weights.

 

Random forest Algorothm

Random forests are basically multiple decision trees put together. This is also known as bagging

To create a Random Forest predictive model, the below steps are followed.

1)Take some random rows from the data, let’s say 80% of all the rows randomly selected.
2) Hence every time selecting some different set of rows.
Take some random columns from the above data, let’s say 50% of all the columns randomly selected.
3)Hence every time selecting some different set of columns.
Create a decision tree using the above data.
4)Repeat steps 1 to 3 for n number of times (Number of trees). n could be any number like 10 or 50 or 500 etc. (This is known as bagging)
5)Combine the predictions from each of the trees to get a final answer.
In the case of Regression, the final answer is the average of predictions made by all trees.

In the case of Classification, the final answer is the mode(majority vote) of predictions made by all trees.

These steps ensure that every possible predictor gets a fair chance in the process.

Because we limit the columns used for each tree.

Also, there is very less bias in the model because we select some different set of random rows for each tree.

 

KNN algorithm

KNN, or K Nearest Neighbors, is an acronym. This method, as its name implies, attempts to categorize a new situation using K neighboring points.

For K, the practical range is 2 to 10. If K=3, KNN will attempt to find the three closest points.
For every new point, it consists of 3 straightforward steps.

The most K comparable (closest) spots should be located.
In those K points, determine the number of each class.
Assign the new point to the class that appears the most frequently among these K points.
Take the mean of the closest “K” points for calculating regression.

Distance between two points can be calculated using any one of the below methods.

  1. Euclidean Distance: Take the difference between the coordinates of points and add it after squaring.
  2. Manhattan Distance: The sum of absolute differences between the coordinates of points.

How to find best hyperparameters using GridSearchCV in python

One of the crucial phases of machine learning is hyperparameter tweaking. Due to the fact that ML algorithms may not always give the maximum accuracy. To reach the highest level of accuracy, you must tweak their hyperparameters.

I’ll talk about Grid Search CV in this post. Cross-validation is the CV’s abbreviation. Grid Search CV tests every possible combination of the parameter values you provide and selects the best one.

Take the example below. It will attempt every combination if you provide it a list of values for three hyperparameters to try. All combinations below refer to the 5X2X2 = 20 hyperparameter combinations. Therefore, increasing the number of alternatives to test by one more hyperparameter will exponentially increase the time required. Selecting only the most crucial factors to tweak requires caution.

1
2
3
4
# Parameters to try
Parameter_Trials={‘n_estimators’:[100,200,300,500,1000],
                  ‘criterion’:[‘gini’,‘entropy’],
                  ‘max_depth’: [2,3]}

The GridSearchCV function runs all of the possible parameter combinations in the example below. There are 20 choices in this case.

GridSearchCV additionally conducts cross-validation for each combination. Using the ‘cv’ argument, you may define the Cross-Validation depth.

cv=5 denotes that the data will be split into five equal halves, one of which will be utilized for training and the other four for testing. K-fold Cross-validation of the model, where K=5, is another name for this. The test data will be altered each time, and this will be done five times. The average of these five times represents the ultimate accuracy.

For cross-validation, any number between 5 and 10 is suitable. Keep in mind that the computation will take longer the larger the value.

What is AUC and ROC Curve?

 

Receiver Operating Characteristic:

The curve between True Positive Rates(TPR) in Y-Axis and False Positive Rates(FPR) in X-Axis is known as the ROC curve. The plot is generated by capturing (TPR, FPR) values for multiple iterations of sampling and predictions.

Area Under the Curve (AUC)

The amount of area covered under the ROC curve. Perfect classification will have its value as 1. A good range for AUC is 0.6-0.9. Which helps to understand the performance of the model. Higher the AUC the better it is. If the value of AUC is less than 0.5 then it means the predictive model is not able to discriminate between the classes.

What is Correlation?

Correlations are mathematical relationships between variables. You
can identify correlations on a scatter diagram by the distinct
patterns they form. The correlation is said to be linear if the scatter
diagram shows the points lying in an approximately straight line.
Let’s take a look at a few common types of correlation between
two variables:

Positive linear correlation (r=0 to 1)

Positive linear correlation is when low values on the x-axis
correspond to low values on the y-axis, and higher values of x
correspond to higher values of y. In other words, y tends to
increase as x increases.

Negative linear correlation(r= -1 to 0)

Negative linear correlation is when low values on the x-axis
correspond to high values on the y-axis, and higher values of x
correspond to lower values of y. In other words, y tends to
decrease as x increases.

No correlation(r=0)

If the values of x and y form a random pattern, then we say there’s
no correlation.

Feature Engineering

There are few types of columns/Variables which are simply useless for the machine learning algorithm because they do not hold any patterns with respect to the target variable

For example, an ID column. It is useless because every row has a unique id and hence it does not has any patterns in it. It is wrong to say if the ID is 1002 then the sales will be 20,000 units. However, if you pass this variable as a predictor to the machine learning algorithm, it will try to use ID values in order to predict sales and come up with some garbage logic!
The rule of Garbage in–> Garbage out applies!

Few more types of columns are also useless like row ID, phone numbers, complete address, comments, description, dates, etc. All these are not useful because each row contains a unique value and hence there will not be any patterns.

Some of these can be indirectly used by deriving certain parts from it. For example, dates cannot be used directly, but the derived month, quarter or week can be used because these may hold some patterns with respect to the target variable. This process of creating new columns from existing columns is known as Feature Engineering.

Remove those columns which do not exhibit any patterns with respect to the target variable

The thumb rule is: Remove those columns which do not exhibit any patterns with respect to the target variable. Do this when you are absolutely sure, otherwise, give the column a chance.

What are the different types of machine learning?

Machine learning is a branch of Artificial Intelligence and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

There are three major types of machine learning:

  1. Supervised ML: Teach the algorithm by examples. If the input is X then the output should be y (Target Variable). Some popular algorithms for this are linear regression, logistic regression, decision trees, random forests, SVM, naive Bayes, XGboost, AdaBoost, etc.
  2. Unsupervised ML: No Target variable is present. Basically you DON’T have data which says the input is X and output is y. In this scenario, important patterns can be derived directly from data like grouping the similar type of rows (Clustering) without any prior knowledge of the given data. Some popular algorithms are K-Means, DBSCAN, PCA, ICA, Apriori.
  3. Reinforcement ML: When the Machine Learning algorithm learns by its mistakes and improvises in the next iteration in order to achieve an objective. Some popular algorithms are Monte Carlo, Q-Learning, SARSA.

Confidence Interval

The range of values that can contain the population mean is based on the error threshold (Alpha Value).

Lets assume population mean is 16. As we have assumed that all good tyres are produced with 16 inches radius.

If we take a sample of 50 tyres, then we will have values like 16.2, 16.3, 15.98, 15.96, 15.99, 16.23…. so on an so forth.

For the sake of understanding let’s say the mean radius of those 50 tires came out to be 16.15. This is called Sample mean.

Now, based on this sample, we can calculate a range. The min and max values between which the mean of the population can be seen. The mean of the population is the mean of the radius of all the tires.

So basically, we are trying to estimate, how the mean of all the tyres look based on the given sample. And instead of giving a single value answer, we are providing a range of values. This range is known as Confidence Interval.

The confidence interval is affected by the alpha value. For every alpha value, we find the value of the statistic which gets multiplied with the standard error.

Confidence Interval = [ Mean(Sample) + N*(SE), Mean(Sample) + N*(SE)]

  • SE=Standard Error=Standard Deviation of sample/sqrt(number of samples)
  • N= Value of the statistic. If the population follows Normal Distribution then Z-statistic, if the population follow t-distribution then the t-statistic value for the given alpha value(probability of error margin)

For example, let us choose the alpha value of 5%. Hence, we are 95% confident that the mean value of the population will fall in between the confidence interval we find. Assuming normal distribution the value of N is 1.96 for alpha=5%. Similarly, the value of N is 2.68 for alpha=1%. So on and so forth. These “N” values are generated out of the probability distribution Z-values or the ideal bell curve distribution.

Hence, to calculate a confidence interval of the population mean. We need a sample of values, we calculate its mean, we calculate its standard deviation, we find the N-value based on the alpha level.

For the sake of explanation, assume below values were found for a sample of 50 tyres.

  • Sample Mean of radius=16.15
  • Standard deviation of 50 radius values=0.64
  • n=50
  • N=1.96 for alpha=5%

For the above values, the confidence interval will be calculated as [ 16.15 – 1.96*(0.64/sqrt(50)) , 16.15 + 1.96*(0.64/sqrt(50)) ].

Which comes out as [15.97 , 16.32].

Hence, based on the given sample of 50 Tyres we are 95% confident that the mean value of the radius of all the tires (population) will be somewhere between 15.97 and 16.32.

Bootstrapping in Python

Before you put the ML model into production, it must be tested for accuracy. This is why we split the available data into training and testing. Typically 70% for training and the remaining 30% for testing the model.

This activity of splitting the data randomly is called sampling. Now when we are measuring the accuracy of machine learning models, there is a chance that the sample is good and It means that the accuracy may come high due to the split of data in a way where the testing data has very similar rows to training data, hence the model will perform better!

To rule out this factor, we try to perform sampling multiple times by changing the seed value in the train_test_split() function. This is called Bootstrapping. simply put, splitting the data into training and testing randomly “multiple times”.

We train at least 5-times so that you are sure, the testing accuracy which you are getting was not just by chance, it is similar for all the different samples.

The final accuracy is the average of the accuracies from all sampling iterations.

Bootstrapping Code

import numpy as np
ColumnNames=[‘Hours’,‘Calories’, ‘Weight’]
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()
#Separate Target Variable and Predictor Variables
TargetVariable=‘Weight’
Predictors=[‘Hours’,‘Calories’]
X=GymData[Predictors].values
y=GymData[TargetVariable].values
#### Bootstrapping ####
########################################################
# Creating empty list to hold accuracy values
AccuracyValues=[]
n_times=5
## Performing bootstrapping
for i in range(n_times):
    #Split the data into training and testing set
    from sklearn.model_selection import train_test_split
    # Chaning the seed value for each iteration
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42+i)
    ########################################################
    ###### Single Decision Tree Regression in Python #######
    from sklearn import tree
    #choose from different tunable hyper parameters
    RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion=‘mse’)
    #Creating the model on Training Data
    DTree=RegModel.fit(X_train,y_train)
    prediction=DTree.predict(X_test)
    #Measuring accuracy on Testing Data
    Accuracy=100 (np.mean(np.abs((y_test prediction) / y_test)) * 100)
    
    # Storing accuracy values
    AccuracyValues.append(np.round(Accuracy))
    
################################################
# Result of all bootstrapping trials
print(AccuracyValues)
# Final accuracy
print(‘Final average accuracy’,np.mean(AccuracyValues))
Output: 
[94.0, 95.0, 97.0, 98.0, 93.0]
Final average accuracy 95.4