Time Series Model

Time series models are statistical models that examine and predict data points collected over time. These models are very helpful for comprehending and forecasting trends, patterns, and behaviors in sequential data. The basic premise in time series analysis is that the observations are time dependent, which means that the order of the data points is important. Time series models aid in the capture and interpretation of data’s temporal patterns, providing for insights into previous trends and future projections.

Time series models are typically categorized into two types: univariate models and multivariate models. Univariate time series models examine a single variable over time, whereas multivariate models examine the interdependencies of numerous variables. Autoregressive Integrated Moving Average (ARIMA) models, which capture autoregressive and moving average components, and Exponential Smoothing State Space Models (ETS), which handle trend and seasonality, are examples of common univariate models. Multivariate models, such as Vector Autoregression (VAR) and Structural Time Series Models, broaden the study to include several interacting variables, allowing for a more complete understanding of complex systems. The model chosen is determined by the nature of the data and the patterns seen, and the success of these models is dependent on their proper selection and fine-tuning.

Computer vision

computer vision gives machines the capacity to analyze and comprehend visual data from their environment. It entails creating methods and techniques that let computers comprehend images or video data at a high degree. The ultimate objective is to emulate human vision, enabling machines to identify patterns, objects, and scenes and to make deft decisions based on visual information.

Image identification is a core problem in computer vision, where algorithms are trained to recognize and categorize objects in photographs. This entails using massive datasets for model training in order to identify characteristics and trends connected to particular items. Another important component is object detection, which aims to identify things as well as locate and delineate their locations within an image. Applications for computer vision can be found in many different fields, such as surveillance systems, driverless cars, facial recognition, and medical picture analysis.

Convolutional neural networks (CNNs), in particular, have made substantial progress toward deep learning, which has greatly enhanced computer vision skills. The ability of CNNs to automatically learn hierarchical representations of visual characteristics makes image recognition more precise and effective. With its continued development, computer vision has the potential to completely transform a number of sectors, improve human-computer interaction, and help build intelligent systems that can perceive and interact with their environment.

GridsearchCV

In machine learning  Grid Search Cross-Validation, is a potent method for optimizing a model’s hyperparameters. Hyperparameters are variables that have a big impact on a model’s performance but are not discovered during training. By thoroughly going over a preset set of hyperparameter values, GridSearchCV generates a grid with every possible combination. For every combination, cross-validation is carried out to evaluate the model’s performance and determine the ideal set of hyperparameters.

Defining a grid of hyperparameter values to investigate is part of the procedure. For instance, a grid for parameters like kernel type and regularization parameter C might be defined in a support vector machine (SVM). GridSearchCV then uses cross-validation to methodically assess the model’s performance with each set of hyperparameters. In addition to reducing the chance of overfitting, cross-validation yields a more accurate prediction of the model’s ability to generalize to new data.

GridSearchCV can be computationally expensive, especially when dealing with huge datasets or complex models, despite being excellent in determining the optimal hyperparameter values. More sophisticated methods have been developed to overcome this, such as RandomizedSearchCV, which randomly samples a predetermined number of combinations of hyperparameters. GridSearchCV is still a popular method for improving model performance and attaining better results in a variety of machine learning applications, even with its computational expense.

Hyperparameter tuning

Hyperparameter tuning involves optimizing parameters that are not learned during the training phase but have a substantial impact on the model’s performance, is an essential stage in the training of machine learning models. These variables, often known as hyperparameters, affect the model’s complexity and behavior. Learning rates, regularization strengths, and the quantity of hidden layers in a neural network are a few examples. The accuracy and generalization of a model to fresh data can be improved by selecting the ideal set of hyperparameters.

Two methods that are frequently used for hyperparameter tuning are grid search and random search. Grid search involves methodically testing a predetermined set of hyperparameter variables to find the combination that performs the best. In contrast, random search enables a more effective exploration of the hyperparameter space by randomly selecting hyperparameter values from predetermined ranges. Finding the hyperparameter settings that produce the best model performance—often indicated by measures like accuracy, precision, or F1 score—is the goal of both strategies.

Practitioners can now more easily access hyperparameter adjustment with the use of automated tools and frameworks such as scikit-learn in Python. Hyperparameter tuning is crucial, but it must be done carefully since making the wrong decisions might result in underfitting, overfitting, or higher computing costs. Effective hyperparameter tweaking is still essential for creating reliable, high-performing models as machine learning develops.

 

Support Vector Regression (SVR)

SVR is a machine learning technique that is generally utilized for continuous or numerical prediction applications. While standard regression models seek to minimize the differences between predicted and actual values, SVR takes a different approach, focused on fitting a “tube” or a hyperplane around the data points. The goal is to fit as many data points as possible inside this tube while keeping variances to a minimum. SVR is very beneficial when dealing with non-linear correlations in data since it uses kernel functions to translate the input features into a higher-dimensional space, allowing complex patterns to be discovered.

One of SVR’s primary advantages is its ability to successfully handle outliers. The algorithm emphasizes points within the tube while ignoring those outside of it. This makes SVR resistant to noise in the data and ensures that extreme values have less influence on the model. SVR has applications in a variety of fields, including finance, biology, and environmental research, where precise prediction of continuous variables is critical. Support Vector Regression is a key tool in the data scientist’s toolset for regression problems due to its flexibility, particularly in detecting subtle patterns and handling outliers.

Support Vector Machines (SVM)

SVM is a strong supervised machine learning method that is used for classification and regression tasks. SVM’s basic idea is to find a hyperplane in a high-dimensional space that best separates data points of distinct classes. The “support vectors” are the data points nearest to the decision boundary or hyperplane. The margin, which is the distance between the support vectors and the decision boundary, is maximized by SVM. A wider margin indicates better generalization to previously unseen data and increased resistance to noise in the training data.

SVM’s capacity to handle non-linear correlations in data using kernel functions is one of its primary strengths. Kernel functions convert the input features into a higher-dimensional space, allowing a hyperplane to be found that efficiently separates the data in this transformed space. As a result, SVM can capture complicated decision boundaries and achieve high accuracy in a wide range of scenarios. Furthermore, SVM is less prone to overfitting than other algorithms since the margin maximization encourages a more generalizable model.

While SVMs thrive in many applications, they may struggle with huge datasets or several classes. SVM training on a large dataset can be computationally expensive, and the method may suffer if the number of features exceeds the number of samples. Despite these limitations, SVM continues to be a popular choice in a variety of disciplines, including image classification, text categorization, and bioinformatics, due to its versatility and effectiveness in dealing with different and complicated datasets.

BERT

For the third project we are taking a data set about comments of a resturant and analyze comments about sentiment and we are thinking of using BERT model.

BERT (Bidirectional Encoder Representations from Transformers), potential natural language processing (NLP) approach. It uses the Transformer design,  to depict a breakthrough in language interpretation.

BERT is its bidirectional context awareness, which enables it to take into account both left and right context words at the same time. BERT pre-trains on vast volumes of textual data by anticipating missing words in sentences, in contrast to earlier NLP models that processed text in a unidirectional manner. This allows it to acquire deeply contextualized representations of words. Through this pre-training procedure, BERT gains a thorough grasp subtleties and semantics.

In a variety of NLP tasks, such as named entity recognition, sentiment analysis, and question answering, BERT has shown outstanding performance. It is a foundational model in the area because to its capacity to capture context-rich embeddings, and its pre-trained representations may be adjusted with comparatively little task-specific input for particular downstream tasks. BERT has had a significant influence on the field of NLP, inspiring the creation of many cutting-edge models that draw from its design and guiding ideas.

Confusion Matrix

As the name implies, a confusion matrix is a numerical matrix that indicates the confusion points in a model. The confusion matrix is a structured method of mapping the predictions to the original classes to which the data belong. In other words, it is a class-wise distribution of the predictive performance of a classification model. This suggests that confusion matrices are only applicable in supervised learning frameworks—that is, when the output distribution is known.

Confusion Matrix for Binary classification:

A dataset with just two unique categories of data is called a binary class dataset. To keep things simple, we might refer to these two groups as the “positive” and the “negative.”

Assume that the dataset we use to assess a machine learning model has a binary class imbalance, with 60 samples in the test set’s positive class and 40 samples in its negative class.

Now, in order to completely comprehend the confusion matrix pertaining to this binary class categorization issue, we must first obtain familiar with the following terms:

  • True Positive (TP) refers to a sample belonging to the positive class being classified correctly.
  • True Negative (TN) refers to a sample belonging to the negative class being classified correctly.
  • False Positive (FP) refers to a sample belonging to the negative class but being classified wrongly as belonging to the positive class.
  • False Negative (FN) refers to a sample belonging to the positive class but being classified wrongly as belonging to the negative class.

SVM – Support Vector Machines

For problems involving regression and classification, Support Vector Machines (SVMs) represent a stable and adaptable class of supervised machine learning algorithms. A Support Vector Machine (SVM) is especially useful in situations when there are more features than samples since its main objective is to locate a hyperplane in a high-dimensional space that maximizes the margin between various classes. The data points that are closest to the decision border and affect its position are known as support vectors, and they are essential to its success. Strong Variable Classifiers (SVMs) are useful in a wide range of applications, including bioinformatics, image classification, and handwriting recognition. They perform well in high-dimensional spaces and remain resilient to outliers. The algorithm’s usefulness in non-linear classification and regression issues is further enhanced by its capacity to handle complex relationships thanks to the kernel approach. SVMs can be an effective tool in your machine learning toolbox if you’re working with data where a distinct margin of separation between classes is essential.

LSTM

Long Short-Term Memory (LSTM) networks are comparable to very intelligent instruments.

When it comes to jobs that need things to happen in a specific order, such as words in a phrase or stock values over time, they excel. LSTMs, in contrast to earlier techniques, feature an interesting design that aids in their long-term memory of critical information.

Imagine it as if you had a smart gate and a dedicated memory cell that decide what information you should remember and forget at each stage. Because of this, LSTMs are excellent at comprehending language, identifying speech, and forecasting future trends in fields like finance.

Put another way, LSTMs are constructed using a set of principles that enable them to gradually discover patterns in data. LSTMs are the clever technology that allows a computer to be taught to recognize a friend’s voice or anticipate the words that will be said next in a phrase. They let computers to comprehend and process sequential data in incredibly intelligent ways; they are the superheroes of computer programming.

Convolutional Neural Network(CNN)

One particular type of neural network that in machine learning and picture classification is a convolutional neural network. Face Recognition is a prime illustration.
Assume you click on 50 images of a person’s face, each for ten different people. You therefore have 500 images overall, each featuring 10 distinct people. CNN will be able to determine the facial features of each of these ten people if you provide it with this data. And subsequently, using the prior knowledge, it will identify the face in a fresh photo of any one of these ten people.

CNN makes an effort to replicate how people view images with their eyes. If you stop to think about it, you will see that even when seeing a whole picture, your attention can only be focused on one point at a time. After that, we move our focus to the other points, which is how we memorize the key features of a picture or a face and remember it.

CNN is a series of operations that first extracts the most significant features from a given image, and then it turns the entire image into a single row of numbers. which the fully connected ANN classifier is capable of learning.

What is Artificial Neural Network(ANN) in layman term?

When we combine multiple neurons together, it creates a vector of neurons called a layer.

When we combine multiple layers together, where all the neurons are interconnected to each other, this network of neurons is called, Artificial Neural Network or abbreviated as ANN.

There are three major type of layers in the ANN listed below

  1. The Input layer (interface to accept data): The data input is handled by a single input layer, which transmits it to the network. Keep in mind that this layer is only the passed data vector. It is only an interface to receive data for the hidden layers, which are the real neurons, and does not consist of an actual neuron layer.
  2. The Hidden layer(s) (Actual Neurons):An ANN may include one or more hidden layers. These are the real neurons that use the input data to calculate things. How many hidden levels has to be employed is inconclusive. It is quite difficult to determine the ideal number of hidden layers and the number of neurons in each layer, this is accomplished by looking at the end accuracy for different configurations.
  3. The Output layer (Actual Neurons to output the result):There is only one output layer and the number of neurons in it depends on the target variable.

 

Recurrent Neural Networks (RNN) Explanation Layman term

The human brain is capable of long-term and short-term memory retention. We can rewind time to recall the sequence of events that occurred and predict what will occur next by using the sequence of events that came before. The goal of Recurrent Neural Networks (RNN) is to imitate this function.

Think about the following situation: When reading a book, you make sense of a chapter’s events by referring to those of earlier chapters. You almost turn back time in your mind to consult the earlier sequence of events that clarifies the current ones. Events from two or three chapters prior are mixed in with those from the previous chapter. Each of these occurrences is imprinted in memory with a temporal sense, indicating when it occurred—recently or far ago.

Because they are still “fresh” in your memory, recent events are easier for you to recall than ones that happened a long time ago. Therefore, the events of the past have an impact on your comprehension of the present situation or aid in your ability to “predict” what will happen next.

Text Preprocessing using NLP techniques

Text must be represented as numerical columns when building a classification model using free text input, such as user reviews and comments. Text vectorization is the term for this method. In other words, using a series of numerical columns to represent text.

There are two main methods for doing this.

  1. Count Vectorization:  is a text preprocessing method that creates a matrix of term frequency counts from a collection of text documents in natural language processing (NLP). This approach involves representing each document in the corpus as a row and each unique word as a column in the matrix. The matrix’s cell values show the frequency with which each word occurs in a given document. Text data may be easily formatted for use in a variety of natural language processing (NLP) activities, including text classification and clustering, by using a technique called count vectorization.

 

2. TF-IDF Vectorization: Term Frequency-Inverse Document Frequency,  is another text preparation method frequently employed in NLP. It is a more sophisticated approach that considers a term’s significance over the whole corpus in addition to how frequently it occurs in a text. Every word in a document is given a weight by TF-IDF based on its term frequency—how often it appears in the text—and its inverse document frequency, which measures how uncommon it is throughout all the documents. This produces a matrix with values denoting the relative relevance of each phrase inside each document, with each document represented as a row and each term as a column.

What are the types of sampling?

  1. Simple Random Sampling Without Replacement (SRSWOR): The most common type of sampling is this one. The concept is that you cannot choose the same number more than once. The term “Without” Replacement was born.
  2. Simple Random Sampling With Replacement (SRSWR): When the total number of values (Population) is minimal, this kind of sampling is employed. Repetition in the chosen values is permitted.

  3. Stratified Sampling:

    A stratum is a group.

    Stratified sampling ensures that a small number of randomly chosen values are taken from each category.

    Examine the example below, which has three different kinds of numbers. Ten, one hundred, and five hundred series.

    It is possible that any one of the series will go entirely unnoticed if you choose five digits at random. The example below shows that the 10 series numbers are entirely absent.

  4. Systematic Sampling:

    Systematic sampling involves choosing each ‘i’th value. Every fifth or tenth number, for instance.

    A straightforward mechanism is in place. This determines the values’ index.

  5. Biased Sampling:

    As the name suggests, this is when you selected values based on your choice purposefully.

    This type of sampling is also known as purposeful sampling or convenience sampling

Sampling Theory

Think about the jar of bubble gum, which has different colored bubble gums.
There’s a good chance that the gums you “randomly” choose will contain gums of all different hues.
As a result, you might conclude that the sample that was chosen at random is representative of all the gums in the jar.
These randomly chosen gums are referred to as the sample in statistical terms, and the jar is referred to as the population.

Effect of Size on Sampling:
Example:
The bubble gum jar contained 200 gums with 6 different colors.
If you select only 10 gums there is a chance that few colors may NOT be present.
If you select 50 gums then there is a high chance of all colors being present.
If you select 100 gums then there is a very high chance of all colors being present.
if you select all 200 gums then its sure that all colors will be present. This is the case where the sample is the same as the population. That means you simply selected all!

Artificial Neural Networks

Below is a diagram illustrating the Deep Learning ANN’s architecture utilized in this case study.

I’m utilizing one output layer with one neuron, two hidden levels with five neurons each, and two hidden layers. Can you adjust these figures please? Yes, you are able to modify the total number of neurons in each layer as well as the number of hidden layers.

Lastly, select the combination that yields the highest level of accuracy. This is how the ANN model is tuned.

In the code snippet we utilize the “module, from the Keras library to construct a sequence of ANN layers that are stacked one after another. Each layer is defined using the “module of Keras. Here we specify aspects such as the number of neurons, in each layer the weight initialization technique used in the network and which activation function should be applied to each neuron in that layer.

understanding the hyperparameters in below code snippets

  • units=5: This means we are creating a layer with five neurons in it. Each of these five neurons will be receiving the values of inputs, for example, the values of ‘Age’ will be passed to all five neurons, similarly all other columns.
  • input_dim=7: This means there are seven predictors in the input data which is expected by the first layer. If you see the second dense layer, we don’t specify this value, because the Sequential model passes this information further to the next layers.
  • kernel_initializer=’normal’: When the Neurons start their computation, some algorithm has to decide the value for each weight. This parameter specifies that. You can choose different values for it like ‘normal’ or ‘glorot_uniform’.
  • activation=’relu’: This specifies the activation function for the calculations inside each neuron. You can choose values like ‘relu’, ‘tanh’, ‘sigmoid’, etc.
  • batch_size=20: This specifies how many rows will be passed to the Network in one go after which the SSE calculation will begin and the neural network will start adjusting its weights based on the errors.
    When all the rows are passed in the batches of 20 rows each as specified in this parameter, then we call that 1-epoch. Or one full data cycle. This is also known as mini-batch gradient descent. A small value of batch_size will make the ANN look at the data slowly, like 2 rows at a time or 4 rows at a time which could lead to overfitting, as compared to a large value like 20 or 50 rows at a time, which will make the ANN look at the data fast which could lead to underfitting. Hence a proper value must be chosen using hyperparameter tuning.
  • Epochs=50: The same activity of adjusting weights continues for 50 times, as specified by this parameter. In simple terms, the ANN looks at the full training data 50 times and adjusts its weights.

 

Random forest Algorothm

Random forests are basically multiple decision trees put together. This is also known as bagging

To create a Random Forest predictive model, the below steps are followed.

1)Take some random rows from the data, let’s say 80% of all the rows randomly selected.
2) Hence every time selecting some different set of rows.
Take some random columns from the above data, let’s say 50% of all the columns randomly selected.
3)Hence every time selecting some different set of columns.
Create a decision tree using the above data.
4)Repeat steps 1 to 3 for n number of times (Number of trees). n could be any number like 10 or 50 or 500 etc. (This is known as bagging)
5)Combine the predictions from each of the trees to get a final answer.
In the case of Regression, the final answer is the average of predictions made by all trees.

In the case of Classification, the final answer is the mode(majority vote) of predictions made by all trees.

These steps ensure that every possible predictor gets a fair chance in the process.

Because we limit the columns used for each tree.

Also, there is very less bias in the model because we select some different set of random rows for each tree.

 

KNN algorithm

KNN, or K Nearest Neighbors, is an acronym. This method, as its name implies, attempts to categorize a new situation using K neighboring points.

For K, the practical range is 2 to 10. If K=3, KNN will attempt to find the three closest points.
For every new point, it consists of 3 straightforward steps.

The most K comparable (closest) spots should be located.
In those K points, determine the number of each class.
Assign the new point to the class that appears the most frequently among these K points.
Take the mean of the closest “K” points for calculating regression.

Distance between two points can be calculated using any one of the below methods.

  1. Euclidean Distance: Take the difference between the coordinates of points and add it after squaring.
  2. Manhattan Distance: The sum of absolute differences between the coordinates of points.

How to find best hyperparameters using GridSearchCV in python

One of the crucial phases of machine learning is hyperparameter tweaking. Due to the fact that ML algorithms may not always give the maximum accuracy. To reach the highest level of accuracy, you must tweak their hyperparameters.

I’ll talk about Grid Search CV in this post. Cross-validation is the CV’s abbreviation. Grid Search CV tests every possible combination of the parameter values you provide and selects the best one.

Take the example below. It will attempt every combination if you provide it a list of values for three hyperparameters to try. All combinations below refer to the 5X2X2 = 20 hyperparameter combinations. Therefore, increasing the number of alternatives to test by one more hyperparameter will exponentially increase the time required. Selecting only the most crucial factors to tweak requires caution.

1
2
3
4
# Parameters to try
Parameter_Trials={‘n_estimators’:[100,200,300,500,1000],
                  ‘criterion’:[‘gini’,‘entropy’],
                  ‘max_depth’: [2,3]}

The GridSearchCV function runs all of the possible parameter combinations in the example below. There are 20 choices in this case.

GridSearchCV additionally conducts cross-validation for each combination. Using the ‘cv’ argument, you may define the Cross-Validation depth.

cv=5 denotes that the data will be split into five equal halves, one of which will be utilized for training and the other four for testing. K-fold Cross-validation of the model, where K=5, is another name for this. The test data will be altered each time, and this will be done five times. The average of these five times represents the ultimate accuracy.

For cross-validation, any number between 5 and 10 is suitable. Keep in mind that the computation will take longer the larger the value.

What is AUC and ROC Curve?

 

Receiver Operating Characteristic:

The curve between True Positive Rates(TPR) in Y-Axis and False Positive Rates(FPR) in X-Axis is known as the ROC curve. The plot is generated by capturing (TPR, FPR) values for multiple iterations of sampling and predictions.

Area Under the Curve (AUC)

The amount of area covered under the ROC curve. Perfect classification will have its value as 1. A good range for AUC is 0.6-0.9. Which helps to understand the performance of the model. Higher the AUC the better it is. If the value of AUC is less than 0.5 then it means the predictive model is not able to discriminate between the classes.

What is Correlation?

Correlations are mathematical relationships between variables. You
can identify correlations on a scatter diagram by the distinct
patterns they form. The correlation is said to be linear if the scatter
diagram shows the points lying in an approximately straight line.
Let’s take a look at a few common types of correlation between
two variables:

Positive linear correlation (r=0 to 1)

Positive linear correlation is when low values on the x-axis
correspond to low values on the y-axis, and higher values of x
correspond to higher values of y. In other words, y tends to
increase as x increases.

Negative linear correlation(r= -1 to 0)

Negative linear correlation is when low values on the x-axis
correspond to high values on the y-axis, and higher values of x
correspond to lower values of y. In other words, y tends to
decrease as x increases.

No correlation(r=0)

If the values of x and y form a random pattern, then we say there’s
no correlation.

Feature Engineering

There are few types of columns/Variables which are simply useless for the machine learning algorithm because they do not hold any patterns with respect to the target variable

For example, an ID column. It is useless because every row has a unique id and hence it does not has any patterns in it. It is wrong to say if the ID is 1002 then the sales will be 20,000 units. However, if you pass this variable as a predictor to the machine learning algorithm, it will try to use ID values in order to predict sales and come up with some garbage logic!
The rule of Garbage in–> Garbage out applies!

Few more types of columns are also useless like row ID, phone numbers, complete address, comments, description, dates, etc. All these are not useful because each row contains a unique value and hence there will not be any patterns.

Some of these can be indirectly used by deriving certain parts from it. For example, dates cannot be used directly, but the derived month, quarter or week can be used because these may hold some patterns with respect to the target variable. This process of creating new columns from existing columns is known as Feature Engineering.

Remove those columns which do not exhibit any patterns with respect to the target variable

The thumb rule is: Remove those columns which do not exhibit any patterns with respect to the target variable. Do this when you are absolutely sure, otherwise, give the column a chance.

What are the different types of machine learning?

Machine learning is a branch of Artificial Intelligence and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

There are three major types of machine learning:

  1. Supervised ML: Teach the algorithm by examples. If the input is X then the output should be y (Target Variable). Some popular algorithms for this are linear regression, logistic regression, decision trees, random forests, SVM, naive Bayes, XGboost, AdaBoost, etc.
  2. Unsupervised ML: No Target variable is present. Basically you DON’T have data which says the input is X and output is y. In this scenario, important patterns can be derived directly from data like grouping the similar type of rows (Clustering) without any prior knowledge of the given data. Some popular algorithms are K-Means, DBSCAN, PCA, ICA, Apriori.
  3. Reinforcement ML: When the Machine Learning algorithm learns by its mistakes and improvises in the next iteration in order to achieve an objective. Some popular algorithms are Monte Carlo, Q-Learning, SARSA.

Confidence Interval

The range of values that can contain the population mean is based on the error threshold (Alpha Value).

Lets assume population mean is 16. As we have assumed that all good tyres are produced with 16 inches radius.

If we take a sample of 50 tyres, then we will have values like 16.2, 16.3, 15.98, 15.96, 15.99, 16.23…. so on an so forth.

For the sake of understanding let’s say the mean radius of those 50 tires came out to be 16.15. This is called Sample mean.

Now, based on this sample, we can calculate a range. The min and max values between which the mean of the population can be seen. The mean of the population is the mean of the radius of all the tires.

So basically, we are trying to estimate, how the mean of all the tyres look based on the given sample. And instead of giving a single value answer, we are providing a range of values. This range is known as Confidence Interval.

The confidence interval is affected by the alpha value. For every alpha value, we find the value of the statistic which gets multiplied with the standard error.

Confidence Interval = [ Mean(Sample) + N*(SE), Mean(Sample) + N*(SE)]

  • SE=Standard Error=Standard Deviation of sample/sqrt(number of samples)
  • N= Value of the statistic. If the population follows Normal Distribution then Z-statistic, if the population follow t-distribution then the t-statistic value for the given alpha value(probability of error margin)

For example, let us choose the alpha value of 5%. Hence, we are 95% confident that the mean value of the population will fall in between the confidence interval we find. Assuming normal distribution the value of N is 1.96 for alpha=5%. Similarly, the value of N is 2.68 for alpha=1%. So on and so forth. These “N” values are generated out of the probability distribution Z-values or the ideal bell curve distribution.

Hence, to calculate a confidence interval of the population mean. We need a sample of values, we calculate its mean, we calculate its standard deviation, we find the N-value based on the alpha level.

For the sake of explanation, assume below values were found for a sample of 50 tyres.

  • Sample Mean of radius=16.15
  • Standard deviation of 50 radius values=0.64
  • n=50
  • N=1.96 for alpha=5%

For the above values, the confidence interval will be calculated as [ 16.15 – 1.96*(0.64/sqrt(50)) , 16.15 + 1.96*(0.64/sqrt(50)) ].

Which comes out as [15.97 , 16.32].

Hence, based on the given sample of 50 Tyres we are 95% confident that the mean value of the radius of all the tires (population) will be somewhere between 15.97 and 16.32.

Bootstrapping in Python

Before you put the ML model into production, it must be tested for accuracy. This is why we split the available data into training and testing. Typically 70% for training and the remaining 30% for testing the model.

This activity of splitting the data randomly is called sampling. Now when we are measuring the accuracy of machine learning models, there is a chance that the sample is good and It means that the accuracy may come high due to the split of data in a way where the testing data has very similar rows to training data, hence the model will perform better!

To rule out this factor, we try to perform sampling multiple times by changing the seed value in the train_test_split() function. This is called Bootstrapping. simply put, splitting the data into training and testing randomly “multiple times”.

We train at least 5-times so that you are sure, the testing accuracy which you are getting was not just by chance, it is similar for all the different samples.

The final accuracy is the average of the accuracies from all sampling iterations.

Bootstrapping Code

import numpy as np
ColumnNames=[‘Hours’,‘Calories’, ‘Weight’]
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()
#Separate Target Variable and Predictor Variables
TargetVariable=‘Weight’
Predictors=[‘Hours’,‘Calories’]
X=GymData[Predictors].values
y=GymData[TargetVariable].values
#### Bootstrapping ####
########################################################
# Creating empty list to hold accuracy values
AccuracyValues=[]
n_times=5
## Performing bootstrapping
for i in range(n_times):
    #Split the data into training and testing set
    from sklearn.model_selection import train_test_split
    # Chaning the seed value for each iteration
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42+i)
    ########################################################
    ###### Single Decision Tree Regression in Python #######
    from sklearn import tree
    #choose from different tunable hyper parameters
    RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion=‘mse’)
    #Creating the model on Training Data
    DTree=RegModel.fit(X_train,y_train)
    prediction=DTree.predict(X_test)
    #Measuring accuracy on Testing Data
    Accuracy=100 (np.mean(np.abs((y_test prediction) / y_test)) * 100)
    
    # Storing accuracy values
    AccuracyValues.append(np.round(Accuracy))
    
################################################
# Result of all bootstrapping trials
print(AccuracyValues)
# Final accuracy
print(‘Final average accuracy’,np.mean(AccuracyValues))
Output: 
[94.0, 95.0, 97.0, 98.0, 93.0]
Final average accuracy 95.4

 

How to treat outliers in data in Python

This scenario can happen when you are doing regression or classification in machine learning.

  • Regression: The target variable is numeric and one of the predictors is categorical
  • Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

  • Null hypothesis(H0): The variables are not correlated with each other
  • P-value: The probability of Null hypothesis being true
  • Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
  • Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Generating sample data
import pandas as pd
ColumnNames=[‘FuelType’,‘CarPrice’]
DataValues= [[  ‘Petrol’,   2000],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   1900],
             [  ‘Petrol’,   2150],
             [  ‘Petrol’,   2100],
             [  ‘Petrol’,   2200],
             [  ‘Petrol’,   1950],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘Diesel’,   2900],
             [  ‘Diesel’,   2850],
             [  ‘Diesel’,   2600],
             [  ‘Diesel’,   2500],
             [  ‘Diesel’,   2700],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1400],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1650],
             [  ‘CNG’,   1600],
             [  ‘CNG’,   1500],
             [  ‘CNG’,   1500]
        
          
           ]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and
# returns F-statistic and P-value
from scipy.stats import f_oneway
# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated
# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby(‘FuelType’)[‘CarPrice’].apply(list)
# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value &gt; 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print(‘P-Value for Anova is: ‘, AnovaResults[1])

Sample Output : P-Value for Anova: 4.355e-12

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other. 

How to treat missing values in data in Python

Before machine learning algorithms can be used, some data pre-processing is required, missing value treatment is one of them.

How to find missing values in Python?

The function isnull() of a pandas data frame helps to find missing values in each column.

LoanData.isnull().sum()

# Creating a sample data frame

import pandas as pd
import numpy as np
ColumnNames=[‘CIBIL’,‘AGE’,‘GENDER’ ,‘SALARY’, ‘APPROVE_LOAN’]
DataValues=[ [480, 28, ‘M’, 610000, ‘Yes’],
             [480, np.nan, ‘M’,140000, ‘No’],
             [480, 29, ‘M’,420000, ‘No’],
             [490, 30, ‘M’,420000, ‘No’],
             [500, 27, ‘M’,420000, ‘No’],
             [510, np.nan, ‘F’,190000, ‘No’],
             [550, 24, ‘M’,330000, np.nan],
             [560, 34, ‘M’,160000, ‘No’],
             [560, 25, ‘F’,300000, ‘Yes’],
             [570, 34, ‘M’,450000, ‘Yes’],
             [590, 30, ‘F’,140000, ‘Yes’],
             [600, np.nan, ‘F’,600000, ‘Yes’],
             [600, 22, ‘M’,400000, ‘No’],
             [600, 25, ‘F’,490000, ‘Yes’],
             [610, 32, ‘F’,120000, np.nan],
             [630, 29, ‘F’,360000, ‘Yes’],
             [630, 30, ‘F’,480000, ‘Yes’],
             [660, 29, ‘F’,460000, ‘Yes’],
             [700, 32, ‘M’,470000, ‘Yes’],
             [740, 28, ‘M’,400000, ‘Yes’]]
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Finding out missing values
LoanData.isnull().sum()

How to treat missing values?

Once you have found the missing values in each column, then you have below options

  • Deleting all the missing values
  • Replacing missing values with the median for continuous columns
  • Replacing missing values with the mode for categorical columns
  • Interpolating values

How to delete all missing values at once?

This option is exercised only if the number of rows containing missing values is much less than the total number of rows.

The dropna() function of the pandas data frame removes all those rows that contain at least one missing value.

1
2
3
4
# Code to delete all the missing values at once
print(‘Before Deleting missing values:’, LoanData.shape)
LoanDataCleaned=LoanData.dropna()
print(‘After Deleting missing values:’, LoanDataCleaned.shape)

Replacing missing values using median/mode

Missing values treatment is done separately for each column in data. If the column is continuous, then its missing values will be replaced by the median of the same column. If the column is categorical, then the missing values will be replaced by the mode of the same column.

1
2
3
4
5
6
7
# Replacing with median value for a numeric variable
MedianAge=LoanData[‘AGE’].median()
LoanData[‘AGE’]=LoanData[‘AGE’].fillna(value=MedianAge)
# Replacing with mode value for a categorical variable
ModeValue=LoanData[‘APPROVE_LOAN’].mode()[0]
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].fillna(value=ModeValue)

Replacing missing values using interpolation

Instead of replacing all missing values with just one value(median/mode), you can also choose to interpolate the missing values based on the data present in the nearby location. This is called interpolation.

1
2
3
4
5
# Replacing missing values by interpolation for a numeric variable
LoanData[‘AGE’]=LoanData[‘AGE’].interpolate(method=‘linear’)
# Replacing missing values by interpolation for a categorical variable
LoanData[‘APPROVE_LOAN’]=LoanData[‘APPROVE_LOAN’].interpolate(method=‘ffill’)

 

K-fold cross validation

The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm or configuration on a dataset.Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

The algorithm of the k-Fold technique:

  1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
  2. Split the dataset into k equal (if possible) parts (they are called folds)
  3. Choose k – 1 folds as the training set. The remaining fold will be the test set
  4. Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
  5. Validate on the test set
  6. Save the result of the validation
  7. Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
  8. To get the final score average the results that you got on step 6.

Regularization Technique

Regularization helps to solve over fitting problem in machine learning. Simple model will be a very poor generalization of data.

At the same time, complex model may not perform well in test data due to over fitting. We need to choose the right model in between simple and complex model. Regularization helps to choose preferred model complexity, so that model is better at predicting.

Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term. It can be used for many machine learning algorithms.

What is a t-test ?

The t-test is a statistical hypothesis that takes samples from both groups to determine if there is a significant difference between the means of the two groups.

It compares both sample mean and standard deviations while considering sample size and the degree of variability of the data.

Steps:
1) State a hypothesis. A hypothesis is classified as a null hypothesis ( H0) and an
2) alternative hypothesis (Ha) that rejects the null hypothesis. The null and alternate hypotheses are defined according to the type of test being performed.
3) Collect sample data.
4) Conduct the test.
5) Reject or fail to reject your null hypothesis H0

Interaction term in Linear Regression

In linear regression when additional terms are added to the regression model to account for the possibility that the relationship between an independent variable (predictor) and the dependent variable (outcome) depends on the value of another independent variable.
In simpler terms, they represent the idea that the effect of one variable on the outcome is not constant but varies depending on the level of another variable.

Mathematically, an interaction term in a linear regression model takes the form of a product between two or more independent variables. For example, if you have two independent variables, X1 and X2, and you suspect that the effect of X1 on the outcome (Y) depends on the value of X2, you can introduce an interaction term like this:

Y = β0 + β1 * X1 + β2 * X2 + β3 * (X1 * X2) + ε

In this equation:

Y represents the dependent variable (the one you’re trying to predict).
X1 and X2 are independent variables.
β0, β1, β2, and β3 are the regression coefficients that represent the relationship between the variables.
ε represents the error term.
The coefficient β3 measures the strength and direction of the interaction effect. If β3 is statistically significant and positive, it indicates that the effect of X1 on Y increases as X2 increases. If β3 is negative, it suggests that the effect of X1 on Y decreases as X2 increases. If β3 is not statistically significant, it implies that there is no interaction effect between X1 and X2.

Interaction terms are useful when you suspect that the relationship between variables is more complex than a simple additive relationship. They allow you to capture how the relationship between two variables changes in the presence of other variables, potentially leading to a better understanding of the underlying data and improved model accuracy. However, it’s essential to be cautious when adding interaction terms, as including too many can lead to overfitting (as the equation polynomial increases, the more overfit the model), and they should be based on theoretical or domain knowledge rather than testing multiple combinations blindly.

How to visualize the relationship between two continuous variables in Python

Scatter plot is the chart used when you want to visualize the relationship between two continuous variables in data. Typically used in Supervised ML(Regression). Where the target variable is a continuous variable. So if you want to check which continuous predictor has a clear relationship with the target variable, then you look at the scatter plots.

Consider the below scenario Here the target variable is “Weight” and we are trying to predict it based on the number of hours a person works out at the gym and the number of calories they consume in a day.

If you plot the scatter chart between weight and calories, you can see an increasing trend. We can easily deduce from this graph that, if the calory intake increases, then the weight also increases. This is known as a positive correlation. We can see a “clear trend”, hence, there is a relationship between weight and calories. In other words, the predictor variable calories can be used to predict weight.

Similarly, you can see there is a clear decreasing trend between Weight and the Hours, It means if the number of hours at the gym increases, the weight decreases. This is known as a Negative correlation. Again, there is a “clear trend”, hence there is a relationship between weight and hours. In other words, hours can be used to predict weight.

What is Hypothesis testing and P-Value?

Hypothesis means assumption.

To test whether our assumption is correct based on given data is Hypothesis testing.

Take a example of a  tire factory. The radius of the ideal tire must be 16 inches. However, even if there is a deviation of 8% then it is accepted. Hence in this scenario, we can apply hypothesis testing.

  1. Define the Null Hypothesis (H0): The radius of the tire= 16 Inch
  2. Define the alternate Hypothesis(Ha): The radius of the tire != 16 Inch
  3. Define the error tolerance limit: 8%
  4. Conduct the test
  5. Look at the P-value generated by the test: P-value= 0.79
  6. If P-Value > 0.05 then accept the Null Hypothesis otherwise reject it. : Accept the Hypothesis, Hence, The tire produced is of good quality

P-Value is the probability of H0 being True.

The higher the P-value, the better the chances of our assumption(H0) to be true. The Textbook threshold to reject a Null Hypothesis is 5%. So, if P-Value is less than 0.05, this means there is less than 5% chance of Null Hypothesis being true, hence it is rejected. Otherwise, if P-Value is more than 0.05, then the Null Hypothesis is accepted.

Class 1 : 09/11/2023 – Comments/ Findings

The first assignment talks about the Disease Control and Prevention 2018 data, It has three data sets i.e.) Diabetes, Obesity and Inactivity percentages based on the county’s in the US.

Diabetes, Obesity and Inactivity dataset has 3142, 363 and 1370 records.

I see that the the datasets are connected with a same column FIPS and I have joined the three datasets using FIPS column and I got 354 records in common.

Using this Data which is complete and contains all the 3 values(Diabetes, Obesity and Inactivity) of all the county’s data.

I did EDA using the below steps for all the 3 columns:
steps:
a) plotted the histogram column.
b) calculated Mean, Median, StdDev, Skewness and Kurtosis.
c) create a probplot

Observations:

1) Obesity: The Histogram looks slight skewed towards the left, its confirmed by the Kurtosis value which is 4.1, skewness 0.6 and in the prob plot also confirms the same.
2) Inactivity: The Histogram looks slight skewed towards the Right and its confirmed by the Kurtosis value which is 2.4, skewness -0.34 and in the prob plot also confirms the same.
3) Diabetic: The Histogram looks highly skewed towards the Right and its confirmed by the Kurtosis value which is 15.32, skewness -2.68 and in the prob plot also confirms the same.

I did a scatter plot on “Inactive” and “obesity” column.

I Split the dataset into 70 and 30 for Train and Train respectively.
Now I created Linear regression model using Diabetic column as target variable and inactive column as predictor variable and got R2 value is 0.20

In the future classes  will be exploring Breusch-Pagan test and Heteroscedasticity.

PS: I am not able to attach .ipynb file of the code in this Blog