Class 1 : 09/11/2023 – Comments/ Findings

The first assignment talks about the Disease Control and Prevention 2018 data, It has three data sets i.e.) Diabetes, Obesity and Inactivity percentages based on the county’s in the US.

Diabetes, Obesity and Inactivity dataset has 3142, 363 and 1370 records.

I see that the the datasets are connected with a same column FIPS and I have joined the three datasets using FIPS column and I got 354 records in common.

Using this Data which is complete and contains all the 3 values(Diabetes, Obesity and Inactivity) of all the county’s data.

I did EDA using the below steps for all the 3 columns:
steps:
a) plotted the histogram column.
b) calculated Mean, Median, StdDev, Skewness and Kurtosis.
c) create a probplot

Observations:

1) Obesity: The Histogram looks slight skewed towards the left, its confirmed by the Kurtosis value which is 4.1, skewness 0.6 and in the prob plot also confirms the same.
2) Inactivity: The Histogram looks slight skewed towards the Right and its confirmed by the Kurtosis value which is 2.4, skewness -0.34 and in the prob plot also confirms the same.
3) Diabetic: The Histogram looks highly skewed towards the Right and its confirmed by the Kurtosis value which is 15.32, skewness -2.68 and in the prob plot also confirms the same.

I did a scatter plot on “Inactive” and “obesity” column.

I Split the dataset into 70 and 30 for Train and Train respectively.
Now I created Linear regression model using Diabetic column as target variable and inactive column as predictor variable and got R2 value is 0.20

In the future classes  will be exploring Breusch-Pagan test and Heteroscedasticity.

PS: I am not able to attach .ipynb file of the code in this Blog

Leave a Reply

Your email address will not be published. Required fields are marked *