Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Bahria University,

Karachi Campus

LAB EXPERIMENT NO.


_2_

LIST OF TASKS
TASK NO OBJECTIVE

1 Data Profiling:

1. Perform a basic data profiling to understand the structure of the dataset,


including the number of rows, columns, and data types.
2. Identify the target variable and the predictor variables.
3. Compute summary statistics (mean, median, standard deviation, etc.) for
each numerical variable.
4. Identify missing values and their distribution across variables.
2 Feature Correlation: Calculate pairwise correlation coefficients between all numerical
variables. Create a correlation matrix and visualize it using a heatmap. Identify highly
correlated variables and discuss their potential impact on model performance.

3 Outlier Detection: Identify potential outliers in the dataset using appropriate techniques, such as
box plots, and scatter plots. Visualize the distribution of each numerical variable to identify any
extreme values. Discuss the potential impact of outliers on the analysis and modeling process.

4 Target Variable Analysis:Analyze the distribution of the target variable (diabetes or non-
diabetes).Visualize the target variable distribution using a histogram or a bar chart.
Identify any potential imbalance in the target variable and discuss its impact on model
performance.

Kashan Riaz 02-131212-075 Data mining Journal


Date: ___________

Kashan Riaz 02-131212-075 Data mining Journal


Task No. 1: Data profiling
Solution:
 Using diabetes dataset


 No of rows= 768
 No. of datatypes= 1 (int)
 No. of coloumns = 9
 No of predictor variables = 8
 No of target variable= 1
 No of missing values= 0 so distribution among them also 0.
 Summary statistics for each coloumn


Task No.2: Feature correlation


Solution:
Using autoviz and dtale

import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)

Kashan Riaz 02-131212-075 Data mining Journal


Pairwise correlation and heatmap:

Kashan Riaz 02-131212-075 Data mining Journal


Look for highly correlated variables in the heatmap. Variables with correlation coefficients close to 1 or -1 are highly
correlated.
Using thresholds -0.4 and 0.4 for high correlation we get the high correlation between the following variables:

 Any variable against themselves have correlation of 1. (e.g.: pregnancy against pregnancy)
 Pregnancy and age
 Glucose and outcome

CORRELATION MATRIX USING DTALE:

Kashan Riaz 02-131212-075 Data mining Journal


Task No.3:Outlier Detection
Solution:
Using autoviz and dtale

import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)
Number of outlier values in every column:

We can find number of outliers using this table. No. of outliers told in DQ issue coloumn.

Kashan Riaz 02-131212-075 Data mining Journal


We can outline outliers using visuals like scatterplots:

All values far from the main cluster are outliers:

USING DTALE: few scatterplots to show outliers

Kashan Riaz 02-131212-075 Data mining Journal


Kashan Riaz 02-131212-075 Data mining Journal
Impact of Outliers on Dataset:
Outliers can have a significant impact on the analysis and modeling process, particularly in correlation analysis and
predictive modeling. Here are some potential impacts:

1. Influence on Correlation Coefficients

Outliers can distort correlation coefficients, particularly Pearson correlation coefficients, which are sensitive to extreme
values.

2. Model Performance

In predictive modeling, outliers can disproportionately influence the model's parameters and predictions.

3. Misleading Visualizations

Outliers can distort visualizations such as scatter plots and correlation matrix heatmaps, making it challenging to identify
true patterns or relationships between variables.

4. Model Robustness:

Outliers can reduce the robustness of statistical models, making them less reliable in the presence of unusual or
unexpected data points.

Kashan Riaz 02-131212-075 Data mining Journal


Task No.4: Target Variable Analysis
Solution:
I have realized that my target variable in diabetes.csv is the outcome variable since it’s the target variable for
all predictor variables.

Using dtale:

Kashan Riaz 02-131212-075 Data mining Journal


Analyzing Distribution:

Outcome: To express the final result 1 is Yes and 0 is No. 1 means the person has diabetes and vice versa.

We can see more people don’t have diabetes so the dataset has a non-diabetes majority.

Histogram:

Kashan Riaz 02-131212-075 Data mining Journal


Potential imbalance and its impact on the model:

Yes, there is an imbalance in the data. The most frequent value, 0, occurs 65.1% of the time, while the other value, 1,
occurs only 34.9% of the time. This indicates a class imbalance, where one class (0 in this case) is much more prevalent
than the other class (1).

Here are some potential impacts of class imbalance:

Biased Model: The model may be biased towards the majority class (0 in this case) because it sees more examples of
that class during training. As a result, it may have lower accuracy in predicting the minority class (1).

Poor Generalization: The model may generalize poorly to new data, especially for the minority class. Since it has seen
fewer examples of the minority class during training, it may not learn to distinguish it effectively.

Misinterpretation of Importance: The importance of features may be skewed towards features that are more prevalent
in the majority class, leading to a misinterpretation of the true drivers of the outcome.

Kashan Riaz 02-131212-075 Data mining Journal

You might also like