Data Mining Journal 2 Kashan

Bahria University,
Karachi Campus
LAB EXPERIMENT NO.

_2_
LIST OF TASKS
TASK NO OBJECTIVE
1 Data Profiling:
1. Perform a basic data profiling to understand the structure of the dataset,

including the number of rows, columns, and data types.
2. Identify the target variable and the predictor variables.
3. Compute summary statistics (mean, median, standard deviation, etc.) for
each numerical variable.
4. Identify missing values and their distribution across variables.
2 Feature Correlation: Calculate pairwise correlation coefficients between all numerical
variables. Create a correlation matrix and visualize it using a heatmap. Identify highly
correlated variables and discuss their potential impact on model performance.
3 Outlier Detection: Identify potential outliers in the dataset using appropriate techniques, such as
box plots, and scatter plots. Visualize the distribution of each numerical variable to identify any
extreme values. Discuss the potential impact of outliers on the analysis and modeling process.
4 Target Variable Analysis:Analyze the distribution of the target variable (diabetes or non-
diabetes).Visualize the target variable distribution using a histogram or a bar chart.
Identify any potential imbalance in the target variable and discuss its impact on model
performance.
Kashan Riaz 02-131212-075 Data mining Journal

Date: ___________

Task No. 1: Data profiling
Solution:
 Using diabetes dataset

 No of rows= 768
 No. of datatypes= 1 (int)
 No. of coloumns = 9
 No of predictor variables = 8
 No of target variable= 1
 No of missing values= 0 so distribution among them also 0.
 Summary statistics for each coloumn


Task No.2: Feature correlation

Solution:
Using autoviz and dtale
import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)

Pairwise correlation and heatmap:

Look for highly correlated variables in the heatmap. Variables with correlation coefficients close to 1 or -1 are highly
correlated.
Using thresholds -0.4 and 0.4 for high correlation we get the high correlation between the following variables:
 Any variable against themselves have correlation of 1. (e.g.: pregnancy against pregnancy)
 Pregnancy and age
 Glucose and outcome
CORRELATION MATRIX USING DTALE:

Task No.3:Outlier Detection
Solution:
Using autoviz and dtale
import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)
Number of outlier values in every column:
We can find number of outliers using this table. No. of outliers told in DQ issue coloumn.

We can outline outliers using visuals like scatterplots:
All values far from the main cluster are outliers:
USING DTALE: few scatterplots to show outliers

Impact of Outliers on Dataset:
Outliers can have a significant impact on the analysis and modeling process, particularly in correlation analysis and
predictive modeling. Here are some potential impacts:
1. Influence on Correlation Coefficients
Outliers can distort correlation coefficients, particularly Pearson correlation coefficients, which are sensitive to extreme
values.
2. Model Performance
In predictive modeling, outliers can disproportionately influence the model's parameters and predictions.
3. Misleading Visualizations
Outliers can distort visualizations such as scatter plots and correlation matrix heatmaps, making it challenging to identify
true patterns or relationships between variables.
4. Model Robustness:
Outliers can reduce the robustness of statistical models, making them less reliable in the presence of unusual or
unexpected data points.

Task No.4: Target Variable Analysis
Solution:
I have realized that my target variable in diabetes.csv is the outcome variable since it’s the target variable for
all predictor variables.
Using dtale:

Analyzing Distribution:
Outcome: To express the final result 1 is Yes and 0 is No. 1 means the person has diabetes and vice versa.
We can see more people don’t have diabetes so the dataset has a non-diabetes majority.
Histogram:

Potential imbalance and its impact on the model:
Yes, there is an imbalance in the data. The most frequent value, 0, occurs 65.1% of the time, while the other value, 1,
occurs only 34.9% of the time. This indicates a class imbalance, where one class (0 in this case) is much more prevalent
than the other class (1).
Here are some potential impacts of class imbalance:
Biased Model: The model may be biased towards the majority class (0 in this case) because it sees more examples of
that class during training. As a result, it may have lower accuracy in predicting the minority class (1).
Poor Generalization: The model may generalize poorly to new data, especially for the minority class. Since it has seen
fewer examples of the minority class during training, it may not learn to distinguish it effectively.
Misinterpretation of Importance: The importance of features may be skewed towards features that are more prevalent
in the majority class, leading to a misinterpretation of the true drivers of the outcome.

Data Mining Journal 2 Kashan

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Journal 2 Kashan

Uploaded by

Copyright:

Available Formats

Bahria University,

LAB EXPERIMENT NO.

1. Perform a basic data profiling to understand the structure of the dataset,

Kashan Riaz 02-131212-075 Data mining Journal

Kashan Riaz 02-131212-075 Data mining Journal

Task No.2: Feature correlation

Kashan Riaz 02-131212-075 Data mining Journal

Kashan Riaz 02-131212-075 Data mining Journal

CORRELATION MATRIX USING DTALE:

Kashan Riaz 02-131212-075 Data mining Journal

Kashan Riaz 02-131212-075 Data mining Journal

All values far from the main cluster are outliers:

USING DTALE: few scatterplots to show outliers

Kashan Riaz 02-131212-075 Data mining Journal

1. Influence on Correlation Coefficients

Kashan Riaz 02-131212-075 Data mining Journal

Kashan Riaz 02-131212-075 Data mining Journal

Kashan Riaz 02-131212-075 Data mining Journal

Here are some potential impacts of class imbalance:

Kashan Riaz 02-131212-075 Data mining Journal

You might also like