Professional Documents
Culture Documents
Data Mining Journal 2 Kashan
Data Mining Journal 2 Kashan
Karachi Campus
LIST OF TASKS
TASK NO OBJECTIVE
1 Data Profiling:
3 Outlier Detection: Identify potential outliers in the dataset using appropriate techniques, such as
box plots, and scatter plots. Visualize the distribution of each numerical variable to identify any
extreme values. Discuss the potential impact of outliers on the analysis and modeling process.
4 Target Variable Analysis:Analyze the distribution of the target variable (diabetes or non-
diabetes).Visualize the target variable distribution using a histogram or a bar chart.
Identify any potential imbalance in the target variable and discuss its impact on model
performance.
No of rows= 768
No. of datatypes= 1 (int)
No. of coloumns = 9
No of predictor variables = 8
No of target variable= 1
No of missing values= 0 so distribution among them also 0.
Summary statistics for each coloumn
import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)
Any variable against themselves have correlation of 1. (e.g.: pregnancy against pregnancy)
Pregnancy and age
Glucose and outcome
import autoviz
import pandas as pd
df = pd.read_csv('diabetes.csv')
AV = autoviz.AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename='diabetes.csv', sep=',', depVar='', dfte=None, header=0,
verbose=1, lowess=False, chart_format='svg', max_rows_analyzed=150000,
max_cols_analyzed=30, save_plot_dir=None)
Number of outlier values in every column:
We can find number of outliers using this table. No. of outliers told in DQ issue coloumn.
Outliers can distort correlation coefficients, particularly Pearson correlation coefficients, which are sensitive to extreme
values.
2. Model Performance
In predictive modeling, outliers can disproportionately influence the model's parameters and predictions.
3. Misleading Visualizations
Outliers can distort visualizations such as scatter plots and correlation matrix heatmaps, making it challenging to identify
true patterns or relationships between variables.
4. Model Robustness:
Outliers can reduce the robustness of statistical models, making them less reliable in the presence of unusual or
unexpected data points.
Using dtale:
Outcome: To express the final result 1 is Yes and 0 is No. 1 means the person has diabetes and vice versa.
We can see more people don’t have diabetes so the dataset has a non-diabetes majority.
Histogram:
Yes, there is an imbalance in the data. The most frequent value, 0, occurs 65.1% of the time, while the other value, 1,
occurs only 34.9% of the time. This indicates a class imbalance, where one class (0 in this case) is much more prevalent
than the other class (1).
Biased Model: The model may be biased towards the majority class (0 in this case) because it sees more examples of
that class during training. As a result, it may have lower accuracy in predicting the minority class (1).
Poor Generalization: The model may generalize poorly to new data, especially for the minority class. Since it has seen
fewer examples of the minority class during training, it may not learn to distinguish it effectively.
Misinterpretation of Importance: The importance of features may be skewed towards features that are more prevalent
in the majority class, leading to a misinterpretation of the true drivers of the outcome.