Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

DATASET 1

1. What is the size of the dataset, and what types of variables are included?

The size of the dataset is 5000 rows and 27 columns. It includes both continues and
categoric data .

2. What are the distributions of the variables, and are they normally distributed?

No the variables were not normally distributed. I converted the categorical data into discrete
data and than I applied the normal distribution on the both the discrete and continuous data.

3. What are the most frequent values or categories in the dataset, and how do they relate
to the target variable?

Delay, housing ,race, sex are the columns containing most frequent values and they are related
to the target variable diagnosis as all of them are categorical data.

4. What are the important variables that influence the target variable?

5. Are there any correlations or patterns between the independent variables?

Yes, “Focus” and “Concentration” and “unusual thought and apathy” are the variables which are
highly correlated to each other.

6. Is the dataset balanced, or is there an imbalance in the target variable distribution?

The dataset is balanced with the target variable “Diagnosis” for this testset.

7. Are there any missing values, and if so, what is the best way to impute them?

Yes, there are missing values in this dataset. I used mean and frequent value method to find the
missing values.
8. Are there any outliers, and how should they be treated?

Yes, there exists the outliers in this dataset which I have calculated using IQR method

9. What is the appropriate method for feature scaling or normalization?

i used min max method to normalize the data

10. What is the best way to handle categorical variables in the model?

The categorical data was converted through one hot encoding

You might also like