The Goal of This Part Is To Use Descriptive Statistics and Visualization To Better Understand Your Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

The goal of this part is to use descriptive

statistics and visualization to better understand


your data.

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv(path, index_col=0, encoding = "ISO-8859-
1")
data.head() data.tail()

Exploratory Data Analysis (EDA)


Analyse de Forme:

Variable target : SARS-Cov-2 exam result

 df.shape

#######################
 print(df.dtypes)
 print(df.dtypes.value_counts())
 df.dtypes.value_counts().plot.pie()

#################################

 df.isna() (bool: T or F)
 plt.figure(figsize=(20,10))
sns.heatmap(df.isna(), cbar=False)
 df.isna().sum()
 (df.isna().sum()/df.shape[0]).sort_values(ascending=Tr
ue)
 missing_rate =df.isna().sum()/df.shape[0]

Analyse de Fond :
 df.isna().sum()/df.shape[0]<0.9
 df.columns[df.isna().sum()/df.shape[0]<0.9]
 df[df.columns[df.isna().sum()/df.shape[0]<0.9]]
 df=df[df.columns[df.isna().sum()/df.shape[0]<0.9]]
 df.head()
 plt.figure(figsize=(20,10))
sns.heatmap(df.isna(), cbar=False)

 df=df.drop(‘Patient ID’,axis=1)
 df['SARS-Cov-2 exam
result'].value_counts(normalize=True)
 for col in df.select_dtypes(‘float’):

print (col)

 for col in df.select_dtypes('float'):

plt.figure()

sns.distplot(df[col])

 sns.distplot(df['Patient age quantile'])


 df['Patient age quantile']. value_counts()
 df[‘SARS-Cov-2 exam result’].unique()
 for col in df.select_dtypes('object'):

print(col, df[col].unique())

 for col in df.select_dtypes('object'):

plt.figure()

df[col]. value_counts().plot.pie()

 Relation Variables / Target :

 df['SARS-Cov-2 exam result']

 df['SARS-Cov-2 exam result'] == 'positive'

 df[df['SARS-Cov-2 exam result'] == 'positive']

 positive_df = df[df['SARS-Cov-2 exam result'] ==


'positive']
 negative_df = df[df['SARS-Cov-2 exam result'] ==
'negative']

(missing_rate =df.isna().sum()/df.shape[0])
 blood_columns = df.columns[(missing_rate < 0.9) &
(missing_rate >0.88)]
 viral_columns = df.columns[(missing_rate < 0.88) &
(missing_rate >0.75)]

 for col in blood_columns:


plt.figure()
sns.distplot(positive_df[col],label='positive')
sns.distplot(negative_df[col], label='negative')
plt.legend()

 sns.countplot(x='Patient age quantile', hue='SARS-Cov-


2 exam result', data=df)
 pd.crosstab(df['SARS-Cov-2 exam result'],
df['Influenza A'])

 for col in viral_columns:

plt.figure()

sns.heatmap(pd.crosstab(df['SARS-Cov-2 exam
result'], df[col], annot=True, fmt=’d’)

You might also like