Assignment-Data Preprocessing (All)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Lesson 2: Data Preprocessing Assignments

In [1]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings("ignore")

Assignment 1
Problem Statement
Suppose you are a public school administrator. Some schools in your state of Tennessee are performing below average academically. Your superintendent under pressure from frustrated parents and voters approached you with the task of understanding why these schools are under-
performing. To improve school performance, you need to learn more about these schools and their students, just as a business needs to understand its

Objective:
Perform exploratory data analysis which includes: determining the type of the data, correlation analysis over the same. You need to convert the data into useful information:

Read the data in pandas data frame

Describe the data to find more details

Find the correlation between ‘reduced_lunch’and‘school_rating’

In [4]:
df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/middle_tn_schools.csv")

In [5]:
df.head()

Out[5]: name school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic

0 Allendale Elementary School 5.0 851.0 10.0 90.2 95.8 15.7 Public 89.4 85.2 54.0 2.9 85.5 1.6 5.6

1 Anderson Elementary 2.0 412.0 71.0 32.8 37.3 12.8 Public 43.0 38.3 32.0 3.9 86.7 1.0 4.9

2 Avoca Elementary 4.0 482.0 43.0 78.4 83.6 16.6 Public 75.7 73.0 29.0 1.0 91.5 1.2 4.4

3 Bailey Middle 0.0 394.0 91.0 1.6 1.0 13.1 Public Magnet 2.1 4.4 30.0 80.7 11.7 2.3 4.3

4 Barfield Elementary 4.0 948.0 26.0 85.3 89.2 14.8 Public 81.3 79.6 64.0 11.8 71.2 7.1 6.0

In [11]:
print('Shape of data :', df.shape)

Shape of data : (347, 15)

In [12]:
print('Column Names are:',df.columns)

Column Names are: Index(['name', 'school_rating', 'size', 'reduced_lunch', 'state_percentile_16',

'state_percentile_15', 'stu_teach_ratio', 'school_type', 'avg_score_15',

'avg_score_16', 'full_time_teachers', 'percent_black', 'percent_white',

'percent_asian', 'percent_hispanic'],

dtype='object')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 347 entries, 0 to 346

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 347 non-null object

1 school_rating 347 non-null float64

2 size 347 non-null float64

3 reduced_lunch 347 non-null float64

4 state_percentile_16 347 non-null float64

5 state_percentile_15 341 non-null float64

6 stu_teach_ratio 347 non-null float64

7 school_type 347 non-null object

8 avg_score_15 341 non-null float64

9 avg_score_16 347 non-null float64

10 full_time_teachers 347 non-null float64

11 percent_black 347 non-null float64

12 percent_white 347 non-null float64

13 percent_asian 347 non-null float64

14 percent_hispanic 347 non-null float64

dtypes: float64(13), object(2)

memory usage: 40.8+ KB

In [14]:
df.isnull().sum()

Out[14]: name 0

school_rating 0

size 0

reduced_lunch 0

state_percentile_16 0

state_percentile_15 6

stu_teach_ratio 0

school_type 0

avg_score_15 6

avg_score_16 0

full_time_teachers 0

percent_black 0

percent_white 0

percent_asian 0

percent_hispanic 0

dtype: int64

In [15]:
df.school_type.unique()

Out[15]: array(['Public', 'Public Magnet', 'Public Charter', 'Public Virtual'],

dtype=object)

In [16]:
s_type={'Public':1, 'Public Magnet':2, 'Public Charter':3, 'Public Virtual':4}

In [18]:
df['school_type']=df['school_type'].map(s_type)

In [20]:
df['school_type']=df['school_type'].astype('float')

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 347 entries, 0 to 346

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 347 non-null object

1 school_rating 347 non-null float64

2 size 347 non-null float64

3 reduced_lunch 347 non-null float64

4 state_percentile_16 347 non-null float64

5 state_percentile_15 341 non-null float64

6 stu_teach_ratio 347 non-null float64

7 school_type 347 non-null float64

8 avg_score_15 341 non-null float64

9 avg_score_16 347 non-null float64

10 full_time_teachers 347 non-null float64

11 percent_black 347 non-null float64

12 percent_white 347 non-null float64

13 percent_asian 347 non-null float64

14 percent_hispanic 347 non-null float64

dtypes: float64(14), object(1)

memory usage: 40.8+ KB

In [23]:
df.describe().round(decimals=2)

Out[23]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic

count 347.00 347.00 347.00 347.00 341.00 347.00 347.00 341.0 347.00 347.00 347.00 347.00 347.00 347.00

mean 2.97 699.47 50.28 58.80 58.25 15.46 1.19 57.0 57.05 44.94 21.20 61.67 2.64 11.16

std 1.69 400.60 25.48 32.54 32.70 5.73 0.47 26.7 27.97 22.05 23.56 27.27 3.11 12.03

min 0.00 53.00 2.00 0.20 0.60 4.70 1.00 1.5 0.10 2.00 0.00 1.10 0.00 0.00

25% 2.00 420.50 30.00 30.95 27.10 13.70 1.00 37.6 37.00 30.00 3.60 40.60 0.75 3.80

50% 3.00 595.00 51.00 66.40 65.80 15.00 1.00 61.8 60.70 40.00 13.50 68.70 1.60 6.40

75% 4.00 851.00 71.50 88.00 88.60 16.70 1.00 79.6 80.25 54.00 28.35 85.95 3.10 13.80

max 5.00 2314.00 98.00 99.80 99.80 111.00 4.00 99.0 98.90 140.00 97.40 99.70 21.10 65.20

In [24]:
cormat=df.corr()

cormat.round(decimals=2).style.background_gradient()

Out[24]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic

school_rating 1.000000 0.180000 -0.820000 0.990000 0.940000 0.200000 -0.110000 0.940000 0.980000 0.120000 -0.590000 0.640000 0.160000 -0.380000

size 0.180000 1.000000 -0.280000 0.170000 0.160000 0.140000 -0.140000 0.160000 0.140000 0.970000 -0.150000 0.100000 0.190000 -0.020000

reduced_lunch -0.820000 -0.280000 1.000000 -0.820000 -0.830000 -0.200000 0.180000 -0.840000 -0.820000 -0.210000 0.560000 -0.670000 -0.230000 0.490000

state_percentile_16 0.990000 0.170000 -0.820000 1.000000 0.950000 0.190000 -0.090000 0.950000 0.990000 0.120000 -0.570000 0.630000 0.150000 -0.380000

state_percentile_15 0.940000 0.160000 -0.830000 0.950000 1.000000 0.140000 -0.100000 0.990000 0.950000 0.110000 -0.560000 0.610000 0.180000 -0.370000

stu_teach_ratio 0.200000 0.140000 -0.200000 0.190000 0.140000 1.000000 0.290000 0.150000 0.180000 0.020000 -0.120000 0.130000 0.090000 -0.090000

school_type -0.110000 -0.140000 0.180000 -0.090000 -0.100000 0.290000 1.000000 -0.120000 -0.100000 -0.170000 0.490000 -0.430000 -0.040000 0.070000

avg_score_15 0.940000 0.160000 -0.840000 0.950000 0.990000 0.150000 -0.120000 1.000000 0.950000 0.110000 -0.600000 0.640000 0.190000 -0.370000

avg_score_16 0.980000 0.140000 -0.820000 0.990000 0.950000 0.180000 -0.100000 0.950000 1.000000 0.090000 -0.590000 0.640000 0.170000 -0.370000

full_time_teachers 0.120000 0.970000 -0.210000 0.120000 0.110000 0.020000 -0.170000 0.110000 0.090000 1.000000 -0.110000 0.060000 0.150000 0.030000

percent_black -0.590000 -0.150000 0.560000 -0.570000 -0.560000 -0.120000 0.490000 -0.600000 -0.590000 -0.110000 1.000000 -0.870000 -0.110000 0.090000

percent_white 0.640000 0.100000 -0.670000 0.630000 0.610000 0.130000 -0.430000 0.640000 0.640000 0.060000 -0.870000 1.000000 -0.090000 -0.540000

percent_asian 0.160000 0.190000 -0.230000 0.150000 0.180000 0.090000 -0.040000 0.190000 0.170000 0.150000 -0.110000 -0.090000 1.000000 0.190000

percent_hispanic -0.380000 -0.020000 0.490000 -0.380000 -0.370000 -0.090000 0.070000 -0.370000 -0.370000 0.030000 0.090000 -0.540000 0.190000 1.000000

In [25]:
fig, axe = plt.subplots(figsize=(12,8))

sns.heatmap(cormat,annot=True,cmap='YlGnBu',square=True)

plt.show()

In [27]:
y = df["school_rating"]

x = df["reduced_lunch"]

correlation = y.corr(x)

print('Correlation between school ratin & reduced lunch :',correlation)

Correlation between school ratin & reduced lunch : -0.8157567373058027

Assignment 2
Problem Statement:
Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive
further insights out if it.

Objective:
Import the dataset, explore for dimensionality, type and average value of the horsepower across all the cars. Also, identify few of mostly correlated features which would help in modification.

In [28]:
mtcar = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/mtcars.csv")

In [29]:
mtcar.head()

Out[29]: model mpg cyl disp hp drat wt qsec vs am gear carb

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

In [32]:
print("Dimensions of mtcar dataset are(Row:Column):",mtcar.shape)

Dimensions of mtcar dataset are(Row:Column): (32, 12)

In [33]:
print('Data types : \n', mtcar.dtypes)

Data types :

model object

mpg float64

cyl int64

disp float64

hp int64

drat float64

wt float64

qsec float64

vs int64

am int64

gear int64

carb int64

dtype: object

In [36]:
mtcar.hp.mean()

Out[36]: 146.6875

In [39]:
mtcar.groupby(['model']).hp.mean()

Out[39]: model
AMC Javelin 150

Cadillac Fleetwood 205

Camaro Z28 245

Chrysler Imperial 230

Datsun 710 93

Dodge Challenger 150

Duster 360 245

Ferrari Dino 175

Fiat 128 66

Fiat X1-9 66

Ford Pantera L 264

Honda Civic 52

Hornet 4 Drive 110

Hornet Sportabout 175

Lincoln Continental 215

Lotus Europa 113

Maserati Bora 335

Mazda RX4 110

Mazda RX4 Wag 110

Merc 230 95

Merc 240D 62

Merc 280 123

Merc 280C 123

Merc 450SE 180

Merc 450SL 180

Merc 450SLC 180

Pontiac Firebird 175

Porsche 914-2 91

Toyota Corolla 65

Toyota Corona 97

Valiant 105

Volvo 142E 109

Name: hp, dtype: int64

In [40]:
cormat=mtcar.corr()

cormat.round(decimals=2).style.background_gradient()

Out[40]: mpg cyl disp hp drat wt qsec vs am gear carb

mpg 1.000000 -0.850000 -0.850000 -0.780000 0.680000 -0.870000 0.420000 0.660000 0.600000 0.480000 -0.550000

cyl -0.850000 1.000000 0.900000 0.830000 -0.700000 0.780000 -0.590000 -0.810000 -0.520000 -0.490000 0.530000

disp -0.850000 0.900000 1.000000 0.790000 -0.710000 0.890000 -0.430000 -0.710000 -0.590000 -0.560000 0.390000

hp -0.780000 0.830000 0.790000 1.000000 -0.450000 0.660000 -0.710000 -0.720000 -0.240000 -0.130000 0.750000

drat 0.680000 -0.700000 -0.710000 -0.450000 1.000000 -0.710000 0.090000 0.440000 0.710000 0.700000 -0.090000

wt -0.870000 0.780000 0.890000 0.660000 -0.710000 1.000000 -0.170000 -0.550000 -0.690000 -0.580000 0.430000

qsec 0.420000 -0.590000 -0.430000 -0.710000 0.090000 -0.170000 1.000000 0.740000 -0.230000 -0.210000 -0.660000

vs 0.660000 -0.810000 -0.710000 -0.720000 0.440000 -0.550000 0.740000 1.000000 0.170000 0.210000 -0.570000

am 0.600000 -0.520000 -0.590000 -0.240000 0.710000 -0.690000 -0.230000 0.170000 1.000000 0.790000 0.060000

gear 0.480000 -0.490000 -0.560000 -0.130000 0.700000 -0.580000 -0.210000 0.210000 0.790000 1.000000 0.270000

carb -0.550000 0.530000 0.390000 0.750000 -0.090000 0.430000 -0.660000 -0.570000 0.060000 0.270000 1.000000

In [41]:
fig, axe = plt.subplots(figsize=(12,8))

sns.heatmap(cormat,annot=True,cmap='YlGnBu',square=True)

plt.show()

Assignment 3
Problem Statement:
Mtcars, the automobile company in the United States have planned to rework on optimizing the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while developing a ML model with respect to horsepower, the efficiency of the model
was compromised. Irregularity might be one of the causes.

Objective:
Check for missing values and outliers within the horsepower column and remove them.

In [42]:
mtcar['hp'].isnull().sum()

Out[42]: 0

In [48]:
sns.boxplot(mtcar['hp']);

In [46]:
filter = mtcar['hp'].values <300

In [47]:
hp_filter = mtcar[filter]

In [49]:
sns.boxplot(hp_filter['hp']);

Assignment 4
Problem Statement:
Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly.

Objective:
Perform missing value and outlier data treatment.

In [50]:
from sklearn.datasets import load_diabetes

In [52]:
data = load_diabetes()

In [53]:
print(data.DESCR)

.. _diabetes_dataset:

Diabetes dataset

----------------

Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:

- age age in years

- sex

- bmi body mass index

- bp average blood pressure

- s1 tc, T-Cells (a type of white blood cells)

- s2 ldl, low-density lipoproteins

- s3 hdl, high-density lipoproteins

- s4 tch, thyroid stimulating hormone

- s5 ltg, lamotrigine

- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [56]:
diabetes = pd.DataFrame(data.data, columns = data.feature_names)

diabetes.head()

Out[56]: age sex bmi bp s1 s2 s3 s4 s5 s6

0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646

1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204

2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930

3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362

4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641

In [58]:
diabetes.isna().sum()

Out[58]: age 0

sex 0

bmi 0

bp 0

s1 0

s2 0

s3 0

s4 0

s5 0

s6 0

dtype: int64

In [61]:
diabetes.describe().round(decimals=3)

Out[61]: age sex bmi bp s1 s2 s3 s4 s5 s6

count 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000

mean -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000

std 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048

min -0.107 -0.045 -0.090 -0.112 -0.127 -0.116 -0.102 -0.076 -0.126 -0.138

25% -0.037 -0.045 -0.034 -0.037 -0.034 -0.030 -0.035 -0.039 -0.033 -0.033

50% 0.005 -0.045 -0.007 -0.006 -0.004 -0.004 -0.007 -0.003 -0.002 -0.001

75% 0.038 0.051 0.031 0.036 0.028 0.030 0.029 0.034 0.032 0.028

max 0.111 0.051 0.171 0.132 0.154 0.199 0.181 0.185 0.134 0.136

In [69]:
diabetes.boxplot(figsize=(12,6),fontsize=14)

Out[69]: <AxesSubplot:>

In [83]:
cols=diabetes.columns

cols

Out[83]: Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

In [84]:
df=diabetes.copy()

In [85]:
def iqr_capping(df,cols):

for col in cols:

q1=df[col].quantile(0.25)

q3=df[col].quantile(0.75)

iqr=q3-q1

UC=q3+(1.5*iqr)

LC=q1-(1.5*iqr)

df[col]=np.where(df[col]>UC,UC,np.where(df[col]<LC,LC,df[col]))

In [86]:
iqr_capping(df,cols)

In [87]:
df.boxplot(figsize=(12,6),fontsize=14)

Out[87]: <AxesSubplot:>

Assignment 5
Problem Statement:
As a macroeconomic analyst at the Organization for Economic Cooperation and Development (OECD), your job is to collect relevant data for analysis. It looks like you have three countries in thenorth_america
data frame and one country in thesouth_americadata frame. As these are in two
separate plots, it's hard to compare the average labor hours between North America and South America. If all the countries were into the same data frame, it would be much easier to do this comparison.

Objective:
Demonstrate concatenation.

In [3]:
df1 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/north_america_2000_2010.csv")

df1.head()

Out[3]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0

1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4

2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0

In [4]:
df2 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/south_america_2000_2010.csv")

df2.head()

Out[4]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

0 Chile 2263 2242 2250 2235 2232 2157 2165 2128 2095 2074 2069.6

In [14]:
df_concatnated=pd.concat([df1,df2],keys=['North_America','South_America'])

df_concatnated.head()

Out[14]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

North_America 0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0

1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4

2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0

South_America 0 Chile 2263.0 2242.0 2250.0 2235.0 2232.0 2157 2165.0 2128.0 2095 2074.0 2069.6

In [28]:
df_concatnated.groupby(level=0).mean()

Out[28]: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

North_America 1975.4 1956.733333 1945.066667 1938.833333 1944.2 1942.333333 1941.866667 1933.466667 1928.333333 1906.066667 1907.8

South_America 2263.0 2242.000000 2250.000000 2235.000000 2232.0 2157.000000 2165.000000 2128.000000 2095.000000 2074.000000 2069.6

Assignment 6
Problem Statement:
SFO Public Department -referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation
and answer the below questions:
1.How much total salary cost has increased from year 2011 to 2014?
2.Who was the top earning employee across all the years?

Objective:
Performdata manipulation and visualization techniques

In [12]:
Salary_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/Salaries.csv")

Salary_df.head()

Out[12]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status

0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN

1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN

2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN

3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN

4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN

In [5]:
Salary_df['TotalPayBenefits']=(Salary_df.TotalPayBenefits)/100000

Salary_df.head()

Out[5]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status

0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 0.000057 2011 NaN San Francisco NaN

1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 0.000054 2011 NaN San Francisco NaN

2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 0.000034 2011 NaN San Francisco NaN

3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 0.000033 2011 NaN San Francisco NaN

4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 0.000033 2011 NaN San Francisco NaN

In [13]:
Salary_df['EmployeeName'].unique()

Out[13]: array(['NATHANIEL FORD', 'GARY JIMENEZ', 'ALBERT PARDINI', ...,

'Mark W Mcclure', 'Charlene D Mccully', 'Joe Lopez'], dtype=object)

In [14]:
Salary_df['EmployeeName']=Salary_df['EmployeeName'].str.lower().str.replace('bernard fatooh','bernard fatooh')

In [15]:
Salary_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 148648 entries, 0 to 148647

Data columns (total 13 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Id 148648 non-null int64

1 EmployeeName 148648 non-null object

2 JobTitle 148648 non-null object

3 BasePay 148043 non-null float64

4 OvertimePay 148648 non-null float64

5 OtherPay 148648 non-null float64

6 Benefits 112490 non-null float64

7 TotalPay 148648 non-null float64

8 TotalPayBenefits 148648 non-null float64

9 Year 148648 non-null int64

10 Notes 0 non-null float64

11 Agency 148648 non-null object

12 Status 38119 non-null object

dtypes: float64(7), int64(2), object(4)

memory usage: 14.7+ MB

1.How much total salary cost has increased from year 2011 to 2014?
In [84]:
Salary_Group=Salary_df.groupby('Year')['TotalPayBenefits'].sum()

Salary_Group.head()

Out[84]: Year

2011 2.594113e+09

2012 3.696790e+09

2013 3.814772e+09

2014 3.821866e+09

Name: TotalPayBenefits, dtype: float64

In [85]:
Salary_2011=Salary_Group.iloc[0]

Salary_2014=Salary_Group.iloc[2]

Increase_Salary=(Salary_2014-Salary_2011)/Salary_2011*100

print(Increase_Salary)

47.05497174543298

2.Who was the top earning employee across all the years?
In [24]:
TopEarn_pivot=Salary_df.pivot_table(index='EmployeeName',columns='Year',values='TotalPayBenefits',aggfunc='mean',margins=True)

TopEarn_pivot.head()

Out[24]: Year 2011 2012 2013 2014 All

EmployeeName

a bernard fatooh 20039.91 23514.85 29379.24 30153.03 25771.7575

a elizabeth marchasin 26282.86 NaN NaN NaN 26282.8600

a jamil niazi 87496.21 NaN NaN NaN 87496.2100

a k finizio NaN NaN NaN 26113.37 26113.3700

a. james robertson ii NaN NaN 22601.80 NaN 22601.8000

In [30]:
TopEarn_pivot=TopEarn_pivot.sort_values(['All'],ascending=False)

TopEarn_pivot.head()

Out[30]: Year 2011 2012 2013 2014 All

EmployeeName

nathaniel ford 567595.43 NaN NaN NaN 567595.430

gary jimenez 538909.28 NaN NaN NaN 538909.280

william j coaker jr. NaN NaN NaN 436224.36 436224.360

amy p hart NaN NaN 383746.78 479652.21 431699.495

gregory p suhr NaN NaN 425815.28 418019.22 421917.250

In [44]:
print("Top Earning employee is",TopEarn_pivot.index[0])

Top Earning employee is nathaniel ford

Assignment 7 -Kaggle -Covid_south_america


Perform the EDA
Read the data

Size, Shape, Datatypes

Description statistical

Missing Values Detection and Treatment

Outlier Detection and Treatment

Correlation heatmap

In [45]:
covid_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/covid_south_america_weekly_trend.csv")

covid_df.head()

Out[45]: Country/Other Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population

0 Argentina 70514 94792 -26 1537 863 1198 -28 19 45881008

1 Bolivia 3537 6862 -48 296 47 111 -58 4 11936049

2 Brazil 576463 741844 -22 2681 5051 5814 -13 23 215056109

3 Chile 194646 234595 -17 10040 874 732 19 45 19387329

4 Colombia 17132 29098 -41 331 602 1043 -42 12 51778918

In [46]:
print('Shape of data :',covid_df.shape)

Shape of data : (13, 10)

In [50]:
print('Size of data :',covid_df.size)

Size of data : 130

In [53]:
print('Data types : \n', covid_df.dtypes)

Data types :

Country/Other object

Cases in the last 7 days int64

Cases in the preceding 7 days int64

Weekly Case % Change int64

Cases in the last 7 days/1M pop int64

Deaths in the last 7 days int64

Deaths in the preceding 7 days int64

Weekly Death % Change int64

Deaths in the last 7 days/1M pop int64

Population int64

dtype: object

In [52]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 13 entries, 0 to 12

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Country/Other 13 non-null object

1 Cases in the last 7 days 13 non-null int64

2 Cases in the preceding 7 days 13 non-null int64

3 Weekly Case % Change 13 non-null int64

4 Cases in the last 7 days/1M pop 13 non-null int64

5 Deaths in the last 7 days 13 non-null int64

6 Deaths in the preceding 7 days 13 non-null int64

7 Weekly Death % Change 13 non-null int64

8 Deaths in the last 7 days/1M pop 13 non-null int64

9 Population 13 non-null int64

dtypes: int64(9), object(1)

memory usage: 1.1+ KB

In [47]:
print('Column Names are:',covid_df.columns)

Column Names are: Index(['Country/Other', 'Cases in the last 7 days',

'Cases in the preceding 7 days', 'Weekly Case % Change',

'Cases in the last 7 days/1M pop', 'Deaths in the last 7 days',

'Deaths in the preceding 7 days', 'Weekly Death % Change',

'Deaths in the last 7 days/1M pop', 'Population'],

dtype='object')

In [54]:
covid_df.isna().sum()

Out[54]: Country/Other 0

Cases in the last 7 days 0

Cases in the preceding 7 days 0

Weekly Case % Change 0

Cases in the last 7 days/1M pop 0

Deaths in the last 7 days 0

Deaths in the preceding 7 days 0

Weekly Death % Change 0

Deaths in the last 7 days/1M pop 0

Population 0

dtype: int64

In [56]:
covid_df.describe()

Out[56]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population

count 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 1.300000e+01

mean 71469.384615 95625.076923 -39.846154 2005.076923 685.769231 824.846154 -27.538462 17.307692 3.358683e+07

std 160814.226454 204458.574605 13.637242 3107.834789 1370.556526 1569.117631 28.814393 12.565725 5.716988e+07

min 198.000000 290.000000 -61.000000 133.000000 4.000000 3.000000 -58.000000 1.000000 3.114600e+05

25% 3537.000000 5834.000000 -48.000000 331.000000 29.000000 64.000000 -50.000000 10.000000 3.493624e+06

50% 8999.000000 21531.000000 -41.000000 636.000000 96.000000 121.000000 -42.000000 13.000000 1.808557e+07

75% 25974.000000 55235.000000 -32.000000 1537.000000 863.000000 1043.000000 -13.000000 23.000000 3.373056e+07

max 576463.000000 741844.000000 -17.000000 10040.000000 5051.000000 5814.000000 33.000000 45.000000 2.150561e+08

In [57]:
cormat=covid_df.corr()

cormat.round(decimals=2).style.background_gradient()

Out[57]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population

Cases in the last 7 days 1.000000 1.000000 0.580000 0.330000 0.960000 0.950000 0.310000 0.370000 0.920000

Cases in the preceding 7 days 1.000000 1.000000 0.550000 0.310000 0.970000 0.960000 0.310000 0.370000 0.930000

Weekly Case % Change 0.580000 0.550000 1.000000 0.600000 0.420000 0.410000 0.490000 0.360000 0.400000

Cases in the last 7 days/1M pop 0.330000 0.310000 0.600000 1.000000 0.150000 0.100000 0.510000 0.760000 0.030000

Deaths in the last 7 days 0.960000 0.970000 0.420000 0.150000 1.000000 1.000000 0.260000 0.330000 0.970000

Deaths in the preceding 7 days 0.950000 0.960000 0.410000 0.100000 1.000000 1.000000 0.210000 0.280000 0.990000

Weekly Death % Change 0.310000 0.310000 0.490000 0.510000 0.260000 0.210000 1.000000 0.640000 0.120000

Deaths in the last 7 days/1M pop 0.370000 0.370000 0.360000 0.760000 0.330000 0.280000 0.640000 1.000000 0.150000

Population 0.920000 0.930000 0.400000 0.030000 0.970000 0.990000 0.120000 0.150000 1.000000

In [ ]:

You might also like