Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

8/30/2021 CMSU Survey Data Analysis

The Student News Service at Clear Mountain State


University (CMSU) has decided to gather data about
the undergraduate students that attend CMSU. CMSU
creates and distributes a survey of 14 questions and
receives responses from 62 undergraduates (stored in
the Survey data set).
Importing the Libraries
In [1]:
import pandas as pd

import numpy as np

import copy

import matplotlib.pyplot as plt

import seaborn as sns

import pylab

import math

%matplotlib inline

import os

import warnings

warnings.filterwarnings('ignore')

Loading the Data


In [4]:
survey = pd.read_csv('Survey-1.csv')

survey.head()

Out[4]: Grad Social


ID Gender Age Class Major GPA Employment Salary Satis
Intention Networking

0 1 Female 20 Junior Other Yes 2.9 Full-Time 50.0 1

1 2 Male 23 Senior Management Yes 3.6 Part-Time 25.0 1

2 3 Male 21 Junior Other Yes 2.5 Part-Time 45.0 2

3 4 Male 21 Junior CIS Yes 2.5 Full-Time 40.0 4

4 5 Male 23 Senior Other Undecided 2.8 Unemployed 40.0 2

Summary of the dataset


In [5]:
survey.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 62 entries, 0 to 61

Data columns (total 14 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 ID 62 non-null int64

1 Gender 62 non-null object

2 Age 62 non-null int64

3 Class 62 non-null object

4 Major 62 non-null object

5 Grad Intention 62 non-null object

6 GPA 62 non-null float64

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 1/13


8/30/2021 CMSU Survey Data Analysis

7 Employment 62 non-null object

8 Salary 62 non-null float64

9 Social Networking 62 non-null int64

10 Satisfaction 62 non-null int64

11 Spending 62 non-null int64

12 Computer 62 non-null object

13 Text Messages 62 non-null int64

dtypes: float64(2), int64(6), object(6)

memory usage: 6.9+ KB

In [6]:
survey.describe()

Out[6]: Social Tex


ID Age GPA Salary Satisfaction Spending
Networking Message

count 62.000000 62.000000 62.000000 62.000000 62.000000 62.000000 62.000000 62.00000

mean 31.500000 21.129032 3.129032 48.548387 1.516129 3.741935 482.016129 246.20967

std 18.041619 1.431311 0.377388 12.080912 0.844305 1.213793 221.953805 214.46595

min 1.000000 18.000000 2.300000 25.000000 0.000000 1.000000 100.000000 0.00000

25% 16.250000 20.000000 2.900000 40.000000 1.000000 3.000000 312.500000 100.00000

50% 31.500000 21.000000 3.150000 50.000000 1.000000 4.000000 500.000000 200.00000

75% 46.750000 22.000000 3.400000 55.000000 2.000000 4.000000 600.000000 300.00000

max 62.000000 26.000000 3.900000 80.000000 4.000000 6.000000 1400.000000 900.00000

Checking the missing value


In [7]:
survey.isnull().sum()

Out[7]: ID 0

Gender 0

Age 0

Class 0

Major 0

Grad Intention 0

GPA 0

Employment 0

Salary 0

Social Networking 0

Satisfaction 0

Spending 0

Computer 0

Text Messages 0

dtype: int64

2.1. For this data, construct the following contingency tables


(Keep Gender as row variable)
2.1.1. Gender and Major

2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 2/13


8/30/2021 CMSU Survey Data Analysis

In [25]: print("2.1.1 - Below is the output of Contingency Tables For Gender and Major")

survey_crosstab = pd.crosstab(survey['Gender'],survey['Major'],margins = False)

print(survey_crosstab)

2.1.1 - Below is the output of Contingency Tables For Gender and Major

Major Accounting CIS Economics/Finance International Business \

Gender

Female 3 3 7 4

Male 4 1 4 2

Major Management Other Retailing/Marketing Undecided

Gender

Female 4 3 9 0

Male 6 4 5 3

In [26]:
print("2.1.2 - Below is the output of Contingency Tables For Gender and Grad Intenti
survey_crosstab = pd.crosstab(survey['Gender'],survey['Grad Intention'],margins = Fa
print(survey_crosstab)

2.1.2 - Below is the output of Contingency Tables For Gender and Grad Intention

Grad Intention No Undecided Yes


Gender
Female 9 13 11
Male 3 9 17

In [27]:
print("2.1.3 - Below is the output of Contingency Tables For Gender and Employment")
survey_crosstab = pd.crosstab(survey['Gender'],survey['Employment'],margins = False)
print(survey_crosstab)

2.1.3 - Below is the output of Contingency Tables For Gender and Employment

Employment Full-Time Part-Time Unemployed

Gender

Female 3 24 6

Male 7 19 3

In [28]:
print("2.1.4 - Below is the output of Contingency Tables For Gender and Computer")

survey_crosstab = pd.crosstab(survey['Gender'],survey['Computer'],margins = False)

print(survey_crosstab)

2.1.4 - Below is the output of Contingency Tables For Gender and Computer

Computer Desktop Laptop Tablet

Gender

Female 2 29 2

Male 3 26 0

2.2. Assume that the sample is representative of the population


of CMSU. Based on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?

2.2.2. What is the probability that a randomly selected CMSU student will be female?

In [50]:
survey['Gender'].value_counts()

Out[50]: Female 33

Male 29

Name: Gender, dtype: int64

In [217…
Female =33

Male =29

Total =Female+Male

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 3/13


8/30/2021 CMSU Survey Data Analysis

print("2.2.1 - Probability that a randomly selected CMSU student will be male", Male
print("2.2.1 - Probability that a randomly selected CMSU student will be Female", Fe

2.2.1 - Probability that a randomly selected CMSU student will be male 46.7741935483
87096 = 46.77%

2.2.1 - Probability that a randomly selected CMSU student will be Female 53.22580645
16129 = 53.22%

2.3. Assume that the sample is representative of the population


of CMSU. Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.

2.3.2 Find the conditional probability of different majors among the female students of CMSU.

In [73]:
##Using contingency tables of Gender and Majors we got the total numbers of males an

pd.crosstab(index=survey['Gender'], columns=survey['Major'])

Out[73]: International
Major Accounting CIS Economics/Finance Management Other Retailing/Marketin
Business

Gender

Female 3 3 7 4 4 3

Male 4 1 4 2 6 4

In [79]:
survey['Gender'].value_counts()

Out[79]: Female 33

Male 29

Name: Gender, dtype: int64

In [132…
print(" Solution to 2.3.1 - Below is the conditional probability of different majors
Male=29 #Total number of male

print("Probability of Males opting for Accounting",4/Male*100) #Total number major t


print("Probability of Males opting for CIS",1/Male*100)

print("Probability of Males opting for Economics/Finance",4/Male*100)

print("Probability of Males opting for International Business",2/Male*100)

print("Probability of Males opting for Management",6/Male*100)

print("Probability of Males opting for Other",4/Male*100)

print("Probability of Males opting for Retailing/Marketing",5/Male*100)

print("Probability of Males opting for Undecided",3/Male*100)

Solution to 2.3.1 - Below is the conditional probability of different majors among


the male students in CMSU.

Probability of Males opting for Accounting 13.793103448275861

Probability of Males opting for CIS 3.4482758620689653

Probability of Males opting for Economics/Finance 13.793103448275861

Probability of Males opting for International Business 6.896551724137931

Probability of Males opting for Management 20.689655172413794

Probability of Males opting for Other 13.793103448275861

Probability of Males opting for Retailing/Marketing 17.24137931034483

Probability of Males opting for Undecided 10.344827586206897

Observation : From this output we can easily say that most of the males students prefer
Management as Majors and CIS is the least preferred one

In [226…
print("Solution to 2.3.2 - Below is the conditional probability of different majors

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 4/13


8/30/2021 CMSU Survey Data Analysis

Female=33 #Total number of Female

print("Probability of Females opting for Accounting",3/Female*100) #Total number maj


print("Probability of Females opting for CIS",3/Female*100)

print("Probability of Females opting for Economics/Finance",7/Female*100)

print("Probability of Females opting for International Business",4/Female*100)

print("Probability of Females opting for Management",4/Female*100)

print("Probability of Females opting for Other",3/Female*100)

print("Probability of Females opting for Retailing/Marketing",9/Female*100)

print("Probability of Females opting for Undecided",0/Female*100)

Solution to 2.3.2 - Below is the conditional probability of different majors among t


he Female students in CMSU.

Probability of Females opting for Accounting 9.090909090909092

Probability of Females opting for CIS 9.090909090909092

Probability of Females opting for Economics/Finance 21.21212121212121

Probability of Females opting for International Business 12.121212121212121

Probability of Females opting for Management 12.121212121212121

Probability of Females opting for Other 9.090909090909092

Probability of Females opting for Retailing/Marketing 27.27272727272727

Probability of Females opting for Undecided 0.0

Obervation: From this output we can easily say that most of the females students prefer
Retailing/Marketing as Majors

2.4. Assume that the sample is a representative of the population


of CMSU. Based on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.

In [119…
## 2.4.1. Find the probability That a randomly chosen student is a male and intends
## Using the Contingency table to Gender and Grad Intention We will get the total nu

print("Below is the output of Contingency Tables For Gender and Grad Intention")

pd.crosstab(index=survey['Gender'], columns=survey['Grad Intention'])

Below is the output of Contingency Tables For Gender and Grad Intention

Out[119… Grad Intention No Undecided Yes

Gender

Female 9 13 11

Male 3 9 17

In [134…
# Solution = Intended Male = Decided male/Totalmale

Total_Male = 3+9+17 #sum of No, Undecided & Yes.

Decided_Male = 17

print("Solution to 2.4.1 - The probability That a randomly chosen student is a male

Solution to 2.4.1 - The probability That a randomly chosen student is a male and int
ends to graduate is 58.620689655172406

Observation : Using contingency tables of Gender and Grad Intention we got the total
numbers of males and number of males intends to be graduate and post calculation we find
out that - Probability of Males and intends to be Graduate. is 58.62%

In [118…
## 2.4.2 Find the probability that a randomly selected student is a female and does

## Using the Contingency table to Gender and Computer We will get the total number o

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 5/13


8/30/2021 CMSU Survey Data Analysis

print("Below is the output of Contingency Tables For Gender and Computer")

pd.crosstab(index=survey['Gender'], columns=survey['Computer'])

Below is the output of Contingency Tables For Gender and Computer

Out[118… Computer Desktop Laptop Tablet

Gender

Female 2 29 2

Male 3 26 0

In [136…
## Solution = Female without laptop = Laptop/Totalfemale*100

Total_Female=2+29+2 ## Sum of desktop, laptop, tablet female users.


No_Laptop_Female=2+2 ## Sum of Desktop, Tablet

print("Solution for 2.4.2 - The probability that a randomly selected student is a fe

Solution for 2.4.2 - The probability that a randomly selected student is a female an
d does NOT have a laptop is 12.121212121212121

Observation : Using contingency tables of Gender and Computer we got the total numbers of females and
number of females does not have a laptop And post calculation we find out that - Probability of randomly
selected student is a Female and does NOT have a laptop. is 12.1%

2.5. Assume that the sample is representative of the population


of CMSU. Based on the data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.

In [120…
##2.5.1. Find the probability that a randomly chosen student is a male or has full-t

## Using the Contingency table to Gender and Employment We will get the total number

pd.crosstab(index=survey['Gender'], columns=survey['Employment'])

Out[120… Employment Full-Time Part-Time Unemployed

Gender

Female 3 24 6

Male 7 19 3

In [137…
#solution: Total_male_student+Total_full_time_student - male_full_time_student/total

Total_male_student = 29

Total_fulltime_student = 10 ## Full time Female & Male

Male_fulltime_student = 7

Total_student = 62 ## Sum of all type of employment and gender

print("Solution of 2.5.1 - The probability that a randomly chosen student is a male

Solution of 2.5.1 - The probability that a randomly chosen student is a male or has
full-time employment is 27.70967741935484

Using contingency tables of Gender and Employment we got the total


numbers of males and number of males who are full time employed And post
localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 6/13
8/30/2021
p y
CMSU Survey Data Analysis
p
calculation we find out that - Probability of randomly chosen student is either
Male or has full time employment. is 28%
In [129…
### 2.5.2. Find the conditional probability that given a female student is randomly

## Using the contingency table to Gender and Major We will get the total number of f

pd.crosstab(index=survey['Gender'], columns=survey['Major'])

Out[129… International
Major Accounting CIS Economics/Finance Management Other Retailing/Marketin
Business

Gender

Female 3 3 7 4 4 3

Male 4 1 4 2 6 4

In [140…
## Solution: (Female_International_Business+ Female_Management) / (Total_Female_stud

Female_International_Business = 4

Female_Management = 4

Total_Female_student = 3+3+7+4+4+3+9 ## Female is 33 which is sume of all Majors to


print("Solution of 2.5.2 - The conditional probability that given a female student i

Solution of 2.5.2 - The conditional probability that given a female student is rando
mly chosen, she is majoring in international business or management 24.2424242424242
42

Observation : Using contingency tables of Gender and Major we got the total numbers of females and number
of females majoring in international business or management and post calculation we find out that - Probability
that given a female student is randomly chosen, she is majoring in international business or management is
24.24%

2.6. Construct a contingency table of Gender and Intent to


Graduate at 2 levels (Yes/No). The Undecided students are not
considered now and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?
In [144…
pd.crosstab(index=survey['Gender'], columns=survey['Grad Intention'])

Out[144… Grad Intention No Undecided Yes

Gender

Female 9 13 11

Male 3 9 17

In [ ]:
#Solution : 2.6 (Solved over Excel - Screenshot/cell copied to the business case)

#The Probablity that a radomly selected student'being female'

#The probablity that a radomly selected student the graduate intention and being fem

#P(Grad Intention Yes) = 28/40*100 = 70

#P(Grad Intention Yes|Female) = 11/20*100 =55

### The Probablity are not equal. This suggests that the two events are independent.

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 7/13


8/30/2021 CMSU Survey Data Analysis

2.7. Note that there are four numerical (continuous) variables in


the data set, GPA, Salary, Spending, and Text Messages.
Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the
conditional probability that a randomly selected female earns 50 or more.

In [153…
#2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is

#Using contingency tables of Gender and GPA we got the total numbers of students and

pd.crosstab(index=survey['Gender'], columns=survey['GPA'])

Out[153… GPA 2.3 2.4 2.5 2.6 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

Gender

Female 1 1 2 0 1 3 5 2 4 3 2 4 1 2 1 1

Male 0 0 4 2 2 1 2 5 2 2 5 2 2 0 0 0

In [174…
# Solution - Total number of student are P(T) = 62

#Total number of the students having GPA less than 3 are P(G<3) =17

#(Student_less_than_3)/(Total_Student)*100

Male=29 #sum of Total male in the data set - Gendar

Female=33 #sun of total female in the data set - Gendar

Total_Student=Male+Female #sum of Total males & Females

Male_less_then_3_GPA= 9 #Sum of males less then 3 GPA

Female_less_3_GPA=8 #Sum of females less then 3 GPA

Total_Student_Less_then_3GPA = Male_less_then_3_GPA+Female_less_3_GPA

print("Solution of 2.7.1 - The Probablity that GPA of the student is less than 3

Solution of 2.7.1 - The Probablity that GPA of the student is less than 3 is 27.419
354838709676

Observation - Using contingency tables of Gender and GPA we got the total numbers of
students and number of students GPA less than 3 and post calculation we find out that -
Probability that student is chosen randomly and that his/her GPA is less than 3 is 27.41%

In [163…
#2.7.2. Find the conditional probability that a randomly selected male earns 50 or m
# Using contingency tables of Gender and GPA we got the total numbers of students an

pd.crosstab(index=survey['Gender'], columns=survey['Salary'])

Out[163… Salary 25.0 30.0 35.0 37.0 37.5 40.0 42.0 45.0 47.0 47.5 50.0 52.0 54.0 55.0 60.0 65

Gender

Female 0 5 1 0 1 5 1 1 0 1 5 0 0 5 5

Male 1 0 1 1 0 7 0 4 1 0 4 1 1 3 3

In [177…
localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 8/13
8/30/2021 CMSU Survey Data Analysis

Number_male_salary_more_than_equal_to_50 = 14

Total_number_of_Male_students = 29

print("Solution of 2.7.2, Probablity that randomly selected male earns more than or

#Solution: Number of Male Having salary more than equal to 50 = 14

#Total number of Male students = 29

#Probability that randomly selected male earns more than or equal to 50 is 14/29*100

Solution of 2.7.2, Probablity that randomly selected male earns more than or equal t
o 50 is 48.275862068965516

In [178…
Number_Female_salary_more_than_equal_to_50 = 18

Total_number_of_Female_students = 33

print("Solution of 2.7.2, Probablity that randomly selected female earns more than

#Solution : Number of Female having salary more than equal to 50 = 18

#Total number of Female students = 33

#Probability that randomly selected female earns more than or equal to 50 is 18/33*1

Solution of 2.7.2, Probablity that randomly selected female earns more than or equa
l to 50 is 54.54545454545454

Observation: Using contingency tables of Gender and Salary we got the total numbers of
Male and Female and number of male and female earning 50 or more, and post calculation
we find out that - Probability that randomly selected male earns 50 or more is 48.27%. and
Probability that randomly selected female earns 50 or more is 54.54%

2.8. Note that there are four numerical (continuous) variables in


the data set, GPA, Salary, Spending, and Text Messages. For each
of them comment whether they follow a normal distribution.
Write a note summarizing your conclusions.
In [220…
##Solution:Using the distplot to know the normal distribution of these four numerica

plt.figure(figsize=(10,5))

sns.distplot(survey['GPA'])

plt.show()

plt.figure(figsize=(10,5))

sns.distplot(survey['Salary'])

plt.show()

plt.figure(figsize=(10,5))

sns.distplot(survey['Spending'])

plt.show()

plt.figure(figsize=(10,5))

sns.distplot(survey['Text Messages'])

plt.show()

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 9/13


8/30/2021 CMSU Survey Data Analysis

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 10/13


8/30/2021 CMSU Survey Data Analysis

I have also used the probability plot to understand if the visible data set is normal. (Second
level check)

In [225…
stats.probplot(survey['GPA'],plot=plt)

plt.title("GPA", fontsize=20, y=1.012)

plt.figure()

stats.probplot(survey['Salary'],plot=plt)

plt.title("Salary", fontsize=20, y=1.012)

plt.figure()

stats.probplot(survey['Spending'],plot=plt)

plt.title("Spending", fontsize=20, y=1.012)

plt.figure()

stats.probplot(survey['Text Messages'],plot=plt)

plt.title("Text Messages", fontsize=20, y=1.012)

plt.figure()

# For GPA, it is visible that the data is normal

# For Salary, it is visible that the data is normal

# For Spending, it is visible that the data is not normal

# For Text Messages, it is visible that the data is not normal

Out[225… <Figure size 432x288 with 0 Axes>

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 11/13


8/30/2021 CMSU Survey Data Analysis

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 12/13


8/30/2021 CMSU Survey Data Analysis

<Figure size 432x288 with 0 Axes>


And to confirm whether these four data sets are following normal distribution or not, have done the Shapiro–
Wilk test and the output from Python we got –

In [205…
from scipy import stats

import scipy as scipy

stats.shapiro(survey['GPA'])

Out[205… ShapiroResult(statistic=0.9685361981391907, pvalue=0.11204058676958084)

In [206…
stats.shapiro(survey['Salary'])

Out[206… ShapiroResult(statistic=0.9565856456756592, pvalue=0.028000956401228905)

In [207…
stats.shapiro(survey['Spending'])

Out[207… ShapiroResult(statistic=0.8777452111244202, pvalue=1.6854661225806922e-05)

In [208…
stats.shapiro(survey['Text Messages'])

Out[208… ShapiroResult(statistic=0.8594191074371338, pvalue=4.324040673964191e-06)

Observation : By these details we confirm that out of the given four data sets ‘GPA’ and
‘Salary’ are following normal distribution whereas other two ‘Spending’ and ‘Text Messages’
are not following the normal distribution

localhost:8888/nbconvert/html/CMSU Survey Data Analysis.ipynb?download=false 13/13

You might also like