00 - Project - Your First Data Science Project - Jupyter Notebook

Project - Guide school to activities to improve G3
grades
Goal of Project
Explore the dataset from lesson further
Follow the Data Science process to understand it better
It will be your task to identify possible activies to improve G3 grades
NOTE: We have very limited skills, hence, we must limit our ambitions in our analysis
Step 1: Acquire
Explore problem
Identify data
Import data
Step 1.a: Import libraries

Execute the cell below (SHIFT + ENTER)
In [8]: import pandas as pd
Step 1.b: Read the data

Use pd.read_csv() to read the file files/student-mat.csv
NOTE: Remember to assign the result to a variable (e.g., data )
In [9]: data = pd.read_csv('files/student-mat.csv')

Step 1.c: Inspect the data
Call .head() on the data to see all is as expected
In [10]: data.head()
Out[10]: school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetim
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
1 GP F 17 U GT3 T 1 1 at_home other ... 5
2 GP F 15 U LE3 T 1 1 at_home other ... 4
3 GP F 15 U GT3 T 4 2 health services ... 3
4 GP F 16 U GT3 T 3 3 other other ... 4
5 rows × 33 columns
Step 1.d: Check length of data

Call len(...) on the data
Result: There should be 395 rows of data
In [11]: len(data)
Out[11]: 395
Step 2: Prepare
Explore data
Visualize ideas
Cleaning data
Notice
We will not cover visualization in this lecture
We also know, that the data is clean - but we will do validations here anyway
Step 2.a: Check the data types

This step tells you if some numeric column is not represented numeric.
Get the data types by .dtypes
In [12]: data.dtypes
Out[12]: school object

sex object
age int64
address object
famsize object
Pstatus object
Medu int64
Fedu int64
Mjob object
Fjob object
reason object
guardian object
traveltime int64
studytime int64
failures int64
schoolsup object
famsup object
paid object
activities object
nursery object
higher object
internet object
romantic object
famrel int64
freetime int64
goout int64
Dalc int64
Walc int64
health int64
absences int64
G1 int64
G2 int64
G3 int64
dtype: object
Step 2.b: Check for null (missing) values

Data often is missing entries - there can be many reasons for this
We need to deal with that (will do later in course)
Use .isnull().any()
In [14]: data.isnull().any()
Out[14]: school False

sex False
age False
address False
famsize False
Pstatus False
Medu False
Fedu False
Mjob False
Fjob False
reason False
guardian False
traveltime False
studytime False
failures False
schoolsup False
famsup False
paid False
activities False
nursery False
higher False
internet False
romantic False
famrel False
freetime False
goout False
Dalc False
Walc False
health False
absences False
G1 False
G2 False
G3 False
dtype: bool
Step 3: Analyze
Feature selection
Model selection
Analyze data
Description
Want to find 3 features to use in our report
The 3 features should be selected based on
Actionable insights
Convey credibility in report
What is realistic within possibilities (including a budget)
Note
This step is where you can explore
You know how to use the following:
corr() to find see correlation
groupby() with mean(), count(), or std()
This should be used for step 4: Report
Step 3.a: Investigate correlation

Correlation is an easy measure to find insights that are actionable.
Use corr() and only show G3, as that is the row we are interested in.
Notice: G1 and G2 are highly correlated, but they are not intented to be used
In [17]: data.corr()['G3']
Out[17]: age -0.161579

Medu 0.217147
Fedu 0.152457
traveltime -0.117142
studytime 0.097820
failures -0.360415
famrel 0.051363
freetime 0.011307
goout -0.132791
Dalc -0.054660
Walc -0.051939
health -0.061335
absences 0.034247
G1 0.801468
G2 0.904868
G3 1.000000
Name: G3, dtype: float64
Step 3.b: Get the Feature names

This step can help you understand features better.
All the features are availbale witb .columns applied on the DataFrame
In [18]: data.columns
Out[18]: Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',

'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
dtype='object')
Step 3.c: Investigate features
Repeat this step (possibly for all features)
Select a features
Calculate the groupby(...) mean() on G3
HINT: This was done in the lesson
Calculate the groupby(...) count() on G3
Calculate the groupby(...) std() on G3
In [16]: data.groupby('freetime')['G3'].mean()
Out[16]: freetime
1 9.842105
2 11.562500
3 9.783439
4 10.426087
5 11.300000
In [17]: data.groupby('freetime')['G3'].count()
Out[17]: freetime
1 19
2 64
3 157
4 115
5 40
Name: G3, dtype: int64
In [18]: data.groupby('freetime')['G3'].std()
Out[18]: freetime
1 4.752346
2 4.219663
3 4.794920
4 4.330757
5 4.619912
Step 3.d Select 3 features

Decide on 3 features to use in the report
The decision should be based on
Actionable insights
Convey credibility in report
What is realistic within possibilities (including a budget)
In [ ]:
In [ ]:
Step 4: Report
Present findings
Visualize results
Credibility counts
Description
With the 3 features from step 3 create a presentation
As we have not learned visualization yet, keep it simple
Remember, that credibility counts
Notice
At this stage it is not supposed to be perfect.
Present the findings here in the Notebook
In [ ]:
In [ ]:
Step 5: Actions
Use insights
Measure impact
Main goal
Description
What actions should the schools take?
How can they evaluate the impact?
Remember, this is the main goal.
In [ ]:
In [ ]:

00 - Project - Your First Data Science Project - Jupyter Notebook

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

00 - Project - Your First Data Science Project - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

Project - Guide school to activities to improve G3

Step 1.a: Import libraries

In [8]: import pandas as pd

Step 1.b: Read the data

In [9]: data = pd.read_csv('files/student-mat.csv')

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

Step 1.d: Check length of data

Step 2.a: Check the data types

Out[12]: school object

Step 2.b: Check for null (missing) values

Out[14]: school False

Step 3.a: Investigate correlation

Out[17]: age -0.161579

Step 3.b: Get the Feature names

Out[18]: Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',

Step 3.d Select 3 features

You might also like