Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Project - Guide school to activities to improve G3

grades

Goal of Project
Explore the dataset from lesson further
Follow the Data Science process to understand it better
It will be your task to identify possible activies to improve G3 grades
NOTE: We have very limited skills, hence, we must limit our ambitions in our analysis

Step 1: Acquire
Explore problem
Identify data
Import data

Step 1.a: Import libraries


Execute the cell below (SHIFT + ENTER)

In [8]: import pandas as pd

Step 1.b: Read the data


Use pd.read_csv() to read the file files/student-mat.csv
NOTE: Remember to assign the result to a variable (e.g., data )

In [9]: data = pd.read_csv('files/student-mat.csv')


Step 1.c: Inspect the data
Call .head() on the data to see all is as expected

In [10]: data.head()

Out[10]: school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetim

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

5 rows × 33 columns

Step 1.d: Check length of data


Call len(...) on the data
Result: There should be 395 rows of data

In [11]: len(data)

Out[11]: 395

Step 2: Prepare
Explore data
Visualize ideas
Cleaning data

Notice
We will not cover visualization in this lecture
We also know, that the data is clean - but we will do validations here anyway

Step 2.a: Check the data types


This step tells you if some numeric column is not represented numeric.
Get the data types by .dtypes
In [12]: data.dtypes

Out[12]: school object


sex object
age int64
address object
famsize object
Pstatus object
Medu int64
Fedu int64
Mjob object
Fjob object
reason object
guardian object
traveltime int64
studytime int64
failures int64
schoolsup object
famsup object
paid object
activities object
nursery object
higher object
internet object
romantic object
famrel int64
freetime int64
goout int64
Dalc int64
Walc int64
health int64
absences int64
G1 int64
G2 int64
G3 int64
dtype: object

Step 2.b: Check for null (missing) values


Data often is missing entries - there can be many reasons for this
We need to deal with that (will do later in course)
Use .isnull().any()
In [14]: data.isnull().any()

Out[14]: school False


sex False
age False
address False
famsize False
Pstatus False
Medu False
Fedu False
Mjob False
Fjob False
reason False
guardian False
traveltime False
studytime False
failures False
schoolsup False
famsup False
paid False
activities False
nursery False
higher False
internet False
romantic False
famrel False
freetime False
goout False
Dalc False
Walc False
health False
absences False
G1 False
G2 False
G3 False
dtype: bool

Step 3: Analyze
Feature selection
Model selection
Analyze data

Description
Want to find 3 features to use in our report
The 3 features should be selected based on
Actionable insights
Convey credibility in report
What is realistic within possibilities (including a budget)
Note
This step is where you can explore
You know how to use the following:
corr() to find see correlation
groupby() with mean(), count(), or std()
This should be used for step 4: Report

Step 3.a: Investigate correlation


Correlation is an easy measure to find insights that are actionable.
Use corr() and only show G3, as that is the row we are interested in.
Notice: G1 and G2 are highly correlated, but they are not intented to be used

In [17]: data.corr()['G3']

Out[17]: age -0.161579


Medu 0.217147
Fedu 0.152457
traveltime -0.117142
studytime 0.097820
failures -0.360415
famrel 0.051363
freetime 0.011307
goout -0.132791
Dalc -0.054660
Walc -0.051939
health -0.061335
absences 0.034247
G1 0.801468
G2 0.904868
G3 1.000000
Name: G3, dtype: float64

Step 3.b: Get the Feature names


This step can help you understand features better.
All the features are availbale witb .columns applied on the DataFrame

In [18]: data.columns

Out[18]: Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',


'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
dtype='object')
Step 3.c: Investigate features
Repeat this step (possibly for all features)

Select a features
Calculate the groupby(...) mean() on G3
HINT: This was done in the lesson
Calculate the groupby(...) count() on G3
Calculate the groupby(...) std() on G3

In [16]: data.groupby('freetime')['G3'].mean()

Out[16]: freetime
1 9.842105
2 11.562500
3 9.783439
4 10.426087
5 11.300000
Name: G3, dtype: float64

In [17]: data.groupby('freetime')['G3'].count()

Out[17]: freetime
1 19
2 64
3 157
4 115
5 40
Name: G3, dtype: int64

In [18]: data.groupby('freetime')['G3'].std()

Out[18]: freetime
1 4.752346
2 4.219663
3 4.794920
4 4.330757
5 4.619912
Name: G3, dtype: float64

Step 3.d Select 3 features


Decide on 3 features to use in the report
The decision should be based on
Actionable insights
Convey credibility in report
What is realistic within possibilities (including a budget)
In [ ]: ​

In [ ]: ​

Step 4: Report
Present findings
Visualize results
Credibility counts

Description
With the 3 features from step 3 create a presentation
As we have not learned visualization yet, keep it simple
Remember, that credibility counts

Notice
At this stage it is not supposed to be perfect.
Present the findings here in the Notebook

In [ ]: ​

In [ ]: ​

Step 5: Actions
Use insights
Measure impact
Main goal

Description
What actions should the schools take?
How can they evaluate the impact?
Remember, this is the main goal.

In [ ]: ​
In [ ]: ​

You might also like