Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

EE331 Introduction to Machine Learning

Spring 2019 Project Proposal


Predicting alcohol consumption based on student information
Hung Vu 20150936
Thanh Nguyen 20150846

(a) Clear description of the task with discussion on (1) how it might be interesting/important and
also (2) the uniqueness of the task

These days, parents and teachers are concerned about how much students consume alcohol
and how this habit affect their academic performance. In this project we will develop and
evaluate the performance and the predictive power of a model trained and tested on data
collected from a survey of students who studied math and Portuguese language courses in high
school. The main task is to predict the level of alcohol consumption of one student given his/her
background information, behaviors and activities.

Once we get a good fit for trained data and test data, we could from different information
available about students’ background, predict how much they use alcohol in weekday and
weekend. The model could be useful to school administration and teachers to prevent and
inform parents about the bad habit of excessive beer and wine consumption.
The dataset is obtained from Kaggle in this link:
https://www.kaggle.com/uciml/student-alcohol-consumption

(b) Detailed description of the data which you will be providing e.g. format, size, number of data,
label.

The data is provided in form of two files student-mat.csv and student-por.csv. The data is
collected from 395 students in math course and 649 students in Portuguese language course, in
which there are 382 students responded to both surveys. Overall there are 33 features ranging
from background of each student to their daily life activities. The data format is presented in
form of numeric, binary, and nominal. The detailed features are listed as belows.

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age​ - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu​ - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to
9th grade, 3 – secondary education or 4 – higher education)
8. Fedu​ - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th
grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g.
administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative
or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course'
preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to
1 hour, or 4 - >1 hour)
14. studytime​ - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4
- >10 hours)
15. failures​ - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel​ - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime​ - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout ​- going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health​ - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences​ - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

1. G1 - first period grade (numeric: from 0 to 20)


2. G2 - second period grade (numeric: from 0 to 20)
3. G3​ - final grade (numeric: from 0 to 20, output target)

Based on the criteria and purpose of the project the dataset is pre-processed as follows:
- The features in green and red color are the most essential features to the task, other
features are excluded.
- We combine the data of both courses including 382 students who belongs to both
courses and will have identical data point values. Consequently, we will have a data set
of 662 students with 11 features in green, which give us quantitative information about
data set and 2 target variables “Dalc” and “Walc”, alcohol consumption during weekday
and weekend which we seek to predict.
(c) Any prior knowledge required to understand task/data.

In order to help making a semantic prediction based given information, some prior assumptions
can be provided as follows:
- Age: higher age students tends to drink more.
- Medu and Fedu: parents with higher education may prevent their children from
consuming alcohol better.
- Studytime: students who spend more time to study may have time for drinking.
- Failures and final grade: bad performance of students at school will likely lead to higher
consumption of alcohol.
- Freetime and gout: More free time and going out activities will give more chance for
students to drink.
- Famrel: better family’s relationship may bring less negative effects on students,
therefore, they will be less likely to consume alcohol.
- Health: Better health also indicates less alcohol consumption.
(d) Provide measure for performance. You may have multiple measures.

- k-fold Cross-Validation​: This technique allows us to use only the training data to
evaluate our performance by shuffling the data set randomly and splitting the data into k
partitions of equal size.
For each partition Q, we can train our model by using the remaining k-1 partitions and
calculate the accuracy on partition Q. We can then sum all up and take the average as
the evaluation scores.
The k value also need to choose carefully to avoid high variance (evaluation change a
lot depending on the data used to fit the model) or high bias (overestimate the skill of the
model). The choice of k is usually 5 or 10, but we can use other value as well.
The reason we should use k-fold cross-validation is because we have a limited amount
of data (~ 700 students) even though it could be computationally expensive.
- Confusion Matrix​ is an intuitive and easy metrics used to measure the accuracy of the
model of a classification model. It is a summary of prediction result by displaying the
number of correct and incorrect predictions with count values and broken down by each
class.
For example, if there are two classes 1 and 2 then the confusion matrix is
Here TP is true positive: Observative is positive and predicted positive
FN is false negative: Observation is positive, but predicted negative
TN is true negative: Observation id negative, and is predicted negative
FP is false positive: Observation is negative, but it predicted positive
Base on these statistics, we could calculate accuracy, recall, precision and F-measure. And we
should find the model with highest values in accuracy, recall and precision. And if there are two
models with lower precision and high recall or vice versa then we could take F-score into
consideration.

(e) Any algorithm that you are providing for reference.

- K-nearest Neighbor​: For classification problems like this one, KNN can be used as the
logic following: an object is classified by a majority vote of its neighbors, with being
assigned to the most common class among its k nearest neighbors. The advantage of
this algorithm is its simplicity and no requirements of assumption, yet high performing. In
contractory, this algorithm require large memory, and is sensitive to irrelevant features
and the scale of data.
- Decision Tree: ​Place the best attribute of the dataset at the root of the tree. Then we
will split the training set into subsets with the same value for each attribute. Repeat that
process until you find leaf nodes in all branches of the tree. There are two popular
attribute selection measures: information gain and gini index.
In the project, we will use the information-theoretic approach. In order to do that, first we
need to calculate the entropy of the target:

Then the residual information for each attribute by the following formula:
Notice that the residual information favors attributes with many values. After that we
could obtain the information gain ratio:

From those information we could build a decision tree by place them according to their
information gain.
The performance of a tree can be improved by pruning. It involves removing the
branches that make of features with low information gain. This way, we reduce the
complexity of tree, and thus increasing its predictive power by reducing overfitting.

(f) Introduce group members and their responsibilities.

Thanh Nguyen: data analysis, k-nearest Neighbor algorithm.


Hung Vu: introduction, measure of performance and decision tree algorithm

You might also like