Data Mining Project

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

PREDICTING

SUCCESSFUL STUDENT
PERFORMANCE
Luis Pineda
Student Success is Not Genetic
Luis Pineda
❏ Social - The aim of this project is:
circumstances
- To find the most significant attribute(s) related to a student’s
❏ Lack of money
academic performance in math.
for external
- To dispel any prior notions of what makes a student successful.
help
- To identify harmful traits and practices in the context of
❏ Problems at
academia.
home
❏ Little/no help - Importance:
from family - With a solid understanding of this phenomenon, we could then turn
members our focus to troubled students and better provide the help and
❏ Few family role guidance they need.
models - Predict troubled students before they
❏ External fail. → Save resources, provide better
education.
responsibilities
USING DATA MINING TO PREDICT SECONDARY SCHOOL
STUDENT PERFORMANCE: Paulo Cortez and Alice Silva
Luis Pineda

➢ Aim was to classify student performance in


order to identify important attributes in relation
to pass/fail grade, letter grade (A - F)
➢ Used Naive Bayes, Neural Networks, Decision
Trees, Random Forests, and Support Vector
Machines algorithms
➢ Unlike our process, they included the 1st and
2nd semester grades in their classifier model in
order to predict the final grades of students
○ We chose to ignore these in order to let
other, less obviously significant,
attributes show their effect
○ As a result, our accuracies were much
lower than those seen here.
Stress, debt and undergraduate medical student performance:
Sarah Ross, Jennifer Cleland, Mary Macleod

● Aimed to examine the relationships between student debt, mental health and academic
performance.

● Students' perceptions of their own levels of debt rather than level of debt per se relates to
performance. Students who worry about money have higher debts and perform less well than
their peers in degree examinations.

● Students from lower socioeconomic backgrounds and postgraduate students had higher
debts. There was no direct correlation between debt, class ranking or General Health
Questionnaire (GHQ) score; however, a subgroup of 125 students (37.7%), who said that
worrying about money affected their studies, did have higher debt and were ranked lower in
their classes.
● http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2929.2006.02448.x/full
Stressful life events and health-related quality of life
in college students: Damush, Teresa M,Hays, Ron D, DiMatteo, M Robin
Luis Pineda ➢ Distressful events were found
to have the largest impact
➢ Data was collected from 350 West Coast ○ Illness,
University Students: sexuality-related
○ 49.1% freshmen, 35.4% sophomores, events, and deviance
12.3% juniors, and 3.1% seniors events
➢ Some measures were strongly
○ 50.0% were Caucasian, 36.8% Asian,
intercorrelated. For example:
9.7% Hispanic, 1.4% African ○ Respondents who
American, and 3.1% reported Other reported greater
ethnicity. anxiety, bodily pain, or
○ 57.2% female depression also
○ reported less sense of
➢ Goal was to find a correlation between belonging, less
positive affect, and
stressful life events and health-related quality
poorer social
of life: functioning,
○ zero-order product moment
correlations to evaluate associations
between stressful life events
experienced in the recent past and
HRQOL measure (Seen in Table 2).
Gender, ethnicity, and social cognitive factors predicting the
academic achievement of students in engineering. Hackett, Gail; Betz, Nancy E.;
Casas, J. Manuel; Rocha-Singh, Indra A.

● Examines the relationships of measures of occupational and academic self-efficacy; vocational


interests; outcome expectations; academic ability; and perceived stress, support, and coping to the
academic achievement of women and men enrolled in university-level engineering/science
programs
● 197 students from diverse racial/ethnic backgrounds responded to scales measuring the variables of
interest; high school and college academic data were obtained from university records.
● Self-efficacy for academic milestones, in combination with other academic and support
variables, was found to be the strongest predictor of college academic achievement.
● Outcome expectations, vocational interests, and low levels of stress were in turn the
strongest predictors of academic self-efficacy
● http://psycnet.apa.org/journals/cou/39/4/527/
PREDICTING STUDENT PERFORMANCE: AN APPLICATION OF DATA MINING
METHODS WITH AN EDUCATIONAL WEB-BASED SYSTEM: Behrouz Minaei-Bidgoli I,
Deborah A. Kashy ', Gerd Kortemeyer', William F. Punch
Luis Pineda

➢ “Early” data mining application aimed at providing


more effective online courses for university students
➢ Used classification algorithms:
○ Quadratic Bayesian classifier
○ k-nearest neighbor (k-NN)
○ Parzen-window
○ Multilayer perceptron (MLP)
○ Decision Trees
➢ 3 Different Classifying criteria:
○ Binary pass/fail
○ High, Middle, and Low
○ GPA (9 cases)
➢ Goal was to predict student grades based on web
features
Does Mandatory Attendance Improve Student Performance? Daniel R. Marburger
● There were two separate groups, a policy group where students were required to show up to class.
Then there was a no policy group, where students were not required to show up to class.
● Classes were split up into two groups and were given three exams throughout a semester as an
experiment to see how the no policy group would do the exams if they were not present during the
previous classes.
● It was found that the no attendance policy group was more likely to answer a question
incorrectly by 9% on the first exam, 12.8% on the second exam, and 14% on the third exam.
-
Luis Pineda

33 Attributes:
Data Description
- School, Sex, Age, Address, ● Collection of student & demographic data of
395 Portuguese students for Math class.
Family Size, Parent’s
○ Acquired from school reports and
Cohabitation Status, Mother’s questionnaires.
Education, Father’s Education, ○ Clean data, no missing values. No
Mother’s Job, Father’s Job, data cleaning required.
Reason for Choosing School,
Student’s Guardian, Travel ● Predictor value of final grade--Highly
Time from School, Study Time correlated with 1st and 2nd period grades.
Per Week, Number of past class ○ G1 & G2 ignored for prediction
failures, Extra School Support,
● Data is mostly:
Family Education Support, Extra ○ Binary: “this” or “that”/ “yes” or “no”
Paid Courses, Extra Curricular ○ Numeric: on a scale from 1-10, 2-5, - 208 of the participants were female (52.7%)
Activities, Attended Nursery etc. - 187 of the participants were male (47.3%)
School, Wants Higher ● 649 Instances
Education, Access to Internet, ● Multivariate
In a Romantic Relationship,
Quality of Family Relationships, ● Some correlation was found between passing
Free Time After School, Goes or failing a class and: Mother/Father’s
Out with Friends, Alcohol education level, Past failures, and Amount of
Consumption During Workday, time spent going out--though not very strong.
Weekend Alcohol Consumption, ○ Failures had a correlation of -.338, the
Health Status, Absences highest of all attributes
Luis Pineda

Data Description Transformation


Continued - Because our target value was in a range →
of 0-20, it made classifying an issue
○ The Portuguese grading scale
defines passing as <=10, so we
transformed our data
○ 0 = failing
○ 1 = passing
○ We also had to balance the
data

← Data is not balanced


○ Distribution of student
performance mostly normal,
however a large percentage
of students received a failing
grade
- Minimum Grade: 0
- Maximum Grade: 20
- Mean Grade: 10.42
- Standard Deviation: 4.581
Luis Pineda

Methodology: Decision Trees On General Data


- 17 Nodes, 9 Terminal
Nodes, 5 Depth.

- Max Depth: 5, Min


Unbalanced
Cases in Parent: 16, Min
Cases in Child: 8 Depth P C Training Testing
- Most significant
attributes: failures,
absences, free time, 5 16 8 75.5% 73.9%
family support
10 30 15 74.8% 66.7%

12 20 10 75.4% 71.8%

Balanced
Depth P C Training Testing
- Balanced data: 130
cases for passing and
failing 5 16 8 75.6% 63.1%

10 30 15 70.4% 52%

12 20 10 72.1% 61.1%
Results:
➢ Expected. Failures and absences were
➢ Most significant attributes: highest contributor to a pass or fail
○ Failures grade.
○ Absences ➢ Surprisingly, on the balanced data,
whether a student considered their
○ free time mother or father their guardian also
○ family support had some impact; with mothers having
➢ Effective accuracy in models varied, though a higher passing score
➢ On Unbalanced data, access to
never quite consistently above 70% internet also had some impact, though
○ Expected. Source study used 1st and it was more or less 50/50
2nd Semester grades, which obviously
resulted in higher accuracies
K-Nearest Neighbor Analysis Luis Pineda

Student Life Analysis No Feature Selection vs Feature Selection


K Testing Training K Testing Training

Decision Trees on Student Life 1 45% 47.4% 1 59.9 35.9

(Balanced): 3 45.9 61.4 3 64% 59.4%


❏ Failures, Absences, Going Out, Internet
Access, Weekday Alcohol Consumption,
6 57.5% 45.6% 6 69.5 50%
Free Time, Desires Higher Education,
Travel Time, Weekend Alcohol
Consumption, Extracurricular activities, 9 54.2 55.6 9 64.7 63.3
Nursery School, Romantic Relationships,
Health, and Study time

❏ 11 Total Nodes, 6 Terminal Nodes, Depth


of 4
❏ Minimum Number of Cases for Parent: 15
❏ Minimum Number of Cases for Child: 7
❏ Maximum Depth of 12
Luis Pineda

Student Life Cont. K-Nearest Neighbor Analysis


- Standout attributes were Study
time, travel time, and failures
Unbalanced Decision Tree:
- 15 Nodes, 8 Terminal Nodes, Depth
of 4. K Testing Trainin K Testing Trainin
g g

1 60.3% 56.9% 1 71.8 72.9

3 63.7% 60.2% 3 74.6% 71.6%

6 66.3% 65.1% 6 75.5% 70.9%

9 63.0% 68.8% 9 78.1% 65.9%


Luis Pineda

Student Life Results


➢ Decision Trees Most Important Attributes:
○ Balanced: Failures, Free time, Go out, and Weekend Alcohol
Consumption
○ Unbalanced: Failures, Free time, Go out, Weekday Alcohol
Consumption
➢ K-Nearest Neighbor:
○ Unbalanced:
■ Feature Selection yielded best results
■ 3 & 6 Neighbors optimal
○ Balanced:
■ Feature Selection also yielded best results
■ 3 & 6 Neighbors results were similar, though much
lower in accuracy
➢ Much harder to classify failing students
Luis Pineda
Balanced Decision Tree
Demographics:
Unbalanced Decision Tree

● Balanced: 17 total Nodes, 9


Terminal Nodes, Depth of 5
● Unbalanced: 19 total Nodes, 10
Terminal nodes, Depth of 5.
● Attributes: Sex, Age, Address,
Famsize, ParentStatus
Demographics Cont.
Feature Selection No Feature
K Training Test K Trainin Test
g

1 70.9% 29.1% 1 70.1% 29.9%

No Feature
3 72.9% 27.1% 3 65.8% 34.2%

6 71.1% 28.9% 6 71.1% 28.9%


Luis Pineda

Parental Influences Unbalanced Decision Tree

Balanced Decision Tree

● Balanced: Total Nodes 7, Number of


Terminal Nodes 4, Depth of 3.
Min Depth Child & Parent: 10,16,8

● Unbalanced: Num Of Nodes 11,


Terminal Nodes 6, Depth of 4.
Min Depth Child & Parent: 10,18,9

● Attributes: FJob, MJob, Famrel, Medu,


Fedu
Parental Influences Cont.
No Feature No Feature Feature
K Training Test K Training Test

1 67.8% 32.2%
1 73.9% 26.1%

3 71.2% 28.8%
3 72.7% 27.3%

6 69.6% 30.4%
6 68.9% 31.1%
Demo & Parental Influences Results
● Balanced Decision Tree For Demographics was more successful.
● No Feature K3 Nearest Neighbor Yielded highest results for Demographics.
● Balanced Decision Tree for Parental Influence was also more successful.
● Feature Selection K1 Nearest Neighbor Yielded highests results for Parental
Influence.
● These two test cases were extremely irrelevant when testing for a Pass or Fail
in regards to Grade 3.
Results Interpretation:
➢ It is harder to classify failing students
➢ Students who have failed one or more classes should be given attention, as
they are more likely to fail a course
➢ Students who have multiple absences are also more likely to fail a course
➢ While going out and enjoying one’s free time could seem dangerous for
struggling students, it is good to encourage some respite from stressful
schoolwork.
➢ Students whose Mothers & Fathers were both educated and employed
resulted in higher rates of passing.

You might also like