Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30

Running head: BUSINESS ANALYTICS

BUSINESS ANALYTICS

Student’s Name:

Student’s ID:

Date of Submission:
BUSINESS ANALYTICS 1

To: Management of ABC Universal Education

From: Consultant of ABC Universal Education

Date:

Title: Drivers of Passing Course Grades

Executive Summary

Throughout this entire assignment, different machine learning techniques are

employed as consultants in order to estimate learners who are most probable to achieve

success or receive a final score minimum of 10 out of 20 on the basis of students'

characteristics. The entire data set consists of 33 factors, containing the desired variable G3,

which reflects as the learners' final score, while there are 585 learners within the sample

population. The final score is derived utilising the final examination, as well as the

preliminary examination G1, as well as G2, which have a strong relationship with the G3, and

therefore, are not utilised as characteristics (Fosso Wamba et al., 2019). The Generalised

linear model has been utilised in order to forecast student's pass condition, and in order to

identify relevant characteristics that influence learners' pass.

Data Extraction, Transformation and Feature Selection

After the download of the complete dataset, the string values are transformed to

variables, featuring the level used as a reference being the value with greatest frequencies

among these values. The target variable is made up of G3 containing the label "G3.pass.flag",

as well as markings P, including F for learners, who succeeded (got >=10), as well as failed

(<10). The variables G1 and G2 are then eliminated, as well as variable absences, because a

large number of learners possess a 0 value regardless of score, and therefore, absence is

presumed to be unrelated to pass condition. The occurrences with erroneous final grade score
BUSINESS ANALYTICS 2

G3 (a score below 0 or greater than 20) are subsequently eliminated as per the list, reducing

the overall size of the sample by almost 17. Medu, as well as Fedu that are instances with 0

values are additionally eliminated. The failure state Boolean variable is created by allocating

logic 1 to learners who failed one or more times in the past, as well as logic 0 to those

learners who had no failures at all. The updated dataset is divided into train and test sets

according to a 75:25 ratio, with 75% of random instances utilised in order to train the GLM

model, whereas, on the other hand, the remaining 25% utilised for assessing GLM's

effectiveness.

Model Building and Documentation

Since there are a lot of characteristics within the dataset, every single one of those are

unable to be incorporated because it will influence the precision of the model. As a result,

relevant characteristics are discovered through the use of a random forest model, followed by

the basic GLM model is constructed based on those characteristics (Schmitt, 2023).

According to the graph of the top parameters of the significance of random forest, famsup, as

well as failures.flag, combine.education, Medu, gout, combine.alc, Fedu, freetime, including

health are the ten most significant variables which impact a learner's pass grade. Since the

possibility of passing is simulated within the GLM, a binomial series with a logit link

function has been utilised.

Model: Generalized Linear Models (GLMs)

The initial GMM yielded approximately 72.86% appropriateness within the test set,

while remaining metrics, for example, as specificity, as well as sensitivity are not excessively

elevated.

Results for initial model in test set:


BUSINESS ANALYTICS 3

The total amount of characteristics within the final GLM model is greatly decreased

from the starting GMM model by employing reverse elimination utilising AIC as the criteria

for selection. The characteristics that were used including the AIC score for the final reverse

elimination cycle are listed below here:


BUSINESS ANALYTICS 4

Presently, considering these five aspects, for example, health, goout, and failures.The

GLM model is run repeatedly with flag, Medu, as well as famsup, and then, the assessment

within the test set is documented.

Outcomes of assessment within a test set with GLM with fewer characteristics:

It is to be observed that the precision is approximately 71.43%; therefore, the

sensitivity can be considered as moderate. However, the degree of specificity is extremely

high, and accounts for 0.84. According to the confusion matrix, passing learners have more

accurate forecasts compared to failed learners, while real failed learners have more wrong

forecasts.

Findings and Recommendation

As an outcome of the GLM model fitting, it is discovered that the variables that have

the greatest impact the percentage of students who succeed are the 'goout' or regularity of
BUSINESS ANALYTICS 5

travelling for hangout, previous experiences of mistakes, the mother's educational degree, as

well as the quality of the family environment (Jha, Ghergulescu & Moldovan, 2019).

It is also noticed that learners who go on vacation less often, have no prior record with

previous test failures, have an extensive level of maternal education, and are raised by family

have a significantly higher pass percentage. As a result, if students concentrate on these basic

variables, their results may be satisfactory. When students focus on outside vacations

regularly, they start avoiding a proper education, since they start focusing on tours, friends,

and other entertainments. As a result, they fail to acquire adequate teaching from the school,

and it excessively affects on their education and grades. However, other students, who less

travel or attain only educational tours, have a potential to increase their grades in the school.

Educational tours have an immense impact not only on a child’s passing grade but also on

building social culture among children. Some of the benefits of educational tours are as

follows:

Improves knowledge and comprehension

Educational travels allow learners to acquire knowledge regarding a certain location

or subject in a more involved and engaging setting. For example, visiting a historical

landmark or an archaeological site may assist learners gain an understanding of a location's

history, as well as culture.

Encourages collaboration and social skills

Educational travels motivate learners to participate in collaborative endeavours and

cooperate, which assists them in building partnership and interpersonal abilities. This is

especially important for younger students since it boosts their self-esteem and level of

confidence.

Encourages self-directed learning


BUSINESS ANALYTICS 6

Learners must be more independent and take control of their learning on educational

travels. This assists pupils in developing problem-solving abilities and becoming more self-

directed learners.

Accordingly, parents’ education plays a significant role on the level of success of

children, which simultaneously, increases the passing grades of children and help then to

achieve a higher position in future. Parental involvement is excessively critical in education

and learning. Even when children receive excellent education, the encouragement of their

families is important to their future success. Parents who can motivate their children within

their homes may maintain their motivation level. Children will like studying if their family

members are encouraging and supportive.

It would be excellent if both father and mother could help their children with their

schoolwork and projects. This procedure will also assist educators. Parental participation

increases learners' academic success. It fosters a calm, productive environment in the

classroom. As a result, teachers can easily monitor kids and assist them in growing. As a

result, when such parental education, tutors’ support, as well as educational tours plays an

essential role in increasing education level of students, it generally increases the passing score

of learners, since they participate in different activities and learn more new things, which

improves their skills and abilities, and as a result, their academic skills improves and their

passing grades additionally increases to those, who do not have proper guide from parents, or

go outside regularly.

Even though the framework offers adequate precision within the test set using fewer

characteristics, it cannot be determined that the feature choosing is completely optimal, nor is

the model the optimum model because a distinct set of decreased characteristic choice may

generate superior outcomes, and developed models, such as neural networks, may
BUSINESS ANALYTICS 7

demonstrate greater effectiveness for learners' pass status forecasting. Furthermore, the total

amount of the dataset is approximately 500, which is quite small, and therefore, incorporating

data for additional learners can assist in enhancing the model's accuracy and authenticity. It

has been seen that the dataset contains a large number of attributes, each one cannot be

included because it would affect the accuracy of the model. Therefore, as an outcome,

significant features are determined using a random forest model, as well as the basic GLM

model is built using those features (Schmitt, 2023). On the basis of the graph of the top

variables in random forest importance, as well as famsup, and failures. The 10 most

influential factors influencing a learner's passing grade and their overall achievement. Their

basic characteristics highly influence whether they are getting passing score or not, and in

this regard, it is to be concluded that parents' education level and less travels has a positive

impact on students' passing grades and overall performance.


BUSINESS ANALYTICS 8

Bibliography

Fosso Wamba, S., Akter, S., Trinchera, L., & De Bourmont, M. (2019). Turning information

quality into firm performance in the big data economy. Management Decision, 57(8),

1756-1783. https://ro.uow.edu.au/cgi/viewcontent.cgi?

article=1573&context=gsbpapers

Jha, N. I., Ghergulescu, I., & Moldovan, A. N. (2019, May). OULAD MOOC Dropout and

Result Prediction using Ensemble, Deep Learning and Regression Techniques.

In CSEDU (2) (pp. 154-164).

https://pdfs.semanticscholar.org/4d82/e06071af59bf3d74068c89f49024faa24848.pdf

Schmitt, M. (2023). Automated machine learning: AI-driven decision making in business

analytics. Intelligent Systems with Applications, 18, 200188.

https://www.sciencedirect.com/science/article/pii/S2667305323000133
BUSINESS ANALYTICS 9

Appendix:

R code documentation:

# title: "Student Success Rate"

## Loading libraries

library(ggplot2)

library(caret)

library(e1071)

library(MASS)

library(randomForest)

#Read in the dataset and creating a pass/fail factor variable.

Full.DS <- read.csv("success_rate.csv")

# Note the number of rows.

nrow(Full.DS) # 585 students

#Take a quick look at G3.

table(Full.DS$G3)

# There are clearly some issues here, they can be handled in the data cleaning stage.
BUSINESS ANALYTICS 10

# Create a new variable that assigns pass "P" to those with G3 >= 10.

Full.DS$G3.Pass.Flag <- as.factor(ifelse(Full.DS$G3 >= 10, "P", "F"))

for (i in c(4:29)){

cat('Level distribution of variable:',names(Full.DS[,4:29])[i-3],'\n')

print(table(as.factor(Full.DS[,i])))

Full.DS$school = relevel(as.factor(Full.DS$school), ref = "GP")

Full.DS$sex = relevel(as.factor(Full.DS$sex), ref = "F")

Full.DS$address = relevel(as.factor(Full.DS$address), ref = "U")

Full.DS$famsize = relevel(as.factor(Full.DS$famsize), ref = "GT3")

Full.DS$Pstatus = relevel(as.factor(Full.DS$Pstatus), ref = "T")

Full.DS$Mjob = relevel(as.factor(Full.DS$Mjob), ref = "other")

Full.DS$Fjob = relevel(as.factor(Full.DS$Fjob), ref = "other")

Full.DS$reason = relevel(as.factor(Full.DS$reason), ref = "course")

Full.DS$guardian = relevel(as.factor(Full.DS$guardian), ref = "mother")

Full.DS$schoolsup = relevel(as.factor(Full.DS$schoolsup), ref = "no")

Full.DS$famsup = relevel(as.factor(Full.DS$famsup), ref = "yes")

Full.DS$paid = relevel(as.factor(Full.DS$paid), ref = "no")


BUSINESS ANALYTICS 11

Full.DS$activities = relevel(as.factor(Full.DS$activities), ref = "yes")

Full.DS$nursery = relevel(as.factor(Full.DS$nursery), ref = "yes")

Full.DS$higher = relevel(as.factor(Full.DS$higher), ref = "yes")

Full.DS$internet = relevel(as.factor(Full.DS$internet), ref = "yes")

Full.DS$romantic = relevel(as.factor(Full.DS$romantic), ref = "no")

# Remove G1, G2, and absences.

Full.DS$G1 <- NULL

Full.DS$G2 <- NULL

Full.DS$absences <- NULL

## Data exploration and cleaning

##To get a sense of the data, here is a summary.

summary(Full.DS)

str(Full.DS)

#Because grades should be between 0 and 20 I removed all records with values outside that

range.

# Remove records with inapprorpiate G3 values, if any

Full.DS <- Full.DS[Full.DS$G3 >= 0,]

Full.DS <- Full.DS[Full.DS$G3 <= 20,]


BUSINESS ANALYTICS 12

table(Full.DS$G3)

# See number of rows now

nrow(Full.DS)

#17 rows removed. To close out the look at G3, here is a bar chart

ggplot(data=Full.DS, mapping = aes(x=G3)) + geom_bar()

#To see the relationship between the one continuous variable (age) and passing I made a

boxplot.

boxplot(age ~ G3.Pass.Flag,

data = Full.DS,

xlab = "Pass",

ylab = "Age")

#It looks like age makes a difference and there are a few abnormally high ages. For

categorical variables (which for this purpose could include those on 1-5 type scales) I made

bar charts. The for loop covers variables 1:2 and 4:29.

for (i in c(1:2,4:29))

plt = ggplot(data=Full.DS, mapping = aes(x=Full.DS[,i], fill = G3.Pass.Flag)) +

geom_bar(position = "fill") + labs(x = colnames(Full.DS)[i])

print(plt)
BUSINESS ANALYTICS 13

#There doesn't seem to be a lot predictive power in most cases. Three look odd.

#Fedu and Medu show a high pass probability when eduction is 0 and

#Dalc (weekday alcohol) shows more passing at the highest level (5). Here is quick look at

them.

table(Full.DS$Medu)

table(Full.DS$Fedu)

table(Full.DS$Dalc)

##Remove the zero values for Medu and Fedu. I will retain the 10 cases where Dalc = 5.

## Variable exploration

# Remove records with questionable variable values.

# Remove the five records with parents education = 0.

Full.DS <- Full.DS[Full.DS$Medu > 0,]

Full.DS <- Full.DS[Full.DS$Fedu > 0,]

## Calculate correlations for numerical variables

# GET NUMERIC VARIABLES FOR CORRELATION MATRIX


BUSINESS ANALYTICS 14

numeric.vars <-names(Full.DS)[sapply(Full.DS, class) %in% c("integer", "numeric")] # get

numeric var names

num.Full.DS <- Full.DS[, numeric.vars] # get only numeric variables

# CREATE CORRELATION MATRIX

cor.Full.DS <- data.frame(round(cor(num.Full.DS), 2))

cor.Full.DS

## Feature creation

#Create Four new features that may have predictive power.

Full.DS$combine.alc <- Full.DS$Dalc * Full.DS$Walc

Full.DS$combine.education <- Full.DS$Medu * Full.DS$Fedu

Full.DS$both.college <- ifelse(Full.DS$combine.education == 16, 1, 0)

Full.DS$failures.flag <- ifelse(Full.DS$failures > 0, 1, 0)

summary(Full.DS)

## Prepare dataset for modeling

#Stratified sampling is used by cvpartition to handle imbalanced class. Train-test splitting is

done in 75:25 ratio


BUSINESS ANALYTICS 15

set.seed(1234)

partition <- createDataPartition(Full.DS$G3.Pass.Flag, list = FALSE, p = .75)

Train.DS <- Full.DS[partition, ]

Test.DS <- Full.DS[-partition, ]

# Pass Rates in train set:

table(Train.DS$G3.Pass.Flag) / nrow(Train.DS)

# Pass rates in test set:

table(Test.DS$G3.Pass.Flag) / nrow(Test.DS)

##Turns out we did get xxx% passing in each set.

### Sample model - Random forest classification

#Run a random forest on the training set and then applied to the test set.

set.seed(894)

excluded_variables <- c("G1","G2","G3") # List excluded variables xxx

control <- trainControl(method = "repeatedcv", number = 5, repeats = 2)

tune_grid <- expand.grid(mtry = c(15:25))


BUSINESS ANALYTICS 16

rf <- train(as.factor(G3.Pass.Flag) ~ .,

data = Train.DS[, !(names(Train.DS) %in% excluded_variables)], method = "rf",

ntree = 50, importance = TRUE, trControl = control, tuneGrid = tune_grid)

plot(varImp(rf), top = 10, main = "Variable Importance of Classification Random Forest")

## Build Model 1 - GLM

#As modeling a probability (of passing), hence the binomial family with a logit link function

is used.

# Intially GLM is used with important variables from the random forest model.

formula <- as.formula(G3.Pass.Flag ~ famsup+ failures.flag + combine.education + Medu +

goout+ failures.flag + combine.alc

+ Fedu + freetime + health)

GLM <- glm(formula, data = Train.DS, family = binomial(link = "logit"))

summary(GLM)
BUSINESS ANALYTICS 17

print("Training confusion matrix")

predicted <- predict(GLM, type = "response") #This outputs the probabiity of passing

predicted

cutoff <- 0.5 # set cutoff value

predicted.final <- as.factor(ifelse(predicted > cutoff, "P", "F"))

confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))

print("Testing confusion matrix")

predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the

probabiity of passing

predicted.final <- as.factor(ifelse(predicted > cutoff, "P", "F"))

confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))

#Using stepAIC from the MASS package to decide the features to remove

stepAIC(GLM, direction = "backward")

Train.DS$Mjob = as.factor(Train.DS$Mjob)
BUSINESS ANALYTICS 18

#We need to look at Mjob. First determine which level has the most observations.

summary(Train.DS$Mjob)

#Relevel Mjob to make services the base.

levels(Train.DS$Mjob)

Train.DS$Mjob <- relevel(Train.DS$Mjob, ref = "services")

levels(Train.DS$Mjob)

Test.DS$Mjob = as.factor(Test.DS$Mjob)

Test.DS$Mjob <- relevel(Test.DS$Mjob, ref = "services")

#Rerun the GLM with the smaller set of variables.

formula <- as.formula(G3.Pass.Flag~goout + failures.flag + Medu +

+ famsup + health)

GLM <- glm(formula, data = Train.DS, family = binomial(link = "logit"))

summary(GLM)

cutoff <- 0.5 # set cutoff value

print("Training confusion matrix")

predicted <- predict(GLM, type = "response") # This outputs the probabiity of passing

predicted.final <- as.factor(ifelse(predicted > cutoff, "P", "F"))

confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))

print("Testing Confusion Matrix")


BUSINESS ANALYTICS 19

predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the

probabiity of passing

predicted.final <- as.factor(ifelse(predicted > cutoff, "P", "F"))

confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))

Plots:

Random forest variable importance plot:

Categorical variables distribution by bar plots colored by target variable pass status :
BUSINESS ANALYTICS 20
BUSINESS ANALYTICS 21
BUSINESS ANALYTICS 22
BUSINESS ANALYTICS 23
BUSINESS ANALYTICS 24
BUSINESS ANALYTICS 25
BUSINESS ANALYTICS 26
BUSINESS ANALYTICS 27
BUSINESS ANALYTICS 28
BUSINESS ANALYTICS 29

Boxplot of age separated by target variable pass status:

Bar plot of final grade score G3:

You might also like