Business Analytics

Running head: BUSINESS ANALYTICS
BUSINESS ANALYTICS
Student’s Name:
Student’s ID:
Date of Submission:
BUSINESS ANALYTICS 1
To: Management of ABC Universal Education
From: Consultant of ABC Universal Education
Date:
Title: Drivers of Passing Course Grades
Executive Summary
Throughout this entire assignment, different machine learning techniques are
employed as consultants in order to estimate learners who are most probable to achieve
success or receive a final score minimum of 10 out of 20 on the basis of students'
characteristics. The entire data set consists of 33 factors, containing the desired variable G3,
which reflects as the learners' final score, while there are 585 learners within the sample
population. The final score is derived utilising the final examination, as well as the
preliminary examination G1, as well as G2, which have a strong relationship with the G3, and
therefore, are not utilised as characteristics (Fosso Wamba et al., 2019). The Generalised
linear model has been utilised in order to forecast student's pass condition, and in order to
identify relevant characteristics that influence learners' pass.
Data Extraction, Transformation and Feature Selection
After the download of the complete dataset, the string values are transformed to
variables, featuring the level used as a reference being the value with greatest frequencies
among these values. The target variable is made up of G3 containing the label "G3.pass.flag",
as well as markings P, including F for learners, who succeeded (got >=10), as well as failed
(<10). The variables G1 and G2 are then eliminated, as well as variable absences, because a
large number of learners possess a 0 value regardless of score, and therefore, absence is
presumed to be unrelated to pass condition. The occurrences with erroneous final grade score
G3 (a score below 0 or greater than 20) are subsequently eliminated as per the list, reducing
the overall size of the sample by almost 17. Medu, as well as Fedu that are instances with 0
values are additionally eliminated. The failure state Boolean variable is created by allocating
logic 1 to learners who failed one or more times in the past, as well as logic 0 to those
learners who had no failures at all. The updated dataset is divided into train and test sets
according to a 75:25 ratio, with 75% of random instances utilised in order to train the GLM
model, whereas, on the other hand, the remaining 25% utilised for assessing GLM's
effectiveness.
Model Building and Documentation
Since there are a lot of characteristics within the dataset, every single one of those are
unable to be incorporated because it will influence the precision of the model. As a result,
relevant characteristics are discovered through the use of a random forest model, followed by
the basic GLM model is constructed based on those characteristics (Schmitt, 2023).
According to the graph of the top parameters of the significance of random forest, famsup, as
well as failures.flag, combine.education, Medu, gout, combine.alc, Fedu, freetime, including
health are the ten most significant variables which impact a learner's pass grade. Since the
possibility of passing is simulated within the GLM, a binomial series with a logit link
function has been utilised.
Model: Generalized Linear Models (GLMs)
The initial GMM yielded approximately 72.86% appropriateness within the test set,
while remaining metrics, for example, as specificity, as well as sensitivity are not excessively
elevated.
Results for initial model in test set:

The total amount of characteristics within the final GLM model is greatly decreased
from the starting GMM model by employing reverse elimination utilising AIC as the criteria
for selection. The characteristics that were used including the AIC score for the final reverse
elimination cycle are listed below here:

Presently, considering these five aspects, for example, health, goout, and failures.The
GLM model is run repeatedly with flag, Medu, as well as famsup, and then, the assessment
within the test set is documented.
Outcomes of assessment within a test set with GLM with fewer characteristics:
It is to be observed that the precision is approximately 71.43%; therefore, the
sensitivity can be considered as moderate. However, the degree of specificity is extremely
high, and accounts for 0.84. According to the confusion matrix, passing learners have more
accurate forecasts compared to failed learners, while real failed learners have more wrong
forecasts.
Findings and Recommendation
As an outcome of the GLM model fitting, it is discovered that the variables that have
the greatest impact the percentage of students who succeed are the 'goout' or regularity of
travelling for hangout, previous experiences of mistakes, the mother's educational degree, as
well as the quality of the family environment (Jha, Ghergulescu & Moldovan, 2019).
It is also noticed that learners who go on vacation less often, have no prior record with
previous test failures, have an extensive level of maternal education, and are raised by family
have a significantly higher pass percentage. As a result, if students concentrate on these basic
variables, their results may be satisfactory. When students focus on outside vacations
regularly, they start avoiding a proper education, since they start focusing on tours, friends,
and other entertainments. As a result, they fail to acquire adequate teaching from the school,
and it excessively affects on their education and grades. However, other students, who less
travel or attain only educational tours, have a potential to increase their grades in the school.
Educational tours have an immense impact not only on a child’s passing grade but also on
building social culture among children. Some of the benefits of educational tours are as
follows:
Improves knowledge and comprehension
Educational travels allow learners to acquire knowledge regarding a certain location
or subject in a more involved and engaging setting. For example, visiting a historical
landmark or an archaeological site may assist learners gain an understanding of a location's
history, as well as culture.
Encourages collaboration and social skills
Educational travels motivate learners to participate in collaborative endeavours and
cooperate, which assists them in building partnership and interpersonal abilities. This is
especially important for younger students since it boosts their self-esteem and level of
confidence.
Encourages self-directed learning

Learners must be more independent and take control of their learning on educational
travels. This assists pupils in developing problem-solving abilities and becoming more self-
directed learners.
Accordingly, parents’ education plays a significant role on the level of success of
children, which simultaneously, increases the passing grades of children and help then to
achieve a higher position in future. Parental involvement is excessively critical in education
and learning. Even when children receive excellent education, the encouragement of their
families is important to their future success. Parents who can motivate their children within
their homes may maintain their motivation level. Children will like studying if their family
members are encouraging and supportive.
It would be excellent if both father and mother could help their children with their
schoolwork and projects. This procedure will also assist educators. Parental participation
increases learners' academic success. It fosters a calm, productive environment in the
classroom. As a result, teachers can easily monitor kids and assist them in growing. As a
result, when such parental education, tutors’ support, as well as educational tours plays an
essential role in increasing education level of students, it generally increases the passing score
of learners, since they participate in different activities and learn more new things, which
improves their skills and abilities, and as a result, their academic skills improves and their
passing grades additionally increases to those, who do not have proper guide from parents, or
go outside regularly.
Even though the framework offers adequate precision within the test set using fewer
characteristics, it cannot be determined that the feature choosing is completely optimal, nor is
the model the optimum model because a distinct set of decreased characteristic choice may
generate superior outcomes, and developed models, such as neural networks, may
demonstrate greater effectiveness for learners' pass status forecasting. Furthermore, the total
amount of the dataset is approximately 500, which is quite small, and therefore, incorporating
data for additional learners can assist in enhancing the model's accuracy and authenticity. It
has been seen that the dataset contains a large number of attributes, each one cannot be
included because it would affect the accuracy of the model. Therefore, as an outcome,
significant features are determined using a random forest model, as well as the basic GLM
model is built using those features (Schmitt, 2023). On the basis of the graph of the top
variables in random forest importance, as well as famsup, and failures. The 10 most
influential factors influencing a learner's passing grade and their overall achievement. Their
basic characteristics highly influence whether they are getting passing score or not, and in
this regard, it is to be concluded that parents' education level and less travels has a positive
impact on students' passing grades and overall performance.

Bibliography
Fosso Wamba, S., Akter, S., Trinchera, L., & De Bourmont, M. (2019). Turning information
quality into firm performance in the big data economy. Management Decision, 57(8),
1756-1783. https://ro.uow.edu.au/cgi/viewcontent.cgi?
article=1573&context=gsbpapers
Jha, N. I., Ghergulescu, I., & Moldovan, A. N. (2019, May). OULAD MOOC Dropout and
Result Prediction using Ensemble, Deep Learning and Regression Techniques.
In CSEDU (2) (pp. 154-164).
https://pdfs.semanticscholar.org/4d82/e06071af59bf3d74068c89f49024faa24848.pdf
Schmitt, M. (2023). Automated machine learning: AI-driven decision making in business
analytics. Intelligent Systems with Applications, 18, 200188.
https://www.sciencedirect.com/science/article/pii/S2667305323000133
Appendix:
R code documentation:
# title: "Student Success Rate"
## Loading libraries
library(ggplot2)
library(caret)
library(e1071)
library(MASS)
library(randomForest)
#Read in the dataset and creating a pass/fail factor variable.
Full.DS <- read.csv("success_rate.csv")
# Note the number of rows.
nrow(Full.DS) # 585 students
#Take a quick look at G3.
table(Full.DS$G3)
# There are clearly some issues here, they can be handled in the data cleaning stage.
# Create a new variable that assigns pass "P" to those with G3 >= 10.
Full.DS$G3.Pass.Flag <- as.factor(ifelse(Full.DS$G3 >= 10, "P", "F"))
for (i in c(4:29)){
cat('Level distribution of variable:',names(Full.DS[,4:29])[i-3],'\n')
print(table(as.factor(Full.DS[,i])))
Full.DS$school = relevel(as.factor(Full.DS$school), ref = "GP")
Full.DS$sex = relevel(as.factor(Full.DS$sex), ref = "F")
Full.DS$address = relevel(as.factor(Full.DS$address), ref = "U")
Full.DS$famsize = relevel(as.factor(Full.DS$famsize), ref = "GT3")
Full.DS$Pstatus = relevel(as.factor(Full.DS$Pstatus), ref = "T")
Full.DS$Mjob = relevel(as.factor(Full.DS$Mjob), ref = "other")
Full.DS$Fjob = relevel(as.factor(Full.DS$Fjob), ref = "other")
Full.DS$reason = relevel(as.factor(Full.DS$reason), ref = "course")
Full.DS$guardian = relevel(as.factor(Full.DS$guardian), ref = "mother")
Full.DS$schoolsup = relevel(as.factor(Full.DS$schoolsup), ref = "no")
Full.DS$famsup = relevel(as.factor(Full.DS$famsup), ref = "yes")
Full.DS$paid = relevel(as.factor(Full.DS$paid), ref = "no")

Full.DS$activities = relevel(as.factor(Full.DS$activities), ref = "yes")
Full.DS$nursery = relevel(as.factor(Full.DS$nursery), ref = "yes")
Full.DS$higher = relevel(as.factor(Full.DS$higher), ref = "yes")
Full.DS$internet = relevel(as.factor(Full.DS$internet), ref = "yes")
Full.DS$romantic = relevel(as.factor(Full.DS$romantic), ref = "no")
# Remove G1, G2, and absences.
Full.DS$G1 <- NULL
Full.DS$G2 <- NULL
Full.DS$absences <- NULL
## Data exploration and cleaning
##To get a sense of the data, here is a summary.
summary(Full.DS)
str(Full.DS)
#Because grades should be between 0 and 20 I removed all records with values outside that
range.
# Remove records with inapprorpiate G3 values, if any
Full.DS <- Full.DS[Full.DS$G3 >= 0,]
Full.DS <- Full.DS[Full.DS$G3 <= 20,]

table(Full.DS$G3)
# See number of rows now
nrow(Full.DS)
#17 rows removed. To close out the look at G3, here is a bar chart
ggplot(data=Full.DS, mapping = aes(x=G3)) + geom_bar()
#To see the relationship between the one continuous variable (age) and passing I made a
boxplot.
boxplot(age ~ G3.Pass.Flag,
data = Full.DS,
xlab = "Pass",
ylab = "Age")
#It looks like age makes a difference and there are a few abnormally high ages. For
categorical variables (which for this purpose could include those on 1-5 type scales) I made
bar charts. The for loop covers variables 1:2 and 4:29.
for (i in c(1:2,4:29))
plt = ggplot(data=Full.DS, mapping = aes(x=Full.DS[,i], fill = G3.Pass.Flag)) +
geom_bar(position = "fill") + labs(x = colnames(Full.DS)[i])
print(plt)
#There doesn't seem to be a lot predictive power in most cases. Three look odd.
#Fedu and Medu show a high pass probability when eduction is 0 and
#Dalc (weekday alcohol) shows more passing at the highest level (5). Here is quick look at
them.
table(Full.DS$Medu)
table(Full.DS$Fedu)
table(Full.DS$Dalc)
##Remove the zero values for Medu and Fedu. I will retain the 10 cases where Dalc = 5.
## Variable exploration
# Remove records with questionable variable values.
# Remove the five records with parents education = 0.
Full.DS <- Full.DS[Full.DS$Medu > 0,]
Full.DS <- Full.DS[Full.DS$Fedu > 0,]
## Calculate correlations for numerical variables
# GET NUMERIC VARIABLES FOR CORRELATION MATRIX

numeric.vars <-names(Full.DS)[sapply(Full.DS, class) %in% c("integer", "numeric")] # get
numeric var names
num.Full.DS <- Full.DS[, numeric.vars] # get only numeric variables
# CREATE CORRELATION MATRIX
cor.Full.DS <- data.frame(round(cor(num.Full.DS), 2))
cor.Full.DS
## Feature creation
#Create Four new features that may have predictive power.
Full.DS$combine.alc <- Full.DS$Dalc * Full.DS$Walc
Full.DS$combine.education <- Full.DS$Medu * Full.DS$Fedu
Full.DS$both.college <- ifelse(Full.DS$combine.education == 16, 1, 0)
Full.DS$failures.flag <- ifelse(Full.DS$failures > 0, 1, 0)
summary(Full.DS)
## Prepare dataset for modeling
#Stratified sampling is used by cvpartition to handle imbalanced class. Train-test splitting is
done in 75:25 ratio

set.seed(1234)
partition <- createDataPartition(Full.DS$G3.Pass.Flag, list = FALSE, p = .75)
Train.DS <- Full.DS[partition, ]
Test.DS <- Full.DS[-partition, ]
# Pass Rates in train set:
table(Train.DS$G3.Pass.Flag) / nrow(Train.DS)
# Pass rates in test set:
table(Test.DS$G3.Pass.Flag) / nrow(Test.DS)
##Turns out we did get xxx% passing in each set.
### Sample model - Random forest classification
#Run a random forest on the training set and then applied to the test set.
set.seed(894)
excluded_variables <- c("G1","G2","G3") # List excluded variables xxx
control <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
tune_grid <- expand.grid(mtry = c(15:25))

rf <- train(as.factor(G3.Pass.Flag) ~ .,
data = Train.DS[, !(names(Train.DS) %in% excluded_variables)], method = "rf",
ntree = 50, importance = TRUE, trControl = control, tuneGrid = tune_grid)
plot(varImp(rf), top = 10, main = "Variable Importance of Classification Random Forest")
## Build Model 1 - GLM
#As modeling a probability (of passing), hence the binomial family with a logit link function
is used.
# Intially GLM is used with important variables from the random forest model.
formula <- as.formula(G3.Pass.Flag ~ famsup+ failures.flag + combine.education + Medu +
goout+ failures.flag + combine.alc
+ Fedu + freetime + health)
GLM <- glm(formula, data = Train.DS, family = binomial(link = "logit"))
summary(GLM)
print("Training confusion matrix")
predicted <- predict(GLM, type = "response") #This outputs the probabiity of passing
predicted
cutoff <- 0.5 # set cutoff value
predicted.final <- as.factor(ifelse(predicted > cutoff, "P", "F"))
confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))
print("Testing confusion matrix")
predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the
probabiity of passing
confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))
#Using stepAIC from the MASS package to decide the features to remove
stepAIC(GLM, direction = "backward")
Train.DS$Mjob = as.factor(Train.DS$Mjob)
#We need to look at Mjob. First determine which level has the most observations.
summary(Train.DS$Mjob)
#Relevel Mjob to make services the base.
levels(Train.DS$Mjob)
Train.DS$Mjob <- relevel(Train.DS$Mjob, ref = "services")
levels(Train.DS$Mjob)
Test.DS$Mjob = as.factor(Test.DS$Mjob)
Test.DS$Mjob <- relevel(Test.DS$Mjob, ref = "services")
#Rerun the GLM with the smaller set of variables.
formula <- as.formula(G3.Pass.Flag~goout + failures.flag + Medu +
+ famsup + health)
GLM <- glm(formula, data = Train.DS, family = binomial(link = "logit"))
summary(GLM)
cutoff <- 0.5 # set cutoff value
print("Training confusion matrix")
predicted <- predict(GLM, type = "response") # This outputs the probabiity of passing
confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))
print("Testing Confusion Matrix")

predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the
probabiity of passing
confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))
Plots:
Random forest variable importance plot:
Categorical variables distribution by bar plots colored by target variable pass status :
Boxplot of age separated by target variable pass status:
Bar plot of final grade score G3:

Business Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Analytics

Uploaded by

Copyright:

Available Formats

Running head: BUSINESS ANALYTICS

To: Management of ABC Universal Education

From: Consultant of ABC Universal Education

Title: Drivers of Passing Course Grades

Throughout this entire assignment, different machine learning techniques are

success or receive a final score minimum of 10 out of 20 on the basis of students'

identify relevant characteristics that influence learners' pass.

Data Extraction, Transformation and Feature Selection

Model Building and Documentation

well as failures.flag, combine.education, Medu, gout, combine.alc, Fedu, freetime, including

function has been utilised.

Model: Generalized Linear Models (GLMs)

Results for initial model in test set:

elimination cycle are listed below here:

within the test set is documented.

It is to be observed that the precision is approximately 71.43%; therefore, the

sensitivity can be considered as moderate. However, the degree of specificity is extremely

Findings and Recommendation

Improves knowledge and comprehension

Educational travels allow learners to acquire knowledge regarding a certain location

landmark or an archaeological site may assist learners gain an understanding of a location's

history, as well as culture.

Encourages collaboration and social skills

Educational travels motivate learners to participate in collaborative endeavours and

Encourages self-directed learning

Accordingly, parents’ education plays a significant role on the level of success of

achieve a higher position in future. Parental involvement is excessively critical in education

members are encouraging and supportive.

increases learners' academic success. It fosters a calm, productive environment in the

impact on students' passing grades and overall performance.

Result Prediction using Ensemble, Deep Learning and Regression Techniques.

In CSEDU (2) (pp. 154-164).

Schmitt, M. (2023). Automated machine learning: AI-driven decision making in business

analytics. Intelligent Systems with Applications, 18, 200188.

# title: "Student Success Rate"

#Read in the dataset and creating a pass/fail factor variable.

Full.DS <- read.csv("success_rate.csv")

# Note the number of rows.

nrow(Full.DS) # 585 students

#Take a quick look at G3.

Full.DS$G3.Pass.Flag <- as.factor(ifelse(Full.DS$G3 >= 10, "P", "F"))

cat('Level distribution of variable:',names(Full.DS[,4:29])[i-3],'\n')

Full.DS$school = relevel(as.factor(Full.DS$school), ref = "GP")

Full.DS$sex = relevel(as.factor(Full.DS$sex), ref = "F")

Full.DS$address = relevel(as.factor(Full.DS$address), ref = "U")

Full.DS$famsize = relevel(as.factor(Full.DS$famsize), ref = "GT3")

Full.DS$Pstatus = relevel(as.factor(Full.DS$Pstatus), ref = "T")

Full.DS$Mjob = relevel(as.factor(Full.DS$Mjob), ref = "other")

Full.DS$Fjob = relevel(as.factor(Full.DS$Fjob), ref = "other")

Full.DS$reason = relevel(as.factor(Full.DS$reason), ref = "course")

Full.DS$guardian = relevel(as.factor(Full.DS$guardian), ref = "mother")

Full.DS$schoolsup = relevel(as.factor(Full.DS$schoolsup), ref = "no")

Full.DS$famsup = relevel(as.factor(Full.DS$famsup), ref = "yes")

Full.DS$paid = relevel(as.factor(Full.DS$paid), ref = "no")

Full.DS$activities = relevel(as.factor(Full.DS$activities), ref = "yes")

Full.DS$nursery = relevel(as.factor(Full.DS$nursery), ref = "yes")

Full.DS$higher = relevel(as.factor(Full.DS$higher), ref = "yes")

Full.DS$internet = relevel(as.factor(Full.DS$internet), ref = "yes")

Full.DS$romantic = relevel(as.factor(Full.DS$romantic), ref = "no")

# Remove G1, G2, and absences.

Full.DS$G1 <- NULL

Full.DS$G2 <- NULL

Full.DS$absences <- NULL