Professional Documents
Culture Documents
Business Analytics
Business Analytics
BUSINESS ANALYTICS
Student’s Name:
Student’s ID:
Date of Submission:
BUSINESS ANALYTICS 1
Date:
Executive Summary
employed as consultants in order to estimate learners who are most probable to achieve
characteristics. The entire data set consists of 33 factors, containing the desired variable G3,
which reflects as the learners' final score, while there are 585 learners within the sample
population. The final score is derived utilising the final examination, as well as the
preliminary examination G1, as well as G2, which have a strong relationship with the G3, and
therefore, are not utilised as characteristics (Fosso Wamba et al., 2019). The Generalised
linear model has been utilised in order to forecast student's pass condition, and in order to
After the download of the complete dataset, the string values are transformed to
variables, featuring the level used as a reference being the value with greatest frequencies
among these values. The target variable is made up of G3 containing the label "G3.pass.flag",
as well as markings P, including F for learners, who succeeded (got >=10), as well as failed
(<10). The variables G1 and G2 are then eliminated, as well as variable absences, because a
large number of learners possess a 0 value regardless of score, and therefore, absence is
presumed to be unrelated to pass condition. The occurrences with erroneous final grade score
BUSINESS ANALYTICS 2
G3 (a score below 0 or greater than 20) are subsequently eliminated as per the list, reducing
the overall size of the sample by almost 17. Medu, as well as Fedu that are instances with 0
values are additionally eliminated. The failure state Boolean variable is created by allocating
logic 1 to learners who failed one or more times in the past, as well as logic 0 to those
learners who had no failures at all. The updated dataset is divided into train and test sets
according to a 75:25 ratio, with 75% of random instances utilised in order to train the GLM
model, whereas, on the other hand, the remaining 25% utilised for assessing GLM's
effectiveness.
Since there are a lot of characteristics within the dataset, every single one of those are
unable to be incorporated because it will influence the precision of the model. As a result,
relevant characteristics are discovered through the use of a random forest model, followed by
the basic GLM model is constructed based on those characteristics (Schmitt, 2023).
According to the graph of the top parameters of the significance of random forest, famsup, as
health are the ten most significant variables which impact a learner's pass grade. Since the
possibility of passing is simulated within the GLM, a binomial series with a logit link
The initial GMM yielded approximately 72.86% appropriateness within the test set,
while remaining metrics, for example, as specificity, as well as sensitivity are not excessively
elevated.
The total amount of characteristics within the final GLM model is greatly decreased
from the starting GMM model by employing reverse elimination utilising AIC as the criteria
for selection. The characteristics that were used including the AIC score for the final reverse
Presently, considering these five aspects, for example, health, goout, and failures.The
GLM model is run repeatedly with flag, Medu, as well as famsup, and then, the assessment
Outcomes of assessment within a test set with GLM with fewer characteristics:
high, and accounts for 0.84. According to the confusion matrix, passing learners have more
accurate forecasts compared to failed learners, while real failed learners have more wrong
forecasts.
As an outcome of the GLM model fitting, it is discovered that the variables that have
the greatest impact the percentage of students who succeed are the 'goout' or regularity of
BUSINESS ANALYTICS 5
travelling for hangout, previous experiences of mistakes, the mother's educational degree, as
well as the quality of the family environment (Jha, Ghergulescu & Moldovan, 2019).
It is also noticed that learners who go on vacation less often, have no prior record with
previous test failures, have an extensive level of maternal education, and are raised by family
have a significantly higher pass percentage. As a result, if students concentrate on these basic
variables, their results may be satisfactory. When students focus on outside vacations
regularly, they start avoiding a proper education, since they start focusing on tours, friends,
and other entertainments. As a result, they fail to acquire adequate teaching from the school,
and it excessively affects on their education and grades. However, other students, who less
travel or attain only educational tours, have a potential to increase their grades in the school.
Educational tours have an immense impact not only on a child’s passing grade but also on
building social culture among children. Some of the benefits of educational tours are as
follows:
or subject in a more involved and engaging setting. For example, visiting a historical
cooperate, which assists them in building partnership and interpersonal abilities. This is
especially important for younger students since it boosts their self-esteem and level of
confidence.
Learners must be more independent and take control of their learning on educational
travels. This assists pupils in developing problem-solving abilities and becoming more self-
directed learners.
children, which simultaneously, increases the passing grades of children and help then to
and learning. Even when children receive excellent education, the encouragement of their
families is important to their future success. Parents who can motivate their children within
their homes may maintain their motivation level. Children will like studying if their family
It would be excellent if both father and mother could help their children with their
schoolwork and projects. This procedure will also assist educators. Parental participation
classroom. As a result, teachers can easily monitor kids and assist them in growing. As a
result, when such parental education, tutors’ support, as well as educational tours plays an
essential role in increasing education level of students, it generally increases the passing score
of learners, since they participate in different activities and learn more new things, which
improves their skills and abilities, and as a result, their academic skills improves and their
passing grades additionally increases to those, who do not have proper guide from parents, or
go outside regularly.
Even though the framework offers adequate precision within the test set using fewer
characteristics, it cannot be determined that the feature choosing is completely optimal, nor is
the model the optimum model because a distinct set of decreased characteristic choice may
generate superior outcomes, and developed models, such as neural networks, may
BUSINESS ANALYTICS 7
demonstrate greater effectiveness for learners' pass status forecasting. Furthermore, the total
amount of the dataset is approximately 500, which is quite small, and therefore, incorporating
data for additional learners can assist in enhancing the model's accuracy and authenticity. It
has been seen that the dataset contains a large number of attributes, each one cannot be
included because it would affect the accuracy of the model. Therefore, as an outcome,
significant features are determined using a random forest model, as well as the basic GLM
model is built using those features (Schmitt, 2023). On the basis of the graph of the top
variables in random forest importance, as well as famsup, and failures. The 10 most
influential factors influencing a learner's passing grade and their overall achievement. Their
basic characteristics highly influence whether they are getting passing score or not, and in
this regard, it is to be concluded that parents' education level and less travels has a positive
Bibliography
Fosso Wamba, S., Akter, S., Trinchera, L., & De Bourmont, M. (2019). Turning information
quality into firm performance in the big data economy. Management Decision, 57(8),
1756-1783. https://ro.uow.edu.au/cgi/viewcontent.cgi?
article=1573&context=gsbpapers
Jha, N. I., Ghergulescu, I., & Moldovan, A. N. (2019, May). OULAD MOOC Dropout and
https://pdfs.semanticscholar.org/4d82/e06071af59bf3d74068c89f49024faa24848.pdf
https://www.sciencedirect.com/science/article/pii/S2667305323000133
BUSINESS ANALYTICS 9
Appendix:
R code documentation:
## Loading libraries
library(ggplot2)
library(caret)
library(e1071)
library(MASS)
library(randomForest)
table(Full.DS$G3)
# There are clearly some issues here, they can be handled in the data cleaning stage.
BUSINESS ANALYTICS 10
# Create a new variable that assigns pass "P" to those with G3 >= 10.
for (i in c(4:29)){
print(table(as.factor(Full.DS[,i])))
summary(Full.DS)
str(Full.DS)
#Because grades should be between 0 and 20 I removed all records with values outside that
range.
table(Full.DS$G3)
nrow(Full.DS)
#17 rows removed. To close out the look at G3, here is a bar chart
#To see the relationship between the one continuous variable (age) and passing I made a
boxplot.
boxplot(age ~ G3.Pass.Flag,
data = Full.DS,
xlab = "Pass",
ylab = "Age")
#It looks like age makes a difference and there are a few abnormally high ages. For
categorical variables (which for this purpose could include those on 1-5 type scales) I made
bar charts. The for loop covers variables 1:2 and 4:29.
for (i in c(1:2,4:29))
print(plt)
BUSINESS ANALYTICS 13
#There doesn't seem to be a lot predictive power in most cases. Three look odd.
#Fedu and Medu show a high pass probability when eduction is 0 and
#Dalc (weekday alcohol) shows more passing at the highest level (5). Here is quick look at
them.
table(Full.DS$Medu)
table(Full.DS$Fedu)
table(Full.DS$Dalc)
##Remove the zero values for Medu and Fedu. I will retain the 10 cases where Dalc = 5.
## Variable exploration
cor.Full.DS
## Feature creation
summary(Full.DS)
set.seed(1234)
table(Train.DS$G3.Pass.Flag) / nrow(Train.DS)
table(Test.DS$G3.Pass.Flag) / nrow(Test.DS)
#Run a random forest on the training set and then applied to the test set.
set.seed(894)
rf <- train(as.factor(G3.Pass.Flag) ~ .,
#As modeling a probability (of passing), hence the binomial family with a logit link function
is used.
# Intially GLM is used with important variables from the random forest model.
summary(GLM)
BUSINESS ANALYTICS 17
predicted <- predict(GLM, type = "response") #This outputs the probabiity of passing
predicted
confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))
predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the
probabiity of passing
confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))
#Using stepAIC from the MASS package to decide the features to remove
Train.DS$Mjob = as.factor(Train.DS$Mjob)
BUSINESS ANALYTICS 18
#We need to look at Mjob. First determine which level has the most observations.
summary(Train.DS$Mjob)
levels(Train.DS$Mjob)
levels(Train.DS$Mjob)
Test.DS$Mjob = as.factor(Test.DS$Mjob)
+ famsup + health)
summary(GLM)
predicted <- predict(GLM, type = "response") # This outputs the probabiity of passing
confusionMatrix(predicted.final, factor(Train.DS$G3.Pass.Flag))
predicted <- predict(GLM, newdata = Test.DS, type = "response") # This outputs the
probabiity of passing
confusionMatrix(predicted.final, factor(Test.DS$G3.Pass.Flag))
Plots:
Categorical variables distribution by bar plots colored by target variable pass status :
BUSINESS ANALYTICS 20
BUSINESS ANALYTICS 21
BUSINESS ANALYTICS 22
BUSINESS ANALYTICS 23
BUSINESS ANALYTICS 24
BUSINESS ANALYTICS 25
BUSINESS ANALYTICS 26
BUSINESS ANALYTICS 27
BUSINESS ANALYTICS 28
BUSINESS ANALYTICS 29