Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Estimating Admission Chances

Project work done as a requirement of Internal Assessment of the course Software


for Data Extraction & Analysis, BMS Sem III

Submitted by:
Kumail Ali Khan ( Roll No. 21112 )
Kodonshel Hongsha ( Roll No. 21104 )
Bachelor of Management Studies (BMS)
Shaheed Sukhdev College of Business Studies
29 November, 2022
Acknowledgement
Prof. Rishi Sahay, our mentor, has our sincere gratitude for his important counsel
and help in getting our project finished. He was there to help us at every stage, and
it was his drive that made it possible for us to complete our assignment
successfully. We also want to express our gratitude to all the other support staff
members who helped us by providing the equipment that was crucial and necessary
since without it, we would not have been able to complete this project effectively.
Additionally, we would like to express our gratitude to the University of Delhi for
approving our research in our area of competence. As we worked on this, we also
want to thank our parents and friends for their encouragement and support.
Declaration
We, the undersigned Kumail Ali Khan and Kodonshel Hongsha, hereby declare
that the work contained in this project work, titled "Estimating Admission
Chances", constitutes our own contribution to the research work conducted under
the supervision of Prof. Rishi Sahay and is the result of our own research work. It
has not previously been submitted to any other University for any other Degree/
Diploma to this or any other University.
Every time a reference to another author's earlier work has been made, it has been
identified unambiguously and listed in the bibliography.
We also state here that all information in this document was acquired and
presented in compliance with ethical standards and academic regulations.
Kumail Ali Khan
Kodonshel Hongsha
Abstract
This paper proposes a practical and data-driven preference estimation method from
reported lists in a Deferred Acceptance mechanism when there are incentives to
report these lists strategically. Data on centralized college admissions show many
pieces of evidence that students construct their lists strategically according to their
admission chances and previous years' admission outcomes. We develop a
preference estimation method to evaluate reported lists within the set of colleges
that are considered accessible to each student. This method allows us to create
personal choice sets and to estimate student preferences by making valid utility
comparisons that are supported by data and theory. We show the robustness of our
estimation method compared to the existing estimation methods. A counterfactual
admissions analysis based on our preference estimates suggests that students from
low SES households are better off under a student sorting rule only based on high
school GPAs.
1.2 Software Analysis: R
A statistical language developed by statisticians is called R. It is hence excellent
for statistical computation. The computer language R is the most used for creating
statistical tools.
R is both a statistical software application and a programming language for math.
It is entirely open source, unlike Stata, SAS, SPSS, MATLAB, and other statistical
tools. After graduating, students can utilise them by simply installing them on their
personal PCs. R allows you to construct scripts that make your analysis
reproducible, unlike Excel. The integrated development environment (IDE) for R
is called RStudio.
Although RStudio offers a more useful, graphical environment for working with R,
you still need to install R in order to utilise it. Along with the terminal, windows
for scripts, files, packages, and plots make it simpler to keep track of what you are
doing. It contains numerous add-ons that increase R's potency. Using the package,
RStudio is very effective for creating reports and presentations. Even Latex PDFs
can be created using it.
Introduction:
R offers more advanced modelling and machine learning methods in addition to
the capabilities of statistical software like Stata. However, Stata will probably still
be used for standard regression analysis because it was designed for that and is so
simple to use in those situations.
Most likely, R will outperform a Matlab-like mathematical language.
Machine learning is where R really shines for economists. Big Data prediction
methods like decision trees, LASSO, and other methods are referred to as machine
learning.
Purpose :
The purpose of this data analysis project is to analyze, predict, and observe the
importance of these variables on the chance of admittance into a Masters graduate
program.
The dataset contains several variables which are considered important during the
application for Masters Programs. The variables included are :
GRE Scores ( out of 340 ) TOEFL Scores ( out of 120 ) University Rating ( out of
5 ) Statement of Purpose Strength ( out of 5) Letter of Recommendation Strength (
out of 5 ) Undergraduate GPA ( out of 10 ) Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

Literature Review:
Understanding the college selection process has consequences for recruiting in
higher education. The variables affecting students' decision to attend college have
been extensively studied. Recruitment agents can divide prospective students into
groups by using student characteristics as a reference.
This enables organisations to target populations with characteristics comparable to
those of the students who are most likely to enrol in their institutions.
Administrators can create a suitable marketing mix to draw in the right students by
taking into account important institutional qualities.
The second section of this article discussed the use of marketing techniques in
student recruitment by higher education institutions. Market research built on
qualitative data and quantitative dataset analysis can help us better understand how
people make college selection decisions. Standardized tests like the CIRP
(Cooperative Institutional Research Program Freshman Survey), ACT Profile (all
high school students who complete ACT tests), ASQ Plus (Admitted Student
Questionnaire), NESL (National Education Longitudinal Study), and SDQ
(College Board's Student Descriptive Questionnaire) are examples of commonly
used datasets, according to Hoyt and Brown (1999).
The scope of this essay was restricted to a study of the key determinants of college
choice and the most popular recruiting marketing techniques. To anticipate the
characteristics influencing students' decisions to choose their schools, it is
important to construct a more extensive set of criteria for each specific institution.
Focus groups and student interviews can provide schools with vital information
about how students evaluate their institutions. Individual institutions may benefit
from doing internal surveys in addition to adopting one of the standardised
instruments to better understand their target market. Hossler (1999) suggested that
colleges use multivariate statistical methods and modern student information
systems to monitor and assess the effectiveness of recruitment initiatives.

OBJECTIVE OF THE RESEARCH:


The investigation of study is to determine chances of getting admission into
college on the basis of different variables.

• To identify the chance of admission of a student.


• To determine the relationship between different variables considered.
• To study the variables that affect the chance of getting admission.
• To perform the strength analysis to determine the most affecting variable.

ANALYSIS AND INTERPRETATION:


Statistical analysis
○ Test of statistics
An analysis of the data's support for a hypothesis using statistical methods. The
null hypothesis, also referred to as H0, is what it is termed. Data are produced by
random processes under H0. In other words, the experimental manipulations, for
instance, which are under control, have no impact on a data. H0 is typically a
statement of equality (for instance, between a correlation coefficient and zero, or
between averages, variances, or both).
○ Correlation Analysis
A statistical technique used in research to determine the association between two
variables and gauge the strength of their linear relationship is correlation analysis.
The magnitude of change in one variable as a result of the change in the other is
determined using correlation analysis, to put it simply. A high correlation indicates
a strong association between the two variables, whilst a low correlation indicates a
poor correlation between the two variables.
○ Regression Analysis
It is possible to determine statistically which of those factors actually has an effect
by using regression analysis. It responds to the questions: What elements are most
crucial? Which may we disregard? What connections do the elements have with
one another? The most crucial question is how confident we are in each of these
characteristics. They are known as variables in regression analysis. The
fundamental element you are attempting to comprehend or anticipate is your
dependent variable.
The chance of admission is the dependent variable in our situation. Then there are
our independent variables, which are (GRE Score, TOEFL, SOP, LOR, etc.) that
we believe have an effect on the dependent variable.

Variables taken:
· GRE Score - GRE scores can support institutions’ efforts to identify
which applicants are academically prepared for graduate-level study
provide a common, objective measure to help programs fairly
compare applicants from different backgrounds
· TOEFL Score - The test measures one’s proficiency in four sections –
reading, writing, speaking and listening. Besides aiming for a good
score, preparing for TOEFL test helps individuals genuinely improve
their English communication skills – both written and verbal
· University Rating - University and business school rankings represent
a tried and tested outlet for prospective decision-making on their next
study destination.
· SOP – sop remains the most important element of the admission
process. It allows the admission committee to peek into your
background, almost a narrative to your entire application. In essence,
a well-drafted statement of purpose could downplay certain
weaknesses.
· LOR – It is an important part of your application that can influence
the admission decision of the admission committee as it allows them
to review your capabilities and candidature for the course applied
from the recommender’s point of view.
· CGPA - Having a good CGPA with creative projects that showcase
your skills make a very strong and impressive resume. One must
always take part in cultural activities, academic or non-academic
since these teach us very important life lessons. Be part of multiple
student organizations, take part in at least some of the several events
and happenings in our university.
· Research - The main purposes of research are to inform action, gather
evidence for theories, and contribute to developing knowledge in a
field of study.
· Chance of admit - admission is always a chance game – you're
applying alongside other students, from everywhere in the world, who
may or may not be just as qualified as you are. For this reason,
students really want to know whether they have a fighting chance of
enrolling in the programme of their dreams.

· Sample of Data that was taken into consideration


Distribution of Variables
Data Refining

Train -Test Split:


Train test split is a model validation procedure that allows you to simulate how a model
would perform on new/unseen data. Here is how the procedure works:

Removing Null values:


The data we used was unrefined, so we need to check and remove the null values by giving
the na.omit command

adm_data = na.omit(adm_data).

1. Checking if the data has outliers:

To check whether data has a outlier or not we chose the graphical method and plotted the
graph of boxplot. The command for the same was – Boxplot(used_cars).
Now in next step we have to remove outliers in the variables. So, now we would focus on
removing these outliers.
Clearly, we can see there is no outliers seen in the box whisker plot we plotted.

2. Removing outliers from the data:

As we do not have any outliers. There is no need to remove.


Correlation Matrix: Checking Multicollinearity
Multicollinearity occurs when independent variables in a regression model are correlated.
This correlation is a problem because independent variables should be independent. If the
degree of correlation between variables is high enough, it can cause problems when you fit
the model and interpret the results.

# Method 1;
By using cor() function, determining correlation between different variables.

Here as we can see TOEFL.Score is highly correlated to GRE.Score and CGPA is also highly correlated.
So, we will take only one variable out of GRE.Score, TOEFL.Score and CGPA.

Now, removing TOEFL.Score and CGPA.

New Dataset:-

# Method 2 ;– Graphical Representation


Graphical representation using ggcorr() function from GGally package an extension package
of ggplot2.

EXPLANATION OF REGRESSION STATISTICS:

Multiple R: The correlation coefficient, abbreviated as r, measures how


strongly two variables are related linearly. You can use it to analyse the
connections between variables like profitability and advertising expenditures,
which aids in understanding how your decisions will affect your organisation.
R2: This is also known as R-squared or r2. It indicates how much variance may
be attributed to a certain variable. A value of 80% means that on a line across
those points, 80% of the values around the mean would fall. In other words, a
regression model fits 80% of the values.
R square adjustment: The R square adjustment takes into account the number
of words in a model. Instead of using R square, you ought to utilise it if you
have several x-variables.
Standard Error of Regression: Regression standard error is an estimation of a
regression coefficient's error. The statistical standard error is not the same as
this. A measure of a variable's precision is the average error of a regression
coefficient, and it is inversely proportional to the size of the sample from which
the coefficient was obtained. In general, descriptions of what is actually
happening are more accurate when the estimates are lower. If the standard error
is high, another factor may be making that particular component of your model
unreliable or inaccurate, necessitating the need for further data.
Observations: Number of observations in the sample, or observations.

REGRESSION STATISTICS EXPLAINED:

Regression SS / Regression degrees of freedom equals regression MS.


Mean squared error (Residual SS / Residual degrees of freedom) is referred to
as residual MS.
To test the null hypothesis, use the F test overall.
P-Value-related significance, or significance F.

REASON FOR REGRESSION COMPUTING TABLE:

Provides the least-squares estimate as a coefficient.


Standard Error: The standard error's least-squares estimate.
T statistic: The T statistic comparing the alternative hypothesis to the null
hypothesis.
The p-value for the hypothesis test is provided by the P-Value function.

This section's most helpful information is the linear regression equation, which
is given as
y = MX + b.
y = slope * x + intercept.

Developing Multiple Regression Analysis:


Train Data
Sample of trained data of the chosen dataset for regression model.

The general mathematical equation for a linear regression is


y = ax + b

Following is the description of the parameters used −


● y is the response variable.
● x is the predictor variable.
● a and b are constants which are called the coefficients.

We have taken “Chance.of.Admit” as the dependent variable y.

And independent variables are :-


1. GRE.Score 4. LOR
2. University.Rating 5. Research
3. SOP

Linear Model : lm() Function


This function creates the relationship model between the predictor and the response variable.
R-square is the percentage of the response variable variation that is explained by a linear
model.

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%: 0% indicates that the model explains none of the
variability of the response data around its mean.100% indicates that the model explains all
the variability of the response data around its mean. In this regression model we can see the
R-square value on Training data 0.7383.

It means this regression model explains 73.83% of variability on the basis of the variables
taken. Other 26% of the variability may be depending of other relevant factors like CGPA,
TOEFL, Personal Interview Marks, Profile, Work Experience and many more such factors.

Let's check the coefficients of independent variables.

A coefficient expresses the strength of the relationship between the input value and the output
value. You can read a coefficient as "for every increase of the input value by one unit, the
output value will change by [whatever number the coefficient is] units". In linear regression,
you can think of the coefficient as the slope of the line.

We can conclude that all these variables have a significant impact on “Chance.of.Admit” and
the model is a good fit for the data. Especially in all these independent variables LOR and
Research have high weightage.

Plot Analysis:
1. Residuals vs Fitted (Linearity):

We don't want to see a pattern here; it should just look like a cloud, and the red line in the
middle should be more or less flat. The fact that the residuals are bigger on the sides and
smaller in the middle shows that our model is much more accurate for middle-range values
than for extreme values (either very high or very low).

2. Normal Q-Q Plot (Normality):


We want all of our points to be on that dotted line, and we want the line to run diagonally
across the middle of the plot. The fact that a significant number of our residuals are on the
line gives rise to the fact that our data is somewhat normal.

3. Scale Location Plot (Constant Variable):

This enables you to determine whether the data points are distributed equally along your
predictors. The red line in the centre should be flat, and the points should be distributed
randomly. This graphic indicates that our data may be skewed. Which is what other plots
have been telling us as well.
4. Residuals vs Leverage Plot (Independence):
This plot will help us find things that stand out. If you can see a red line with a dot between
most of the points and one or two that stand out, those points have a big effect on your
analysis. We don't have to worry about any strong outliers here.

Actual Value vs Predicted Value:

As we have split the data into “Train” and “Test and made a regression model on the “Train”
dataset. So now we can use this model to predict values. We can achieve this by visualizing
the predicted and the actual values.

First let's predict the values.


Plotting Line Chart:
For better understanding, we are going to plot the actual values of “price and then on top of
that we will plot the predicted values of “price” . To make the plot look less cluttered and
easy on our eyes, we've only taken the first 150 values.
Here in this plot we can see that the plot of predicted values almost overlaps the plot of actual
values. This states that our model can predict the price of diamond fairly accurately, given the
dependent variables.

VIF - Test :
VIF measures the strength of the correlation between the independent variables in regression
analysis. This correlation is known as multicollinearity, which can cause problems for
regression models.
For that we’ll use the vif() function from the car library to calculate the VIF for each
predictor variable in the model:

These results seem really positive, vif values above 5 are the values which are concerning. In
our model there is no multicollinearity as we have already taken care of it through plotting
the correlation matrix.
Summary:
In this study, we use the admission dataset to forecast a student's likelihood of being admitted
to a university based on a variety of academic performance assessment data. We first
processed the data and discovered two variables to be strongly connected. We then went
ahead and eliminated these two variables. Additionally, we had a respectable r square value
of around 73%, which gives a reasonable estimate of the expected admission possibilities.
Finally, we examined numerous graphs to confirm the linearity and normality of the data we
collected. To make the project easier to grasp, several graphical representations were created.

Conclusion:
We found that the data on the chances of admission to a university is fairly available, which
offers us the opportunity to create a more accurate regression model. We can infer that a
variety of additional characteristics, including a student's CGPA, TOEFL score, and work
experience, may have an impact on their likelihood of being admitted to a university. The
model cannot predict the precise value of chances of admission to a college or university
because these criteria were not taken into consideration, but it can provide a close estimate,
which can be helpful to someone looking to enrol in a new college.

References/ Bibliography:
• https://www.kaggle.com/code/aninda/admission-prediction-in-
r/notebook
• https://rpubs.com/safirawp/graduate_admissions_analysis

You might also like