Professional Documents
Culture Documents
Detecting Multicollinearity in Regression Analysis: Keywords
Detecting Multicollinearity in Regression Analysis: Keywords
2, 39-42
Available online at http://pubs.sciepub.com/ajams/8/2/1
Published by Science and Education Publishing
DOI:10.12691/ajams-8-2-1
variables on the outcome variables that exist within a alternative. Moreover, condition index or condition
population. Hence, the regression analysis was used for number of the correlation matrix can be used to measure
the variables of interest. the overall multicollinearity of the variables. The
condition index (CI) is a function of eigenvalues. It is
2.1. Techniques for Detecting given by
Muticollinearity 𝐶𝐶𝐶𝐶 = �
𝜆𝜆 (𝑝𝑝 −1)
𝜆𝜆 (1)
The primary techniques for detecting the multicollnearity
are i) correlation coefficient, ii) variance inflation factor, where, (p – 1) is the number of predictors, , λ(p-1) and λ(1)
and iii) eigenvalue method. are the maximum and minimum eigenvalues respectively.
The value of CI is always greater than one, so a higher
2.1.1. Pairwise Scatterplot and Correlation value of CI indicates the multicollinearity. Generally, CI
Coefficients <15 usually means weak multicollinearity, 15<CI<30 is
The scatterplot is a graphical method that signifies the evidence of moderate multicollinearity, and CI >30 is
linear relationship between pairs of independent variables. indication of strong multicollinearity [1,11].
It is important to look for any scatterplots that seem to
indicate a linear relationship between pairs of independent
variables. 3. Results
The correlation coefficient is calculated using the
formula: The respondents were smartphone users between 18 to
27 years and living in acity. Out of total respondents,
𝑛𝑛(∑ 𝑋𝑋𝑋𝑋)−(∑ 𝑋𝑋) (∑ 𝑌𝑌)
𝑟𝑟 = 45.8% were male and the rest 54.2% were female.
�[𝑛𝑛 ∑ 𝑋𝑋 2 − (∑ 𝑋𝑋)2 ] [𝑛𝑛 ∑ 𝑌𝑌 2 − (∑ 𝑌𝑌)2 ] The users were provided questionnaire to examine the
predictors; product quality (PQ), brand experience (BE),
Where, r = correlation coefficient, n = number of
product feature (PF), product attractiveness (PA), and
observations, X = first variable in the context, and Y =
product price (PP) that could potentially influence the
second variable in the context. If the correlation
level of customer satisfaction (CS) for smartphones.
coefficient value is higher with the pairwise variables, it
indicates possibility of collinearity. In general, if the
absolute value of Pearson correlation coefficient is close 3.1. Detecting Multicollinearity
to 0.8, collinearity is likely to exist [1,10].
Multicollinearity among the variables is examined
2.1.2. Variance Inflation Factor (VIF) using different methods.
Variance inflation factor is used to measure how much 3.1.1. Pairwise Scatterplot
the variance of the estimated regression coefficient is
A scatterplot is used to observe the relationship
inflated if the independent variables are correlated. VIF is
between the variables. It uses dots to represent values for
calculated as
two different variables. The location of each dot on the
1 1 horizontal and vertical axis denotes values for an
=VIF = individual data point. Figure 1 shows a scatterplot of the
1 − R 2 Tolerance
data with the estimated simple linear regression line
Where, the tolerance is simply the inverse of the VIF. The overlaid. It is useful to find outliers and observe the
lower the tolerance, the more likely is the multicollinearity patterns between some dimensions. The figure shows
among the variables. The value of VIF =1 indicates that positive correlation between the pairwise variables but
the independent variables are not correlated to each other. does not give exact extent of correlation.
If the value of VIF is 1< VIF < 5, it specifies that the
variables are moderately correlated to each other. The
challenging value of VIF is between 5 to 10 as it specifies
the highly correlated variables. If VIF ≥ 5 to 10, there will
be multicollinearity among the predictors in the regression
model and VIF > 10 indicate the regression coefficients
are feebly estimated with the presence of multicollinearity
[9].
Table 1. Pearson's Correlation Coefficients satisfaction, is r = 0.570 which shows there is positive
Variables CS PP PF PA PQ BE correlation between the variables. The R2 = 0.325 shows
CS 1 0.368 0.353 0.412 0.399 0.431 that 32.5% of the movement in the dependent variable can
PP 0.368 1 0.419 0.325 0.329 0.333 be explained by the independent variables and the rest
PF 0.353 0.419 1 0.359 0.337 0.301 67.5% remains unexplained. The adjusted R2 = 0.316
PA 0.412 0.325 0.359 1 0.329 0.342 gives the idea of how well the model generalizes. The
PQ 0.399 0.329 0.337 0.329 1 0.357 difference between the R2 and adjusted R2 is 0.325-0.316=
BE 0.431 0.333 0.301 0.342 0.357 1 0.009; it means if the model was derived from the
population rather than a sample it would account for
Correlation is significant at the 0.01 level (2-tailed). approximately 0.9% less variance the outcome. The
F-value of ANOVA Table 3 measures the statistical
The correlation coefficient of overall customer significance of the model. Here, F-value is considered
satisfaction with the variable brand experience has statistically significant at p < 0.001 and it can be observed
moderate correlation (r = 0.431, p < 0.01), followed by that the outputs from the analysis are not due to chance
product attractiveness (r = 0.412, p < 0.01), product alone.
quality (r = 0.399, p < 0.01), product price (r = 0.368, The regression coefficient shows how much the
p < 0.01), and product feature (r = 0.353, p < 0.01). Here dependent variable customer satisfaction is expected to
the absolute value of Pearson correlation coefficient is less increase when the predictor variable under consideration
than 0.8, it shows collinearity is very less likely to exist. increases by one and all other independent variables are
held at the same value. In regression analysis using
3.1.3. Variance Inflation Factor (VIF)
stepwise method, it is observed that the variable product
Regression analysis is more applicable in the customer feature is not significant.
satisfaction survey because it helps to identify the factors The value of VIF is 1< VIF < 5; it specifies that the
that influence on customer satisfaction for smartphones. In variables are moderately correlated to each other. The
Table 2, the multiple correlation coefficient between the small values of VIF corresponding to the variables show
independent variables and the outcome variable, customer that there is no problem of collinearity.
Table 2. Model Summary
Model Summary b
Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
1 .570a 0.325 0.316 0.689 1.899
a Predictors: (Constant), BE, PP, PA, PQ
b Dependent Variable: CS
ANOVA a
Model Sum of Squares df Mean Square F Sig.
1 Regression 69.697 4 17.424 36.702 .000b
Residual 144.796 305 0.475
Total 214.493 309
a Dependent Variable: CS
b Predictors: (Constant), BE, PP, PA, PQ
Coefficients a
Mod Unstandardized Standardized 95.0% Confidence Collinearity
t Sig.
el Coefficients Coefficients Interval for B Statistics
Std. Lower Upper
B Beta Tolerance VIF
Error Bound Bound
(Consta 1.40 0.16
1 0.338 0.241 -0.137 0.812
nt) 1 2
2.96 0.00 1.23
PP 0.165 0.056 0.155 0.055 0.275 0.807
2 3 9
4.15 1.24
PA 0.234 0.056 0.218 0 0.123 0.345 0.803
1 6
3.63 1.26
PQ 0.226 0.062 0.192 0 0.104 0.348 0.793
5 1
4.43 1.27
BE 0.229 0.052 0.236 0 0.128 0.331 0.785
7 4
a Dependent Variable: CS
42 American Journal of Applied Mathematics and Statistics
3.1.4. Eigenvalue Method major factors product quality, brand experience, product
Table 5 illustrates the eigenvalue, condition index and feature, product attractiveness, and product price are
variance proportions. The dimension stands for the linear significant with p<0.001. The multicollinearity among
combination of the variables. Eigen value stands for the the variables is detected using the three techniques;
variance of the linear combination of the variables. The correlation coefficients, variance inflation factor, and
condition index is a function of eigenvalues. In Table 5 for eigenvalue method. It is observed that there is no evidence
variable brand experience, higher variance proportions i.e. of multicollinearity among the variables. The variable
96% is associated with dimension 2 that has an eigenvalue product attractiveness is the most significant variable that
of 0.039 and a condition index of 11.225. The variable influences the outcome customer satisfaction.
product attractiveness has the higher variance of 75% and Additionally, more extended researches including other
is associated with the dimension 3, with eigenvalue of significant variables can be conducted for identifying
0.034 and a condition index of 11.89. Similarly, the higher customer satisfaction. To determine the presence of
variance of product quality and product-price are multicollinearity, advanced regression procedures such as
associated with the dimension 4 and 3 with eigenvalues principal components regression, weighted regression, and
0.032 and 0.034 respectively. A condition index greater ridge regression method can be used.
than 15 denotes a probable problem of multicollinearity.
The higher condition index is 15.39 for dimension 5 but
the variance proportions of variables are not associated References
with this value. This shows there is no evidence of
[1] Young, D.S., Handbook of regression methods, CRC Press, Boca
collinearity among the variables. Raton, FL, 2017, 109-136.
In Table 4, the confidence interval shows the maximum [2] Frank, E.H. Jr., Regression modeling strategies: with applications
and minimum values of the true value. The most likely to linear models, logistic regression, and survival analysis,
true value of the coefficient for product attractiveness is Springer, New York, 2001, 121-142.
0.234. There is 95% level of confident that the true value [3] Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X., Applied
Logistic Regression, John Wiley & Sons, New Jersey, 2013.
lies between 0.123 and 0.345. The most likely true value [4] Pedhajur, E.J., Multiple regression in behavioral research:
of the coefficient for brand experience is 0.229. There is explanation and prediction (3rd edition), Thomson Learning,
95% level of confidence that the true value lies between Wadsworth, USA, 1997.
0.128 and 0.331 and so on. The confidence intervals of [5] Keith, T.Z., Multiple regression and beyond: An introduction to
more than two regression coefficients are overlapped. It multiple regression and structural equation modeling (2nd edition),
Taylor and Francis, New York, 2015.
shows there is a chance that the true values of the [6] Aiken, L.S. and West, S.G., Multiple regression: Testing and
coefficients of various variables will be of similar interpreting interactions, Sage, Newbury Park, 1991.
importance. The beta value (0.234) shows that product [7] Gunst, R.F. and Webster, J.T., “Regression analysis and problems
attractiveness is more rational cause than other three of multicollinearity,” Communications in Statistics, 4 (3). 277-292.
variables; brand experience, product quality, and product 1975.
[8] Vatcheva, K.P., Lee, M., McCormick, J.B., and Rahbar, M.H.,
price. The variable product price is the less important “Multicollinearity in regression analysis conducted in
cause for customer satisfaction. However, it is very epidemiologic studies,” Epidemiology (Sunnyvale, Calif.), 6 (2).
difficult to identify the single most significant variable 227. 2016.
that influence the outcome because there is overlap of [9] Belsley, D.A., Conditioning diagnostics: Collinearity and weak
data in regression, John Wiley & Sons, Inc., New York, 1991.
confidence intervals.
[10] Belinda, B. and Peat, J., Medical statistics: a guide to SPSS,
data analysis, and critical appraisal (2nd edition), Wiley, UK,
2014.
4. Conclusion [11] Knoke, D., Bohrnstedt, W. G. and Mee, A. P., Statistics for
social data analysis (4th edition), F.E. Peacock Publisher, Illinois,
2002.
The relationship between customer satisfaction with the
© The Author(s) 2020. This article is an open access article distributed under the terms and conditions of the Creative Commons
Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).