Multiple Linear and Non-Linear Regression in Minitab: Lawrence Jerome

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

MSOR Connections Vol 9 No 3 August – October 2009

Lawrence Jerome

Multiple linear and non-linear regression in Minitab

Lawrence Jerome
Senior Instructor
Park University
lawrence7000@msn.com

Park University’s online Advanced Statistics course, EC315, is required of all Park
Economics students, and is the second statistics course in the undergraduate program,
and is also required of MBA students. EC315 also goes by the name, Quantitative
Research Methods, and focuses on hypothesis testing and multiple linear and non-
linear regression—with multiple regression constituting about half of the course – plus
the course project. Students are required to find a suitable project and data sets for
multiple regression (usually on the internet), perform the analyses and interpretations,
and submit a final paper with their results. Students are given a choice of using either
Excel’s Data Analysis ToolPak or Minitab, which comes with the textbook as Minitab 14
Student edition. Minitab provides a full set of analysis outputs within the regression
tool to allow the researcher to evaluate regression equations and determine which
independent variables provide the best predictors of the dependent variable.
A crucial part of the online course is the full set of tutorials showing students how
to perform all the different types of statistical analyses in both Excel and Minitab.
In addition, instructors can use the eCollege Live Chat Pro feature to give live
demonstrations of how to use the statistical software by sharing their desktop and
lecturing via microphone. Although students may be located literally around the
world, they still can participate in live computer lab demonstrations and see directly
how to use statistical software to solve problems and perform statistical analyses.

Linear regression
All multiple linear regression equations have the general form shown in Eqn. 1.
(Eqn. 1) y = b + m1 x1 + m2 x2 + … + mn xn
In Eqn. 1, y is the dependent variable and the various xi’s are the independent
variables. The constant, b, is the y intercept for all xi = 0, and the various mi’s are
the slopes/coefficients of the corresponding independent variables, xi. Unlike the
two-dimensional case, y = mx + b, in which the independent variable x is the sole
contributor to predictable changes in the dependent variable, in multiple regression,
different independent variables contribute unevenly to changes in the dependent
variable. Hence, in any multiple regression, it’s important to evaluate the contributions
of each independent variable – in fact, it may be necessary to drop some independent
variables and/or add new ones to the regression. Minitab allows the user to make such
evaluations and determinations.
As an example, one Park student conducted a recent study of public high school
graduation rates as a function of school funding, average family income, average

17
MSOR Connections Vol 9 No 3 August – October 2009

Fig 2 – Controlling
the output
in Minitab
regression
coefficient, and the corresponding P-value by which the
significance level of each coefficient can be evaluated. The
student noted that, while the F statistic level of significance
was at the 1% level (indicating the overall model is
Fig 1 – Raw data in Minitab for high school graduate rate multiple regression significant), the Adjusted R-square value of 30.6% indicates
the model only accounts for about 30% of the response
teacher salary, and student-to-teacher ratio. The study uses
variable variation.
cross-sectional data for 30 of the largest
school districts in the United States,
reported in 2005. The data for this Regression Analysis: GRADS, % versus FUNDING (In thousands), RATIO, ...
research came from the National Center The regression equation is
for Education Statistics, the Bureau of GRADS, % = 1.86 - 0.000000 FUNDING (In thousands) - 0.0163 RATIO
Labor Statistics and the U.S. Census + 0.000051 INCOME (In dollars) + 0.000019 SALARY (In dollars)
Bureau [1, 2, 3, 4]. The linear regression
Predictor Coef SE Coef T P
model is: Constant 1.861 1.442 1.29 0.209
(Eqn. 2) FUNDING (In thousands) -0.00000029 0.00000016 -1.81 0.083
RATIO -0.01628 0.05144 -0.32 0.754
Grad = b + m1 * Funding + m2 * Income + m3
INCOME (In dollars) 0.00005111 0.00001749 2.92 0.007
* Salary + m4 * Ratio
SALARY (In dollars) 0.00001887 0.00002239 0.84 0.407
Fig 1 shows the raw data setup in a S = 0.888587 R-Sq = 40.1% R-Sq(adj) = 30.6%
Minitab Worksheet. Note that data
PRESS = 28.2372 R-Sq(pred) = 14.36%
must start at the top of worksheets with
headings in the first row. The first column Analysis of Variance
is C1-T, the T indicating that it’s text Source DF SS MS F P
data and won’t be used in the multiple Regression 4 13.2317 3.3079 4.19 0.010
regression. Thus, we have one dependent Residual Error 25 19.7397 0.7896
variable, Grad, and four independent Total 29 32.9714
variables, Funding through Salary. Source DF Seq SS
In Minitab, the student chose GRADS, % FUNDING (In thousands) 1 3.6626
RATIO 1 0.0005
as the Response variable and the other
INCOME (In dollars) 1 9.0076
four variables as the Predictor variables. SALARY (In dollars) 1 0.5610
Among the Minitab Regression Options,
Unusual Observations
most students select Standardized
Histograms and Normal Plots, and for FUNDING
Display of Results students select the (In
“in addition sequential sum of squares Obs thousands) GRADS, % Fit SE Fit Residual St Resid
5 489856 8.260 6.295 0.480 1.965 2.63R
and unusual observations…” option as
13 367330 6.560 4.472 0.304 2.088 2.50R
shown in Fig 2.
18 6111619 3.700 3.480 0.717 0.220 0.42 X
Fig 3 shows the output for this R denotes an observation with a large standardized residual.
regression analysis. Note that Minitab X denotes an observation whose X value gives it large influence.
gives the regression equation, as well
as the coefficients, T statistic for each Fig 3 – School gradation study Minitab linear regression output
18 Multiple linear and non-linear regression in Minitab – Lawrence Jerome
MSOR Connections Vol 9 No 3 August – October 2009

Surprisingly, in this student study of factors contributing logged variables. Another type of non-linear regression
to high school graduation rates, Family Income proved the situation occurs in Economics in the form of learning curve
best predictor (with a P-value = 0.007) and Student/Teacher and economy of scale cost analyses. This important class of
Ratio the poorest predictor (with a P-value = 0.754). Thus, multiple non-linear regression cost equations can be solved
the student recommended dropping the RATIO and SALARY using Minitab. Cost equations based on learning curves and
variables from the regression and either rerunning the economies of scales are typically exponential equations of
regression or looking for other predictor variables. the form shown in Eqn. 3.
The student was able to note that the Minitab Residual Plots (Eqn 3) cost = T1 x x1a1 x x2a2 x … x xnan
for the two best predictor variables were not as randomly
In Eqn. 3, T1 is the first unit cost with all xi variables set to one,
distributed as desired for a good regression equation,. In
xi are cost parameters (size, numbers, technical parameters),
particular, the Funding residuals are bunched to the low side.
and ai are coefficients less than 1, so that the overall cost
Figure 4 shows these Minitab Residual Plots for this study.
curve declines as learning and economies of scale go up
[5, 6]. Taking the natural logarithm of both sides of Eqn. 3
transforms the non-linear equation into a linear equation:
(Eqn. 4) ln(cost) = 1n(T1 ) + a1 1n(x1) + a2 1n(x2) + … + an 1n(xn)
Note that Eqn. 4 is in the same form as Eqn. 1, with the
dependent variable on the left and a constant (ln(T1)) plus
coefficients times dependent variables on the right. Thus,
to perform multiple non-linear regression on cost learning
curve data, take the natural log of the data and perform
the multiple linear regression on the logged data, then
transform the resulting equation into the form of Eqn. 3 by
taking the natural exponential of both sides.
Many of Park University’s students are military students,
interested in the economics of military equipment,
maintenance and operations. One student performed a
non-linear regression of the cost of missile power supplies,
studying the learning curves and economies of scale for this
particular piece of equipment.
Fig 5 shows the set of cost data for a missile power supply,
in the process of being transformed to the form of Eqn. 4
using the Minitab calculator. The LN variables are selected
for the “Store result in variable” textbox; the natural log
function, LOGE, goes in the “Expression” textbox with the
original variable as the argument. Minitab automatically
fills in the natural log columns with the calculated values.
This procedure is repeated for each of the variables,
Fig 4 – Minitab residual plots for the two best predictor variables independent and dependent.
Once the natural log variables were calculated, then the
This Park student study of factors affecting high school student performed the usual multiple linear regression on
graduation rates was able to use Minitab effectively to the logged variables. The results are shown in Fig 6. Note
develop and evaluate multiple regression equations, and that both the overall F value and coefficient t stats are very
use the regression output and plots in a final report of the large, with P-values equal to zero – indicating that this
study. As the student noted, this regression model shows regression equation is extremely significant. All that remains
some promise, but is far from the perfect prediction model is to raise both sides of logged regression equation to the
for high school graduation rates—as is often the case in real power of e, giving the final nonlinear cost equation:
multiple regression studies.
(Eqn. 5) $K = 11.4429 (Power)0.391 (#Outputs)0.351 (Qty)0.663
Non-linear regression
When a multiple linear regression shows poor overall level
of significance, students are encouraged to try non-linear
regression by taking the natural logarithm of all variables
and then running the multiple linear regression on the

Multiple linear and non-linear regression in Minitab – Lawrence Jerome 19


MSOR Connections Vol 9 No 3 August – October 2009

future endeavours, and invariably students will mention


that learning how to perform statistical analyses in Minitab
and Excel will prove valuable skills in future coursework
and future employment. In short, learning the statistical
software is at least as important to the students as learning
the statistics themselves!

References
1. Bureau of Labor Statistics. (2006). Occupational
Employment and Wages. 25-2031 Secondary School
Teachers, Except Special and Vocational Education.
Retrieved November 2, 2007. Available via:
http://www.bls.gov/oes/current/oes252031.htm#msa
[Accessed 10 June 2009].
2. Education Resource Information Center. (2007).
ED452177 - An Examination of Teacher Salary and
Fig 5 – Transforming non-linear data using the Minitab calculator Student Performance. Retrieved November 20, 2007.
Available via: http://www.eric.ed.gov/ERICWebPortal/
custom/portlets/recordDetails/detailmini.
Regression Analysis: LN(Cost) versus LN(Power), jsp?_nfpb=true&_&ERICExtSearch_SearchValue_
LN(Outputs), ... 0=ED452177&ERICExtSearch_SearchType_
The regression equation is 0=no&accno=ED452177 [Accessed 10 June 2009].
LN(Cost) = 2.44 + 0.391 LN(Power) + 0.351
3. U.S. Census Bureau. (2007). Retrieved November 8,
LN(Outputs) + 0.633 LN(Quantity)
2007. Available via: http://www.census.gov/acs/www/
Predictor Coef SE Coef T P Products/Ranking/2003/R14T160.htm
Constant 2.4374 0.2017 12.08 0.000
[Accessed 10 June 2009].
LN(Power) 0.39073 0.04600 8.49 0.000
LN(Outputs) 0.35113 0.04948 7.10 0.000 4. U.S. Department of Education. (2007). Retrieved
LN(Quantity) 0.63283 0.07459 8.48 0.000 November 8, 2007. Available via:
S = 0.0306750 R-Sq = 98.6% R-Sq(adj) = 97.9% http://www.ed.gov/about/overview/fed/10facts/index.
html [Accessed 10 June 2009].
Analysis of Variance
5. Wright, T.P., (1936) Factors Affecting the Cost of
Source DF SS MS F P
Regression 3 0.40235 0.13412 142.53 0.000 Airplanes, Journal of Aeronautical Sciences, Vol 3 (No. 4):
Residual Error 6 0.00565 0.00094 pp122-128.
Total 9 0.40800
6. Hax, Arnoldo C.; Majluf, Nicolas S. (October 1982).
“Competitive cost dynamics: the experience curve”.
Fig 6 – Results of multiple linear regression of logged variables
Interfaces 12: pp 50–61.
Conclusions
Park University’s online EC315 Quantitative Research
Methods course gives students powerful analytical tools
to tackle some of the toughest economic problems that
can be solved mathematically—multiple linear and non-
linear regression. Both Minitab and Excel’s Data Analysis
ToolPak provide all the tools necessary to perform very
efficient and well analysed multiple linear regressions. The
Minitab output and residual graphs are very clear and easy
to read, and are easy to copy and paste into final reports to
produce a polished presentation. The Minitab regression
tool provides ample options and user choices to match the
output to the user’s needs. Even non-linear regressions such
as cost learning curves and economy of scale curves can be
set up and derived in Minitab.
In the final week of class, students are asked to discuss
how they will or might be using their course learning in

20 Multiple linear and non-linear regression in Minitab – Lawrence Jerome

You might also like