 Statistical distribution of data often have unequal variation (e.g.

due to the complex

interaction between the risk factors affecting a population/sub-population that cannot be all
measured and accounted for in statistical analyses)
 Unequal variation implies that there are > 1 slope (rate of chenage) describing the
relationship between the outcome and risk factors/predictors/independent variables

Linear regression

“Standard linear regression (or least squares regression (LSR)) techniques summarize the average
relationship between a set of independent variables and the response (outcome/independent)
variable based on the conditional mean function E(y|x)”. In other words, the relationship is
estimated in the mean of the response variable distribution as a function of the independent
variables. Therefore, it is also known as the regression around the mean or ordinary least square
regression (OLSR).

For a number of 𝑛 observations the outcome variable, 𝑦𝑖 , is expressed as a linear function of the
independent variables, 𝑥𝑖𝑘 :

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽12 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 + 𝜀𝑖 , 𝑖 = 1, 2, … . , 𝑛

For one independent variable, 𝑥𝑖1 :

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝜀𝑖 , 𝑖 = 1, 2, … . , 𝑛
 Note that OLSR estimates only one rate of change (slope - β) for each independent variable.

Errors are minimised using least mean squares algorithm: 𝑚𝑖𝑛(∑𝑖 𝜀𝑖2 )

Figure 1. Least squares regression.

Linear Qauntile regression(LQR)

“Quantile regression quantifies the association of explanatory variables with a conditional quantile
of a dependent variable without assuming any specific conditional distribution. It hence models the
quantiles, instead of the mean as done in standard regression. In cases where either the
requirements for mean regression, such as homoscedasticity, are violated or interest lies in the
outer regions of the conditional distribution, quantile regression can explain dependencies more
accurately than classical methods (Waldmann 2018, Statistical Modelling).”

“Since OLS assumes normal distribution of the data, the results provide information only on the
mean of the dependent variable to the independent variable” (Jang et al, SciRep, 2018).

Quantile regression (QR), which was introduced by Koenker and Bassett (1978), is an extension of
OLS which allows to study the conditional distribution of the outcome (𝑦𝑖 ) on 𝑥𝑖𝑘 at different
locations of Y. It gives an overall view of the relationship between Y and X, not only around the
mean. The location of Y are its quantiles. In QR there are not assumptions on the distribution of the
dependent variable as in OLS (where it has to be normal, and no outliers).

“Quantile regression fits specified percentiles of the response variable (outcome), such as the 90th
provides data on the relationship with outliers of predictor/independent variables" (Jang et al,
SciRep, 2018).
provides data on the relationship with outliers of predictor/independent variables” (Jang et al, examine “OUTCOME NAME”, allowing for a conditional
distribution. Quantile regression is a regression model in
SciRep, 2018). which a specified conditional quantile of the outcome
(NAME) is expressed as a linear or nonlinear function of the
The expression of a linear relationship is similar with that for linear regression: covariates in the model
(𝑝) (𝑝) (𝑝)
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝜀𝑖 , 𝑖 = 1, 2, … . , 𝑛

but the regression coefficients are now calculated for a specific percentile 𝑝 of the outcome variable
𝑦𝑖 .
 Quantile regression estimates multiple rates of change (slopes - 𝛽1 ) from the minimum to
the maximum values of the outcome variable. It provides a more complete picture of the
relationship between the outcome and risk factors/predictors/independent variables.

Instead of using the least square method, QR minimises the SUM of the absolute values (i.e.
positive) of the errors |𝜀𝑖 | with weights applied on them, depending on the percentile 𝑝 and if the
data points are above or below of the QR line (Figure 2):

(1 − 𝑞) |𝜀𝑖 |, for data points below the QR line


𝑞 |𝜀𝑖 |, for data points above the QR line.

Figure 2. Linear quantile regression. Note: see that QR uses all 𝑦𝑖 , but gives them weights depending
on quintile and the position of the point relative to the QR line.

Quantile regression makes no assumptions about the distribution of the residuals (in standard LSR
they have to be normally distributed).


 in business:
- it is likely that the amount of money spent in a store or website is skewed. You, as the store’s
owner, may be more interested in finding out what predicts the top quantiles spending rather
than the mean.
 in health:
- predicting low birth weight is important because babies born at low weight are much
more likely to have health complications than babies of more typical weight
- modelling risk factors for birth weight in twin gestations (Michael et al, 2017)
- association of physical activity and sedentary time with health related QoL among
lung cancer survivors
- factors affecting the length of stay in hospital before death (e.g. length of stay in men
hospitalised organ failure is higher than those hospitalised for cancer, and is more
pronounced for longer stays; women with influenza stay less than those with cancer
and the difference increases for longer stays)
- factors affecting medical expenditure (having more chronic problems increases the
expenditure in those with low and medium medical expenditures, but not in those
with highest)
- smoking and obesity (does smoking affects obesity in an even way in Chinese
women?; no – the weight is decreasing with smoking at lower rates in obese than in
those with lower weight)
- child nutritional status in Egypt (association of HAZ (height at age Z score) with socio-
economic factors, depends on where on the HAZ scale); however in this study there
are discrepancies between regression coefficients presented in Table2 and those
plotted in Figure 2 (see for example access to clean water).
- predictors of fatigue in people with HBV (hepatitis B virus): cortisol and cytokines (IL-6
and TNF-) are predictors of fatigue depending on the level of total fatigue and
subdomain/level of fatigue (e.g. IL-6 is associated(decreasing) with cognitive/mood
subdomain in the top quantiles >0.7, but not in the other; TNF- is
associated(increasing) with fatigue, but the association is strongest in those with
lowest fatigue levels(0.1 quintile)).
- how political regimes affect health conditions such as infant and child mortality rates
and life expectancy.
- Association between physical activity and paediatric obesity (Mitchell et al., 2017);
strongest association for the higher quantiles of BMI z-score, i.e. moderate to
vigorous physical activity (MVPA) can shift the BMI score towards lover levels.
 in economics:
- wage inequality by education (does education improves wages? This is country dependent,
and where you are on the earnings scale; a good example is Portugal where a 12.5% average
increase in salary per extra year of school, hides a 6.5% increase only in 10% quantile of those
which earn less)(Martins and Pereira, 2004); this is a good paper for journal club
- change in BMI distribution over time and associated socioeconomic gradients (Gebremariam
et al 2018) – interesting one; good for journal club
- determinants of house prices evaluation (characteristics of the properties are not priced the
same across a given distribution of house prices: e.g. in Utah USA(1999-2000), size of house
(sq feet) had a higher impact on the house price for most expensive houses; same was for the
area of the land, having full baths, hard wood floors, tile floors; whilst having a garage
influences in the same way across the price ranges)
- determinants of Paris apartment prices (Amedee-Manesme et al_2017)
- variants influencing farmland prices in Germania at the top end of the price range; farmland
prices in Germany doubled between 2006-2016; EU working to a policy to cap the prices to
make the lands affordable to farmers; finding determinants of price by regression around the
mean might not apply appropriately at the top end of the prices
 education
- effect of cooperative learning for exams (i.e. learning in groups) on grades (high-achieving
students benefit from cooperative learning, whilst on low-achieving students there is a
negative impact if they learn as part of a learning group, compared with those which learn
on their own)
 miscellaneous
- extent to which drinkers respond to price changes by varying the ‘quality’ of the alcohol that
they consume (difference between normal and heavy drinkers, with heavy drinkers re-
orienting towards poor quality alcohol)
 environment and ecology
- analysing radon accumulation in homes
- density of trout as a function of stream width:depth ratio (density decreases much quicker in
higher density populations and there is an increase in lower density populations)
Log linear function
ln 𝑦 = 𝛽0 + 𝛽1 𝑋 + 𝜀
 psychology
- Does art makes you happy? A quantile regression approach
- Predicting counterproductive work behavior with narrow personality traits (one of the best
examples for quantile regression)
Modeling the log of total medical expenditure (ltoexp) for medicare (elderly) patients

(Cristopher Baum, 2013)

use, clear

* Explanatory variables include an indicator for supplementary private insurance(suppins), a health
status variable(totchr) and three demographic measures: age, female, and white.

Drop missing data

drop if mi(ltotexp)


su ltotexp suppins totchr age female white, sep(0)

Variable Obs Mean Std. Dev. Min Max

ltotexp 2,955 8.059866 1.367592 1.098612 11.74094

suppins 2,955 .5915398 .4916322 0 1
totchr 2,955 1.808799 1.294613 0 7
age 2,955 74.24535 6.375975 65 90
female 2,955 .5840948 .4929608 0 1
white 2,955 .9736041 .1603368 0 1

Plot the cumulative distribution function of “ltotexp”- log total medical expenditures
qplot ltotexp, recast(line) ylab(,angle(0)) xlab(0(0.1)1) xline(0.5) xline(0.1) xline(0.9)

Figure 3. CDF for ltotext.

10th, 50th and 90th quantiles are approximately 6, 8, and 10 and the distribution of the data is
Perform median regression and do not display the iteration log:

qreg ltotexp suppins totchr age female white, nolog

Median regression Number of obs = 2,955
Raw sum of deviations 1555.48 (about 8.111928)
Min sum of deviations 1398.492 Pseudo R2 = 0.1009

ltotexp Coef. Std. Err. t P>|t| [95% Conf. Interval]

suppins .2769771 .0535936 5.17 0.000 .1718924 .3820617

totchr .3942664 .0202472 19.47 0.000 .3545663 .4339664
age .0148666 .0041479 3.58 0.000 .0067335 .0229996
female -.0880967 .0532006 -1.66 0.098 -.1924109 .0162175
white .4987457 .1630984 3.06 0.002 .1789474 .818544
_cons 5.648891 .341166 16.56 0.000 4.979943 6.317838

All variables but female are statistically significant.

Compare ordinary lest square regression with QR.

eststo clear

*linear regression and store

eststo, ti("OLS"): qui reg ltotexp suppins totchr age female white, robust

*Quantile Regression and store

foreach q in 0.10 0.25 0.50 0.75 0.90 {

eststo, ti("Q(`q')"): qui qreg ltotexp suppins totchr age female white, q(`q') nolog

*Save results in a *tex file which tabulates the coefficients of OLS vs. QR

esttab using 82303ht.tex, replace nonum nodep mti drop(_cons) ti("Models of log total medical
expenditure via OLS and QR")


*processing this file in the “Texmaker” editor to be able to convert the table to pdf

* the 82303ht.tex file needs altered in “Texmaker” program (and save it as 82303ht_2.tex) as follows
…content from 82303ht.tex file
*to convert this table to pdf in “Texmaker”, choose “ViewPDF” and press “Quick Build”. This will save
the table as pdf, which in Adobe Acrobat Pro DC ca be converted in word


Table 1: Models of log total medical expenditure via OLS and QR

OLS Q(0.10) Q(0.25) Q(0.50) Q(0.75) Q(0.90)
suppins 0.257∗∗∗ 0.396∗∗∗ 0.386∗∗∗ 0.277∗∗∗ 0.149∗ -0.0143
(5.44) (4.90) (6.64) (5.17) (2.44) (-0.16)

totchr 0.445∗∗∗ 0.539∗∗∗ 0.459∗∗∗ 0.394∗∗∗ 0.374∗∗∗ 0.358∗∗∗

(25.56) (17.67) (20.93) (19.47) (16.18) (10.36)

age 0.0127∗∗∗ 0.0193∗∗ 0.0155∗∗∗ 0.0149∗∗∗ 0.0183∗∗∗ 0.00592

(3.52) (3.08) (3.45) (3.58) (3.86) (0.84)

female -0.0765 -0.0127 -0.0161 -0.0881 -0.122∗ -0.158

(-1.65) (-0.16) (-0.28) (-1.66) (-2.01) (-1.74)

white 0.318∗ 0.0734 0.338 0.499∗∗ 0.193 0.305

(2.34) (0.30) (1.91) (3.06) (1.04) (1.10)
N 2955 2955 2955 2955 2955 2955
t statistics in parentheses
∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

“From Table 1 we see that the effect of supplementary insurance differs considerably, having
a strong effect on expenditures at lower quantiles. The median estimate is similar to the OLS
point estimate. For the health status variable, the effects are much stronger at lower
quantiles, with the OLS effect quite far from the median estimate.”

*to test for the equivalence of the quantile estimates across quintiles
qui sqreg ltotexp suppins totchr age female white, nolog q(0.1 0.25 0.5 0.75 0.9)
test [q25=q50=q75]: suppins

( 1) [q25]suppins - [q50]suppins = 0
( 2) [q25]suppins - [q75]suppins = 0

F( 2, 2949) = 6.38
Prob > F = 0.0017

test [q25=q50=q75]: totchr

( 1) [q25]totchr - [q50]totchr = 0
( 2) [q25]totchr - [q75]totchr = 0

F( 2, 2949) = 7.19
Prob > F = 0.0008

It rejects the equality of the regression coefficients for the 3 quartiles in each case, i.e. “suppins” and

*Using Azevedo’s routine grqreg, available from SSC, we can view how each covariate’s effects vary
across quantiles, and contrast them with the (fixed) OLS estimates:

qreg ltotexp suppins totchr age female white, q(.50) nolog

Median regression Number of obs = 2,955
Raw sum of deviations 1555.48 (about 8.111928)
Min sum of deviations 1398.492 Pseudo R2 = 0.1009

ltotexp Coef. Std. Err. t P>|t| [95% Conf. Interval]

suppins .2769771 .0535936 5.17 0.000 .1718924 .3820617

totchr .3942664 .0202472 19.47 0.000 .3545663 .4339664
age .0148666 .0041479 3.58 0.000 .0067335 .0229996
female -.0880967 .0532006 -1.66 0.098 -.1924109 .0162175
white .4987457 .1630984 3.06 0.002 .1789474 .818544
_cons 5.648891 .341166 16.56 0.000 4.979943 6.317838

grqreg, cons ci ols olsci reps(100)

“Figure 4 illustrates how the effects of private insurance and health status (number of chronic
problems) vary over quantiles, and how the magnitude of the effects (i.e. regression coefficients) at
various quantiles differ considerably from the OLS coefficient, even in terms of the confidence
intervals around each coefficient.
Figure 4. Quantile regression coefficients(& 90% CIs) compared with OLS coefficients(&90%CI) for
each quantile.

Koenker, R., and Bassett, G. W. (1978). “Regression Quantiles.” Econometrica 46:33–50.
Cristopher Baum, 2013

