Professional Documents
Culture Documents
Aos 1681
Aos 1681
1 regression from various statistical views, and Garthiwate [19] attempted a statis- 1
2 tical interpretation of PLS algorithms. Naik and Tsai [30] demonstrated that PLS 2
3 regression provides a consistent estimator of the central subspace [7, 8] when the 3
4 distribution of the response given the predictors follows a single-index model and 4
5 n → ∞ with p fixed. Delaigle and Hall [16] extended it to functional data. Cook, 5
6 Helland and Su [12] established a population connection between PLS regression 6
7 and envelopes [14] in the context of multivariate linear regression, provided the 7
8 first firm PLS model and showed that envelope estimation leads to root-n consis- 8
9 tent estimators whose performance dominates that of PLS in traditional fixed p 9
10 contexts. 10
11 PLS regression also has a substantial following outside of the Chemometrics 11
12 and Statistics communities. Boulesteix and Strimmer [3] studied the advantages 12
13 of PLS regression for the analysis of high-dimensional genomic data, and Nguyen 13
14 and Rocke [31, 32] proposed it for microarray-based classification. Worsley [36] 14
15 considered PLS regression for the analysis of data from PET and fMRI studies. Ap- 15
16 plication of PLS for the analysis of spatiotemporal data was proposed by Lobaugh 16
17 et al. [27], and Schwartz et al. [33] used PLS in image analysis. Because of these 17
18 and many other applications, it seems clear that PLS regression is widely used 18
19 across the applied sciences. All subsequent references to PLS in this article should 19
20 be understood to mean PLS regression. 20
21 In view of the apparent success that PLS has had in Chemometrics and else- 21
22 where, we might anticipate that it has reasonable statistical properties in high- 22
23 dimensional regression. However, the algorithmic nature of PLS evidently made it 23
24 difficult to study using traditional statistical measures, with the consequence that 24
25 PLS was long regarded as a technique that is useful, but whose core statistical 25
26 properties are elusive. Chun and Keleş [6] provided a piece of the puzzle by show- 26
27 ing that, within a certain modeling framework, the PLS estimator of the coefficient 27
28 vector in linear regression is inconsistent unless p/n → 0. They then used this as 28
29 motivation for their development of a sparse version of PLS. The Chun–Keleş re- 29
30 sult poses a little dilemma. On the one hand, decades of experience support PLS as 30
31 a useful method, but its inconsistency when p/n → c > 0 casts doubt on its use- 31
32 fulness in high-dimensional regression, which is one of the contexts in which PLS 32
33 undeniably stands out by virtue of its widespread application. There are several 33
34 possible explanations for this conflict, including (a) consistency does not always 34
35 signal the value of a method in practice, (b) the Chemometrics literature is largely 35
36 wrong about the value of PLS and (c) the modeling construct used by Chun and 36
37
Keleş does not reflect the range of applications in which PLS is employed. 37
38
Cook and Forzani [9] studied single-component PLS regressions and found that 38
39
in some reasonable settings PLS predictions can converge at the root-n rate as 39
40
n, p → ∞, regardless of the alignment between n and p, a result that stands in 40
41
contrast to the finding of Chun and Keleş [6]. Single-component regressions do 41
42
occur in practice, but our impression is that multiple-component regressions are the 42
43
rule. Recent studies that used multiple PLS components include studies of seasonal 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 3
PLS PREDICTION 3
1 streamflow forecasting [1], Italian craft beer [2], the metabolomics of meat exudate 1
2 [5], the prediction of biogas yield [24], quantification in bioprocesses [25] and the 2
3 Japanese honeysuckle [26]. 3
4 In this article, we follow the general setup of Cook and Forzani [9] and use 4
5 traditional (n, p)-asymptotic arguments to provide insights into PLS predictions 5
6 in multiple-component regressions. We also give bounds on the rates of conver- 6
7 gence for PLS predictions as n, p → ∞ and in doing so we conjecture about the 7
8 value of PLS in various regression scenarios. Section 2 contains a review of PLS 8
9 regression, along with comments on its connection to envelopes and sufficient di- 9
10 mension reduction. The specific objective of our study is described in Section 3. In 10
11 Section 4, we introduce and provide intuition for various quantities that influence 11
12 the (n, p)-asymptotic behavior of PLS predictions. Our main results are given as 12
13 two theorems in Section 5. There we also describe connections with the results 13
14 of Cook and Forzani [9] for single-component regressions and offer a different 14
15 view of the Chun–Keleş result [6]. Supporting simulations and an illustrative data 15
16 analysis are given in Section 6. We focus solely on predictive consistency until 16
17 Section 7.1 where we address estimative consistency. Proofs and other supporting 17
18 material are given in an online supplement to this article [10]. 18
19 Our results show that there is a range of regression scenarios where PLS predic- 19
20 tions have the usual root-n convergence rate, even when n ≪ p, and an even wider 20
21 21
range where the rate is slower but may still produce practically useful results, the
22 22
Chun–Keleş result notwithstanding.
23 23
24 24
2. PLS review. There are several different PLS algorithms for the multivariate
25 25
(multi-response) linear regression of r responses on p predictors. These algorithms
26 26
may not be presented as model-based, but instead are often regarded as methods for
27 27
prediction. It is known they give the same result for univariate responses but give
28 28
distinct sample results for multivariate responses. We restrict attention to univariate
29 29
regression so that the methodology is clear. See Section 7.2 for further discussion
30 30
related to this choice.
31 31
The context for our study is the typical linear regression model with univariate
32 32
response y and random predictor vector X ∈ Rp ,
33 33
34 (2.1) y = µ + β T X − E(X) + ǫ, 34
35 35
36
where the regression coefficients β ∈ Rp are unknown, and the error ǫ has mean 36
37
0, variance τ 2 and is independent of X. We assume that (y, X) follows a non- 37
38
singular multivariate normal distribution and that the data (yi , Xi ), i = 1, . . . , n, 38
39
arise as independent copies of (y, X). We use the normality assumption to facili- 39
40
tate asymptotic calculations and to connect with the results of Chun and Keleş [6]; 40
41
nevertheless, simulations and experience in practice indicate that it is not essen- 41
42
tial for the methodology itself. Further discussion of this assumption is given in 42
43
Section 7.4. To avoid trivial cases, we assume throughout that β 6= 0. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 4
PLS PREDICTION 5
1/2
1 wk+1 = QSk σ/ σ T QSk σ , 1
2 2
3
Wk+1 = (w0 , . . . , wk , wk+1 ). 3
4 At termination, span(Wd ) is a dimension reduction subspace H. Since d is as- 4
5 sumed to be known and effectively fixed, SIMPLS depends on only two population 5
6 quantities—σ and 6—that must be estimated. The sample version of SIMPLS is 6
7 constructed by replacing σ and 6 by their sample counterparts and terminating af- 7
8 ter d steps, even if 6̂ is singular. In particular, SIMPLS does not make use of 6̂ −1 8
9 and so does not require 6̂ to be nonsingular, but it does require d ≤ min(p, n − 1). 9
10 If d = p, then span(Wp ) = Rp and PLS reduces to the ordinary least squares 10
11 11
estimator. Let G = (σ, 6σ, . . . , 6 d−1 σ ) and Ĝ = (σ̂ , 6̂ σ̂ , . . . , 6̂ d−1 σ̂ ) denote
12 12
population and sample Krylov matrices. Helland [21] showed that span(G) =
13 13
span(Wd ), giving a closed-form expression for a basis of the population PLS sub-
14 14
space, and that the sample version of the SIMPLS algorithm gives span(Ĝ).
15
PLS can be seen as an envelope method as follows [12]. A subspace R ⊆ Rp is 15
16 16
a reducing subspace of 6 if R decomposes 6 = PR 6PR + QR 6QR and then we
17 17
say that R reduces 6. The intersection of all reducing subspaces of 6 that contain
18
a specified subspace S ⊆ Rp is called the 6-envelope of S and denoted as E6 (S ). 18
19 19
Let Pk denote the projection onto the kth eigenspace of 6, k = 1, . . . , q ≤ p. Then
20 20
the 6-envelopeP of S can be constructed by projecting onto the eigenspaces of 6
21 q 21
[14]: E6 (S ) = i=1 Pk S . Cook et al. [12] showed that the population SIMPLS
22 22
algorithm produces E6 (B ), the 6-envelope of B := span(β), so H = span(Wd ) =
23 23
span(G) = E6 (B ).
24
From this point, we use H ∈ Rp×d to denote any semi-orthogonal basis matrix 24
25
for E6 (B ) and let (H, H0 ) ∈ Rp×p denote an orthogonal matrix. The connection 25
26 26
with envelopes led Cook et al. [12] to the following envelope model for PLS:
27 27
T T
28 y = µ + βy|H T XH X − E(X) + ǫ, 28
29
(2.3) 29
30
6 = H 6H H T + H0 6H0 H0T , 30
31 where 31
32 32
33
6H = var H T X = H T 6H ∈ Rd×d , 33
34 6H0 = var H0T X = H0T 6H0 ∈R (p−d)×(p−d)
, 34
35 35
36
and βy|H T X can be interpreted as the coordinates of β relative to basis H . In terms 36
37
of the parameters in model (2.1), this model makes use of the basis H of E6 (B ) to 37
38
achieve a parsimonious re-parameterization of β and 6 in terms: 6 is as given in 38
39
the model and 39
−1
40 (2.4) β = PH(6) β = Hβy|H T X = H H 6H T
H σ = G G 6G −1 GT σ,
T T 40
41 41
42
where the last step follows because, as noted previously, E6 (B ) = span(H ) = 42
43
span(G). This re-parameterization has no impact on the predictors or the error 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 6
PLS PREDICTION 7
1 where the weights wi = σ T Phi σ/σ T PH σ , Phi denotes the projection onto 1
P
2 span(hi ) and di=1 wi = 1. Consequently, if the wi are bounded away from 0 2
3 and if many predictors are correlated with y so that kσ k2 → ∞, then the eigenval- 3
4 ues of 6H must diverge to ensure that β T 6β remains bounded. We could in this 4
5 case take η(p) = kσ k2 . 5
6 Suppose that the first k eigenvalues λi , i = 1, . . . , k, diverge with p, that λi ≍ 6
7 λj , i, j = 1, . . . , k, and that the remaining d − k eigenvalues are a lower order, 7
8 λj = o(λi ), i = 1, . . . , k, j = k + 1, . . . , d. Then if kσ k2 ≍ λi , i = 1, . . . , k, we 8
9 must have wi → 0 for i = k + 1, . . . , d for β T 6β to remain bounded. 9
10 It is possible also that the eigenvalues λi are bounded. This happens in sparse 10
11 regressions when only d predictors are relevant. For instance, if H = (Id , 0)T then 11
12 6H is the dth order leading principal submatrix of 6, and thus it is fixed with 12
13 bounded eigenvalues. Bounded eigenvalues are possible also when many predic- 13
14 14
tors are related weakly with the response so kσ k is bounded. If the eigenvalues λi
15 15
are bounded, then η ≍ 1.
16 16
From the discussion so far, we see that κ, being the trace of a p − d × p − d
17 17
positive definite matrix, would normally be at least the order of p, but might have
18 18
a larger order. η, being the trace of a d × d matrix, will in practice have order at
19
most p and can achieve that order in abundant regressions where kσ k2 ≍ p. We 19
20 20
can contrive cases where p = o(η), but they seem impractical. For these reasons,
21 21
we limit our consideration to regressions in which η = O(κ).
22 22
The measures κ and η are frequently joined naturally in our asymptotic expan-
23 23
sions into the combined measure
24 24
κ(p)
25 (4.5) φ(n, p) = . 25
26 nη(p) 26
27 As will be seen later, a good scenario for prediction occurs when φ(n, p) → 0 as 27
28 n, p → ∞. This implies a synergy between the signal η and the sample size n, 28
29 with the product nη being required to dominate the variation of the noise in X 29
30 as measured by κ. This is similar to the signal rate found by Cook, Forzani and 30
31 Rothman [11] in their study of abundant high-dimensional linear regression. We 31
32 typically drop the arguments (n, p) when referring to φ(n, p). 32
33 33
34 4.3. Coefficients βy|H T X . The coefficients for the regression of y on the re- 34
35 −1 35
duced predictors H T X can be represented as βy|H T X = 6H σH , where σH =
36 T d×1 36
H σ ∈ R . Population predictions based on the reduced predictor involve the
37 T T T 37
product βy|H T X H X. If var(H X) = 6H diverges along certain directions, then
38 38
39
we must have corresponding parts of βy|H T X converge to 0 to balance the increases 39
in H T X or otherwise the form βy|HT T
40 T X H X will not make sense asymptotically. 40
41 This essential behavior can be seen also from 41
42 T T
T T −1 T 42
var βy|H T X H X = βy|H T X 6H βy|H T X = σH 6H σH = β 6β.
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 9
PLS PREDICTION 9
PLS PREDICTION 11
1 We see as a consequence of this structure that both σ̄ and kcp k must be bounded 1
2 and that w1 → 1 and w2 → 0. Additionally, with η = p and σ̄∞ = limp→∞ σ̄ , 2
3 √ √ √ 3
4
aH = lim p σ̄ / η, kcp k/ η T = (σ̄∞ , 0)T , 4
p→∞
5 5
V = lim diag (1 − ψ + pψ), (1 − ψ) /p = diag(ψ, 0),
6 p→∞ 6
7 7
σ̄∞ ψ σ̄∞
8 G∞ = . 8
0 0
9 9
10 In this case, V and G∞ both have rank 1, and so by Proposition 1 tr(C −1 ) is 10
11 unbounded as p → ∞. 11
12 To find an order for tr(C −1 ), we have 12
13 13
6σ = σ̄ b(p, ψ)1p + (1 − ψ)cp ,
14 14
15 G = σ̄ 1p + cp , σ̄ b(p, ψ)1p + (1 − ψ)cp , 15
16 2 2 2 2 2 2
! 16
σ̄ pb(p, ψ) + (1 − ψ)kcp k σ̄ pb (p, ψ) + (1 − ψ) kcp k
17 GT 6G = , 17
18
σ̄ 2 pb2 (p, ψ) + (1 − ψ)2 kcp k2 σ̄ 2 pb3 (p, ψ)3 + (1 − ψ)3 kcp k2 18
19 where b(p, ψ) = 1 − ψ + pψ. From this, it can be verified that tr(C −1 )
≍ p2 , 19
20 2 −1
so ρ = p . The behavior of tr(C ) in this example is due to the different orders 20
21 of magnitude of the eigenvalues of 6H , λ1 ≍ p and λ2 ≍ 1. As will be seen later 21
22 in Theorems 1 and 2, a consequence of this structure is that we would need sam- 22
23 ple size n ≫ p4 to keep the direction in span(1) from swamping the direction in 23
24 span⊥ (1). 24
25 25
26 4.7. Universal conditions. Before discussing asymptotic results in the next 26
27 section, we summarize the conditions that we assume through this article. We re- 27
28 quire that: 28
29 29
C1. Model (2.1) holds, where (y, X) follows a nonsingular multivariate normal
30 30
distribution and that the data (yi , Xi ), i = 1, . . . , n, arise as independent copies
31 31
of (y, X). To avoid the trivial case, we assume that the coefficient vector β 6= 0,
32 32
which implies that the dimension of the envelope d ≥ 1. We also assume that the
33 33
error standard deviation
√ τ is bounded away from 0 as p → ∞.
34 34
C2. φ and ρ/ n → 0 as n, p → ∞, where φ and ρ are defined at (4.5) and
35 35
(4.6).
36 36
C3. η = O(κ) as p → ∞, where η ≥ 1, and η and κ are defined at (4.2) and
37 37
(4.3).
38 38
C4. The dimension d of the envelope is known and constant for all finite p.
39 C5. 6 > 0 for all finite p. This restriction allows 6̂ to be singular, which is a 39
40 scenario PLS was designed to handle. We do not require as a universal condition 40
41 that the eigenvalues of 6 are bounded as p → ∞. 41
42 42
43
Additional conditions will be needed for various results. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 13
PLS PREDICTION 13
PLS PREDICTION 15
PLS PREDICTION 17
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
F IG . 1. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 6 the
15 lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15
16 16
17 17
for each 2 we can take the corresponding η = pa , 0 ≤ a ≤ 1. It follows from
18 18
the discussion of Section 5.2 and from the above calculations that ρ ≍ 1. Since
19 19
κ ≍ p, the asymptotic
√ behavior of the simulation is governed by Theorem 2.I,
20 20
giving DN = Op ( φ) with φ = p/nη. A data set of size n = p/2 was obtained
21 21
by using n independently generated observations on (y, ν T ) and ω in model (5.2)
22 22
to obtain n independent observations on X. Then n additional observations on X
23 2 was computed for each and averaged. The vertical axis 23
were generated and DN
24 24
Db2 of Figure 1 is the average over 100 replications of this whole process. Reading
25 25
26
from the top to bottom at log2 (p) = 6, the lines in Figure 1 correspond to η equal to 26
27
a constant, p1/2 , p3/4 and p. Since n = p/2, we have φ = 2/η. Thus, in reference 27
28
to Figure 1, our theoretical results predict convergence of the curves for η equal to 28
29
p1/2 , p3/4 and p, but no convergence for η equal to a constant. The curves shown 29
30
in Figure 1 seem to support this prediction, with the best results being achieved for 30
31
η = p, followed by η = p3/4 and η = p1/2 . 31
32
Figure 2 was constructed like Figure 1, except diag(2T 2) = (pa , pa ), 0 < a ≤ 32
33
1, and diag(2T 2) = (c, c) where c is constant. This seemingly small change has 33
34
the potential to have a big impact on the results because now the eigenvalues of V 34
35
are no longer distinct and R1 = 1, with the consequence that ρ may slow the rate 35
36
of convergence as indicated in Theorem 2. Indeed, the results in Figure 2 seem 36
37
uniformly worse than those in Figure 1. While it seems clear that the curve for 37
38
η = p is convergent, it is not clear if the curves for η = p1/2 or η = p3/4 are so. 38
39
The influence of U on the results of this example is controlled largely by the 39
40
correlations cyν = cov(y, ν)/ var1/2 (y) between y and the elements of ν. The con- 40
41
dition var(ν) = I2 was imposed without loss of generality since we can always 41
42
achieve it by rescaling. In Figures 1 and 2, cyν = (0.4, 0.4). If we had set the cor- 42
relations to be larger, say cyν = (0.8, 0.8), D b2 would have decreased faster as a
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 19
PLS PREDICTION 19
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 F IG . 2. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 12 the 14
15
lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15
16 16
17 function of p. If we had set the correlations to be weaker, say cyν = (0.2, 0.2), 17
18 Db2 would have decreased slower. Although in either case, the general conclusions 18
19 from Figures 1 and 2 would still be discernible. We selected correlations of 0.4 19
20 because we felt that they represent modest correlations that illustrate the theory 20
21 nicely without giving an optimistic impression, as might happen if we had used 21
22 large correlations. 22
23 23
24 6.1.2. Compound symmetry (4.7). For this simulation, we used model (2.1) 24
25 with the compound symmetry structure (4.7) for 6 constructed with σ = 1p + 25
26 cp , σ = 1p + 0.5cp and σ = 1p and in each case ψ = 0.8. With σ , ψ and p 26
27 set, we generated a single observation on X ∼ Np (0, 6) and then generated the 27
28 corresponding y according to model (2.1) with error standard deviation τ = 1. 28
29 This process was repeated n = p/2 times to get β̂. Then n additional observations 29
on X were generated and DN 2 was computed for each and averaged. The vertical
30 30
axis Db2 of Figure 3 is the average over 100 replications of this whole process.
31 31
In this simulation, we have κ ≍ p, η ≍ κ and tr(6H h ) ≍ κ. It follows that The-
32 0 32
33 orem 2.II is applicable for σ = 1p √+ cp and σ = 1p + 0.5cp giving, from the dis- 33
34 cussion in Section 4.6, DN = p2 / n. Since we used n = p/2, we do not expect 34
35 convergence, which seems consistent with the results shown in Figure 3. Theo- 35
36 rem 2.III applies for σ = 1p since then d = 1. In that case, DN = Op (p−1/2 ), 36
37 which again seems consistent with the results of Figure 3. 37
38 38
39
6.2. Tetracycline data. Goicoechea and Olivieri [20] used PLS to develop a 39
40
predictor of tetracycline concentration in human blood. The 50 training samples 40
41
were constructed by spiking blank sera with various amounts of tetracycline in the 41
42
range 0–4 µg mL−1 . A validation set of 57 samples was constructed in the same 42
43
way. For each sample, the values of the predictors were determined by measuring 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 20
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14
F IG . 3. Simulation results using compound symmetry (4.7). Reading from top to bottom the lines 14
correspond to σ = 1p + cp , σ = 1p + 0.5cp and σ = 1p .
15 15
16 16
17 fluorescence intensity at p = 101 equally spaced points in the range 450–550 nm. 17
18 18
The authors determined using leave-one-out cross validation that the best predic-
19 19
tions of the training data were obtained with d = 4 linear combinations of the
20 20
original 101 predictors.
21 21
We use these data to illustrate the behavior of PLS predictions in Chemometrics
22 22
as the number of predictors increases. We used PLS with d = 4 to predict the
23 23
validation data based on p equally spaced spectra, with p ranging between 10
24 24
and 101. The root mean squared error (MSE) is shown in Figure 4 for five values
25 25
of p. PLS fits were determined by using library{pls} in R. We see a relatively
26 26
steep drop in MSE for small p, say less than 30, and a slow but steady decrease
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
F IG . 4. Tetracycline data: Validation MSE from 10, 20, 33, 50 and 101 equally spaced spectra.
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 21
PLS PREDICTION 21
1 in MSE thereafter. Since we are dealing with actual prediction, the root-MSE will 1
2 not converge to 0 with increasing p as it seems to do in some of the simulations. 2
3 3
4 4
7. Discussion. In this section, we give results on the convergence of β̂ and
5 5
describe our rationale for some of the restrictions that we imposed.
6 6
7 7
7.1. Convergence of β̂. The focus of this article has been on the rate of con-
8 8
vergence of predictions as measured by DN . In this section, we consider for com-
9 9
pleteness the rate of convergence of β̂ in the 6 inner product. Let
10 10
1/2
11
Vn,p = var1/2 (DN | β̂) = (β̂ − β)T 6(β̂ − β) . 11
12 12
13 Then, as shown in Appendix Section S8, Vn,p and DN have the same order as 13
14 n, p → ∞. To be clear, we state this in the following theorem. 14
15 15
16 T HEOREM 3. As n, p → ∞, 16
17 17
18 I. Under the conditions of Theorem 1, 18
19 √ 19
Vn,p = Op (ρ/ n) + Op ρ 1/2 n−1/2 (κ/η)d .
20 20
21 II. Under the conditions of Theorem 2, 21
22 p 22
√
23 Vn,p = Op (ρ/ n) + Op ( ρφ). 23
24 24
25 It follows from this theorem that the special cases of Theorems 1 and 2 and the 25
26 subsequent discussions apply to Vn,p as well. In particular, estimative convergence 26
27 as measured in the 6 inner product will be at or near the root-n rate under the same 27
28 conditions as predictive convergence. 28
29 29
30 30
7.2. Multivariate Y . Recall that we confined our study to regressions with a
31 31
univariate response. An extension to multivariate Y ∈ Rr seems elusive because
32 32
there are numerous PLS algorithms for multivariate Y and they can all produce dif-
33 33
ferent results. The two most common algorithms NIPLS and SIMPLS are known
34 34
to produce different results when r > 1 but give the same results when r = 1 [12,
35 35
15, 34]. The multivariate version of the Krylov construction Ĝ provides another
36 36
PLS algorithm. Some prefer to standardize the elements of Y to have sample vari-
37 37
ance equal to 1, while others do not standardize. Some PLS algorithms reduce
38 38
Y and X simultaneously, while others reduce X alone. These various algorithms
39 39
40
can produce different results when r > 1 but also produce the same or equivalent 40
41
results when r = 1. It seems to us that any extension to allow for a multivariate 41
42
response would first need to address the multiplicity of methods, which is outside 42
43
the scope of this report. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 22
1 7.3. Choice of the dimension, d. We assumed through this article that the di- 1
2 mension d of the envelope is effectively fixed and known, as did Chun and Keleş 2
3 [6]. In practice, d will not normally be known so a data-dependent estimate dn,p 3
4 will often be used in its stead. If dn,p > d, the (nonasymptotic) results of a PLS 4
5 analysis will still be based on a true model, albeit one with more variation than 5
6 necessary. If dn,p < d, then PLS will incur some bias in estimation. The bias can 6
7 be sizable if dn,p is substantially less than d, an event that we judge to be unlikely 7
8 because the far values of dn,p should be ruled out by standard PLS methodology. 8
9 Extensions of the asymptotic results of this article that allow for using dn,p in- 9
10 stead of d will depend on the rate at which dn,p converges to d. If that rate is 10
11 sufficiently fast, then the results of this article will still hold. Otherwise, the rates 11
12 presented here will be optimistic. We chose to assume d known so that the results 12
13 might reflect the core behavior of PLS while keeping an important link with the 13
14 work of Chun and Keleş [6]. This view avoided the task of studying selection meth- 14
15 ods, which is outside the scope of this article but still an important next step. Eck 15
16 and Cook [17] proposed an estimator of β as a weighted average of the envelope 16
17 estimators over the possible dimensions of the envelope, the weights being func- 17
18 tions of the Bayes information criterion for each envelope model. This weighted 18
19 estimator avoids the need to estimate the dimension and might be adaptable for 19
20 asymptotic studies of PLS. 20
21 Another desirable extension is to allow d → ∞ as p → ∞. In such a case, 21
22 we expect PLS to still yield consistent results provided d grows at a rate that is 22
23 sufficiently slow relative to p. 23
24 24
25 7.4. Importance of normality. As mentioned previously, simulations and our 25
26 experience in practice suggest that normality is not an essential assumption in prac- 26
27 tice, particularly if a holdout sample is used to assess performance of the final pre- 27
28 dictive model. Theoretically, we expect that our asymptotic results are indicative 28
29 for sub-normal variables, but may not be so for sur-normals, depending on the tail 29
30 behavior. We relied extensively on the behavior of higher order moments of nor- 30
31 mals. Extending these results to classes of distributions would require bounds that 31
32 would likely be quite loose for normals. Assuming normality allowed us to get rel- 32
33 atively sharp bounds, which we feel is useful for a first look at PLS asymptotics. 33
34 The same normality was used also by Naik and Tsai [30] in their asymptotic study 34
35 of the fixed p case and by Chun and Keleş [6] for the case in which p and n both 35
36 diverge. 36
37 37
38
7.5. Impact of the results. Our asymptotic results are intended to provide a 38
39
qualitative understanding of various plausible PLS scenarios. For instance, if it is 39
40
thought that nearly all predictors contribute information about the response, so η ≍ 40
41
p, then we may have DN = Op (n−1/2 ) without regard to the relationship between 41
42
n and p. On the other extreme, if the regression is viewed as likely sparse, so η ≍ 1, 42
43
then we may have DN = Op ((p/n)1/2 ) and we now need n to be large relative 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 23
PLS PREDICTION 23
1 [15] DE J ONG , S. (1993). SIMPLS: An alternative approach to partial least squares regression. 1
2 Chemom. Intell. Lab. Syst. 18 251–263. DOI:10.1016/0169-7439(93)85002-X. 2 <author>
3 [16] D ELAIGLE , A. and H ALL , P. (2012). Methodology and theory for partial least squares applied 3
to functional data. Ann. Statist. 40 322–352. MR3014309 <mr>
4 4
[17] E CK , D. J. and C OOK , R. D. (2017). Weighted envelope estimation to handle variability in
5 model selection. Biometrika 104 743–749. MR3694595 5 <mr>
6 [18] F RANK , I. E. and F RIDEMAN , J. H. (1993). A statistical view of some chemometrics regres- 6
7 sion tools. Technometrics 35 102–246. DOI:10.1080/00401706.1993.10485033. 7 <author>
8 [19] G ARTHWAITE , P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 8
89 122–127. MR1266290 <mr>
9 9
[20] G OICOECHEA , H. C. and O LIVER , A. C. (1999). Enhanced synchronous spectrofluorometric
10 determination of tetracycline in blood serum by chemometric analysis. Comparison of 10
11 partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368. 11 <author>
12 [21] H ELLAND , I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 12
13 17 97–114. MR1085924 13 <mr>
[22] H ELLAND , I. S. (1992). Maximum likelihood regression on relevant components. J. Roy.
14 14
Statist. Soc. Ser. B 54 637–647. MR1160488 <mr>
15 [23] H ELLAND , I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. 15
16 Intell. Lab. Syst. 58 97–107. 16 <author>
17 [24] K ANDEL , T. A., G ISLUM , R., J ØRGENSEN , U. and L ÆRKE , P. E. (2013). Prediction of biogas 17
18
yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and 18
chemometrics. Bioresour. Technol. 146 282–287. <author>
19 19
[25] KOCH , C., P OSCH , A. E., G OICOECHEA , H. C., H ERWIG , C. and L ENDLA , B. (2013). Multi-
20 analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by par- 20
21 tial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103– 21
22 110. 22 <author>
23
[26] L I , W., C HENG , Z., WANG , Y. and Q U , H. (2013). Quality control of Lonicerae Japonicae 23
Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72
24 24
33–39. <author>
25 [27] L OBAUGH , N. J., W EST, R. and M C I NTOSH , A. R. (2001). Spatiotemporal analysis of exper- 25
26 imental differences in event-related potential data with partial least squares. Psychophys- 26
27 iology 38 517–530. 27 <author>
28
[28] M ARTENS , H. and N ÆS , T. (1992). Multivariate Calibration. Wiley, Chichester. MR1029523 28
<mr>
[29] N ÆS , T. and H ELLAND , I. S. (1993). Relevant components in regression. Scand. J. Stat. 20
29 29
239–250. MR1241390 <mr>
30 [30] NAIK , P. and T SAI , C.-L. (2000). Partial least squares estimator for single-index models. J. R. 30
31 Stat. Soc. Ser. B. Stat. Methodol. 62 763–771. MR1796290 31 <mr>
32 [31] N GUYEN , D. V. and ROCKE , D. M. (2002). Tumor classification by partial least squares using 32
33
microarray gene expression data. Bioinformatics 18 39–50. 33
<author>
[32] N GUYEN , D. V. and ROCKE , D. M. (2004). On partial least squares dimension reduction for
34 34
microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–
35 425. MR2067030 35 <mr>
36 [33] S CHWARTZ , R. W., K EMBHAVI , A., H ARWOOD , D. and DAVIS , L. S. (2009). Human detec- 36
37 tion using partial least squares analysis. In 2009 IEEE 12th International Conference on 37
38
Computer Vision 24–31. 38
<author>
[34] TER B RAAK , C. J. F. and DE J ONG , S. (1998). The objective function of partial least squares
39 39
regression. J. Chemom. 12 41–54. <author>
40 [35] W OLD , S., M ARTENS , H. and W OLD , H. (1983). The multivariate calibration problem in 40
41 chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils 41
42 (A. Ruhe and B. Kågström, eds.). Lecture Notes in Mathematics 973 286–293. Springer, 42
43
Heidelberg. 43
<author>
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 25
PLS PREDICTION 25
1 [36] W ORSLEY, K. J. (1997). An overview and some new developments in the statistical analysis 1
2 of PET and fMRI data. Hum. Brain Mapp. 5 254–258. 2 <author>
3 3
S CHOOL OF S TATISTICS FACULTAD DE I NGENIERÍA
4 U NIVERSITY OF M INNESOTA Q UÍMICA , UNL 4
5 313 F ORD H ALL S ANTIAGO DEL E STERO 2819 5
6 224 C HURCH S T. SE S ANTA F E 6
M INNEAPOLIS , M INNESOTA 55455 A RGENTINA
7 USA E- MAIL : liliana.forzani@gmail.com 7
8 E- MAIL : dennis@stat.umn.edu 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 26