Identifying Potentially Influential Subsets in Multivariate Regression

Identifying Potentially Influential Subsets in Multivariate Regression
Bill Seaver, Ph.D.

Arika Blankenship
Department of Statistics
The University of Tennessee, Knoxville
K.P. Triantis, Ph.D.
Department of Industrial and Systems Engineering
Virginia Tech University
2
I. Introduction
One of the challenges faced when performing a regression analysis is identifying influential
subsets of observations. When using multiple regression, there are several statistical tools one can
use to determine whether or not single case observations are influential and what the cause of
influentiality is. These include Cooks index, the Welsch-Kuh statistic, the covariance ratio,
internal and external studentized residuals and the influence on the s, the regression coefficients.
Generally, in order to identify subsets of influential observations in multiple regression, one might
examine all possible groupings of pairs, triplets, quadruplets, etc. and then look at the influence
statistics for all the different groupings. In a multivariate regression scenario, Barrett and Ling
(1992) did this combinatorial examination with the Rohwer data set (Hossain and Naik, 1989) to
identify influential pairs of observations.
There are two possible strategies for multivariate regression that may avoid the
combinatorial problems. First, there is the fuzzy clustering strategy, alluded to by Seaver and
Triantis (1992) and more fully developed recently ( Seaver, Triantis, and Reeves (1997)). By
using fuzz plots, a degree of membership matrix, and fuzzy indices to identify possible influential
observations, only certain observations need to be evaluated in pairs, triplets, quadruplets, etc..
Thus, it is not as computationally intensive as Barrett and Lings approach. The second option,
which uses bootstrapping, will not necessarily be computationally shorter; but it will be simpler
and less dependent upon finding the appropriate optimal subset combinations.
The purpose of this research is to see if the fuzzy clustering or bootstrapping strategies
will identify potentially influential subsets in multivariate regression, since it has already been
demonstrated to work for the multiple regression case (Seaver, Triantis, and Reeves, 1997). The
intent of this research is not to say what the cause/criterion of the influence is. Traditionally, there
3
has not been a straightforward way to identify influential observations in multivariate regression.
Barrett and Ling (1992) extended some of the univariate ideas to produce multivariate measures
as did Hossain and Naik (1989). These included a multivariate version of Cooks index, a
Welsch-Kuh type statistic, and a multivariate covariance ratio, to name a few; all of which are
very labor intensive. In addition, after evaluating individual observations with these statistics, one
must still look at all possible pairs, triplets, quadruplets, etc. The fuzzy clustering strategy and/or
the bootstrapping method used in earlier work should reduce the labor time and should provide a
much simpler way to identify influential subsets in multivariate regression.
To demonstrate the method, the previously published Rohwer data is analyzed (Barrett
and Ling, 1992; and Hossain and Naik, 1989). The Rohwer data presents some difficulties with
collinearity as well as with identifying influential observations and influential subsets. It will serve
to illustrate the differences between the fuzzy clustering strategy, bootstrapping, and the
multivariate regression diagnostics approach.
The next section of this paper will discuss the basis for this research while the third section
will detail the methodology used and the findings. Finally, section four will provide conclusions
and possible extensions of this research.
4
II. Background
Barrett and Ling (1992) used multivariate measures of influence partitioned into leverage
and residual components to construct influence plots for identifying influential subsets. An
influence plot would be constructed for each combination of observations, single cases, pairs,
triplets, and so forth. The influence plots for single cases (Figure 1) and for pairs (Figure 2) are
shown below for the partitioning of the multivariate measure of Cooks distance into leverage and
residual components:
Figure 1: Influence plots of individual cases (Barrett and Ling, 1992)
5
Both of these plots have the log of the leverage on the Y-axis and the log of the studentized
residual on the X-axis as to Cooks distance. The dashed lines represent contours of influence.
Each successive dashed line represents twice the influence as the line below it. Figure 1
represents individual cases while Figure 2 illustrates influential pairs. In Figure 1, the single
observations 5, 25, 27, and 29 are more influential than the other observations in this data set. In
Figure 2, the influential pairs appear to be (5,10) and (5,22).
The drawback to the influential pairs plot is that all possible pairs are plotted. This is
evidenced by the multitude of dots filling Figure 2. Another drawback is the difficulty in reading
the plots. For instance, in Figure 1 for single case observations, Barrett and Ling state case {10}
and case {20} have nearly identical influence but are quite different in terms of leverage and
residual contribution. In Figure 1, it appears that observation 10 has twice the influence of
observation 20. While the interpretation of some points or subsets in the influence plots can be
difficult, the benefit of these plots is the visualization of the influence.
Figure 2: All possible pairs influence plot (Barrett and Ling, 1992)
6
Barrett and Ling also produced formulas for several different multivariate influence
measures, such as the DFFITS, Covratio, the Andrew-Pregibon statistic, and Cooks Distance.
Although they did produce some time saving formulas, one must still know ahead of time which
measures one wants to use. Once the multivariate influence measure is chosen, it is partitioned
into leverage and residual components. Therefore, there can be an extensive amount of matrix
manipulation and combinatorial computations for even small data sets and for the choice of
several influence measures, which can be difficult and too time consuming.
The fuzzy clustering strategy for identifying influential subsets consists of three-stages
(Seaver, Triantis, and Reeves, 1997; Seaver and Triantis, 1992). First, the modal number of
clusters is discovered using the matrix H* = Z(ZZ)
-1
Z where Z = (X|Y) and X is (n x p) and Y is
(n x k) as a measure of the K
th
nearest-neighbor clustering algorithm (H* is assumed to be a
similarity matrix which holds leverage, residual and other influence information). It is possible
that some of these clusters may be unique influential subsets. The details of that procedure are
discussed by Seaver, Triantis, and Reeves (submitted). In the second stage, the fuzzy K-means
clustering method uses the solution from the K
th
nearest-neighbor algorithm as a starting point for
assigning a degree of membership to each observation for every cluster. Third, a sensitivity
analysis is performed. Seaver, Triantis, and Reeves (1997), used bootstrapping to confirm their
fuzzy results and assess the uniqueness of subsets and individual observations in multiple
regression.
As noted by Seaver et. al (1997):
Observations that are influential would be expected to belong to a cluster or
clusters with a high degree of belonging or they would have a very fuzzy degree of
belonging across several clusters. In the latter case, if an observation or observations have
a fuzzy degree of belonging across several clusters and if the number of clusters is
increased slightly, it is expected that a fuzzy observation or subset would form their own
7
cluster. Thus, the uniqueness of a subset can be evaluated by changing the value of two
parameters, the fuzzifier (m) and/or the number of subsets.
The evaluation of the subsets uniqueness is referred to as the sensitivity analysis. This step
includes evaluating fuzz plots (a visualization of the degrees of belonging across subsets),
analyzing the degree of membership matrix, and inspecting the normalized fuzzy indices.
The fuzz plot is constructed with the degree of belonging listed on the Y-axis while the
case number is shown on the X-axis. The XY space illustrates the cluster belonging of an
observation. To identify influential subsets or observations in a fuzz plot, there are three possible
patterns. For instance, using the Rohwer data, these three patterns are given in Figures 3, 4, and
5. Figure 3 shows a typical straight-line pattern for the hard solution with six clusters. Most
degrees of belonging are close to 1, with no subsets standing out.
8
Fuzzi f i er of 1. 1
Pl ot of CL1*I D2. Symbol used i s ' 1' .
CL1

1. 00 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 6 6 6
0. 95 1
0. 90 5
0. 85
0. 80
0. 75
0. 70
0. 65
0. 60
0. 55
0. 50
0. 45
0. 40
0. 35
0. 30
0. 25
0. 20
0. 15
0. 10 3
0. 05
0. 00 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

_
2 6 7 4 1 2 3 5 2 3 1 1 3 2 2 3 2 2 1 2 8 1 1 2 2 1 1 1 2 1 9 1
8 0 7 0 0 8 2 1 1 3 2 2 9 3 1 4 5 6 6 9 5 4 7
I D2
NOTE: 127 obs hi dden.
Fi gure 3: Fuzz pl ot wi t h a f uzzi f i er of 1. 1
Comparing the almost straight line of Figure 3 to the waterfall pattern of Figure 4, one begins to
see possible influential observations, such as 17, 29, 9, and 5.
9
CL1

1. 00 1 1 2 3 6
0. 95 2 3 4 6
0. 90 3 4 5 5 5 5
0. 85 2 2 3 4 4 6
0. 80 1 1
0. 75
0. 70
0. 65 1 6
0. 60 5
0. 55 2
0. 50 3
0. 45
0. 40
0. 35 4
0. 30 3 4
0. 25 2 5 3
0. 20 5 3 6 1
0. 15 1 1 1 3
0. 10 3 2 4 5 5 2 5 1 1
0. 05 3 2 3 1 1 1 1 2 2 5 1 3 2 1 2 4 1 6 3 2 3 2
0. 00 2 2 2 4 3 1 1 4 3 1 1 4 1 1 2 1 1 1 2 1 1 4 1 1 1

_
6 2 2 3 7 2 3 3 1 1 1 2 3 4 2 1 1 2 1 2 2 9 5 8 1 2 2 2 1 1 1 1
8 7 0 0 1 8 2 7 3 2 1 0 3 4 5 9 1 6 4 2 9 5 6
I D2
However, too much of a waterfall effect, as seen in Figure 5 does not allow an analyst to easily
identify influential observations. In the fuzz plot of Figure 5, observations 5, 29, 25, and 17 stand
out since they approach the average degree of membership line. If all of the observations fall to
the average degree of membership, no new knowledge about the observations is gained (the
average degree of membership is the reciprocal of the number of clusters). Another flag for
influential subsets using the fuzzy strategy is clusters that are stable throughout the sensitivity
analysis. These need to be investigated as candidates for influential subsets.
CL1
10

1. 00 2
0. 95 1 1 3
0. 90 3 4 4 6
0. 85 2 3 5 5
0. 80 6
0. 75 6
0. 70
0. 65 1 1 2
0. 60 2 4 5
0. 55 1 3
0. 50 4 4 5
0. 45 2 6
0. 40
0. 35 3 4
0. 30 3 5
0. 25 3 5 4
0. 20 1 3 2 3 2
0. 15 2 3 1 3 1 2 6 1 3 4 1 1 1
0. 10 3 4 2 3 4 1 1 2 2 1 1 5 1 4 2 5
0. 05 2 3 5 3 1 3 4 4 2 1 1 2 2 1 3 2 1 2 3 2 1
0. 00 2 2 6 1 1 1 1 4 2 1 2 1 1 1 1 2

_
2 6 2 7 3 2 3 3 1 1 5 2 3 4 2 1 2 2 1 9 1 1 2 2 2 8 2 1 1 1 1 1
8 7 0 0 1 8 2 3 2 1 0 9 3 1 4 5 4 6 2 9 5 6 7
I D2
In evaluating the degree of membership matrix, there are a couple of things for which to
search. Fuzzy observations that have either degrees of belonging splintered across clusters or
that attain almost equal membership among several clusters during the sensitivity analysis may
form influential subsets. Influential observations may also switch clusters as the fuzzifier changes.
This switch illustrates the fact that the particular observations characteristics are not the same as
the other observations characteristics. For example, observation 5 in the Rohwer data switches
clusters. Thus, observation 5 has different characteristics than all the other observations in the
data set.
Sometimes, normalized fuzzy indices are useful in determining whether or not the chosen
fuzzifier is appropriate for identifying the observations. If either of the indices is close to one or
close to zero, then the fuzzifier is not the correct one. One should concentrate on the cluster
solution where both the Dunns normalized partition (FPU) and the normalized average squared
11
error (DPU) coefficients have complementary low values (Seaver, et. al, 1997). This guideline
will help in identifying potentially influential observations or subsets of observations.
The bootstrapping process is a separate approach in this research, although it is used as a
verification step in the multiple regression sense (Seaver, Triantis, and Reeves, 1997). Using the
H* matrix discussed earlier, a case-wise bootstrap of the H* elements is performed to obtain a set
of possible influential observations. In this setting, one must decide the number of repetitions to
use and how big the sample size needs to be. Repetitions may range anywhere from 1,000 to
10,000, while sample sizes generally increase from 5 to 10. Using standardized skewness and
kurtosis measures on the means from these samples, one can identify influential observations and
possible subsets in a two-dimensional plot. The following excerpt from the statistical package,
Number Cruncher Statistical System (1997), provides an overview of these standardized
skewness and kurtosis measures:
12
13
(NCSS, 97, Help Menu).
14
III. Research Methodology and Findings
To compare influential subset identification strategies, a data set with known influential
observations and subsets was examined. The Rohwer data set contains three dependent variables
and five independent variables: y1=Peabody Picture Vocabulary Test; y2=Student Achievement
Test; y3 = Ravin Progressive Matrices Test; x1= named; x2 = still; x3 = named still; x4 = named
action; and x5 = sentence still. The subjects were 32 randomly selected white, upper class,
residential school children. Previously, both Hossain and Naik and Barrett and Ling identified
observation 5 as influential. Hossain and Naik concluded that observation 5 is jointly influential
with [independent] variable 1. If [independent] variable 1 is removed from the analysis,
observation 5 is not influential (Hossain and Naik, 1989). Barrett and Ling concluded that
observation 5 has both the largest influence and the largest leverage. They also decided that the
pairs of {5,14} and {5, 25} are both influential but are canceling in their influence (Barrett and
Ling, 1992).
The three-stage fuzzy clustering process identified six as the modal number of clusters. As
this is a vague idea of what is needed, the sensitivity analysis was performed to solidify the results.
This second stage requires one to vary the fuzzifier, the number of clusters and possibly how
many neighbors to consider in the nearest neighbor method. The fuzzifier was varied from 1.1 to
2.0; the number of clusters varied from 6 to 8 and how many neighbors to consider varied from 2
to 3. The analysis indicates that six is an ideal number of clusters.
The fuzz plots for fuzzifier values of 1.1, 1.2, 1.25, 1.3, 1.4 and 1.5 can be found in
Appendix A. In evaluating the fuzz plots, it is possible to see that a fuzzifier of 1.1 produces a
hard solution, while a fuzzifier of 1.4 results in 25 percent (8 of 32) of the observations clustering
15
around the average degree of membership. Therefore, two conclusions could be reached. First,
the subset size of influential points could be as large as 8. Secondly, it is unlikely that the fuzzifier
needs to go beyond 1.4. This latter conclusion is supported by the normalized fuzzy indices given
in Table 1 for clusters of size six.
Fuzzifier FPU DPU
1.1 .9871736 .5989974
1.15 .9452504 .9582983
1.2 .8396765 .9237529
1.25 .7413848 .5920081
1.3 .6236388 .6679089
1.4 .4372243 .5653917
1.5 .2981738 .5714289
1.6 .1203189 .6879582
2.0 .0003395 .9799557
Table 1: Normalized Indices
Normalized fuzzy indices are necessary to measure how hard a fuzzy clustering solution is. Given
that u
ij
is the degree of belonging of the i
th
observation to the j
th
group, Dunns normalized
partition coefficient is:
ranging in value from 0 to 1 (Kaufman and Rousseau, 1990). This value is based on the square of
the degree of belonging matrix. Thus, it tends to be fuzzier than the other index, DPU. In
contrast, the DPU index, the normalized average square error of a fuzzy solution with respect to
the closest hard solution, compares the fuzzy result to a standard (Kaufman and Rousseau, 1990).
16
It evaluates how far one has moved from the starting point and tends to be a much more reliable
value. In evaluating the results, the FPU steadily decreases from about 1, the hard solution, to
zero, the fuzzy solution, as the fuzzifier increases. The normalized fuzzy indices, FPU and DPU,
in Table 1 seem to suggest that the ideal fuzzifier may be 1.25 or 1.3.
For the 1.3 fuzzifier and six clusters, observations 5, 17, 9, and 29 fuzz out according to
the matrix of belonging, quickly followed by observations 12, 10, 22, and 16. It is interesting to
note that observations 2, 14, 25, 29, and 9 are clustered with the dominant influential observation,
5. When one extends the clusters to 7 and 8 for the same fuzzifier value of 1.3, the same data
points are still influential candidates. The bottom line is that there could be influential subsets of
size 8 to consider, and that is not an easy task. The lack of clarity in subset identification may be
due to the collinearity among the independent variables.
The bootstrapping approach, which takes h
i
* row vector out of the H* matrix identified
four influential observations. Each row of the H* matrix was sampled 1000 times for sample sizes
of 5. The means of the samples were computed, scored, and placed into boxplots. Anything
deviating strongly from a normal distribution is a potential influential observation. The following
boxplots in Figure 6 are for observations 1, 5, 10, 16, 27, and 31. Observations 1 and 31 are
included to represent non-influential observations.
17
Figure 6: Boxplots of Select Observations
The critical bootstrapping guideline in classifying observations as influential or not is the use of
standardized skewness and kurtosis measures. Based on these measures three results could be
indicative of influence. There could be extreme skewness and extreme kurtosis; one could have
only extreme skewness, or one could have just extreme kurtosis. However, the scatterplot of the
standardized skewness and kurtosis measures should reveal the potential influential subset
patterns in the data. One decides about the presence of influence from the scatterplot but not the
cause of the influence. Tables 2 through 5 show the Z-scores for the skewness and kurtosis
measures. These Z-scores are based on DAgostinos normality tests as performed in NCSS.
The A/R columns in these tables refer to whether or not the null hypotheses were accepted or
rejected as to normality by that measure.
-0.2
0.0
0.1
0.3
row_1 Row_5 Row_10 Row_16 Row_27 Row_31
Box Plot
Variables
A
m
o
u
n
t
18
Row # Skewness A/R Kurtosis A/R
1 3.5923 R 0.3276 A
2 2.5090 R 0.7483 A
3 0.8676 A 2.0038 R
4 0.4058 A -1.0960 A
5 12.9440 R 7.4058 R
6 3.6621 R -0.2928 A
7 -0.4866 A 0.5601 A
8 2.4031 R 1.8530 A
9 0.9889 A -0.8626 A
10 8.2951 R 4.5951 R
11 2.5787 R -1.5404 A
12 0.1658 A -0.4565 A
13 1.0196 A 1.0692 A
14 2.1956 R -0.1877 A
15 3.8750 R 0.1031 A
16 9.6255 R 2.8368 R
17 4.4339 R 1.3482 A
18 1.0635 A -0.2535 A
19 5.8685 R 0.6944 A
20 0.9472 A -2.2039 R
21 3.1403 R -0.8729 A
22 -1.4270 A 2.2091 R
23 3.7199 R -0.8683 A
24 1.1983 A 0.1924 A
25 5.4232 R 1.4257 A
26 2.5539 R -0.8784 A
27 7.0213 R 2.7499 R
28 0.0656 A -0.6047 A
29 6.3322 R 1.8205 A
30 1.2567 A -4.2048 A
31 1.5001 A -0.0027 A
32 0.2727 A -2.0944 R
Table 2: Skewness and Kurtosis Measures (1000 Samples of Size 5)
These standardized measures were also plotted to identify possible subsets. In Figure 7,
the results of the standardized skewness and kurtosis measures in Table 2 are shown. For
instance, a subset of size 1 would be observation 5. A subset of size 2 could be cases {10, 16 },
{5, 10}, or {5, 16} while a subset of size 3 might consist of observations 5, 10, and 16. Any
19
larger subsets might include observations 27 and/or 29 and maybe 25, 19, or 17. Thus, subsets
no larger than size eight could be investigated further for the cause of the influence.
Figure 7: Skewness vs. Kurtosis plot of Z-values (1000 Samples of Size 5)
From this plot in Figure 7 and the results of the hypothesis tests, further bootstrapping was
performed to clarify the potential patterns. Samples of size 5 were taken 5000 times and samples
of size 10 were taken both 1000 and 5000 times.
-6.0
-1.3
3.3
8.0
-5.0 1.7 8.3 15.0
Skewness vs Kurtosis
Skewness
s
i
s
o
t
r
u
K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
20
1 5.687 R -1.8543 A
2 4.8967 R -0.7868 A
3 1.5184 A 4.6493 R
4 -1.919 A -3.5383 R
5 27.3181 R 15.2891 R
6 5.9316 R -1.2864 A
7 -0.2876 A -2.0493 R
8 6.2722 R 0.6717 A
9 3.0186 R 0.0386 A
10 16.7802 R 8.6645 R
11 6.149 R -3.5721 R
12 5.0384 R 0.8109 A
13 1.2143 A 0.6561 A
14 9.7394 R 4.724 R
15 14.3182 R 5.6717 R
16 23.2649 R 11.5382 R
17 9.8139 R 4.0338 R
18 6.9041 R 1.9395 A
19 12.8177 R 1.854 A
20 5.4967 R -1.6092 A
21 8.3171 R 1.0903 A
22 0.6489 A 3.2664 R
23 5.6547 R -2.6139 R
24 1.805 A 1.1992 A
25 11.2776 R 5.1479 R
26 4.5591 R -1.1977 A
27 18.9888 R 8.8051 R
28 3.2569 R -1.6537 A
29 14.6532 R 6.7067 R
30 4.1593 R -2.6362 R
31 5.0978 R 2.3014 R
32 2.3563 R -1.6769 A
Table 3: Skewness and Kurtosis Measures (5000 samples of size 5)
Increasing the number of repetitions produced more observations that rejected both hypotheses.
However, the plot of standardized skewness and kurtosis measures in Figure 8 reveals a very
similar pattern of influential candidates as given in Figure 7 for 1000 repetitions, except for
21
observation 4. Observations 5, 16, 27, 10, 29, 15, and 25 could be used to form potential
influential subsets of size 1 to 7 in light of the order shown in Figure 8.
-5.0
3.3
11.7
20.0
-5.0 6.7 18.3 30.0
Skewness
K
u
r
t
o
s
I
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 19
20
21
22
23
24
25
26
27
28
29
30
31
32
Both bootstrap results (1000 and 5000 repetitions) reveal that observation 5 is highly influential
(although the cause of influence is unknown). In addition, observations 16, 27, 10, and 29 seem
to be influential as well. Observations 15 and 25 do not seem to differ drastically in skewness and
kurtosis from the data mass in Figure 8, but subsets with these observations should be
considered..
22
1 0.272004 A 0.576398 A
2 0.6364 A -1.6234 A
3 0.5871 A 1.7254 A
4 -1.3578 A -0.116 A
5 8.9176 R 4.3235 R
6 2.3565 R -1.7805 A
7 -0.7298 A -0.9443 A
8 1.3969 A 2.1924 R
9 2.1156 R 0.0503 A
10 4.8511 R -0.277 A
11 3.9442 R -0.3561 A
12 1.6278 A -0.5072 A
13 0.9256 A 0.7652 A
14 2.606 R -1.5672 A
15 4.9337 R 1.2585 A
16 8.4859 R 4.1296 R
17 3.6385 R -0.5904 A
18 0.5822 A -1.6423 A
19 4.5861 R 0.4925 A
20 1.6242 A -0.9651 A
21 2.1366 R -1.9387 A
22 -1.9825 A 1.4526 A
23 0.7325 A -2.0482 R
24 0.5388 A 2.0328 R
25 5.0293 R 1.6868 A
26 2.0485 R -0.2862 A
27 5.4577 R 2.2836 R
28 0.8776 A -1.6666 A
29 6.0279 R 3.3629 R
30 1.0938 A -1.0115 A
31 1.5759 A -0.4592 A
32 0.6294 A -0.9335 A
23
-4.0
-0.7
2.7
6.0
-2.0 2.0 6.0 10.0
Skewness
K
u
r
t
o
s
i
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
It is possible that if there is an influential subset of a size greater than five, one might need
to look at bootstrap samples for n=10, which is done in Figures 9 and 10 for 1000 and 5000
repetitions, respectively. From Figure 9, one is able to see that observations 5 and 16 could form
a subset of size 2. In addition, observation 29 could be a subset of size 1 while 15, 25, and 27
could be a subset of size 3. However, at least the same seven observations (5, 16, 29, 27, 25, 15,
and 19) stand out again and could easily be configured into ordered influential subsets of size 2 to
7.
24
1 4.9053 R 0.0951 A
2 5.2498 R 0.5949 A
3 -0.5994 A 2.5711 R
4 0.1191 A -0.4914 A
5 21.6334 R 11.8558 R
6 6.1677 R 1.2282 A
7 -1.7925 A -1.321 A
8 3.9275 R -0.9826 A
9 2.5841 R -2.6125 R
10 12.4649 R 4.5684 R
11 6.2206 R -1.0011 A
12 5.0394 R 3.3093 R
13 0.4083 A -1.0387 A
14 7.6242 R 1.8701 A
15 9.6576 R 1.1786 A
16 17.0018 R 6.0276 R
17 7.126 R 0.4743 A
18 3.0454 R 0.4115 A
19 8.7743 R 3.5423 R
20 3.3341 R -0.1122 A
21 6.3813 R -0.1391 A
22 -0.5259 A 2.1144 R
23 4.2903 R -1.9261 A
24 -0.8020 A 2.3953 R
25 10.8794 R 4.1006 R
26 3.2874 R 1.271 A
27 13.1162 R 2.3096 R
28 1.6321 A 0.3177 A
29 11.4741 R 5.0043 R
30 2.405 R -0.0426 A
31 4.3292 R 3.7389 R
32 0.7024 A -1.5531 A
25
-5.0
1.7
8.3
15.0
-5.0 5.0 15.0 25.0
Skewness
K
u
r
t
o
s
i
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 18
19
20 21
22
23
24
25
26
27
28
29
30
31
32
Figure 10 clearly illustrates that observation 5 is different from the rest of the data in
influence, but the same seven observations as before are flagged. Working backwards from the
most extreme skewness and kurtosis (observation 5) to where there is separation from the mass of
standardized measures in Figure 10, one can easily construct a minimal number of potential
influential subsets to evaluate for cause of influence.
26
IV. Conclusions and Recommendations
One goal of this research was to find a method that easily identified influential subsets for
multivariate regression. Below is a table that compares the three methods discussed: fuzzy
methods, bootstrapping, and the multivariate influence plots. The following table helps illustrate
the pros and cons of the methods:
Barrett and Ling Fuzzy Method Bootstrapping
Implementation Not simple Not simple Simple
Computational
Matrix manipulation
Requires solid math and
matrices background
Can use resampling
programs
Visual
Influence plot not easily
interpreted
Fuzz plot
Skewness/Kurtosis plot
clear interpretation
Combinatorial
Maximum Minimal Minimal
Practical
Difficult to explain Could be difficult Easiest to explain
Programming
Difficult Difficult Moderate
The bootstrapping is the easiest of the methods because it only requires bootstrapping each row
of H*, it minimizes the combination of subsets to consider, and it avoids the dominance problem
that a single observation can have in influence analysis. Influential observations have similar
distribution qualities, as evidenced by their skewness and/or kurtosis. In addition, our results
seem to suggest repetitions of 5000 for n=5 and n= 10 work best when flagging potential
influential subsets with bootstrapping, but more work is needed. For sure, the best results are
achieved when the bootstrapping strategy and the fuzzy methods are combined, but practical
applications in the real multivariate world would probably not have time for both strategy
implementations.
While this research was able to demonstrate useful methods for identifying potentially
influential subsets, some of the details still need work. The usefulness of the fuzzy indices and
27
what exactly is an appropriate value needs to be defined. Criterion for influential subsets in
multivariate regression are being programmed in order to pinpoint the cause of the influence.
Also, objective criteria for identifying influential observations still need to be formalized for the
bootstrap. Although the skewness and kurtosis measures are great guidelines, there needs to be a
more objective definition.
Finally, one extension of this research is to explore these methods on a data set that
contains efficiency and quality measurements, where Y contains several output measurements and
the X matrix is composed of several inputs. Many companies are beginning to focus on these
types of multivariate regression models so there is a great need for techniques that quickly
identify influential subsets because of model implications and the process knowledge gained from
some of these influential subsets. This research is an important step in that direction.
28
Bibliography
Barrett, Bruce E, and Ling, Robert F. (1992), General Classes of Influence Measures for
Multivariate Regression, Journal of the American Statistical Association, Vol. 87, No.
417, March, 184-191.
Hintze, Jerry (1997), Number Cruncher Statistical System (NCSS), Help Menu.
Hossain, A. and Naik, D.N. (1989), Detection of influential observations in multivariate
regression, Journal of Applied Statistics, Vol. 16, No. 1, 25-37.
Kaufman, Leonard and Rousseeuw, Peter (1990), Finding Groups in Data: An Introduction to
Cluster Analysis, New York: John Wiley & Sons, pp164-198.
Reeves, Chip (1996), Using the Bootstrap to Detect Influential Subsets in Regression,
Independent Project.
Seaver, Bill and Triantis, Konstantinos, (1992), A Fuzzy Clustering Approach Used in
Evaluating Technical Efficiency Measures in Manufacturing, ), The Journal of
Productivity Analysis, Vol. 3, 337-363.
Seaver, Bill, Triantis, Konstantinos, and Reeves, Chip, Under revision for Technometrics, The
Identification of Influential Subsets in Regression Using a Fuzzy Clustering Strategy.
29
Appendix A
30
CL1

1. 00 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 6 6 6
0. 95 1
0. 90 5
0. 85
0. 80
0. 75
0. 70
0. 65
0. 60
0. 55
0. 50
0. 45
0. 40
0. 35
0. 30
0. 25
0. 20
0. 15
0. 10 3
0. 05
0. 00 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

_
2 6 7 4 1 2 3 5 2 3 1 1 3 2 2 3 2 2 1 2 8 1 1 2 2 1 1 1 2 1 9 1
8 0 7 0 0 8 2 1 1 3 2 2 9 3 1 4 5 6 6 9 5 4 7
I D2
CL1

1. 00 1 1 1 2 2 3 3 3 5 5 5 5 6 6 6
0. 95 2 2 4 4 4 4 6 6
0. 90 1 2 3 3 5
0. 85
0. 80
31
0. 75 1
0. 70 4
0. 65
0. 60
0. 55
0. 50
0. 45
0. 40 1
0. 35
0. 30 6 4
0. 25
0. 20 1
0. 15 5 3 2
0. 10 3 3 5 5
0. 05 5 2 2 1 3 2 3 2
0. 00 2 2 2 2 4 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

_
2 6 2 7 3 1 2 3 1 3 1 2 3 4 1 2 2 2 1 1 2 5 8 2 2 1 2 1 1 1 9 1
8 7 6 0 0 8 1 2 3 2 0 1 5 3 4 9 4 6 1 2 9 5 7
I D2
32
CL1

1. 00 1 1 2 2 3 3 6
0. 95 1 3 4 4 5 5 5 5 6 6
0. 90 2 2 4 4
0. 85 1 3 6 6
0. 80 5
0. 75 2 3
0. 70
0. 65 1
0. 60
0. 55
0. 50
0. 45 4
0. 40
0. 35 1
0. 30 6
0. 25 3 4
0. 20 1
0. 15 3 5 3
0. 10 5 6 2 5 1 2 5 3
0. 05 2 2 4 1 3 1 2 1 3 1 6 1 1 2 4 2
0. 00 2 2 2 3 4 1 1 3 1 5 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1

_
2 6 2 7 3 1 2 3 1 3 1 2 3 4 2 1 1 2 2 1 2 5 8 2 2 1 2 1 1 1 9 1
8 7 6 0 0 8 1 2 3 2 1 0 3 5 4 9 4 6 1 2 9 5 7
I D2
33
CL1

1. 00 1 1 2 3 6
0. 95 2 3 4 6
0. 90 3 4 5 5 5 5
0. 85 2 2 3 4 4 6
0. 80 1 1
0. 75
0. 70
0. 65 1 6
0. 60 5
0. 55 2
0. 50 3
0. 45
0. 40
0. 35 4
0. 30 3 4
0. 25 2 5 3
0. 20 5 3 6 1
0. 15 1 1 1 3
0. 10 3 2 4 5 5 2 5 1 1
0. 05 3 2 3 1 1 1 1 2 2 5 1 3 2 1 2 4 1 6 3 2 3 2
0. 00 2 2 2 4 3 1 1 4 3 1 1 4 1 1 2 1 1 1 2 1 1 4 1 1 1

_
6 2 2 3 7 2 3 3 1 1 1 2 3 4 2 1 1 2 1 2 2 9 5 8 1 2 2 2 1 1 1 1
8 7 0 0 1 8 2 7 3 2 1 0 3 4 5 9 1 6 4 2 9 5 6
I D2
CL1

1. 00 2
0. 95 1 1 3
0. 90 3 4 4 6
0. 85 2 3 5 5
0. 80 6
0. 75 6
0. 70
0. 65 1 1 2
0. 60 2 4 5
0. 55 1 3
0. 50 4 4 5
0. 45 2 6
0. 40
34
0. 35 3 4
0. 30 3 5
0. 25 3 5 4
0. 20 1 3 2 3 2
0. 15 2 3 1 3 1 2 6 1 3 4 1 1 1
0. 10 3 4 2 3 4 1 1 2 2 1 1 5 1 4 2 5
0. 05 2 3 5 3 1 3 4 4 2 1 1 2 2 1 3 2 1 2 3 2 1
0. 00 2 2 6 1 1 1 1 4 2 1 2 1 1 1 1 2

_
2 6 2 7 3 2 3 3 1 1 5 2 3 4 2 1 2 2 1 9 1 1 2 2 2 8 2 1 1 1 1 1
8 7 0 0 1 8 2 3 2 1 0 9 3 1 4 5 4 6 2 9 5 6 7
I D2
35
CL1

1. 00
0. 95 2
0. 90 1 6
0. 85 1 3
0. 80 2 3 4
0. 75 3 4 5
0. 70 5
0. 65
0. 60
0. 55 1 4 6
0. 50 2 4 6
0. 45 1 1 2 5
0. 40 3 6
0. 35 2 4 5
0. 30 3 4
0. 25 3 5 3 3 4
0. 20 3 1 1 4 5 2 3 3
0. 15 2 3 3 3 2 1 2 1 5 6 3 1 5 3 1 1
0. 10 4 4 2 1 3 1 1 1 2 1 1 1 1 1 2 2 4 2
0. 05 4 2 3 2 1 4 5 2 1 6 6 1 1 2 2 2 2 1 3 6 3 1 1
0. 00 2 5 1 4 1 4 2 1 1

_
2 6 7 2 3 2 3 3 1 1 5 2 3 4 2 1 2 2 1 1 9 1 2 2 2 8 2 1 1 1 1 1
8 7 0 0 1 8 2 3 2 1 0 9 3 1 4 5 6 4 2 9 5 6 7
I D2

Identifying Potentially Influential Subsets in Multivariate Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identifying Potentially Influential Subsets in Multivariate Regression

Uploaded by

Copyright:

Available Formats

Identifying Potentially Influential Subsets in Multivariate Regression

Bill Seaver, Ph.D.

You might also like