The document discusses identifying potentially influential subsets in multivariate regression. It describes two strategies - fuzzy clustering and bootstrapping - that can be used to identify influential subsets more efficiently than examining all possible groupings. The fuzzy clustering strategy involves determining cluster membership using nearest neighbor algorithms, fuzzy K-means clustering, and sensitivity analysis including fuzz plots. Bootstrapping can also help identify influential subsets without exhaustive computations. The strategies are demonstrated on a published dataset to show how they compare to existing multivariate diagnostics approaches.
The document discusses identifying potentially influential subsets in multivariate regression. It describes two strategies - fuzzy clustering and bootstrapping - that can be used to identify influential subsets more efficiently than examining all possible groupings. The fuzzy clustering strategy involves determining cluster membership using nearest neighbor algorithms, fuzzy K-means clustering, and sensitivity analysis including fuzz plots. Bootstrapping can also help identify influential subsets without exhaustive computations. The strategies are demonstrated on a published dataset to show how they compare to existing multivariate diagnostics approaches.
The document discusses identifying potentially influential subsets in multivariate regression. It describes two strategies - fuzzy clustering and bootstrapping - that can be used to identify influential subsets more efficiently than examining all possible groupings. The fuzzy clustering strategy involves determining cluster membership using nearest neighbor algorithms, fuzzy K-means clustering, and sensitivity analysis including fuzz plots. Bootstrapping can also help identify influential subsets without exhaustive computations. The strategies are demonstrated on a published dataset to show how they compare to existing multivariate diagnostics approaches.
Identifying Potentially Influential Subsets in Multivariate Regression
Bill Seaver, Ph.D.
Arika Blankenship Department of Statistics The University of Tennessee, Knoxville K.P. Triantis, Ph.D. Department of Industrial and Systems Engineering Virginia Tech University 2 I. Introduction One of the challenges faced when performing a regression analysis is identifying influential subsets of observations. When using multiple regression, there are several statistical tools one can use to determine whether or not single case observations are influential and what the cause of influentiality is. These include Cooks index, the Welsch-Kuh statistic, the covariance ratio, internal and external studentized residuals and the influence on the s, the regression coefficients. Generally, in order to identify subsets of influential observations in multiple regression, one might examine all possible groupings of pairs, triplets, quadruplets, etc. and then look at the influence statistics for all the different groupings. In a multivariate regression scenario, Barrett and Ling (1992) did this combinatorial examination with the Rohwer data set (Hossain and Naik, 1989) to identify influential pairs of observations. There are two possible strategies for multivariate regression that may avoid the combinatorial problems. First, there is the fuzzy clustering strategy, alluded to by Seaver and Triantis (1992) and more fully developed recently ( Seaver, Triantis, and Reeves (1997)). By using fuzz plots, a degree of membership matrix, and fuzzy indices to identify possible influential observations, only certain observations need to be evaluated in pairs, triplets, quadruplets, etc.. Thus, it is not as computationally intensive as Barrett and Lings approach. The second option, which uses bootstrapping, will not necessarily be computationally shorter; but it will be simpler and less dependent upon finding the appropriate optimal subset combinations. The purpose of this research is to see if the fuzzy clustering or bootstrapping strategies will identify potentially influential subsets in multivariate regression, since it has already been demonstrated to work for the multiple regression case (Seaver, Triantis, and Reeves, 1997). The intent of this research is not to say what the cause/criterion of the influence is. Traditionally, there 3 has not been a straightforward way to identify influential observations in multivariate regression. Barrett and Ling (1992) extended some of the univariate ideas to produce multivariate measures as did Hossain and Naik (1989). These included a multivariate version of Cooks index, a Welsch-Kuh type statistic, and a multivariate covariance ratio, to name a few; all of which are very labor intensive. In addition, after evaluating individual observations with these statistics, one must still look at all possible pairs, triplets, quadruplets, etc. The fuzzy clustering strategy and/or the bootstrapping method used in earlier work should reduce the labor time and should provide a much simpler way to identify influential subsets in multivariate regression. To demonstrate the method, the previously published Rohwer data is analyzed (Barrett and Ling, 1992; and Hossain and Naik, 1989). The Rohwer data presents some difficulties with collinearity as well as with identifying influential observations and influential subsets. It will serve to illustrate the differences between the fuzzy clustering strategy, bootstrapping, and the multivariate regression diagnostics approach. The next section of this paper will discuss the basis for this research while the third section will detail the methodology used and the findings. Finally, section four will provide conclusions and possible extensions of this research. 4 II. Background Barrett and Ling (1992) used multivariate measures of influence partitioned into leverage and residual components to construct influence plots for identifying influential subsets. An influence plot would be constructed for each combination of observations, single cases, pairs, triplets, and so forth. The influence plots for single cases (Figure 1) and for pairs (Figure 2) are shown below for the partitioning of the multivariate measure of Cooks distance into leverage and residual components: Figure 1: Influence plots of individual cases (Barrett and Ling, 1992) 5 Both of these plots have the log of the leverage on the Y-axis and the log of the studentized residual on the X-axis as to Cooks distance. The dashed lines represent contours of influence. Each successive dashed line represents twice the influence as the line below it. Figure 1 represents individual cases while Figure 2 illustrates influential pairs. In Figure 1, the single observations 5, 25, 27, and 29 are more influential than the other observations in this data set. In Figure 2, the influential pairs appear to be (5,10) and (5,22). The drawback to the influential pairs plot is that all possible pairs are plotted. This is evidenced by the multitude of dots filling Figure 2. Another drawback is the difficulty in reading the plots. For instance, in Figure 1 for single case observations, Barrett and Ling state case {10} and case {20} have nearly identical influence but are quite different in terms of leverage and residual contribution. In Figure 1, it appears that observation 10 has twice the influence of observation 20. While the interpretation of some points or subsets in the influence plots can be difficult, the benefit of these plots is the visualization of the influence. Figure 2: All possible pairs influence plot (Barrett and Ling, 1992) 6 Barrett and Ling also produced formulas for several different multivariate influence measures, such as the DFFITS, Covratio, the Andrew-Pregibon statistic, and Cooks Distance. Although they did produce some time saving formulas, one must still know ahead of time which measures one wants to use. Once the multivariate influence measure is chosen, it is partitioned into leverage and residual components. Therefore, there can be an extensive amount of matrix manipulation and combinatorial computations for even small data sets and for the choice of several influence measures, which can be difficult and too time consuming. The fuzzy clustering strategy for identifying influential subsets consists of three-stages (Seaver, Triantis, and Reeves, 1997; Seaver and Triantis, 1992). First, the modal number of clusters is discovered using the matrix H* = Z(ZZ) -1 Z where Z = (X|Y) and X is (n x p) and Y is (n x k) as a measure of the K th nearest-neighbor clustering algorithm (H* is assumed to be a similarity matrix which holds leverage, residual and other influence information). It is possible that some of these clusters may be unique influential subsets. The details of that procedure are discussed by Seaver, Triantis, and Reeves (submitted). In the second stage, the fuzzy K-means clustering method uses the solution from the K th nearest-neighbor algorithm as a starting point for assigning a degree of membership to each observation for every cluster. Third, a sensitivity analysis is performed. Seaver, Triantis, and Reeves (1997), used bootstrapping to confirm their fuzzy results and assess the uniqueness of subsets and individual observations in multiple regression. As noted by Seaver et. al (1997): Observations that are influential would be expected to belong to a cluster or clusters with a high degree of belonging or they would have a very fuzzy degree of belonging across several clusters. In the latter case, if an observation or observations have a fuzzy degree of belonging across several clusters and if the number of clusters is increased slightly, it is expected that a fuzzy observation or subset would form their own 7 cluster. Thus, the uniqueness of a subset can be evaluated by changing the value of two parameters, the fuzzifier (m) and/or the number of subsets. The evaluation of the subsets uniqueness is referred to as the sensitivity analysis. This step includes evaluating fuzz plots (a visualization of the degrees of belonging across subsets), analyzing the degree of membership matrix, and inspecting the normalized fuzzy indices. The fuzz plot is constructed with the degree of belonging listed on the Y-axis while the case number is shown on the X-axis. The XY space illustrates the cluster belonging of an observation. To identify influential subsets or observations in a fuzz plot, there are three possible patterns. For instance, using the Rohwer data, these three patterns are given in Figures 3, 4, and 5. Figure 3 shows a typical straight-line pattern for the hard solution with six clusters. Most degrees of belonging are close to 1, with no subsets standing out. 8 Fuzzi f i er of 1. 1 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 2 6 7 4 1 2 3 5 2 3 1 1 3 2 2 3 2 2 1 2 8 1 1 2 2 1 1 1 2 1 9 1 8 0 7 0 0 8 2 1 1 3 2 2 9 3 1 4 5 6 6 9 5 4 7 I D2 NOTE: 127 obs hi dden. Fi gure 3: Fuzz pl ot wi t h a f uzzi f i er of 1. 1 Comparing the almost straight line of Figure 3 to the waterfall pattern of Figure 4, one begins to see possible influential observations, such as 17, 29, 9, and 5. 9 Fuzzi f i er of 1. 3 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 6 2 2 3 7 2 3 3 1 1 1 2 3 4 2 1 1 2 1 2 2 9 5 8 1 2 2 2 1 1 1 1 8 7 0 0 1 8 2 7 3 2 1 0 3 4 5 9 1 6 4 2 9 5 6 I D2 NOTE: 94 obs hi dden. Fi gure 4: Fuzz pl ot wi t h a f uzzi f i er of 1. 3 However, too much of a waterfall effect, as seen in Figure 5 does not allow an analyst to easily identify influential observations. In the fuzz plot of Figure 5, observations 5, 29, 25, and 17 stand out since they approach the average degree of membership line. If all of the observations fall to the average degree of membership, no new knowledge about the observations is gained (the average degree of membership is the reciprocal of the number of clusters). Another flag for influential subsets using the fuzzy strategy is clusters that are stable throughout the sensitivity analysis. These need to be investigated as candidates for influential subsets. Fuzzi f i er of 1. 4 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1 10
_ 2 6 2 7 3 2 3 3 1 1 5 2 3 4 2 1 2 2 1 9 1 1 2 2 2 8 2 1 1 1 1 1 8 7 0 0 1 8 2 3 2 1 0 9 3 1 4 5 4 6 2 9 5 6 7 I D2 NOTE: 87 obs hi dden. Fi gure 5: Fuzz pl ot wi t h a f uzzi f i er of 1. 4 In evaluating the degree of membership matrix, there are a couple of things for which to search. Fuzzy observations that have either degrees of belonging splintered across clusters or that attain almost equal membership among several clusters during the sensitivity analysis may form influential subsets. Influential observations may also switch clusters as the fuzzifier changes. This switch illustrates the fact that the particular observations characteristics are not the same as the other observations characteristics. For example, observation 5 in the Rohwer data switches clusters. Thus, observation 5 has different characteristics than all the other observations in the data set. Sometimes, normalized fuzzy indices are useful in determining whether or not the chosen fuzzifier is appropriate for identifying the observations. If either of the indices is close to one or close to zero, then the fuzzifier is not the correct one. One should concentrate on the cluster solution where both the Dunns normalized partition (FPU) and the normalized average squared 11 error (DPU) coefficients have complementary low values (Seaver, et. al, 1997). This guideline will help in identifying potentially influential observations or subsets of observations. The bootstrapping process is a separate approach in this research, although it is used as a verification step in the multiple regression sense (Seaver, Triantis, and Reeves, 1997). Using the H* matrix discussed earlier, a case-wise bootstrap of the H* elements is performed to obtain a set of possible influential observations. In this setting, one must decide the number of repetitions to use and how big the sample size needs to be. Repetitions may range anywhere from 1,000 to 10,000, while sample sizes generally increase from 5 to 10. Using standardized skewness and kurtosis measures on the means from these samples, one can identify influential observations and possible subsets in a two-dimensional plot. The following excerpt from the statistical package, Number Cruncher Statistical System (1997), provides an overview of these standardized skewness and kurtosis measures: 12 13 (NCSS, 97, Help Menu). 14 III. Research Methodology and Findings To compare influential subset identification strategies, a data set with known influential observations and subsets was examined. The Rohwer data set contains three dependent variables and five independent variables: y1=Peabody Picture Vocabulary Test; y2=Student Achievement Test; y3 = Ravin Progressive Matrices Test; x1= named; x2 = still; x3 = named still; x4 = named action; and x5 = sentence still. The subjects were 32 randomly selected white, upper class, residential school children. Previously, both Hossain and Naik and Barrett and Ling identified observation 5 as influential. Hossain and Naik concluded that observation 5 is jointly influential with [independent] variable 1. If [independent] variable 1 is removed from the analysis, observation 5 is not influential (Hossain and Naik, 1989). Barrett and Ling concluded that observation 5 has both the largest influence and the largest leverage. They also decided that the pairs of {5,14} and {5, 25} are both influential but are canceling in their influence (Barrett and Ling, 1992). The three-stage fuzzy clustering process identified six as the modal number of clusters. As this is a vague idea of what is needed, the sensitivity analysis was performed to solidify the results. This second stage requires one to vary the fuzzifier, the number of clusters and possibly how many neighbors to consider in the nearest neighbor method. The fuzzifier was varied from 1.1 to 2.0; the number of clusters varied from 6 to 8 and how many neighbors to consider varied from 2 to 3. The analysis indicates that six is an ideal number of clusters. The fuzz plots for fuzzifier values of 1.1, 1.2, 1.25, 1.3, 1.4 and 1.5 can be found in Appendix A. In evaluating the fuzz plots, it is possible to see that a fuzzifier of 1.1 produces a hard solution, while a fuzzifier of 1.4 results in 25 percent (8 of 32) of the observations clustering 15 around the average degree of membership. Therefore, two conclusions could be reached. First, the subset size of influential points could be as large as 8. Secondly, it is unlikely that the fuzzifier needs to go beyond 1.4. This latter conclusion is supported by the normalized fuzzy indices given in Table 1 for clusters of size six. Fuzzifier FPU DPU 1.1 .9871736 .5989974 1.15 .9452504 .9582983 1.2 .8396765 .9237529 1.25 .7413848 .5920081 1.3 .6236388 .6679089 1.4 .4372243 .5653917 1.5 .2981738 .5714289 1.6 .1203189 .6879582 2.0 .0003395 .9799557 Table 1: Normalized Indices Normalized fuzzy indices are necessary to measure how hard a fuzzy clustering solution is. Given that u ij is the degree of belonging of the i th observation to the j th group, Dunns normalized partition coefficient is: ranging in value from 0 to 1 (Kaufman and Rousseau, 1990). This value is based on the square of the degree of belonging matrix. Thus, it tends to be fuzzier than the other index, DPU. In contrast, the DPU index, the normalized average square error of a fuzzy solution with respect to the closest hard solution, compares the fuzzy result to a standard (Kaufman and Rousseau, 1990). 16 It evaluates how far one has moved from the starting point and tends to be a much more reliable value. In evaluating the results, the FPU steadily decreases from about 1, the hard solution, to zero, the fuzzy solution, as the fuzzifier increases. The normalized fuzzy indices, FPU and DPU, in Table 1 seem to suggest that the ideal fuzzifier may be 1.25 or 1.3. For the 1.3 fuzzifier and six clusters, observations 5, 17, 9, and 29 fuzz out according to the matrix of belonging, quickly followed by observations 12, 10, 22, and 16. It is interesting to note that observations 2, 14, 25, 29, and 9 are clustered with the dominant influential observation, 5. When one extends the clusters to 7 and 8 for the same fuzzifier value of 1.3, the same data points are still influential candidates. The bottom line is that there could be influential subsets of size 8 to consider, and that is not an easy task. The lack of clarity in subset identification may be due to the collinearity among the independent variables. The bootstrapping approach, which takes h i * row vector out of the H* matrix identified four influential observations. Each row of the H* matrix was sampled 1000 times for sample sizes of 5. The means of the samples were computed, scored, and placed into boxplots. Anything deviating strongly from a normal distribution is a potential influential observation. The following boxplots in Figure 6 are for observations 1, 5, 10, 16, 27, and 31. Observations 1 and 31 are included to represent non-influential observations. 17 Figure 6: Boxplots of Select Observations The critical bootstrapping guideline in classifying observations as influential or not is the use of standardized skewness and kurtosis measures. Based on these measures three results could be indicative of influence. There could be extreme skewness and extreme kurtosis; one could have only extreme skewness, or one could have just extreme kurtosis. However, the scatterplot of the standardized skewness and kurtosis measures should reveal the potential influential subset patterns in the data. One decides about the presence of influence from the scatterplot but not the cause of the influence. Tables 2 through 5 show the Z-scores for the skewness and kurtosis measures. These Z-scores are based on DAgostinos normality tests as performed in NCSS. The A/R columns in these tables refer to whether or not the null hypotheses were accepted or rejected as to normality by that measure. -0.2 0.0 0.1 0.3 row_1 Row_5 Row_10 Row_16 Row_27 Row_31 Box Plot Variables A m o u n t 18 Row # Skewness A/R Kurtosis A/R 1 3.5923 R 0.3276 A 2 2.5090 R 0.7483 A 3 0.8676 A 2.0038 R 4 0.4058 A -1.0960 A 5 12.9440 R 7.4058 R 6 3.6621 R -0.2928 A 7 -0.4866 A 0.5601 A 8 2.4031 R 1.8530 A 9 0.9889 A -0.8626 A 10 8.2951 R 4.5951 R 11 2.5787 R -1.5404 A 12 0.1658 A -0.4565 A 13 1.0196 A 1.0692 A 14 2.1956 R -0.1877 A 15 3.8750 R 0.1031 A 16 9.6255 R 2.8368 R 17 4.4339 R 1.3482 A 18 1.0635 A -0.2535 A 19 5.8685 R 0.6944 A 20 0.9472 A -2.2039 R 21 3.1403 R -0.8729 A 22 -1.4270 A 2.2091 R 23 3.7199 R -0.8683 A 24 1.1983 A 0.1924 A 25 5.4232 R 1.4257 A 26 2.5539 R -0.8784 A 27 7.0213 R 2.7499 R 28 0.0656 A -0.6047 A 29 6.3322 R 1.8205 A 30 1.2567 A -4.2048 A 31 1.5001 A -0.0027 A 32 0.2727 A -2.0944 R Table 2: Skewness and Kurtosis Measures (1000 Samples of Size 5) These standardized measures were also plotted to identify possible subsets. In Figure 7, the results of the standardized skewness and kurtosis measures in Table 2 are shown. For instance, a subset of size 1 would be observation 5. A subset of size 2 could be cases {10, 16 }, {5, 10}, or {5, 16} while a subset of size 3 might consist of observations 5, 10, and 16. Any 19 larger subsets might include observations 27 and/or 29 and maybe 25, 19, or 17. Thus, subsets no larger than size eight could be investigated further for the cause of the influence. Figure 7: Skewness vs. Kurtosis plot of Z-values (1000 Samples of Size 5) From this plot in Figure 7 and the results of the hypothesis tests, further bootstrapping was performed to clarify the potential patterns. Samples of size 5 were taken 5000 times and samples of size 10 were taken both 1000 and 5000 times. -6.0 -1.3 3.3 8.0 -5.0 1.7 8.3 15.0 Skewness vs Kurtosis Skewness s i s o t r u K 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 20 Row # Skewness A/R Kurtosis A/R 1 5.687 R -1.8543 A 2 4.8967 R -0.7868 A 3 1.5184 A 4.6493 R 4 -1.919 A -3.5383 R 5 27.3181 R 15.2891 R 6 5.9316 R -1.2864 A 7 -0.2876 A -2.0493 R 8 6.2722 R 0.6717 A 9 3.0186 R 0.0386 A 10 16.7802 R 8.6645 R 11 6.149 R -3.5721 R 12 5.0384 R 0.8109 A 13 1.2143 A 0.6561 A 14 9.7394 R 4.724 R 15 14.3182 R 5.6717 R 16 23.2649 R 11.5382 R 17 9.8139 R 4.0338 R 18 6.9041 R 1.9395 A 19 12.8177 R 1.854 A 20 5.4967 R -1.6092 A 21 8.3171 R 1.0903 A 22 0.6489 A 3.2664 R 23 5.6547 R -2.6139 R 24 1.805 A 1.1992 A 25 11.2776 R 5.1479 R 26 4.5591 R -1.1977 A 27 18.9888 R 8.8051 R 28 3.2569 R -1.6537 A 29 14.6532 R 6.7067 R 30 4.1593 R -2.6362 R 31 5.0978 R 2.3014 R 32 2.3563 R -1.6769 A Table 3: Skewness and Kurtosis Measures (5000 samples of size 5) Increasing the number of repetitions produced more observations that rejected both hypotheses. However, the plot of standardized skewness and kurtosis measures in Figure 8 reveals a very similar pattern of influential candidates as given in Figure 7 for 1000 repetitions, except for 21 observation 4. Observations 5, 16, 27, 10, 29, 15, and 25 could be used to form potential influential subsets of size 1 to 7 in light of the order shown in Figure 8. -5.0 3.3 11.7 20.0 -5.0 6.7 18.3 30.0 Skewness vs Kurtosis Skewness K u r t o s I s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Figure 8: Skewness vs. Kurtosis plot of Z-values (5000 Samples of Size 5) Both bootstrap results (1000 and 5000 repetitions) reveal that observation 5 is highly influential (although the cause of influence is unknown). In addition, observations 16, 27, 10, and 29 seem to be influential as well. Observations 15 and 25 do not seem to differ drastically in skewness and kurtosis from the data mass in Figure 8, but subsets with these observations should be considered.. 22 Row # Skewness A/R Kurtosis A/R 1 0.272004 A 0.576398 A 2 0.6364 A -1.6234 A 3 0.5871 A 1.7254 A 4 -1.3578 A -0.116 A 5 8.9176 R 4.3235 R 6 2.3565 R -1.7805 A 7 -0.7298 A -0.9443 A 8 1.3969 A 2.1924 R 9 2.1156 R 0.0503 A 10 4.8511 R -0.277 A 11 3.9442 R -0.3561 A 12 1.6278 A -0.5072 A 13 0.9256 A 0.7652 A 14 2.606 R -1.5672 A 15 4.9337 R 1.2585 A 16 8.4859 R 4.1296 R 17 3.6385 R -0.5904 A 18 0.5822 A -1.6423 A 19 4.5861 R 0.4925 A 20 1.6242 A -0.9651 A 21 2.1366 R -1.9387 A 22 -1.9825 A 1.4526 A 23 0.7325 A -2.0482 R 24 0.5388 A 2.0328 R 25 5.0293 R 1.6868 A 26 2.0485 R -0.2862 A 27 5.4577 R 2.2836 R 28 0.8776 A -1.6666 A 29 6.0279 R 3.3629 R 30 1.0938 A -1.0115 A 31 1.5759 A -0.4592 A 32 0.6294 A -0.9335 A Table 4: Skewness and Kurtosis Measures (1000 Samples of Size 10) 23 -4.0 -0.7 2.7 6.0 -2.0 2.0 6.0 10.0 Skewness K u r t o s i s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Figure 9: Skewness vs. Kurtosis plot of Z-values (1000 Samples of Size 10) It is possible that if there is an influential subset of a size greater than five, one might need to look at bootstrap samples for n=10, which is done in Figures 9 and 10 for 1000 and 5000 repetitions, respectively. From Figure 9, one is able to see that observations 5 and 16 could form a subset of size 2. In addition, observation 29 could be a subset of size 1 while 15, 25, and 27 could be a subset of size 3. However, at least the same seven observations (5, 16, 29, 27, 25, 15, and 19) stand out again and could easily be configured into ordered influential subsets of size 2 to 7. 24 Row # Skewness A/R Kurtosis A/R 1 4.9053 R 0.0951 A 2 5.2498 R 0.5949 A 3 -0.5994 A 2.5711 R 4 0.1191 A -0.4914 A 5 21.6334 R 11.8558 R 6 6.1677 R 1.2282 A 7 -1.7925 A -1.321 A 8 3.9275 R -0.9826 A 9 2.5841 R -2.6125 R 10 12.4649 R 4.5684 R 11 6.2206 R -1.0011 A 12 5.0394 R 3.3093 R 13 0.4083 A -1.0387 A 14 7.6242 R 1.8701 A 15 9.6576 R 1.1786 A 16 17.0018 R 6.0276 R 17 7.126 R 0.4743 A 18 3.0454 R 0.4115 A 19 8.7743 R 3.5423 R 20 3.3341 R -0.1122 A 21 6.3813 R -0.1391 A 22 -0.5259 A 2.1144 R 23 4.2903 R -1.9261 A 24 -0.8020 A 2.3953 R 25 10.8794 R 4.1006 R 26 3.2874 R 1.271 A 27 13.1162 R 2.3096 R 28 1.6321 A 0.3177 A 29 11.4741 R 5.0043 R 30 2.405 R -0.0426 A 31 4.3292 R 3.7389 R 32 0.7024 A -1.5531 A Table 5: Skewness and Kurtosis Measures (5000 Samples of Size 10) 25 -5.0 1.7 8.3 15.0 -5.0 5.0 15.0 25.0 Skewness vs Kurtosis Skewness K u r t o s i s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Figure 10: Skewness vs. Kurtosis plot of Z-values (5000 Samples of Size 10) Figure 10 clearly illustrates that observation 5 is different from the rest of the data in influence, but the same seven observations as before are flagged. Working backwards from the most extreme skewness and kurtosis (observation 5) to where there is separation from the mass of standardized measures in Figure 10, one can easily construct a minimal number of potential influential subsets to evaluate for cause of influence. 26 IV. Conclusions and Recommendations One goal of this research was to find a method that easily identified influential subsets for multivariate regression. Below is a table that compares the three methods discussed: fuzzy methods, bootstrapping, and the multivariate influence plots. The following table helps illustrate the pros and cons of the methods: Barrett and Ling Fuzzy Method Bootstrapping Implementation Not simple Not simple Simple Computational Matrix manipulation Requires solid math and matrices background Can use resampling programs Visual Influence plot not easily interpreted Fuzz plot Skewness/Kurtosis plot clear interpretation Combinatorial Maximum Minimal Minimal Practical Difficult to explain Could be difficult Easiest to explain Programming Difficult Difficult Moderate The bootstrapping is the easiest of the methods because it only requires bootstrapping each row of H*, it minimizes the combination of subsets to consider, and it avoids the dominance problem that a single observation can have in influence analysis. Influential observations have similar distribution qualities, as evidenced by their skewness and/or kurtosis. In addition, our results seem to suggest repetitions of 5000 for n=5 and n= 10 work best when flagging potential influential subsets with bootstrapping, but more work is needed. For sure, the best results are achieved when the bootstrapping strategy and the fuzzy methods are combined, but practical applications in the real multivariate world would probably not have time for both strategy implementations. While this research was able to demonstrate useful methods for identifying potentially influential subsets, some of the details still need work. The usefulness of the fuzzy indices and 27 what exactly is an appropriate value needs to be defined. Criterion for influential subsets in multivariate regression are being programmed in order to pinpoint the cause of the influence. Also, objective criteria for identifying influential observations still need to be formalized for the bootstrap. Although the skewness and kurtosis measures are great guidelines, there needs to be a more objective definition. Finally, one extension of this research is to explore these methods on a data set that contains efficiency and quality measurements, where Y contains several output measurements and the X matrix is composed of several inputs. Many companies are beginning to focus on these types of multivariate regression models so there is a great need for techniques that quickly identify influential subsets because of model implications and the process knowledge gained from some of these influential subsets. This research is an important step in that direction. 28 Bibliography Barrett, Bruce E, and Ling, Robert F. (1992), General Classes of Influence Measures for Multivariate Regression, Journal of the American Statistical Association, Vol. 87, No. 417, March, 184-191. Hintze, Jerry (1997), Number Cruncher Statistical System (NCSS), Help Menu. Hossain, A. and Naik, D.N. (1989), Detection of influential observations in multivariate regression, Journal of Applied Statistics, Vol. 16, No. 1, 25-37. Kaufman, Leonard and Rousseeuw, Peter (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons, pp164-198. Reeves, Chip (1996), Using the Bootstrap to Detect Influential Subsets in Regression, Independent Project. Seaver, Bill and Triantis, Konstantinos, (1992), A Fuzzy Clustering Approach Used in Evaluating Technical Efficiency Measures in Manufacturing, ), The Journal of Productivity Analysis, Vol. 3, 337-363. Seaver, Bill, Triantis, Konstantinos, and Reeves, Chip, Under revision for Technometrics, The Identification of Influential Subsets in Regression Using a Fuzzy Clustering Strategy. 29 Appendix A 30 Fuzzi f i er of 1. 1 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 2 6 7 4 1 2 3 5 2 3 1 1 3 2 2 3 2 2 1 2 8 1 1 2 2 1 1 1 2 1 9 1 8 0 7 0 0 8 2 1 1 3 2 2 9 3 1 4 5 6 6 9 5 4 7 I D2 NOTE: 127 obs hi dden. Fuzzi f i er of 1. 2 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 2 6 2 7 3 1 2 3 1 3 1 2 3 4 1 2 2 2 1 1 2 5 8 2 2 1 2 1 1 1 9 1 8 7 6 0 0 8 1 2 3 2 0 1 5 3 4 9 4 6 1 2 9 5 7 I D2 NOTE: 113 obs hi dden. 32 Fuzzi f i er of 1. 25 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 2 6 2 7 3 1 2 3 1 3 1 2 3 4 2 1 1 2 2 1 2 5 8 2 2 1 2 1 1 1 9 1 8 7 6 0 0 8 1 2 3 2 1 0 3 5 4 9 4 6 1 2 9 5 7 I D2 NOTE: 102 obs hi dden. 33 Fuzzi f i er of 1. 3 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 6 2 2 3 7 2 3 3 1 1 1 2 3 4 2 1 1 2 1 2 2 9 5 8 1 2 2 2 1 1 1 1 8 7 0 0 1 8 2 7 3 2 1 0 3 4 5 9 1 6 4 2 9 5 6 I D2 NOTE: 94 obs hi dden. Fuzzi f i er of 1. 4 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1
_ 2 6 2 7 3 2 3 3 1 1 5 2 3 4 2 1 2 2 1 9 1 1 2 2 2 8 2 1 1 1 1 1 8 7 0 0 1 8 2 3 2 1 0 9 3 1 4 5 4 6 2 9 5 6 7 I D2 NOTE: 87 obs hi dden. 35 Fuzzi f i er of 1. 5 Pl ot of CL1*I D2. Symbol used i s ' 1' . Pl ot of CL2*I D2. Symbol used i s ' 2' . Pl ot of CL3*I D2. Symbol used i s ' 3' . Pl ot of CL4*I D2. Symbol used i s ' 4' . Pl ot of CL5*I D2. Symbol used i s ' 5' . Pl ot of CL6*I D2. Symbol used i s ' 6' . CL1