Professional Documents
Culture Documents
Kyplot Research PDF
Kyplot Research PDF
Koichi Yoshioka
1 Introduction
2 Structure of KyPlot
Figure 1. Screen shot of KyPlot showing a spread window containing the ethanol dataset and a
scatter plot of the data in a page of a figure window.
The other type of windows is "f i gure" containing one or more pages for drawing
graphs and schemes. Figure pages are made object-oriented. The classes of drawing
objects supported are line, rectangle, ellipse, B-spline, text, picture and graph. KyPlot
has a versatile drawing functionality to create technical drawings: one can move and
resize objects with a "select and drag" operation with a mouse; one can cut, copy and
paste them and apply operations such as grouping and layering. KyPlot supports
many types of 2- and 3-dimensional graphs. The versatility in creating many types of
39
graphs and in customizing graph components is one of the most significant features
of KyPlot.
On-screen presentation using a personal computer has been increasingly popular
recently. In version 3 of KyPlot, mUltiple pages can be created in a figure window
and the user can perform a screen show of these pages in a full screen mode within
the application.
Figure 3 shows the results of applying local polynomial least squares regression to
40
the ethanol dataset (Simonoff 1996; Loader 1999). The procedures to perform graph
fitting are as follows. The first step of graph fitting is to create a graph of the data: the
user selects the range of a dataset in a spreadsheet and opens the [Create Graph]
menu, which pops up a dialog box to specifY how the data is formatted and then
another dialog box to select the graph type and other graph options. After finishing
these operations, a graph is created in a figure page (Figure 1). The user then opens
the [Graph] - [Fit] menu, which pops up a dialog box for fitting, and selects
"Local Polynomial Regression" from the "Method" combo box (Figure 2).
As shown in Figure 2, settings for fitting procedures can be done by selecting radio
button options, marking check boxes and entering values in text boxes so that no
programming or macro is necessary: the user can perform fitting by clicking with
mouse and minimal amount of typing.
5
4
4
o~ 3
3 Z
2
, 0"
Z 2
o
O~-.-----r----~----~
0.6 0.8 1 .0 1.2 0.6
Equivalence RatiO
Equivalence Ratio
Figure 3. Univariate (A) and bivariate (B) local quadratic fits to the ethanol dataset. For the
bivariate fit, a spherically symmetric Gaussian kernel was used; the data points under the fitting
surface are darkened. The bandwidths of the local regressors were detennined by Alec.
In the Fit dialog box, the user can specifY parameters of fitting, such as the degree of
local polynomial (2 in this example), type of kernel function (Gaussian in this
example), and the value of the bandwidth. Then the user clicks the "Apply Fit"
button to start calculation and the fitting curve is drawn on the graph (Figure 3). To
determine the value of the bandwidth, selection criteria such as GCY (generalized
cross validation), AlC (Akaike's information criterion) and AlCc (corrected AlC) are
available. The user can perform a grid search for the criteria: in the "Select
Bandwidth" frame in the dialog box, selecting the "Grid Search" option and
clicking the "Start Calculation" button give a range of bandwidths and the
values of these criteria in the spreadsheet of the dialog box (Figure 2). When the
"Optimize" option is selected, clicking the Start Calculation button activates an
iterative routine to optimize one of the criteria. In Figure 3 the bandwidths of both the
univariate and bivariate local quadratic estimates were determined by AlCc (Hurvich
et ai., 1998). These procedures can be repeated until a satisfactory fit is obtained. The
41
Fit dialog box thus serves as a "toolbox" for interactive graph fitting.
KyPlot provides flexible options for fitting: the dataset used for fitting is the one
given in the spreadsheet in the Fit dialog box so that the user may add, exclude or
change values of the data for fitting without changing the data for graphing; weights
can be assigned to individual data points; and by clicking the "New Fit" button,
multiple fits can be applied to the same graph.
KyPlot also supports penalized least squares regression: smoothing splines for
univariate regression and thin plate splines for bivariate regression. Figure 4 shows a
thin plate spline interpolant (A) and smoother (B) to the ore-bearing layer dataset
(Green and Silverman, 1996). The smoothing parameter of the splines can be
determined in a similar manner to the bandwidth of local polynomial estimators.
·60
-20 o 20 40 60 80 -20 o 20 40 60 80
Figure 4. Contour plots of thin plate interpolant (A) and smoother (B) to the ore-bearing layer
dataset (positions of 38 data sites and the "true width" of the ore-bearing layer measured at
each site) (Green and Silvennan, 1996).
Wavelet transforms are a device for representing functions in a way that is localized
in both time and frequency domains. It has been demonstrated that wavelet-based
smoothing methods can adapt readily to spatially inhomogeneous curves (Donoho et
aI., 1995; Bruce and Gao, 1996). KyPlot provides wavelet regression methods by
thresholding (shrinkage). In Figure 5 wavelet thresholding is applied to the MNR
spectrum dataset (Bruce and Gao, 1996). The threshold level can be determined by
the methods of Donoho et ai. (1995), Nason (1996) or Hurvich and Tsai (1998).
Local or penalized least squares methods are generalized to likelihood methods when
response variables have a non-Gaussian distribution (Green and Silverman, 1996;
Loader, 1999). KyPlot supports local and penalized likelihood regressions with a
variety of families (binomial, Poisson, gamma, negative binomial, Pearson and
Huber) and link functions. In Figure 6, bivariate nonparametric logistic regression is
applied to a binary dataset. A contour plot of local logistic regression estimate is
drawn for the MBA grade dataset (Simonoff, 1996). The data are the score on the
Graduate Management Admission Test (GMAT) and first-year grade point average
for 61 second-year MBA students at New York University's Stem School of
Business in 1995, with data values for men marked by x and those for women
42
marked by O. The bold contour is the 0.5 level, which can be used as the
classification boundary for men and women.
40 40
Figure 5. Wavelet regression by thresholding. Left panel: original NMR spectrum signal; right
panel: result of wavelet hard thresholding. Symmlet of order 6 was used and the thresholding
level was determined by the cross-validation criterion of Nason (1996).
4
x x x x )(
Male
)(
x x x • Female
a>
Cl) x ><" x
CtI XX
L- 3.5
a>
-
«>
c
0
0..
a>
"0
CtI
3
L-
19
x
2.5
400 500 600 700
GMAT Score
Figure 6. Contour plot of local logistic regression estimate for the MBA grade dataset. Local
quadratic logistic regression was applied to the data and the result was plotted as a contour plot.
The bold contour is the 0.5 level. Gaussian kernel was used and the bandwidth was determined
by AICc.
When the errors have a long tailed distribution, least squares estimates can be
excessively sensitive to extreme observations or outliers. Robust regression methods
attempt to remedy this by identifying and down-weighting influential observations. A
popular robust nonparametric procedure is LOWESS, which reduces the influence of
43
A B
-_. A = 0.05 - A = 1 ..... A = 20 - Cauchy - - Huber
15 15 o
10 10
y y
5 , 5
/1 0
o I
00 , , " 00
0 0
0 ,
Figure 7. A. Cubic smoothing spline fits with three different values of the smoothing
parameter (A) to an artificial dataset. B. Penalized likelihood fits with Cauchy and Huber
family distributions to the same dataset. The values of Awere determined by AlCc.
KyPlot also offers nonparametric density estimation methods. In the left panel of
Figure 8, the density of the Old Faithful geyser dataset, which contains durations of
107 eruptions, was estimated by the local likelihood method of Loader (1996) and the
penalized likelihood method of Silverman (1982). The smoothing parameters of both
estimates were determined by AIC (O'Sullivan, 1988; Loader, 1999). When data are
given as binned data with a sufficiently small bin width, the density can be estimated
by applying nonparametric Poisson regression (Simonoff, 1996; Loader, 1999). A
histogram with the bin width of 0.03 was constructed from the geyser data and the
local likelihood and penalized likelihood Poisson regressions were applied with the
smoothing parameters determined by AICc. As shown in the right panel of Figure 8,
the fit using local or penalized Poisson regression is comparable to that of local or
penalized likelihood density estimation method. Density of bivariate data can be
similarly estimated (Figure 9).
44
0.6 0.6
0.5 0.5
2:-
'iii 0.4 0.4
c:
Q)
0 0.3 0.3
0.2 0.2
0.1 0.1
0 i ° ri~---r----.----,-----r~
Figure 8. Density estimation for the Old Faithful geyser dataset. Left panel: local likelihood
density estimate (solid line; local quadratic with Gaussian kernel) and penalized likelihood
estimate (dashed line; with 3rd order derivative penalty). The grey jaggy line is a kernel
estimate .using uniform kernel. Right panel: density estimates for binned data using local
likelihood (solid line; local quadratic with Gaussian kernel) or penalized likelihood (dashed
line; with 3rd order derivative penalty) Poisson regression. The smoothing parameters of
density estimates were determined by Ale and those of Poisson regression fits were
determined by Alec.
D 4
uration ( . 6
min)
Figure 9. Local likelihood (local quadratic with Gaussian kernel) density estimate for the
bivariate Old Faithful geyser dataset. The bandwidth was determined by Ale.
45
The hazard rate function for survival time data can be estimated nonparametrically
with a slight modification of the density estimation methods (O'Sullivan, 1988;
Loader, 1999). The upper panel in Figure 10 shows the local likelihood and
penalized likelihood estimates for the hazard rate of the Stanford heart transplant data.
These estimates suggest an initial high hazard followed by a sharp drop after the
operation. In the lower panel, the estimates of survival function are plotted, showing
that both the likelihood estimates are very close to the Kaplan-Meier estimate.
A 0.005
0.004
<ll Local likelih ood estimate • Dead
"§ 0 ,003
Penalized likelihood esimate o Censored
"0
ro
N
0 ,002
ro
--
I 0001
--------------~---~~---
0 o+ 0
++
00 co
~~
0 0
I I i
0 1000 2000 3000 4000
B
c - Kaplan-Meier estimate
0 0.8
TIc 0.6
- -- Lower & Upp er 95%
.2
0.4
~ _-_~ ~- ~
'2:
"'\'" ______ - ... - . - - - - - - - - - 1 - __ _ _ _ _ _ __ _
~
0.2 - -, ____ ~___ - - - ___ _-J
if)
0
0 1000 2000 3000 4000
Time (days)
Figure 10. A. Local likelihood and penalized likelihood hazard rate estimates of heart
transplant data. B. Kaplan-Meier estimate and its 95% confidence bands with the local and
penalized likelihood estimates.
4 Discussion
References
Bruce, A. and Gao, H.-Y (1996). Applied Wavelet Analysis with S-Plus. New York:
Springer-Verlag.
Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatter
plots. Journal ofAmerican Statistical Association, 74, 829-836.
Donoho, D.L., Johnstone, l.M., Kerkyacharian, G. and Picard, D. (1995). Wavelet
shrinkage: asymptopia? (with discussion) Journal of the Royal Statistical Society,
Series B ,57, 362-366.
Green, PJ. and Silverman, B.W. (1994). Nonparametric Regression and Generalized
Linear Models. London: Chapman & Hall.
Hurvich, C.M., Simonoff, J.S. and Tsai, c.-L. (1998). Smoothing parameter selection
in nonparametric regression using an improved Akaike information criterion.
Journal ofthe Royal Statistical Society Series B, 60, 271-293.
Hurvich, C.M. and Tsai, c.-L. (1998). A cross-validatory AIC for hard thresholding
in spatially adaptive function estimation. Biometrika, 85, 701-710.
Loader, C.R. (1996). Local likelihood density estimation. Annals of Statistics, 24,
1602-1618.
Loader, C.R. (1999). Local Regression and Likelihood. New York: Springer.
Nason, G.P. (1996). Wavelet shrinkage using cross-validation. Journal of the Royal
Statistical Society, Series B, 58, 463-479.
O'Sullivan, F. (1988). Fast computation of fully automated log-density and log-
hazard estimators. SIAM Journal on Scientific and Statistical Computation, 9,
363-379.
Silverman, B.W. (1982). On the estimation of a probability density function by the
maximum penalized likelihood method. Annals ofStatistics, 10, 795-810.
Simonoff, J.S. (1996), Smoothing Methods in Statistics. New York: Springer.
Yoshioka, K. (2002). KyPlot - a user-oriented tool for statistical data analysis and
visualization, Computational Statistics, To appear.