Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

KyPlot as a Tool for Graphical Data Analysis

Koichi Yoshioka

Department of Biochemistry and Biophysics, Graduate School of Allied Health


Sciences, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku,
Tokyo 113-8519, Japan

Abstract. KyPlot is a software package intended to provide an integrated


environment for data analysis and visualization. It offers a broad range of statistical
procedures on a spreadsheet interface and versatile graphing tools. KyPlot has the
functionality of running a screen show to present the created graphs and schemes. It
also supports an interactive system of graph fitting, including nonlinear regression,
interpolation and a variety of smoothing methods for nonparametric regression and
density estimation.

Keywords. Data visualization, graph fitting, nonparametric regression, density


estimation, smoothing methods

1 Introduction

Graphical visualization is an important aspect of data analysis. Graphs allow us to


grasp overall patterns of data or find detailed structures of data. Graphs also allow us
to view mathematical models fitted to data and check the validity of the models.
Iterative exploration is critical for graphical visualization. Graphing needs to be
iterative because graphs can assist to discover unidentified aspects of the data, and
once we notice them, we can often formulate a new question to the data.
Graphing is also important for data communication, when we want to present
information to others. Interactive graphing is necessary in producing high-quality
graphs for presentation and publication. Because it is difficult to determine all the
specifications of a desired graph beforehand, graphs need to be fully customizable
after they are created.
There is thus an increasing need for a software system of data analysis and
visualization that is easy to use, capable of creating customizable graphs and has the
functionality of interactive graph fitting. Although statistical software systems
currently available usually include graphing tools, they may not be adequate in terms
of versatility and interactivity.
KyPlot is a software package intended to provide an integrated environment for data
analysis and visualization. It offers a broad range of statistical procedures on a
spreadsheet interface and versatile tools for creating graphs to visualize the results.
Overall functionality of version 2 of KyPlot was described elsewhere (Yoshioka,
2002). In this paper the general design of a new version (version 3) of the software is
presented in section 2 and the functionality of graph fitting, particularly, smoothing,
is described in section 3.

W. Härdle et al. (eds.), Compstat


© Springer-Verlag Berlin Heidelberg 2002
38

2 Structure of KyPlot

KyPlot is a stand-alone Windows application with menus, toolbars and tabbed


dialog boxes (Figure I). Within the main window, it contains two types of child
windows. One is "spread" containing one or more spreadsheets where one can
enter, process and analyze data. The functionality of the spreadsheets is similar to
that of other spreadsheet applications, supporting cell formulae and formatting.
The syntax of cell formulae is compatible with Microsoft Excel and data can be
moved about and cross-referenced similarly. As most computer users are familiar
with a spreadsheet, they will easily learn how to use the spreadsheet of KyPlot.
Furthermore, KyPlot provides a variety of functions for data processing and
statistical procedures, including parametric and nonparametric tests, procedures for
regression analysis, multivariate analysis, survival analysis and time series analysis
(Yoshioka, 2002).

Figure 1. Screen shot of KyPlot showing a spread window containing the ethanol dataset and a
scatter plot of the data in a page of a figure window.

The other type of windows is "f i gure" containing one or more pages for drawing
graphs and schemes. Figure pages are made object-oriented. The classes of drawing
objects supported are line, rectangle, ellipse, B-spline, text, picture and graph. KyPlot
has a versatile drawing functionality to create technical drawings: one can move and
resize objects with a "select and drag" operation with a mouse; one can cut, copy and
paste them and apply operations such as grouping and layering. KyPlot supports
many types of 2- and 3-dimensional graphs. The versatility in creating many types of
39

graphs and in customizing graph components is one of the most significant features
of KyPlot.
On-screen presentation using a personal computer has been increasingly popular
recently. In version 3 of KyPlot, mUltiple pages can be created in a figure window
and the user can perform a screen show of these pages in a full screen mode within
the application.

3 Graph fitting functionality of KyPlot


A variety of graph fitting methods are available in KyPlot: nonlinear regression for
user-given models based on least squares or maximum likelihood methods,
polynomial regression, B-spline fits, piecewise polynomial interpolation and surface
interpolation. KyPlot also offers smoothing methods for nonparametric regression
and density estimation. KyPlot supports three major methods for smoothing:
smoothing splines or penalized likelihood methods, kernel-based or local likelihood
methods and basis function methods such as wavelets and Fourier series.

Figure 2. Fit dialog box for local polynomial regression.

Figure 3 shows the results of applying local polynomial least squares regression to
40

the ethanol dataset (Simonoff 1996; Loader 1999). The procedures to perform graph
fitting are as follows. The first step of graph fitting is to create a graph of the data: the
user selects the range of a dataset in a spreadsheet and opens the [Create Graph]
menu, which pops up a dialog box to specifY how the data is formatted and then
another dialog box to select the graph type and other graph options. After finishing
these operations, a graph is created in a figure page (Figure 1). The user then opens
the [Graph] - [Fit] menu, which pops up a dialog box for fitting, and selects
"Local Polynomial Regression" from the "Method" combo box (Figure 2).
As shown in Figure 2, settings for fitting procedures can be done by selecting radio
button options, marking check boxes and entering values in text boxes so that no
programming or macro is necessary: the user can perform fitting by clicking with
mouse and minimal amount of typing.

5
4
4
o~ 3
3 Z
2
, 0"
Z 2

o
O~-.-----r----~----~
0.6 0.8 1 .0 1.2 0.6
Equivalence RatiO
Equivalence Ratio

Figure 3. Univariate (A) and bivariate (B) local quadratic fits to the ethanol dataset. For the
bivariate fit, a spherically symmetric Gaussian kernel was used; the data points under the fitting
surface are darkened. The bandwidths of the local regressors were detennined by Alec.

In the Fit dialog box, the user can specifY parameters of fitting, such as the degree of
local polynomial (2 in this example), type of kernel function (Gaussian in this
example), and the value of the bandwidth. Then the user clicks the "Apply Fit"
button to start calculation and the fitting curve is drawn on the graph (Figure 3). To
determine the value of the bandwidth, selection criteria such as GCY (generalized
cross validation), AlC (Akaike's information criterion) and AlCc (corrected AlC) are
available. The user can perform a grid search for the criteria: in the "Select
Bandwidth" frame in the dialog box, selecting the "Grid Search" option and
clicking the "Start Calculation" button give a range of bandwidths and the
values of these criteria in the spreadsheet of the dialog box (Figure 2). When the
"Optimize" option is selected, clicking the Start Calculation button activates an
iterative routine to optimize one of the criteria. In Figure 3 the bandwidths of both the
univariate and bivariate local quadratic estimates were determined by AlCc (Hurvich
et ai., 1998). These procedures can be repeated until a satisfactory fit is obtained. The
41

Fit dialog box thus serves as a "toolbox" for interactive graph fitting.
KyPlot provides flexible options for fitting: the dataset used for fitting is the one
given in the spreadsheet in the Fit dialog box so that the user may add, exclude or
change values of the data for fitting without changing the data for graphing; weights
can be assigned to individual data points; and by clicking the "New Fit" button,
multiple fits can be applied to the same graph.
KyPlot also supports penalized least squares regression: smoothing splines for
univariate regression and thin plate splines for bivariate regression. Figure 4 shows a
thin plate spline interpolant (A) and smoother (B) to the ore-bearing layer dataset
(Green and Silverman, 1996). The smoothing parameter of the splines can be
determined in a similar manner to the bandwidth of local polynomial estimators.

A Thin plate spline interpolant B Thin plate spline smoother

·60

-20 o 20 40 60 80 -20 o 20 40 60 80

Figure 4. Contour plots of thin plate interpolant (A) and smoother (B) to the ore-bearing layer
dataset (positions of 38 data sites and the "true width" of the ore-bearing layer measured at
each site) (Green and Silvennan, 1996).

Wavelet transforms are a device for representing functions in a way that is localized
in both time and frequency domains. It has been demonstrated that wavelet-based
smoothing methods can adapt readily to spatially inhomogeneous curves (Donoho et
aI., 1995; Bruce and Gao, 1996). KyPlot provides wavelet regression methods by
thresholding (shrinkage). In Figure 5 wavelet thresholding is applied to the MNR
spectrum dataset (Bruce and Gao, 1996). The threshold level can be determined by
the methods of Donoho et ai. (1995), Nason (1996) or Hurvich and Tsai (1998).
Local or penalized least squares methods are generalized to likelihood methods when
response variables have a non-Gaussian distribution (Green and Silverman, 1996;
Loader, 1999). KyPlot supports local and penalized likelihood regressions with a
variety of families (binomial, Poisson, gamma, negative binomial, Pearson and
Huber) and link functions. In Figure 6, bivariate nonparametric logistic regression is
applied to a binary dataset. A contour plot of local logistic regression estimate is
drawn for the MBA grade dataset (Simonoff, 1996). The data are the score on the
Graduate Management Admission Test (GMAT) and first-year grade point average
for 61 second-year MBA students at New York University's Stem School of
Business in 1995, with data values for men marked by x and those for women
42

marked by O. The bold contour is the 0.5 level, which can be used as the
classification boundary for men and women.

Original NMR Signal Hard Thresholding


60 60

40 40

Figure 5. Wavelet regression by thresholding. Left panel: original NMR spectrum signal; right
panel: result of wavelet hard thresholding. Symmlet of order 6 was used and the thresholding
level was determined by the cross-validation criterion of Nason (1996).

4
x x x x )(
Male
)(
x x x • Female
a>
Cl) x ><" x
CtI XX
L- 3.5
a>

-
«>
c
0
0..
a>
"0
CtI
3
L-

19

x
2.5
400 500 600 700
GMAT Score

Figure 6. Contour plot of local logistic regression estimate for the MBA grade dataset. Local
quadratic logistic regression was applied to the data and the result was plotted as a contour plot.
The bold contour is the 0.5 level. Gaussian kernel was used and the bandwidth was determined
by AICc.

When the errors have a long tailed distribution, least squares estimates can be
excessively sensitive to extreme observations or outliers. Robust regression methods
attempt to remedy this by identifying and down-weighting influential observations. A
popular robust nonparametric procedure is LOWESS, which reduces the influence of
43

outliers by an iterative reweighted least square scheme (Cleveland, 1979). Another


approach is to apply the local likelihood or penalized likelihood method with a long
tailed distribution for errors (Loader, 1999). For this purpose, KyPlot supports the
Pearson family of distributions that includes Cauchy and t-distributions and Huber
function family. The dataset shown in Figure 7 is generated by adding noise with a
long tailed distribution to a sine curve. As shown in Figure 7A, the smoothing spline
fits with three different values of smoothing parameter are considerably deviated
from the sine curve. By contrast, the penalized likelihood estimates with Cauchy and
Huber families are successful in identifying the sine curve (Figure 7B).

A B
-_. A = 0.05 - A = 1 ..... A = 20 - Cauchy - - Huber

15 15 o

10 10

y y
5 , 5
/1 0

o I
00 , , " 00

.~. "... J~~~


0
0 0
~?o ~. .. ~
'
0 ,
0 (0
~ 0
0
o
0

0 0

0 ,

0 20 40 60 80 100 0 ~O 40 flO AO 100

Figure 7. A. Cubic smoothing spline fits with three different values of the smoothing
parameter (A) to an artificial dataset. B. Penalized likelihood fits with Cauchy and Huber
family distributions to the same dataset. The values of Awere determined by AlCc.

KyPlot also offers nonparametric density estimation methods. In the left panel of
Figure 8, the density of the Old Faithful geyser dataset, which contains durations of
107 eruptions, was estimated by the local likelihood method of Loader (1996) and the
penalized likelihood method of Silverman (1982). The smoothing parameters of both
estimates were determined by AIC (O'Sullivan, 1988; Loader, 1999). When data are
given as binned data with a sufficiently small bin width, the density can be estimated
by applying nonparametric Poisson regression (Simonoff, 1996; Loader, 1999). A
histogram with the bin width of 0.03 was constructed from the geyser data and the
local likelihood and penalized likelihood Poisson regressions were applied with the
smoothing parameters determined by AICc. As shown in the right panel of Figure 8,
the fit using local or penalized Poisson regression is comparable to that of local or
penalized likelihood density estimation method. Density of bivariate data can be
similarly estimated (Figure 9).
44

- Local likelihood --_. Penalized likelihood

Density Estimation Poisson Regression


0.7 0.7

0.6 0.6

0.5 0.5
2:-
'iii 0.4 0.4
c:
Q)
0 0.3 0.3
0.2 0.2
0.1 0.1
0 i ° ri~---r----.----,-----r~

Eruption Duration Eruption Duration

Figure 8. Density estimation for the Old Faithful geyser dataset. Left panel: local likelihood
density estimate (solid line; local quadratic with Gaussian kernel) and penalized likelihood
estimate (dashed line; with 3rd order derivative penalty). The grey jaggy line is a kernel
estimate .using uniform kernel. Right panel: density estimates for binned data using local
likelihood (solid line; local quadratic with Gaussian kernel) or penalized likelihood (dashed
line; with 3rd order derivative penalty) Poisson regression. The smoothing parameters of
density estimates were determined by Ale and those of Poisson regression fits were
determined by Alec.

D 4
uration ( . 6
min)

Figure 9. Local likelihood (local quadratic with Gaussian kernel) density estimate for the
bivariate Old Faithful geyser dataset. The bandwidth was determined by Ale.
45

The hazard rate function for survival time data can be estimated nonparametrically
with a slight modification of the density estimation methods (O'Sullivan, 1988;
Loader, 1999). The upper panel in Figure 10 shows the local likelihood and
penalized likelihood estimates for the hazard rate of the Stanford heart transplant data.
These estimates suggest an initial high hazard followed by a sharp drop after the
operation. In the lower panel, the estimates of survival function are plotted, showing
that both the likelihood estimates are very close to the Kaplan-Meier estimate.

A 0.005
0.004
<ll Local likelih ood estimate • Dead
"§ 0 ,003
Penalized likelihood esimate o Censored
"0
ro
N
0 ,002
ro

--
I 0001
--------------~---~~---
0 o+ 0
++
00 co
~~
0 0
I I i
0 1000 2000 3000 4000
B
c - Kaplan-Meier estimate
0 0.8
TIc 0.6
- -- Lower & Upp er 95%
.2
0.4
~ _-_~ ~- ~
'2:
"'\'" ______ - ... - . - - - - - - - - - 1 - __ _ _ _ _ _ __ _

~
0.2 - -, ____ ~___ - - - ___ _-J
if)
0
0 1000 2000 3000 4000

Time (days)

Figure 10. A. Local likelihood and penalized likelihood hazard rate estimates of heart
transplant data. B. Kaplan-Meier estimate and its 95% confidence bands with the local and
penalized likelihood estimates.

4 Discussion

Development of the software started from the necessity for my work. I am a


neurophysiologist, studying how the brain works. I have been analyzing experimental
data by applying various analytical methods. I needed software tools to perform such
analyses and tools to create graphs and schemes. Because existing computer
programs made for these purposes did not satisfy me, I began making programs by
myself, gradually added many kinds of routines and integrated them into a software
package, KyPlot. There exist still many plans for the project of KyPlot. I plan to add
more statistical procedures and more graphical tools. For smoothing, procedures to be
added are methods for multivariate regression such as additive models and adaptive
regression splines, semiparametric models and neural networks.
46

References
Bruce, A. and Gao, H.-Y (1996). Applied Wavelet Analysis with S-Plus. New York:
Springer-Verlag.
Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatter
plots. Journal ofAmerican Statistical Association, 74, 829-836.
Donoho, D.L., Johnstone, l.M., Kerkyacharian, G. and Picard, D. (1995). Wavelet
shrinkage: asymptopia? (with discussion) Journal of the Royal Statistical Society,
Series B ,57, 362-366.
Green, PJ. and Silverman, B.W. (1994). Nonparametric Regression and Generalized
Linear Models. London: Chapman & Hall.
Hurvich, C.M., Simonoff, J.S. and Tsai, c.-L. (1998). Smoothing parameter selection
in nonparametric regression using an improved Akaike information criterion.
Journal ofthe Royal Statistical Society Series B, 60, 271-293.
Hurvich, C.M. and Tsai, c.-L. (1998). A cross-validatory AIC for hard thresholding
in spatially adaptive function estimation. Biometrika, 85, 701-710.
Loader, C.R. (1996). Local likelihood density estimation. Annals of Statistics, 24,
1602-1618.
Loader, C.R. (1999). Local Regression and Likelihood. New York: Springer.
Nason, G.P. (1996). Wavelet shrinkage using cross-validation. Journal of the Royal
Statistical Society, Series B, 58, 463-479.
O'Sullivan, F. (1988). Fast computation of fully automated log-density and log-
hazard estimators. SIAM Journal on Scientific and Statistical Computation, 9,
363-379.
Silverman, B.W. (1982). On the estimation of a probability density function by the
maximum penalized likelihood method. Annals ofStatistics, 10, 795-810.
Simonoff, J.S. (1996), Smoothing Methods in Statistics. New York: Springer.
Yoshioka, K. (2002). KyPlot - a user-oriented tool for statistical data analysis and
visualization, Computational Statistics, To appear.

You might also like