Professional Documents
Culture Documents
Statistics Toolbox: Perform Statistical Modeling and Analysis
Statistics Toolbox: Perform Statistical Modeling and Analysis
Statistics Toolbox™ provides algorithms and tools for organizing, analyzing, and modeling data. You can use
regression or classification for predictive modeling, generate random numbers for Monte Carlo simulations, use
statistical plots for exploratory data analysis, and perform hypothesis tests.
For analyzing multidimensional data, Statistics Toolbox includes algorithms that let you identify key variables that
impact your model with sequential feature selection, transform your data with principal component analysis,
apply regularization and shrinkage, or use partial least squares regression.
Statistics Toolbox includes specialized data types for organizing and accessing heterogeneous data. Dataset arrays
store numeric data, text, and metadata in a single data container. Built-in methods enable you to merge datasets
using a common key (join), calculate summary statistics on grouped data, and convert between tall and wide data
representations. Categorical arrays provide a memory-efficient data container for storing information drawn from
a finite, discrete set of categories.
Key Features
▪ Statistical arrays for storing heterogeneous and categorical data
▪ Regression techniques, including linear, nonlinear, robust, and ridge, and nonlinear mixed-effects models
▪ Classification algorithms, including boosted and bagged decision trees, k-Nearest Neighbor, and linear
discriminant analysis
▪ Hypothesis testing
Statistics Toolbox provides two specialized arrays for storing and managing statistical data: dataset arrays and
categorical arrays.
Dataset Arrays
Dataset arrays enable convenient organization and analysis of heterogeneous statistical data and metadata. Dataset
arrays provide columns to represent measured variables and rows to represent observations. With dataset arrays,
you can:
1
▪ Label rows and columns of data and reference that data using recognizable names.
Dataset array displayed in the MATLAB Variable Editor. This dataset array includes a mixture of cell strings and
numeric information, with selected columns available in the Plot Selector Tool.
Statistics Toolbox provides specialized functions to operate on dataset arrays. With these specialized functions,
you can:
▪ Export data into standard file formats, including Microsoft® Excel® and comma-separated value (CSV).
Categorical Arrays
Categorical arrays enable you to organize and process nominal and ordinal data that uses values from a finite set
of discrete levels or categories. With categorical arrays, you can:
▪ Decrease memory footprint by replacing repetitive text strings with categorical labels.
▪ Store nominal data using descriptive labels, such as red, green, and blue for an unordered set of colors.
▪ Store ordinal data using descriptive labels, such as cold, warm, and hot for an ordered set of temperature
measurements.
▪ Manipulate categorical data using familiar array operations and indexing methods.
2
Exploratory Data Analysis
Statistics Toolbox provides multiple ways to explore data: statistical plotting with interactive graphics, algorithms
for cluster analysis, and descriptive statistics for large datasets.
Statistics Toolbox includes graphs and charts to explore your data visually. The toolbox augments MATLAB® plot
types with probability plots, box plots, histograms, scatter histograms, 3D histograms, control charts, and
quantile-quantile plots. The toolbox also includes specialized plots for multivariate analysis, including
dendograms, biplots, parallel coordinate charts, and Andrews plots.
3
Scatter histogram using a combination of scatter plots and histograms to describe the relationship between variables.
Plot comparing the empirical CDF for a sample from an extreme value distribution with a plot of the CDF for the
sampling distribution.
Cluster Analysis
Statistics Toolbox offers multiple algorithms to analyze data using hierarchical clustering, k-means clustering, and
Gaussian mixtures.
4
Two-component Gaussian mixture model fit to a mixture of bivariate Gaussians.
5
Dendrogram plot showing a model with four clusters.
Descriptive Statistics
Descriptive statistics enable you to understand and describe potentially large sets of data quickly. Statistics
Toolbox includes functions for calculating:
▪ Measures of central tendency (measures of location), including average, median, and various means
▪ Measures of dispersion (measures of spread), including range, variance, standard deviation, and mean or
median absolute deviation
These functions help you summarize values in a data sample using a few highly relevant numbers.
In some cases, estimating summary statistics using parametric methods is not possible. To deal with these cases,
Statistics Toolbox provides resampling techniques, including:
▪ Jackknife function for estimating sample statistics using subsets of the data
6
Regression, Classification, and ANOVA
Regression
With regression, you can model a continuous response variable as a function of one or more predictors. Statistics
Toolbox offers a wide variety of regression algorithms, including:
▪ Linear regression
▪ Nonlinear regression
▪ Robust regression
▪ R2 and adjusted R2
With the toolbox, you can calculate confidence intervals for both regression coefficients and predicted values.
Statistics Toolbox supports more advanced techniques to improve predictive accuracy when the dataset includes
large numbers of correlated variables. The toolbox supports:
▪ Subset selection techniques, including sequential features selection and stepwise regression
Statistics Toolbox also supports nonparametric regression techniques for generating an accurate fit without
specifying a model that describes the relationship between the predictor and the response. Nonparametric
regression techniques include decision trees as well as boosted and bagged regression trees.
Additionally, Statistics Toolbox supports nonlinear mixed-effect (NLME) models in which some parameters of a
nonlinear function vary across individuals or groups.
7
Nonlinear mixed-effects model of drug absorption and elimination showing intrasubject concentration-versus-time
profiles. The nlmefit function in Statistics Toolbox generates a population model using fixed and random effects.
Classification
Classification algorithms enable you to model a categorical response variable as a function of one or more
predictors. Statistics Toolbox offers a wide variety of parametric and nonparametric classification algorithms, such
as:
▪ Boosted and bagged classification trees, including AdaBoost, LogitBoost, GentleBoost, and RobustBoost
You can evaluate goodness of fit for the resulting classification models using techniques such as:
▪ Cross-validated loss
▪ Confusion matrices
ANOVA
Analysis of variance (ANOVA) enables you to assign sample variance to different sources and determine whether
the variation arises within or among different population groups. Statistics Toolbox includes these ANOVA
algorithms and related techniques:
8
▪ One-way ANOVA
Multivariate Statistics
Multivariate statistics provide algorithms and functions to analyze multiple variables. Typical applications include:
▪ Transforming correlated data into a set of uncorrelated components using rotation and centering (principal
component analysis)
▪ Exploring relationships between variables using visualization techniques, such as scatter plot matrices and
classical multidimensional scaling
Feature Transformation
Feature transformation techniques enable dimensionality reduction when transformed features can be more easily
ordered than original features. Statistics Toolbox offers three classes of feature transformation algorithms:
▪ Nonnegative matrix factorization when model terms must represent nonnegative quantities
Multivariate Visualization
Statistics Toolbox provides graphs and charts to explore multivariate data visually, including:
▪ Dendograms
▪ Biplots
▪ Andrews plots
▪ Glyph plots
9
Group scatter plot matrix showing how model year impacts different variables.
Biplot showing the first three loadings from a principal component analysis.
10
Andrews plot showing how country of original impacts the variables.
Cluster Analysis
▪ K-means clustering, which assigns data points to the cluster with the closest mean.
▪ Gaussian mixtures, which are formed by combining multivariate normal density components. Clusters are
assigned by selecting the component that maximizes posterior probability.
11
Output from applying a clustering algorithm to the same example.
Probability Distributions
Statistics Toolbox provides functions and an app to work with parametric and nonparametric probability
distributions. With these tools, you can:
▪ Compute key functions such as probability density functions and cumulative distribution functions.
The Distribution Fitting Tool in the toolbox enables you to fit data using predefined univariate probability
distributions, a nonparametric (kernel-smoothing) estimator, or a custom distribution that you define. This tool
12
supports both complete data and censored (reliability) data. You can exclude data, save and load sessions, and
generate MATLAB code.
Visual plot of distribution data (left) and summary statistics (right). Using the Distribution Fitting Tool, you can estimate
a normal distribution with mean and variance values (16.9 and 8.7, respectively, in this example).
You can estimate distribution parameters at the command line or construct probability distributions that
correspond to the governing parameters.
Additionally, you can create multivariate probability distributions, including Gaussian mixtures and multivariate
normal, multivariate t, and Wishart distributions. You can use copulas to create multivariate distributions by
joining arbitrary marginal distributions using correlation structures.
With the toolbox, you can specify custom distributions and fit these distributions using maximum likelihood
estimation.
Statistics Toolbox provides statistical plots to evaluate how well a dataset matches a specific distribution. The
toolbox includes probability plots for a variety of standard distributions, including normal, exponential, extreme
value, lognormal, Rayleigh, and Weibull. You can generate probability plots from complete datasets and censored
13
datasets. Additionally, you can use quantile-quantile plots to evaluate how well a given distribution matches a
standard normal distribution.
Statistics Toolbox also provides hypothesis tests to determine whether a dataset is consistent with different
probability distributions. Specific tests include:
▪ Lilliefors tests
▪ Ansari-Bradley tests
▪ Jarque-Bera tests
Statistics Toolbox provides functions for generating pseudo-random and quasi-random number streams from
probability distributions. You can generate random numbers from either a fitted or constructed probability
distribution by applying the random method.
14
MATLAB code for constructing a Poisson distribution with a specific mean and generating a vector of random numbers
that match the distribution.
▪ Generating random samples from multivariate distributions, such as t, normal, copulas, and Wishart
You can also generate quasi-random number streams. Quasi-random number streams produce highly uniform
samples from the unit hypercube. Quasi-random number streams can often accelerate Monte Carlo simulations
because fewer samples are required to achieve complete coverage.
Hypothesis Testing
Random variation can make it difficult to determine whether samples taken under different conditions are
different. Hypothesis testing is an effective tool for analyzing whether sample-to-sample differences are significant
and require further evaluation or are consistent with random and expected data variation.
Statistics Toolbox supports widely used parametric and nonparametric hypothesis testing procedures, including:
▪ Nonparametric tests for one sample, paired samples, and two independent samples
15
▪ Comparison of distributions (two-sample Kolmogorov-Smirnov)
Design of Experiments
Functions for design of experiments (DOE) enable you to create and test practical plans to gather data for
statistical modeling. These plans show how to manipulate data inputs in tandem to generate information about
their effect on data outputs. Supported design types include:
▪ Full factorial
▪ Fractional factorial
▪ D-optimal
▪ Latin hypercube
You can use Statistics Toolbox to define, analyze, and visualize a customized DOE. For example, you can estimate
input effects and input interactions using ANOVA, linear regression, and response surface modeling, then
visualize results through main effect plots, interaction plots, and multi-vari charts.
Fitting a decision tree to data. The fitting capabilities in Statistics Toolbox enable you to visualize a decision tree by
drawing a diagram of the decision rule and group assignments.
16
Model of a chemical reaction for an experiment using the design-of-experiments (DOE) and surface-fitting capabilities
of Statistics Toolbox.
Statistics Toolbox provides a set of functions that support Statistical Process Control (SPC). These functions
enable you to monitor and improve products or processes by evaluating process variability. With SPC functions,
you can:
▪ Apply Western Electric and Nelson control rules to control chart data.
Control charts showing process data and violations of Western Electric control rules. Statistics Toolbox provides a
variety of control charts and control rules for monitoring and evaluating products or processes.
17
Resources
Product Details, Examples, and System Requirements Online User Community
www.mathworks.com/products/statistics www.mathworks.com/matlabcentral
Trial Software Training Services
www.mathworks.com/trialrequest www.mathworks.com/training
Sales Third-Party Products and Services
www.mathworks.com/contactsales www.mathworks.com/connections
Technical Support Worldwide Contacts
www.mathworks.com/support www.mathworks.com/contact
© 2013 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks
for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. 18