Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Statistics Toolbox

Perform statistical modeling and analysis

Statistics Toolbox™ provides algorithms and tools for organizing, analyzing, and modeling data. You can use
regression or classification for predictive modeling, generate random numbers for Monte Carlo simulations, use
statistical plots for exploratory data analysis, and perform hypothesis tests.

For analyzing multidimensional data, Statistics Toolbox includes algorithms that let you identify key variables that
impact your model with sequential feature selection, transform your data with principal component analysis,
apply regularization and shrinkage, or use partial least squares regression.

Statistics Toolbox includes specialized data types for organizing and accessing heterogeneous data. Dataset arrays
store numeric data, text, and metadata in a single data container. Built-in methods enable you to merge datasets
using a common key (join), calculate summary statistics on grouped data, and convert between tall and wide data
representations. Categorical arrays provide a memory-efficient data container for storing information drawn from
a finite, discrete set of categories.

Statistics Toolbox is included in MATLAB and Simulink Student Version.

Key Features
▪ Statistical arrays for storing heterogeneous and categorical data

▪ Regression techniques, including linear, nonlinear, robust, and ridge, and nonlinear mixed-effects models

▪ Classification algorithms, including boosted and bagged decision trees, k-Nearest Neighbor, and linear
discriminant analysis

▪ Analysis of variance (ANOVA)

▪ Probability distributions, including copulas and Gaussian mixtures

▪ Random number generation

▪ Hypothesis testing

▪ Design of experiments and statistical process control

Data Organization and Management

Statistics Toolbox provides two specialized arrays for storing and managing statistical data: dataset arrays and
categorical arrays.

An Introduction to Dataset Arrays 5:31


Improve productivity with statistical arrays.

Dataset Arrays

Dataset arrays enable convenient organization and analysis of heterogeneous statistical data and metadata. Dataset
arrays provide columns to represent measured variables and rows to represent observations. With dataset arrays,
you can:

▪ Store different types of data in a single container.

1
▪ Label rows and columns of data and reference that data using recognizable names.

▪ Display and edit data in an intuitive tabular format.

▪ Use metadata to define units, describe data, and store information.

Dataset array displayed in the MATLAB Variable Editor. This dataset array includes a mixture of cell strings and
numeric information, with selected columns available in the Plot Selector Tool.

Statistics Toolbox provides specialized functions to operate on dataset arrays. With these specialized functions,
you can:

▪ Merge datasets by combining fields using common keys.

▪ Export data into standard file formats, including Microsoft® Excel® and comma-separated value (CSV).

▪ Calculate summary statistics on grouped data.

▪ Convert data between tall and wide representations.

An Introduction to Joins 5:01


Combine fields from two dataset arrays using a variable that is present in both.

Categorical Arrays

Categorical arrays enable you to organize and process nominal and ordinal data that uses values from a finite set
of discrete levels or categories. With categorical arrays, you can:

▪ Decrease memory footprint by replacing repetitive text strings with categorical labels.

▪ Store nominal data using descriptive labels, such as red, green, and blue for an unordered set of colors.

▪ Store ordinal data using descriptive labels, such as cold, warm, and hot for an ordered set of temperature
measurements.

▪ Manipulate categorical data using familiar array operations and indexing methods.

▪ Create logical indexes based on categorical data.

▪ Group observations by category.

2
Exploratory Data Analysis

Statistics Toolbox provides multiple ways to explore data: statistical plotting with interactive graphics, algorithms
for cluster analysis, and descriptive statistics for large datasets.

Statistical Plotting and Interactive Graphics

Statistics Toolbox includes graphs and charts to explore your data visually. The toolbox augments MATLAB® plot
types with probability plots, box plots, histograms, scatter histograms, 3D histograms, control charts, and
quantile-quantile plots. The toolbox also includes specialized plots for multivariate analysis, including
dendograms, biplots, parallel coordinate charts, and Andrews plots.

Group scatter plot matrix showing interactions between five variables.

Compact box plot with whiskers providing a five-number summary of a dataset.

3
Scatter histogram using a combination of scatter plots and histograms to describe the relationship between variables.

Plot comparing the empirical CDF for a sample from an extreme value distribution with a plot of the CDF for the
sampling distribution.

Cluster Analysis

Statistics Toolbox offers multiple algorithms to analyze data using hierarchical clustering, k-means clustering, and
Gaussian mixtures.

4
Two-component Gaussian mixture model fit to a mixture of bivariate Gaussians.

Output from applying a clustering algorithm to the same example.

5
Dendrogram plot showing a model with four clusters.

Cluster Analysis (Example)


Use k-means and hierarchical clustering to discover natural groupings in data.

Descriptive Statistics

Descriptive statistics enable you to understand and describe potentially large sets of data quickly. Statistics
Toolbox includes functions for calculating:

▪ Measures of central tendency (measures of location), including average, median, and various means

▪ Measures of dispersion (measures of spread), including range, variance, standard deviation, and mean or
median absolute deviation

▪ Linear and rank correlation (partial and full)

▪ Results based on data with missing values

▪ Percentile and quartile estimates

▪ Density estimates using a kernel-smoothing function

These functions help you summarize values in a data sample using a few highly relevant numbers.

In some cases, estimating summary statistics using parametric methods is not possible. To deal with these cases,
Statistics Toolbox provides resampling techniques, including:

▪ Generalized bootstrap function for estimating sample statistics using resampling

▪ Jackknife function for estimating sample statistics using subsets of the data

▪ bootci function for estimating confidence intervals

6
Regression, Classification, and ANOVA

Regression

With regression, you can model a continuous response variable as a function of one or more predictors. Statistics
Toolbox offers a wide variety of regression algorithms, including:

▪ Linear regression

▪ Nonlinear regression

▪ Robust regression

▪ Logistic regression and other generalized linear models

Fitting with MATLAB: Statistics, Optimization, and Curve Fitting (Webinar)


Apply regression algorithms with MATLAB.

You can evaluate goodness of fit using a variety of metrics, including:

▪ R2 and adjusted R2

▪ Cross-validated mean squared error

▪ Akaike information criterion (AIC) and Bayesian information criterion (BIC)

With the toolbox, you can calculate confidence intervals for both regression coefficients and predicted values.

Statistics Toolbox supports more advanced techniques to improve predictive accuracy when the dataset includes
large numbers of correlated variables. The toolbox supports:

▪ Subset selection techniques, including sequential features selection and stepwise regression

▪ Regularization methods, including ridge regression, lasso, and elastic net

Computational Statistics: Feature Selection, Regularization, and Shrinkage with MATLAB


(Webinar)
Learn how to generate accurate fits in the presence of correlated data.

Statistics Toolbox also supports nonparametric regression techniques for generating an accurate fit without
specifying a model that describes the relationship between the predictor and the response. Nonparametric
regression techniques include decision trees as well as boosted and bagged regression trees.

Nonparametric Fitting 4:07


Develop a predictive model without specifying a function that describes the
relationship between variables.

Additionally, Statistics Toolbox supports nonlinear mixed-effect (NLME) models in which some parameters of a
nonlinear function vary across individuals or groups.

7
Nonlinear mixed-effects model of drug absorption and elimination showing intrasubject concentration-versus-time
profiles. The nlmefit function in Statistics Toolbox generates a population model using fixed and random effects.

Classification

Classification algorithms enable you to model a categorical response variable as a function of one or more
predictors. Statistics Toolbox offers a wide variety of parametric and nonparametric classification algorithms, such
as:

▪ Boosted and bagged classification trees, including AdaBoost, LogitBoost, GentleBoost, and RobustBoost

▪ Naïve Bayes classification

▪ k-Nearest Neighbor (kNN) classification

▪ Linear discriminant analysis

An Introduction to Classification 9:00


Develop predictive models for classifying data.

You can evaluate goodness of fit for the resulting classification models using techniques such as:

▪ Cross-validated loss

▪ Confusion matrices

▪ Performance curves/receiver operating characteristic (ROC) curves

ANOVA

Analysis of variance (ANOVA) enables you to assign sample variance to different sources and determine whether
the variation arises within or among different population groups. Statistics Toolbox includes these ANOVA
algorithms and related techniques:

8
▪ One-way ANOVA

▪ Two-way ANOVA for balanced data

▪ Multiway ANOVA for balanced and unbalanced data

▪ Multivariate ANOVA (MANOVA)

▪ Nonparametric one-way and two-way ANOVA (Kruskal-Wallis and Friedman)

▪ Analysis of covariance (ANOCOVA)

▪ Multiple comparison of group means, slopes, and intercepts

Multivariate Statistics

Multivariate statistics provide algorithms and functions to analyze multiple variables. Typical applications include:

▪ Transforming correlated data into a set of uncorrelated components using rotation and centering (principal
component analysis)

▪ Exploring relationships between variables using visualization techniques, such as scatter plot matrices and
classical multidimensional scaling

▪ Segmenting data with cluster analysis

Fitting an Orthogonal Regression Using Principal Component Analysis (Example)


Implement Deming regression (total least squares).

Feature Transformation

Feature transformation techniques enable dimensionality reduction when transformed features can be more easily
ordered than original features. Statistics Toolbox offers three classes of feature transformation algorithms:

▪ Principal component analysis for summarizing data in fewer dimensions

▪ Nonnegative matrix factorization when model terms must represent nonnegative quantities

▪ Factor analysis for building explanatory models of data correlation

Partial Least Squares Regression and Principal Component Regression (Example)


Model a response variable in the presence of highly correlated predictors.

Multivariate Visualization

Statistics Toolbox provides graphs and charts to explore multivariate data visually, including:

▪ Scatter plot matrices

▪ Dendograms

▪ Biplots

▪ Parallel coordinate charts

▪ Andrews plots

▪ Glyph plots

9
Group scatter plot matrix showing how model year impacts different variables.

Biplot showing the first three loadings from a principal component analysis.

10
Andrews plot showing how country of original impacts the variables.

Cluster Analysis

Statistics Toolbox offers multiple algorithms for cluster analysis, including:

▪ Hierarchical clustering, which creates an agglomerative cluster typically represented as a tree.

▪ K-means clustering, which assigns data points to the cluster with the closest mean.

▪ Gaussian mixtures, which are formed by combining multivariate normal density components. Clusters are
assigned by selecting the component that maximizes posterior probability.

Two-component Gaussian mixture model fit to a mixture of bivariate Gaussians.

11
Output from applying a clustering algorithm to the same example.

Dendrogram plot showing a model with four clusters.

Probability Distributions

Statistics Toolbox provides functions and an app to work with parametric and nonparametric probability
distributions. With these tools, you can:

▪ Fit distributions to data.

▪ Use statistical plots to evaluate goodness of fit.

▪ Compute key functions such as probability density functions and cumulative distribution functions.

▪ Generate random and quasi-random number streams from probability distributions.

Fitting Distributions to Data

The Distribution Fitting Tool in the toolbox enables you to fit data using predefined univariate probability
distributions, a nonparametric (kernel-smoothing) estimator, or a custom distribution that you define. This tool

12
supports both complete data and censored (reliability) data. You can exclude data, save and load sessions, and
generate MATLAB code.

Visual plot of distribution data (left) and summary statistics (right). Using the Distribution Fitting Tool, you can estimate
a normal distribution with mean and variance values (16.9 and 8.7, respectively, in this example).

You can estimate distribution parameters at the command line or construct probability distributions that
correspond to the governing parameters.

Additionally, you can create multivariate probability distributions, including Gaussian mixtures and multivariate
normal, multivariate t, and Wishart distributions. You can use copulas to create multivariate distributions by
joining arbitrary marginal distributions using correlation structures.

See the complete list of supported distributions.

Simulating Dependent Random Numbers Using Copulas (Example)


Create distributions that model correlated multivariate data.

With the toolbox, you can specify custom distributions and fit these distributions using maximum likelihood
estimation.

Fitting Custom Univariate Distributions (Example)


Perform maximum likelihood estimation on truncated, weighted, or bimodal data.

Evaluating Goodness of Fit

Statistics Toolbox provides statistical plots to evaluate how well a dataset matches a specific distribution. The
toolbox includes probability plots for a variety of standard distributions, including normal, exponential, extreme
value, lognormal, Rayleigh, and Weibull. You can generate probability plots from complete datasets and censored

13
datasets. Additionally, you can use quantile-quantile plots to evaluate how well a given distribution matches a
standard normal distribution.

Statistics Toolbox also provides hypothesis tests to determine whether a dataset is consistent with different
probability distributions. Specific tests include:

▪ Chi-Square goodness-of-fit tests

▪ One-sided and two-sided Kolmogorov-Smirnov tests

▪ Lilliefors tests

▪ Ansari-Bradley tests

▪ Jarque-Bera tests

Analyzing Probability Distributions

Statistics Toolbox provides functions for analyzing probability distributions, including:

▪ Probability density functions

▪ Cumulative density functions

▪ Inverse cumulative density functions

▪ Negative log-likelihood functions

Generating Random Numbers

Statistics Toolbox provides functions for generating pseudo-random and quasi-random number streams from
probability distributions. You can generate random numbers from either a fitted or constructed probability
distribution by applying the random method.

14
MATLAB code for constructing a Poisson distribution with a specific mean and generating a vector of random numbers
that match the distribution.

Statistics Toolbox also provides functions for:

▪ Generating random samples from multivariate distributions, such as t, normal, copulas, and Wishart

▪ Sampling from finite populations

▪ Performing Latin hypercube sampling

▪ Generating samples from Pearson and Johnson systems of distributions

You can also generate quasi-random number streams. Quasi-random number streams produce highly uniform
samples from the unit hypercube. Quasi-random number streams can often accelerate Monte Carlo simulations
because fewer samples are required to achieve complete coverage.

Hypothesis Testing

Random variation can make it difficult to determine whether samples taken under different conditions are
different. Hypothesis testing is an effective tool for analyzing whether sample-to-sample differences are significant
and require further evaluation or are consistent with random and expected data variation.

Statistics Toolbox supports widely used parametric and nonparametric hypothesis testing procedures, including:

▪ One-sample and two-sample t-tests

▪ Nonparametric tests for one sample, paired samples, and two independent samples

▪ Distribution tests (Chi-square, Jarque-Bera, Lillifors, and Kolmogorov-Smirnov)

15
▪ Comparison of distributions (two-sample Kolmogorov-Smirnov)

▪ Tests for autocorrelation and randomness

▪ Linear hypothesis tests on regression coefficients

Selecting a Sample Size (Example)


Calculate the sample size necessary for a hypothesis test.

Design of Experiments and Statistical Process Control

Design of Experiments

Functions for design of experiments (DOE) enable you to create and test practical plans to gather data for
statistical modeling. These plans show how to manipulate data inputs in tandem to generate information about
their effect on data outputs. Supported design types include:

▪ Full factorial

▪ Fractional factorial

▪ Response surface (central composite and Box-Behnken)

▪ D-optimal

▪ Latin hypercube

You can use Statistics Toolbox to define, analyze, and visualize a customized DOE. For example, you can estimate
input effects and input interactions using ANOVA, linear regression, and response surface modeling, then
visualize results through main effect plots, interaction plots, and multi-vari charts.

Fitting a decision tree to data. The fitting capabilities in Statistics Toolbox enable you to visualize a decision tree by
drawing a diagram of the decision rule and group assignments.

16
Model of a chemical reaction for an experiment using the design-of-experiments (DOE) and surface-fitting capabilities
of Statistics Toolbox.

Statistical Process Control

Statistics Toolbox provides a set of functions that support Statistical Process Control (SPC). These functions
enable you to monitor and improve products or processes by evaluating process variability. With SPC functions,
you can:

▪ Perform gage repeatability and reproducibility studies.

▪ Estimate process capability.

▪ Create control charts.

▪ Apply Western Electric and Nelson control rules to control chart data.

Control charts showing process data and violations of Western Electric control rules. Statistics Toolbox provides a
variety of control charts and control rules for monitoring and evaluating products or processes.

17
Resources
Product Details, Examples, and System Requirements Online User Community
www.mathworks.com/products/statistics www.mathworks.com/matlabcentral
Trial Software Training Services
www.mathworks.com/trialrequest www.mathworks.com/training
Sales Third-Party Products and Services
www.mathworks.com/contactsales www.mathworks.com/connections
Technical Support Worldwide Contacts
www.mathworks.com/support www.mathworks.com/contact

© 2013 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks
for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. 18

You might also like