Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

STAT 582

MATLAB Primer for Statistical Analysis


Kan Yao
4/28/2010
MATLAB Primer Kan Yao

INTRODUCTION
MATLAB stands for "Matrix Laboratory" and is an interactive, matrix-based system and fourth-generation
programming language. It is a commercial software package written by The MathWorks, Inc. The interface is a
combination of pull-down menus and command-line. The user interface is more commercial like (aka nicer) and
certainly much more useful than SAS. A lot of the tasks can be done without typing a single command.

Although MATLAB was originally developed for numerical computing, there are many add-on toolboxes that
extend MATLAB to specific areas of functionality, such as statistics, finance, signal processing etc.. In this
primer we will focus on introducing how to use MATLAB to accomplish basic statistical analysis, and in fact,
most of the functionalities are achieved by using the statistics toolbox. The structure of the this primer is:

1. Data 1.1 Data Import


1.2 Data Manipulation
2. Regression 2.1 Linear Regression Modeling
2.2 Nonlinear Regression Modeling
3. Analysis 3.1 ANOVA / ANCOVA
3.2 Sample Size Calculation
3.3 PCA / Factor Analysis
3.4 Chi-square Testing
3.5 Smoothing
4. Graphics 4.1 Histograms
4.2 QQ Plots

This primer is based on the 64bit version of MATLAB 7.1 (R2010a). Most of the methods provided are
compatible with the 32-bit variant and lower versions of MATLAB. Most datasets used in the examples come
with the toolbox, so please feel free to try on the codes.

1. DATA
1.1 DATA IMPORT

The easiest way to get data into MATLAB is by using the Import Wizard. It can be launched by either choosing
"File" - "Import Data..." form the menu bar or typing "uiimport" in the command line. The support data types
including delimited data (.txt), MATLAB data files (.mat), Excel data (.xls, .xlsx) and even binary data like .avi
or .jpg files. However, it doesn't directly support SAS or SPSS data files.

After choosing the file to import, MATLAB will read the data and show you the structure of the data. You are
able to pick the desired columns to import and discard the unnecessary ones. One thing worth noting is that
MATLAB store the same type of data in one matrix. So you will have separate matrices for numerical data, labels
and descriptive data. A screenshot of the Import Wizard is shown below.

2
MATLAB Primer Kan Yao

1.2 DATA MANIPULATION

Since MATLAB is a matrix-based system, manipulate with data is relatively easy to do. Below are several
examples that illustrate several data manipulation methods.

Example 1.1: Log transformation

Suppose our data are stored in "data" matrix. The second column is the dependent variable and we want to
perform log transformation on it. The following command does the trick:

>> x2 = data(:,2);
>> x2log = log(x2);

Then x2log will contain the responses after log transformation. Please note that the original data are not changed.
All we did is creating a new vector called x2 to duplicate the second column of the original data, then creating
another vector called x2log to contain the transformed data.

Example 1.2: Deleting rows / columns

Suppose now we want to delete the 5th row and 3rd column, all we need is to delete the row and column using the
Variable Editor, which can be launched by double clicking the data in the Workspace window. The procedure is
straightforward and is undoable so that you won't accidentally delete the useful data. A screenshot of the Variable
Editor is shown below:

3
MATLAB Primer Kan Yao

Example 1.3: Matrix transposition

To find the transpose of the data matrix is really easy in MATLAB. It uses the usual symbol for transposition. The
following command does the job:

>> dataT = data';

Example 1.4: Filling in missing values

Missing data is a huge issue in matrix computation, so it has to be taken care of before any statistical analysis.
MATLAB can either delete the rows (observations) with missing values or interpolating the missing values. The
following code fills the missing values using linear interpolation.

>> function y = fixgaps(x); % FIXGAPS linearly interpolates gaps in a time series %


>> y = x;
>> bd = isnan(x); % bd contains the missing values %
>> gd = find(~bd);
>> bd([1:(min(gd)-1) (max(gd)+1):end]) = 0;
>> y(bd) = interp1(gd, x(gd), find(bd));

2. REGRESSION
Both linear and nonlinear regression analysis can be performed in MATLAB by using the Statistics toolbox. The
following two examples illustrate these two types of regression analysis.

2.1 LINEAR REGRESSION MODELING

Example 2.1: Multiple regression analysis

4
MATLAB Primer Kan Yao

Suppose matrix X contains 5 explanatory variables (5 columns) and Y contains the dependent variable (1 column).
Then the function REGSTATS will perform multiple regression analysis. The command in use is:

>> regstats (y, x);

After running the command, a window will popup and you can choose the statistics you wish to calculate for the
multiple regression analysis.

2.2 NONLINEAR REGRESSION MODELING

Example 2.2: Nonlinear regression analysis

The function NLINTOOL opens a GUI for interactive exploration of multidimensional nonlinear functions.
Suppose the data in "reaction.mat" are partial pressures of three chemical reactants and the corresponding reaction
rates. The function hougen implements the nonlinear Hougen-Watson model for reaction rates. The following fits
the model to the data:

>> load reaction;


>> nlintool(reactants,rate,@hougen,beta,0.01,xn,yn);

You can see three plots. The response variable for all plots is the reaction rate, plotted in green. The red lines
show confidence intervals on predicted responses. The first plot shows hydrogen as the predictor, the second
shows n-pentane, and the third shows isopentane.

5
MATLAB Primer Kan Yao

3. ANALYSIS
3.1 ANOVA

The functions ANOVA1, ANOVA2, ANOVAN in MATLAB perform one-way, two-way and n-way ANOVA,
respectively. And both random effects and nested variables are supported.

Example 3.1: ANOVA with random effects

Suppose we are studying a few factories but you want information about what would happen if you build these
same car models in a different factory - either one that you already have or another that you might construct. To
get this information, fit the ANOVA model, specifying a model that includes an interaction term and that the
FACTORY factor is random.

>> [pvals,tbl,stats] = anovan(mileage,{factory carmod},'model',2,'random',1,'varnames',{'Factory' 'Car


Model'});

The ANOVAN function also has arguments that enable you to specify two other types of model terms. First, the
NESTED argument specifies a matrix that indicates which factors are nested within other factors. A nested factor
is one that takes different values within each level its nested factor. Second, the CONTINUOUS argument
specifies that some factors are to be treated as continuous variables. The remaining factors are categorical
variables. Although the ANOVAN function can fit the models with multiple continuous and categorical predictors,
the simplest model that combines one predictor of each type is known as an ANCOVA model.

Example 3.2: ANCOVA

The Statistics Toolbox data set "carsmall.mat" contains information on cars from the years 1970, 1976, and 1982.
This example studies the relationship between the weight of a car and its mileage, and whether this relationship
has changed over the years. The following command calls AOCTOOL function to fit a separate line to the column
vectors WEIGHT and MPG for each of the three model group defined in MODEL_YEAR. The initial fit models
the Y variable, MPG, as a linear function of the X variable, WEIGHT.

>> [h,atab,ctab,stats] = aoctool(Weight,MPG,Model_Year);

6
MATLAB Primer Kan Yao

3.2 SAMPLE SIZE CALCULATION

To perform sample size calculation in MATLAB, we use the function SAMPSIZEPWR, which is also included in
the Statistics toolbox. The following example shows how to implement this function.

Example 3.3 Sample size calculation

Compute the sample size n required to distinguish p = 0.2 from p = 0.26 with a binomial test. The required power
is 0.6.

>> napprox = sampsizepwr('p',0.2,0.26,0.6)


napprox =
244

3.3 PCA / FACTOR ANALYSIS

PCA and Factor analysis can also be accomplished using the Statistics toolbox. The functions used are
PRINCOMP and FACTORAN, respectively.

Example 3.4 PCA analysis

Compute principal components for the INGREDIENTS variable in the "Hald" dataset, and the variance accounted
for by each component.

>> load hald;


>> [pc,score,latent,tsquare] = princomp(ingredients);
>> pc,latent
pc =
0.0678 -0.6460 0.5673 -0.5062
0.6785 -0.0200 -0.5440 -0.4933
-0.0290 0.7553 0.4036 -0.5156
-0.7309 -0.1085 -0.4684 -0.4844

latent =
517.7969

7
MATLAB Primer Kan Yao

67.4964
12.4054
0.2372

Example 3.5 Factor analysis

Load the "carbig" data, and fit the default model with two factors.

>> load carbig


>> X = [Acceleration Displacement Horsepower MPG Weight];
>> X = X(all(~isnan(X),2),:);
>> [Lambda,Psi,T,stats,F] = factoran(X,2,'scores','regression');
>> inv(T'*T) % Estimated correlation matrix of F, == eye(2)
>> Lambda*Lambda'+diag(Psi) % Estimated correlation matrix
>> Lambda*inv(T) % Unrotate the loadings
>> F*T' % Unrotate the factor scores
>> biplot(Lambda,'LineWidth',2,'MarkerSize',20) % Create biplot of two factors

3.4 CHI-SQUARE TESTING

The Chi-square goodness-of-fit test can be conducted in MATLAB using the function CHI2GOF.

Example 3.6: Chi-square testing

Test against the standard normal:

>> x = randn(100,1);
>> [h,p] = chi2gof(x,'cdf',@normcdf) % h = 1 means the null hypothesis is rejected at alpha=0.5 level
h=
0
p=
0.9443

8
MATLAB Primer Kan Yao

3.5 SMOOTHING

To smooth out the response data, we can use the SMOOTH function from the Curve Fitting toolbox.

Example 3.7: Smoothing

Suppose you want to smooth traffic count data with a moving average filter to see the average traffic flow over a
5-hour window (span is 5).

>> load count.dat


>> y = count(:,1);
>> yy = smooth(y);

Plot the original data and the smoothed data.

>> t = 1:length(y);
>> plot(t,y,'r-.',t,yy,'b-')
>> legend('Original Data','Smoothed Data Using ''moving''',2)

4. GRAPHICS
MATLAB offers a variety of data potting functions plus a set of GUI tools to create, and modify graphic displays.
The GUI tools afford most of the control over graphic properties and options that typed commands provide. The
quality of the generated plots are very good and have the option to save as the format of your choice
(.jpg, .png, .pdf, .eps etc.). In this section examples are provided for the two commonly used statistics plots:
histogram plot and QQ plot.

9
MATLAB Primer Kan Yao

4.1 HISTOGRAM PLOT

It is very easy to plot a histogram in MATLAB. To graph selected variables, just use the Plot Selector in the

Workspace Browser. If you prefer using the command-line, a simple example shows how to do it:

Example 4.1: Histogram plot

Generate a bell-curve histogram from Gaussian data.

>> x = -4:0.1:4;
>> y = randn(10000,1);
>> hist(y,x)

4.2 QQ PLOT

The QQ plot is generated by the function QQPLOT in MATLAB. The command QQPLOT (X) displays a
quantile-quantile plot of the sample quantiles of X versus theoretical quantiles from a normal distribution.

Example 4.2: QQ plot

The following example shows a quantile-quantile plot of two samples from Poisson distributions.

>> x = poissrnd(10,50,1);
>> y = poissrnd(5,100,1);
>> qqplot(x,y);

10

You might also like