Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

DATA HANDLING AND

PRESENTATION
MULTIVARIATE ANALYSIS

Dr. Anne van Dam


Multivariate data analysis: lecture outline

• What is it?
¾ Why/when use it
¾ Classification of techniques

• Structured approach to multivariate analysis


¾ Problem definition, objectives
¾ Analysis plan
¾ Assumptions
¾ Model estimation
¾ Interpretation
¾ Validation

• Examples
¾ multiple regression analysis
¾ factor analysis (Dr. Peter Kelderman)
Multivariate data analysis – what is it?

• Univariate analysis: single-variable distributions

• Bivariate analysis: analysis of relationships between two variables


¾ cross-classification
¾ correlation
¾ one-way analysis of variance (ANOVA)
¾ simple regression

• Multivariate analysis : analysis of > 2 variables


Univariate versus bivariate analysis: examples

Univariate analysis: dissolved Bivariate analysis: phosphorous


oxygen in a fishpond and chlorophyll a in lakes
Histogram
45

40
Frequency
35
No. of observations

30

25

20

15
Mean = 4.78
10 S.D. = 0.95
5

0
1 3 5 7 9 Meer
DO concentration (mg/l)
Multivariate data analysis: example
Predicting the phosphporous concentrations of lakes
What is a variate?

Building block of multivariate analysis is the “variate”

Definition:
a variate is a linear combination of variables with
empirically determined weights

Variate value = w1X1 + w2X2 + w3X3 + …. + wnXn


Xn : observed variable
Wn : weight determined by multivariate technique

The value of the variate represents the combination of the


entire set of variables
Variate: example from multiple regression
log(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985
Metric and non-metric data
Non-metric or qualitative:
attributes, characteristics, or categorical properties
types, classes, absence/presence
Example: male/female
Nominal (or categorical) scale: class symbols have no quantitative meaning
(e.g., female = 1, male = 2)
Ordinal scale: variables can be ranked according to scale (e.g., level of
agreement in survey: don’t agree = 1, don’t know = 2, agree = 3)

Metric or quantitative:
differing in amount or degree
Example: temperature
Interval scale: arbitrary zero point (e.g., temperature in Celcius versus
Fahrenheit)
Ratio scale: absolute zero point (e.g., weight or length)
Classification of multivariate techniques

Dependence techniques:
variable or set of variables is identified as the dependent
variable to be predicted or explained by independent
variables
Example: multiple regression analysis

Interdependence techniques:
set of variables is analysed simultaneously without
defining dependence relationships
Example: factor analysis
Multivariate dependence methods
(one dependent variable)

• Analysis of variance
Y1 = X1 + X2 + X3 + ... + Xn
(metric) (nonmetric)

• Multiple discriminant analysis


Y1 = X1 + X2 + X3 + ... + Xn
(nonmetric) (metric)

• Multiple regression analysis


Y1 = X1 + X2 + X3 + ... + Xn
(metric) (metric, nonmetric)
Multivariate dependence methods
(multiple dependent variables)

• Canonical correlation
Y1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn
(metric, nonmetric) (metric, nonmetric)

• Multivariate analysis of variance


Y1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn
(metric) (nonmetric)
Type of relationship?

dependence interdependence

How many
variables
predicted?
multiple relationships
of dependent and several dependent variable one dependent variable
independent variables in a single relationship in a single relationship

Structural Measurement Measurement


equation scale of dependent scale of dependent
modelling variable? variable?

metric nonmetric metric nonmetric

Multiple Multiple
Measurement Canonical
regression discriminant analysis
scale of predictor correlation w/ Conjoint Linear probability
variable? dummy variables analysis models

metric nonmetric

Canonical Multivariate
correlation analysis of
analysis variance
Source: Hair et al. 1998. Multivariate data analysis, 5th ed.
Type of relationship?

dependence interdependence

Structure of
relationships
among:

variable cases / respondent object

Factor Cluster How are the


analysis analysis attributes
measured?

metric nonmetric

nonmetric

Multi- Correspondence
dimensional analysis
scaling

Source: Hair et al. 1998. Multivariate data analysis, 5th ed.


Classification of multivariate techniques

Exploratory methods:
Main objective is to identify interrelationships and
structures among variables. Reduction of large number
of variables to a few key components
Examples: principal component and factor analysis; cluster analysis;
multidimensional scaling

Confirmatory methods:
Main objective is to test hypothesized relationships
between variables. Researcher has a-priori
understanding of relationships
Examples: correlation analysis; multiple regression; canonical correlation;
analysis of variance; discriminant analysis; conjoint analysis; structural
equation modelling
Structured approach to multivariate analysis
Stage 1: Define problem, objectives, and choose technique

Stage 2: Develop analysis plan and evaluate assumptions of


multivariate technique

Stage 3: Estimate the model

Stage 4: Interprate the variates

Stage 5: Validate the model


1. Define problem, objectives, and technique

• Develop a conceptual model


¾ relationships between variables
¾ structure, similarities
¾ cause and effect

• Define the objective(s) of the model


¾ prediction or exploration
¾ understanding of the system

• Choose technique in relation to objectives and data types


¾ dependence/interdependence
¾ metric/nonmetric
2. Develop analysis plan and evaluate
assumptions underlying technique
• Analysis plan:
¾ minimum/desired sample size
¾ allowable/required variable types
¾ special variable formulation
• Assumptions:
¾ normal distribution
¾ linearity
¾ independence of error terms
¾ equality of variance (homoscedasticity)

• Data manipulation / transformation


¾ e.g., logarithmic or arc sine transformation
¾ dummy variables
3. Estimate the model
• Model estimation

• Assessment of model fit


¾ R2
¾ if necessary, improve fit (e.g., rotation in factor analysis)
¾ iteration

• Evaluation of the model


¾ statistical significance (model, parameters)
¾ outliers?
¾ “robustness”
4. Interpret the variates

• Examination of estimated coefficients (weights) for each


variable in the variate
¾ e.g., : multiple regression
log(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985

• Interpretation of multiple variates as underlying “dimensions”


¾ e.g., principle components

• If necessary: reformulation of model


5. Validate the model

• How general is the model?


¾ does is apply to the total population?
¾ can the model be used for prediction

• Method: check model fit with independent dataset


Structured approach to multivariate analysis

Define problem, objectives, and choose technique

Analysis plan and evaluate assumptions, data manipulation

Estimate the model


iteration!!
Interprate the variates

Validate the model


Examples of multivariate analysis
Principal components and factor analysis
Multiple regression analysis
Multiple discriminant analysis
Multivariate analysis of variance and covariance
Conjoint analysis
Canonical correlation
Cluster analysis
Multidimensional scaling
Correspondence analysis
Linear probability models
Structural equation modelling
Data mining and warehousing
Neural networks
Resampling
Multiple regression analysis of rice-fish data*
• Background
¾ 50 experiments on rice-fish culture in unpublished
reports
¾ problem: low power of single pond experiments (many
type II errors)
¾ multivariate analysis allows new look at data
• Objectives
¾ explain variation in fish and rice production from input
and climate data
¾ exploratory analysis, “meta analysis”

• Choice of technique: multiple linear regression


¾ relate one metric variable (rice, fish yield) to multiple
metric explaining variables
*Source: van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture
experiments: a rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.
Multiple
regression
analysis of rice-
fish data in the
Philippines
System description

Nile tilapia
(Oreochromis niloticus L.)
Multiple regression analysis of rice-fish data
Analysis plan / methodology / assumptions

• Database management

• First analysis: plots, correlation matrix

• Transformations/re-expression
Multiple regression analysis of rice-fish data
Database
management
Data plot; trends

120.00 0.90

0.80
100.00
0.70

80.00 0.60

0.50

(g/day)
(%)

60.00
0.40

40.00 0.30

0.20
20.00
0.10

0.00 0.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00
Fish stocking size (g)
recovery growth rate
Data plot; trends

350.00

300.00

250.00

200.00
y = 1.8516x - 50.993
R2 = 0.4605
Net fish yield (kg/ha)

150.00

100.00

50.00

0.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00

-50.00

-100.00

-150.00
Recovery (%)
DESC R IPT IVE ST AT IST IC S OF R IFE-FISH DAT ASET . NO. OF C ASES (N) = 198
Nam e M ean SD M in. M ax.

Multiple D ependent variables


G ross fish yield (kg ha -1 ) 122.86 77.66 2.5 390.0
-1
N et fish yield (kg ha ) 51.35 68.62 -105.0 300.0
regression Fish recovery percentage (% )
Fish growth rate (g d ) -1
55.62
0.347
25.49
0.123 0.14
1 100
0.84

analysis of -1
R ice yield (kg ha )
Independent variables
4337.46 1689.08 600 8250

Plot size (m 2 )
rice-fish data Period (d)
201.52
78.97
27.55
17.73
100
50
400
114
Log period (d) 1.89 0.0995 1.70 2.06
-1
Stocking density (no. ha ) 5878.79 1883.80 2000 10000

Results: Log stocking density (no. ha )


Stocking size (g)
-1
3.75
12.80
0.128
9.41
3.30
1
4.00
44

summary of data Log stocking size (g)


Basal N application (kg ha ) -1
0.958
63.11
0.404
13.18 40.2
0 1.64
79.5
-1
Log basal N applic. (kg ha ) 1.79 0.0961 1.60 1.90
-1
T op dress N applic. (kg ha ) 23.18 31.40 0 89.3
-1
Basal P application (kg ha ) 30.84 13.66 10.5 55.2
-1
Log basal P applic. (kg ha ) 1.44 0.205 1.02 1.74
-1
T op dress P applic. (kg ha ) 3.23 8.02 0 28
H erbicide application (dum m y) 0.136 0.344 0 1
Basal insect. applic. (dum m y) 0.707 0.456 0 1
N o. of insecticide sprayings 1.19 0.879 0 3
Feed (dum m y) 0.222 0.417 0 1
Avg. m ax. air tem perature (? C ) 32.29 1.30 30.2 34.7
Avg. m in. air tem perature (? C ) 22.84 0.690 21.2 24.0
Avg. daily wind speed (knots) 4.97 1.04 3.2 7.0
Avg. daily evaporation (m m ) 5.50 1.39 3.5 7.4
Avg. daily rainfall (m m ) 3.68 2.81 0.1 8.8
Avg. daily sunshine (m in) 476.85 85.51 350 611
Multiple regression analysis of rice-fish data*
Y = a + b1X1 + b2X2 + ... + bkXk + ε

with

Y : dependent variable
X1..k : independent or explaining variable
B1..k : partial regression coefficients (slopes)
a : constant (intercept)
ε : residual

Estimate model with R2 and F, α (significance of model)


Evaluate sign and significance of coefficients (t-test)
Standardize b’s: beta-weights
Multiple regression analysis of rice-fish data
Results: estimation of 5 models
R2 F-value Sign.(α)

Gross fish yield (kg ha-1) 0.6571 52.013 <0.001


Net fish yield (kg ha-1) 0.5536 33.659 <0.001
Fish recovery (%) 0.4469 21.935 <0.001
Fish growth rate (g d-1) 0.3549 21.129 <0.001
Rice yield (kg ha-1) 0.7088 66.080 <0.001

Conclusion:
significant models; 35-71% of variation in Y’s explained
Multiple regression analysis of rice-fish data
Multiple regression models for gross fish yield (kg ha-1). All partial regression coefficients (bs) were significant at the
0.1% level, except when marked* (5%) or ** (1%). Also given are the standard errors (SE) of thebs and the
standardized bs or betaweights (rankings between brackets). Number of cases = 198
Model 1 Model 2
b SE beta b SE beta
Independent variables
Period (d) 1.57 0.225 0.359(4)
Log period (d) 230.26 42.23 0.295(5)
Stocking density (no. ha-1) 0.012 0.002 0.279(6)
Log stocking density (no. ha-1) 136.31 27.31 0.225(6)
Stocking size (g) 3.78 0.432 0.458(1) 3.82 0.423 0.463(2)
Basal N application (kg ha-1) 1.74 0.283 0.296(5)
Log basal N application (kg ha-1) 276.07 36.36 0.342(4)
Basal P application (kg ha-1) -2.05 0.318 -0.361(3)
Log basal P application (kg ha-1) -166.37 21.16 -0.439(3)
*
No. of insecticidesprayings -10.03 4.15 -0.114(7)
Maximum air temperature (°C) 26.97 3.21 0.452(2) 28.67 3.17 0.481(1)

Constant (a) -1022.83 -2051.25


Coeff. of determination (R2) 0.6571 0.6676
F-value 52.013 63.946
Probability <0.001 <0.001
Durbin-Watson statistic 1.5120 1.5738
Multiple regression analysis of rice-fish data
Partial regression coefficients: b’s

Y = a + b1X1 + b2X2 + ... + bkXk + ε

Model: gross fish yield


b s.e. beta (rank)

Period (d) 1.57 0.225 0.359(4)


Stocking density (ha-1) 0.012 0.002 0.279(6)
Stocking size (g) 3.78 0.432 0.458(1)
Basal N application (kg ha-1) 1.74 0.283 0.296(5)
Basal P application (kg ha-1) -2.05 0.318 -0.361(3)
No. of insecticide sprayings -10.03 4.15 -0.114(7)
Maximum air temperature (°C) 26.97 3.21 0.452(2)

βk = bk • (sdXk / sdYk)
Multiple regression analysis of rice-fish data

Beta weights: standardized regression coefficients


Allow straight comparison between effects
Y = a + b1X1 + b2X2 + ... + bkXk + ε
βk = bk • (sdXk / sdYk)
Gross yield Net yield Recovery Growth rate Rice yield
PER
SD
SS
N
P
H
B
I
T
-0.3 0 0.3 -0.3 0 0.3 -0.3 0 0.3 -0.3 0 0.3 -0.3 0 0.3

Beta weight
Figure 1. Beta-weights of variables in all models. Pesticides (dotted bars)
were of minor importance for yield and recovery, had a strong negative effect
on fish growth rate and positive effects on rice yields. Phosphorous fertilization
(striped bars) showed a negative effect on all fish variables (PER=length of
the culture period, SD=stocking density, SS=stocking size, N=basal nitrogen
application, P=basal phosphorous application, H=herbicide application,
B=basal insecticide application, I=number of insecticide sprayings,
T=maximum air temperature).
Factor analysis
• Analyze interrelationships among large no. of variables
• Explain in terms of underlying relationships (= factors = variates)
• Data reduction (reduce large no. of variables to 2-4 factors)

Example: water quality in fishponds (Bangladesh)

12 ponds, 4 treatments (fish stocked)


9 WQ parameters: temp, transp, pH, DO, alk, PO4, NH4, NO2,
NO3, ChlA

Samples on 20 dates (1 May – 14 November), every 10 days

Dataset: 20 * 12 * 9 = 2160 numbers


Factor analysis
WQ Factors WQF1 WQF2 WQF3 WQF4

Secchi 0.65 0.20 0.34 -0.15


alkalinity 0.70 -0.08 -0.19 -0.02
pH 0.61 0.04 -0.44 0.36
oxygen 0.11 -0.47 -0.28 0.36
ammonia -0.57 0.49 -0.06 0.04
nitrite 0.15 0.01 0.83 0.23
nitrate 0.37 0.52 0.20 0.36
phosphate -0.21 0.58 -0.23 0.51
chlorophyll -0.30 -0.52 0.27 0.61

Variance
explained (%) 21 16 14 12

Interpretation liming photosynthesis partial P-limiting


effect decomposition nitrification algae
Factor analysis
For each pond and date, the value of the factor can be calculated, e.g.:

1-May WQ1 WQ2 WQ3 WQ4

temp 31.3
secchi 31
alk 154
pH 8.03
DO 9.8
NHx 0.8
0.81 -1.00 0.23 1.08
NO2 0.08
NO3 2.0
ChlA 177
PO4 1.5

Factors can be plotted, e.g., in relation to time (see pdf)


Model estimation: software

Some popular statistics software packages:

SPSS http://www.spss.com
SAS http://www.sas.com
SYSTAT http://www.systat.com
Statistica http://www.statsoftinc.com
Minitab http://www.minitab.com
LISREL http://www.ssicentral.com/lisrel/mainlis.htm

Etcetera !!!!!

Analyse-it www.analyse-it.com software add-in for Excel


Some further reading
Doucet, P. And P.B. Sloep (1992) Mathematical modeling in the life sciences. Ellis Horwood, New York.
490 p.
Hair, J.F., R. E. Anderson, R.L. Tatham and W.C. Black (1998) Multivariate data analysis, 5th Edition.
Prentice Hall International, Inc., New Jersey.
Kelly, L.A., A. Bergheim and M.M. Hennesy (1994) Predicting output of ammonium from fish farms.
Water Research 28, 1403-1405.
Lindstrom, M., L. Hakanson, O. Abrahamsson and H. Johansson (1999) An empirical model for
prediction of lake waqter suspended particulate matter. Ecological Modelling 121, 185-198.
Milstein, A., M.A. Wahab and M.M. Rahman (2002) Environmental effects of common carp Cyprinus
carpio (L.) and mrigal Cirrhinus mrigala (Hamilton) as bottom feeders in major Indian carp
polycultures. Aquaculture Research 33, 1103-1117. Prein, M., G. Hulata and D. Pauly (1993)
Multivariate methods in aquaculture research: case studies of tilapias in experimental and commercial
systems. ICLARM Stud. Rev. 20, 221 p.
Van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture experiments: a
rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.

You might also like