BRM CS

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Ch1Multivariate Analysis-All statistical methods that simultaneously analyze 3Multidimensional Scaling (MDS) identifies “unrecognized” dimensions that data

ns that data estimation and model estimation- No imputation for individual cases, rather
multiple measurements on each individual or object under investigation. Why use affect purchase behavior based on customer judgments ofsimilarities or direct estimation of means and covariance matrix.• Multiple imputation:
it?• Methodological Benefits• Measurement• Explanation & Prediction• preferencesand transforms these into distances represented as perceptual maps. Estimation of imputed values for missing data of individual cases by specified
Hypothesis Testing• Improved decision making. THREE CONVERGING 4Correspondence Analysis uses non-metric data and evaluates either linear or model- Calculates multiple sets of imputed values, each set varying by adding a
TRENDS:1 The Rise of Big Data-Unique elements of Big Data focused in five nonlinear relationships in an effort to develop a perceptual map representing the random element to imputed values and then forming a separate dataset for
areas:• Volume(it is the sheer magnitude of information being collected that association between objects (firms, products,etc.) and a set of descriptive estimation-Model estimates made for each imputed dataset and then combined for
initiated the term.)• Variety• Velocity• Veracity• Variability and Value. Impacts characteristics of the objects. GUIDELINES FOR MULTIVARIATE final model estimates.Choosing between maximum likelihood and multiple
on:• Organizational Decisions and Academic Research – improved decision- ANALYSES ANDINTERPRETATION:1Establish Practical Significance as imputation:Multiple imputation uses conventional techniques for model
making capabilities as well as the explosion of data available to characterize well as Statistical Significance.: Practical significance asks the question, “So estimation while maximum likelihood limited in applicable methods.Choosing
situations on dimensions never before available• Analytics and the Analyst – what?”• Applicable to both managerial and academic contexts.2 Sample Size Imputation Based on:Extent of missing data:• Under 10% – Any of the
expansion of the domains of study embracing analytics as well as methodological Affects All Results: Small samples – too little statistical power or too easily imputation methods can be applied, complete case method has been shown to be
challenges due to seemingly “unlimited” data Problems – emerging on “overfitting” the data.• Large Samples – make the statistical tests overly the least preferred.• 10% to 20% – all-available, hot deck case substitution, and
technological, ethical and sociological fronts. 2: Statistical Versus Data Mining sensitive.3 Know Your Data:• Multivariate analyses require an even more regression methods most preferred for MCAR data, whereas model-based
Models: Two approaches to data analysis-• Statistical/data models( – analysis rigorous examination of the data because the influence of outliers, violations of methods are necessary with MAR missing data processes.• Over 20% – if
where a specific model is proposed (e.g., dependent and independent variables to assumptions, and missing data can be compounded across several variables to necessary, the preferred methods are:A) The regression method for MCAR
be analyzed by the general linear model), the model is then estimated and a create substantial effects4Strive for Model Parsimony:• Irrelevant variables situations.B) Model-based methods when MAR missing data occur.Type of
statistical inference is made as to its generalizability to the population through usually increase a technique’s ability to fit the sample data, but at the expense of missing data process:• MCAR – any imputation method can provide unbiased
statistical tests. • Data mining/algorithmic models( – models based on overfitting the sample data and making the results less generalizable to the estimates if MCAR conditions met, but the model-based methods are preferred.•
algorithms (e.g., neural networks, decision trees, support vector machine) that are population.• Even though irrelevant variables typically do not bias the estimates MAR – only model-based methods.Outlier:An observation/response with a
widely used in many Big Data applications. Their emphasis is on predictive of the relevant variables, they can mask the true effects due to an increase in unique combination of characteristics and identifiable as distinctly different from
accuracy rather than statistical inference and explanation. 3: Causal Inference multicollinearity 5 Look at Your Errors:• starting point for diagnosing the the other observations/responses.Contexts for defining outliers: Pre-analysis
(Causal inference is the movement beyond statistical inference to the stronger validity of the obtained results. • indication of the remaining unexplained Context: A Member of a Population.• focus is on each case as compared to the
statement of “cause and effect” in non-experimental situations):- While causal relationships6 Simplify Your Models By Separation:Estimate separate models other observations under study.Post-analysis Context: Meeting Analysis
statements have been primarily conceived as the domain of randomized controlled when possible (e.g., in presence of moderators)7 Validate Your Results.:Use Expectations.• defines “normal” as the expectations (e.g., predicted values, group
experiments, recent developments have provided researchers with .• the split-sample or cross-validation to assess generalizability of any model.A membership predictions, etc.) generated by the analysis of interest .Impacts of
theoretical frameworks for understanding the requirements for causal inferences STRUCTURED APPROACH TO MULTIVARIATE MODEL:Stage 1: Outliers: Practical Impacts• Can have substantial impact on the results of any
in non-experimental settings.• some techniques applicable to data not gathered in Define the Research Problem, Objectives, and Multivariate Technique(s) to be analysis.Substantive Impacts• Non-representative outliers can distort results and
an experimental setting that still allow some causal inferences to be drawn. Used.Stage 2: Develop the Analysis Plan.Stage 3: Evaluate the Assumptions lame them less generalizable to the population.Outliers– • Good – identify
Multivariate analysis: statistical techniques that simultaneously analyze Underlying the Multivariate Technique(s)Stage 4: Estimate the Multivariate perhaps, small, but unique, portions of the sample that should be included.• Bad –
multiple measurements on individuals or objects under investigation.• 1The Model and Assess Overall Model Fit.Stage 5: Interpret the Variate(s)Stage 6: distort results and impact generalizability.Classifying Outliers: Types of impacts
variate is a linear combination of variables with empirically determined weights. Validate the Multivariate ModelBUILDING CH2 THE CHALLENGE OF BIG of outliers:• Error outliers – differ from expected values generated by the
2 Measurement Scales : Nonmetric• Nominal – size of number is not related to DATA RESEARCH EFFORT:1Data Management• Many consider most analysis.• Interesting outliers – different enough to generate insight into the
the amount of the characteristic being measured.• Ordinal – larger numbers daunting challenge• Many times majority of research effort expended in this task• analysis.• Influential outliers – different enough to substantively impact the
indicate more (or less) of the characteristic measured, but not how much more (or Complexity arises from A) Merging disparate sources of data B)Use of results.Reasons for Outlier Designation:• Procedural Error.• Extraordinary
less). Metric• Interval – contains ordinal properties, and in addition, there are unstructured data. 2Data Quality• True “value” of analysis may rest in data Event.• Extraordinary Observations.• Observations unique in their combination of
equal differences between scale points• Ratio – contains interval scale properties, quality• Conceptualized in eight dimensions• Many times “hidden” in basic nature values. Detecting Outliers: • Standardize data and then identify outliers in terms
and in addition, there is a natural zero point. The Impact of Choice of of the data (e.g., binary measures). Graphical Examination:1Fundamental tool of number of standard deviations.• Examine data using Box Plots, Stem & Leaf,
Measurement Scale : A. The researcher must identify the measurement scale of in data examination is graphical examination.types of graphical tools:• Shape: and Scatterplots.• Multivariate detection (Mahalanobis D2).Univariate methods
each variable used, so that nonmetric data are not incorrectly used as metric data, (Histogram, Bar Chart, Box & Whisker plot, Stem and Leaf plot)• Relationships( – examine all metric variables to identify unique or extreme observations. • For
and vice versa. B) Themeasurement scale is also critical in determining which Scatterplot, Outliers).2PRELIMINARY EXAMINATION OF THE DATa: small samples (80 or fewer observations), outliers typically are defined as cases
multivariate techniques are the most applicable to the data, with considerations Univariate Profiling: Examining the Shape of the Distribution(Histograms and with standard scores of 2.5 or greater. • For larger sample sizes, increase the
made for both independent and dependent variables. 3Measurement Error ( The Normal Curve,Stem & Leaf Diagram, Frequency Distribution)-3Bivariate threshold value of standard scores up to 4. • If standard scores are not used,
distorts observed relationships and makes multivariate techniques less powerful)• Profiling: Examining the Relationship Between Variables(Box & Whiskers Plots, identify cases falling outside the ranges of 2.5 versus 4 standard deviations,
(Researchers use summated scales). Assessing Measurement Error: • HBAT Scatterplot)- Bivariate Profiling: Examining Group Differences- depending on the sample size.Bivariate methods – focus their use on specific
1Validity&Reliabilitythe degree to which a measure accurately represents what 4Multivariate Profile- New Measures of Association. Wide range of existing variable relationships, such as the independent versus dependent variables: • use
it is supposed to.• Reliability the degree to which the observed variable measures measures beyond Pearson correlation (parametric and non-parametric).5New scatterplots with confidence intervals at a specified Alpha level.Multivariate
the “true” value and is thus error free.2Employing Multivariate Measurement measures from data mining:• Hoeffding’s D • nonparametric measure of methods – best suited for examining a complete variate, such as the independent
:to reduce measurement error by improving individual variables, the researcher association based on departures from independence.• dCor (the distance variables in regression or the variables in factor analysis: • threshold levels for the
may choose to develop multivariate measurements, known as summated scales, correlation)• distance-based measure of association which also is more sensitive D2/df measure should be very conservative (.005 or .001), resulting in values of
.MANAGING THE MULTIVARIATE MODEL :1Managing the Variate- to nonlinear patterns in the data.• MIC (mutual information correlation)• Pattern- 2.5 (small samples) versus 3 or 4 in larger samples.Impact of Dimensionality
A)Variable Specification – preparing the variables for analysis.• Use original matching approach amenable to identifying both non-linear relationships as well :Increased dimensionality (i.e., increased number of variables) dramatically
variables – retain most detailed attributes, but may suffer due to as a range of distinct patterns.MISSING DATA:Missing Data: information not impacts outlier detection and designation in three ways:1. Distance measures
multicollinearity.• Dimensional reduction – develop some form of composites of available for a subject (or case) about whom other information is available. become less useful• higher levels of dimensionality create a “natural” dispersion
original variables. A)User controlled – Exploratory factor analysis (principal Typically occurs when respondent fails to answer one or more questions in a among observations that makes distance measures less useful for identifying
components or common factor analysis).B) Software controlled – Principal survey. Systematic? Random?Impact: Reduces sample size available for observations.2. Impact of irrelevant variables• as dimensionality increases, the
components regression (PCR).B)Variable Selection – identifying the variables analysis, Can distort results.Researcher’s Concern: to identify the patterns and presence of irrelevant variables has higher likelihood, thus confounding the ability
included in the analysis• User controlled – researcher explicitly defines variables relationships underlying the missing data in order to maintain as close as possible to identify outliers.3. Compatibility of dimensions• as dimensionality increases
in analysis.A)Confirmatory or combinatorial• Software controlled. A) Sequential to the original distribution of values when any remedy is applied. Two major through the use of multiple sources of data, especially unstructured data, methods
– subset of variables included based on algorithm (e.g., stepwise). B) Constrained issues experiencing a resurgence of interest.• Wide range of data sources now for assessing comparability among observations becomes more difficult.Dealing
– regression weight constrained to force some to zero or very low, effectively being used in analysis.• Expanded availability and improved usability of model- with Outliers:1Outlier Designation• Researcher judgment should guide
eliminating those variables from the analysis (e.g., ridge or LASSO).2) Managing based methods of imputation.Corresponding increase in:• Study and designation of outliers versus a strictly empirical designation.2Outlier
the Dependence Model :A)Single Equation versus Multiple Equation Models• applications across wide range of disciplines.• Model-based methods availability Description and Profiling• Outliers should be described on the variables used to
Single equation – most widely used methods in the past• Multiple equation – in all major software.Four-Step Process for Identifying Missing Data 1: compare between observations.• Profiles on additional variables should be
represents interrelated relationships.B)GLM versus GLZ/GLIM• GLM – Determine the Type of Missing Data:Ignorable Missing Data – expected and generated when possible to provide more insight into the character of
classical OLS-based system of techniques• GLZ/GLIM – developed for non part of research design.• Sample – form of missing data, with excluded data the outliers.3Retention versus Deletion• Should be retained unless demonstrable
normal response variables. 3) Statistical Significance and Power: Type I error remaining population.• Part of data collection – e.g., skip patterns.• Censored data proof indicates that they are truly aberrant and not representative of any
(α) – probability of rejecting the null hypothesis when it is true. Type II error (β) – some data not yet observed (e.g., survival data).Not Ignorable Missing Data – observations in the population.• If possible generate results with and without
– probability of failing to reject the null hypothesis when it is false. Power (1 – β) data which must be addressed in the analysis.• Known process – identified due to outliers to assess impact.• Methods to minimize outlier influence (e.g., robust
– probability of rejecting the null hypothesis when it is false. Power is procedural factors (e.g., data entry or data management).• Unknown process – methods) are available.Need For Testing of Assumptions:Foundation for
Determined by Three Factors :Effect size: the actual magnitude of the effect of primarily related to respondent, but important characteristic is level of making statistical inferences and results.Need is increased in multivariate analysis
interest (e.g., the difference between means or the correlation between randomness (e.g., straight lining; lack of attention).Levels of Missingness:Three because the complexity of the analysis:1. Makes the potential distortions and
variables).Alpha (α): as α is set at smaller levels, power decreases. Typically, α levels:• Item-level – missing data for individual variable.• Construct-level – biases more potent when the assumptions are violated.2. May mask the indicators
= .05. .Sample size: as sample size increases, power increases. With very large missing data for entire set of questions about a specific construct.• Person-level – of assumption violations apparent in the simpler univariate analyses.Important
sample sizes (1,000+), even very small effects can be statistically significant, missing data related to individual’s willingness or ability to provide responses. 2: Note: Must test for assumptions twice –• Individual variables – to understand
raising the issue of practical significance vs. statistical significance. Rules of Determine the Extent of Missing Data:Basic question: Is the extent or amount basic sources of problems.• Variate – to assess the combined effect across all
Thumb: Statistical Power Analysis• Researchers should always design the study of missing data is low enough to not affect the results, even if it operates in a variables.Four Important Statistical Assumptions:1Normality• Comparison of
to achieve a power level of .80 at the desired significance level. • More stringent nonrandom manner.Levels of analysis: percentage of data missing by • Variable distribution to normal distribution.• Basis for statistical inference from sample to
significance levels (e.g., .01 instead of .05) require larger samples to achieve the – common form of assessment.• Case – amount of missing data across all population.2Homoscedasticity• Variance of the error terms appears constant over
desired power level. • Conversely, power can be increased by choosing a less variables by case.Guidelines for deleting variables and/or cases:• 10 percent or a range of predictor variables.• Heteroscedasticity is when error terms have
stringent alpha level (e.g., .10 instead of .05).• Smaller effect sizes always require less generally acceptable – cases or observations with 10% or less are amenable to increasing or modulating variance.• Analysis of residuals best illustrates this
larger sample sizes to achieve the desired power.• Any increase in power is most any imputation strategy.• Sufficient sample size – be sure missing data remedy point.3Linearity• Relationship represented by a straight line (i.e., constant unit
likely achieved by increased sample size. CLASSIFICATION OF provides adequate sample size.• Cases with missing data for dependent variable(s) change (slope) of the dependent variable for a constant unit change of the
MULTIVARIATE TECHNIQUES :1. Dependence techniques: a variable or typically are deleted.• When deleting a variable, ensure that alternative variables, independent variable.4 Non-correlated Errors• Prediction errors are
set of variables is identified as the dependent variable to be predicted or explained hopefully highly correlated, are available to represent the intent of the original uncorrelated with each other.Normality Assumptions : Univariate versus
by other variables known as independent variables..• Multiple Regression• variable.• Perform the analysis both with and without the deleted cases or Multivariate Normality.• 1Univariate normality – each individual variable.•
Multiple Discriminant Analysis• Logit/Logistic Regression• Multivariate Analysis variables.3: Diagnose the Randomness of the Missing Data Processes : Levels Multivariate normality – combinations of variables.2Impacts of Assumption
of Variance (MANOVA) and Covariance• Conjoint Analysis• Canonical of Randomness of the Missing Data Process• Missing Data at Random Violations• Shape of Distribution – skewness versus kurtosis.• Impact of sample
Correlation• Structural Equations Modeling (SEM)• Partial Least Squares (PLS) (MAR)-missing values of Y depend on X, but not on Y- Example – observed Y size – increased sample size reduces detrimental effects.3Testing for Normality
Modeling. 2Interdependence techniques: involve the simultaneous analysis of values represent a random sample of the actual Y values for each value of X, but Assumptions• Visual check of histogram or normal probability plot.• Statistical
all variables in the set, without distinction between dependent variables and the observed data for Y do not necessarily represent a truly random sample of all tests of skewness and kurtosis.4Remedies• Most often some form of data
independent variables.• Principal Components and Common Factor Analysis• Y values.• Missing Completely at Random (MCAR):• observed values of Y are transformation.HomoscedasticityAssumption.: Impact of Heteroscedasticity –
Cluster Analysis• Multidimensional Scaling (perceptual mapping)• truly a random sample of all Y values.• no underlying association to the other inflates/deflates standard errorsc. 1Sources :• Variable type – common in
Correspondence Analysis. Selecting a Multivariate Technique: • Dependence observed variables, characterized as “purely haphazard missingness”.• Not percentages or proportions.• Skewed distribution – one or both variables.2Tests :•
relationship: Missing at Random (NMAR)-Distinct non-random pattern of missing data- Graphical test.• Statistical tests.A) Levene test (univariate)B) Box’s M
• How many variables are being predicted?• What is the measurement scale of the Non-random pattern not related to any other variables-Example: all individuals (multivariate).3Remedies for • Transformation of variable(s).• Use of
dependent variable?• What is the measurement scale of the predictor variable (s) with high income had missing data.Diagnostic Tests for Levels of heteroscedasticity-consistent standard errors (HCSE).Nonlinear relationships:•
Interdependence relationship: • Are you examining relationships between Randomness:t test of Missingness• Test of differences between cases with can be very well defined, but seriously understated unless:A) data is transformed
variables, respondents, or objects? TYPES OF MULTIVARIATE missing data versus not missing data on other variables:1. For specific variable to a linear pattern, or B) explicit model components are used to represent the
TECHNIQUES: 1 Dependence Techniques- 1Multiple Regression (a single (e.g., X1), create two groups of cases – cases with missing values on X1 and those nonlinear portion of the relationship.Correlated errors:• arise from a process that
metric dependent variable is predicted by several metric independent variables.)- with valid values on X12. Compare these two groups with a t test for differences must be treated much like missing data: A) Researcher must first identify and
2MANOVA(several metric dependent variables are predicted by a set of on other variables in the analysis (e.g., X2, X3 ….)3. Differences indicate MAR define the “causes” among variables, either internal or external to the dataset (e.g.,
nonmetric (categorical) independent variables.)-3 Discriminant Analysis( single, processes, no differences indicate MCAR processes.Little’s MCAR Test• grouping or time series).B)If they are not found and remedied, serious biases can
non-metric (categorical) dependent variable is predicted by several metric analyzes the pattern of missing data on all variables and compares it with the occur in the results, many times Unknown to the researcher. • Remedies:A)
independent variables)-4LogisticRegression(single nonmetric dependent variable pattern expected for a random missing data process. • If no significant differences Inclusion of omitted causal factor underlying correlation of errors.B) Apply
is predicted by several metric independent variables)-5 Canonical are found, the missing data can be classified as MCAR.MAR or MCAR?• Useful specialized model forms (e.g., multi-level linear models).Data
Analysis( several metric dependent variables are predicted by several metric for selecting remedy, but less impactful when using model-based methods. 4: Transformations:Provide a means of modifying variables for one of four
independent variables.) 6Conjoint Analysis (quasi-experimental design based on Select the Imputation Method- Imputation of MCAR Using Only Valid Data: reasons:1. Enhancing statistical properties.• Primarily to achieve normality,
attributes and levels of attributes which develops combinations of attributes/levels IF MCAR, several approaches available:1. Using only valid data• Complete homoscedasticity or linearity.2. Ease of interpretation.• Standardization –
which are then evaluated by respondents)- 7Structural Equation Modeling case approachA) Use only cases with no missing data.• Using all-available performed across cases to provide common metric for comparison• Centering –
(SEM)( estimates multiple, interrelated dependence relationships based on two dataB)Calculate imputed values based on all valid pairwise information.2. Using performed within-case to allow for comparison across variables.3. Representing
components: a. Measurement Model b. Structural Model. Two basic known replacement data• Hot or Cold Deck imputation• Case substitution3. specific relationship types.• Transformed variables represent unique
methodologies CB-SEM – covariance-based SEM PLS-SEM – Partial Least Calculating replacement values• Mean substitutionA) replaces the missing relationships – (e.g., elasticity).4. Simplification.• Binning – categorization of
Squares Modeling (variance-based). Interdependence Techniques: values with the mean value of that variable calculated from all valid responses.• values into a smaller number of categories (i.e., reduce cardinality).A)
1Exploratory Factor Analysis (EFA)- analyzes the structure of the Regression imputationB) predict the missing values of a variable based on its Dichotomization – frequently employed to form two groups (e.g., mean-split).B)
interrelationships among a large number of variables to determine a set of relationship to other variables in the dataset. Imputation of a MAR Missing Extreme groups – define three categories, eliminate middle group to accentuate
common underlying dimensions (factors).2Cluster Analysis- groups objects Data Process: Best remedy is some form of model-based approach:Two forms of differences.• Smoothing – use of response surface methods or other techniques to
(respondents, products, firms, variables, etc.) so that each object is similar to the model-based imputation which rely upon MAR relationships to estimate represent generalized patterns in the data.Guidelines for Transforming
other objects in the cluster and different from objects in all the other clusters. missing data:• Maximum likelihood and EM: Single step process of missing Data:When explanation is important, beware of transformation• 1)To judge the
potential impact of a transformation, calculate the ratio of the variable’s mean to analysis of reliability and validity issues.Assessing Summated variables).VIF and Tolerance are inversely related: VIF = 1 /
its standard deviation: A)Noticeable effects should occur when the ratio is less Scales:a)Unidimensionality• Never create a summated scale without first Tolerance.Effects of Multicollinearity: Impacts on Estimation- Decrease in
than 4. • B)When the transformation can be performed on either of two variables, assessing its unidimensionality with exploratory or confirmatory factor explained variance,Singularity,Reversal of signs of Coefficients, Increases in
select the variable with the smallest ratio .• 2)Generally applied to the analysis.B)Reliability• should exceed a threshold of .70, although a .60 level can standard error.Impacts on Explanation: Since coefficients only represent unique
independent variables except in the case of heteroscedasticity.•A) be used in exploratory research. • the threshold should be raised as the number of explanation, multicollinearity can obscure the total effect of a variables, which
Heteroscedasticity can be remedied only by the transformation of the dependent items increases, especially as the number of items approaches 10 or requires newer measures of relative importance. Bivariate correlations:Values of
variable in a dependence relationship. If a heteroscedastic relationship is also more.C)Validity:• convergent validity = scale correlates with other similar .70 or higher may result in problems and lower values may be problematic if they
nonlinear, the dependent variable, and perhaps the independent variables, must be scales.• discriminant validity = scale is sufficiently different from other related are higher than the correlations with the dependent variable.Tolerance or VIF•
transformed.•3) Transformations may change the interpretation of the variables. scales.• nomological validity = scale “predicts” as theoretically suggested. Tolerance values up to .20, corresponding to a VIF of 5, almost always indicate
A)• For example, transforming variables by taking their logarithm translates the Computing Factor Scores: Advant:• represents all variables loading on the problems with multicollinearity.• VIF values of even 3 to 5 may result in
relationship into a measure of proportional change (elasticity). Always be sure to factor.• best method for complete data reduction. • by default, factors (and factor interpretation or estimation problems, particularly when the relationships with the
explore thoroughly the possible interpretations of the transformed variables.B)• scores) are orthogonal and can avoid complications caused by dependent variable are weaker.Remedies for Multicollinearity• Delete collinear
Use variables in their original (untransformed) format when profiling or multicollinearity.Disadvas:• interpretation more difficult since all variables variable(s).• Apply dimensional reduction, such as composites from exploratory
interpreting results. Dummy variable: a nonmetric independent variable that has contribute through loadings. • difficult to replicate across studies.ch 5 : Multiple factor analysis.• Specific estimation techniques – Bayesian or principal
two distinct levels that are coded 0 and 1. These variables act as replacement Regression:Statistical technique that can be used to analyze the relationship components regression.• Do nothing – particularly if used solely for prediction,
variables to enable multi-category (3 or more) nonmetric variables to be used as between a single dependent (criterion) variable and several independent but stillrisky.STAGE 6: VALIDATION OF THE RESULTS: • Additional or
metric variables.CH3 EFA:An interdependence technique whose primary purpose (predictor) variables.Variable • Dependent – metric• Independent – metric or Split Samples • Calculating the PRESS (Prediction Sum of Squares) statistic•
is to define the underlying structure among the variables in the analysis.• EFA is a transformed non-metric (through dummy variable coding)Multiple Regression Comparing Regression Model• Forecasting.ILLUSTRATION OF A
summarization and data reduction technique that does not have independent and Decision Process: Stage 1: Objectives of Multiple Regression researcher: REGRESSION ANALYSIS: Stage 1: ObjectivesStage 2: Research Design•
dependent variables, but is an interdependence technique in which all variables must consider three primary issues: 1) the appropriateness of the research Thirteen independent variables (X6 to X18).• Meets minimum ratio of
are considered simultaneously. Types of Factor Analysis:(EFA)• used to problem (Predictive purpose- maximize predictive accuracy by ensuring the observations per variable – 7:1 with adequate power.Stage 3: Assumptions•
discover the factor structure of a construct and examine its reliability. It is data validity of the set of independent variables , model comparison by comparing two Linearity – graphical analysis did not reveal nonlinear relationships.•
driven.Confirmatory Factor Analysis (CFA)• used to confirm the fit of the or more independent variables to ascertain the predictive power of each variate vs Homoscedasticity – only two variables (X6 and X17) had minimal violations.•
hypothesized factor structure to the observed (sample) data. It is theory driven. Explanation purpose – relative imp of independent variables by assessing Normality – six variables indicated violations, thus requiring further analysis.ch 8
STAGE 1: OBJECTIVES OFEFA: Specifying Unit of Analysis:• Variables –to magnitude and direction of each independent variables, nature of relationship w/ Logistic Regression: A specialized form of regression that is designed to predict
summarize charactersitics, use R-factor analysis.• Cases – grouping cases, similar dependent variables, nature of relationships among independent variables and explain a binary (two-group) categorical variablerather than a metric
to cluster analysis, termed Q-factor analysis.Data Summarization/Data (multicollinearity) ) 2) specification of a statistical relationship (Multiple dependent measure.CharacteristicsIts variate is similar to regular regression and
Reduction• Data summarization – • Data reduction – .Variable Selection : regression is appropriate when research is interested in statistical relation , not made up of metric independent variables. • It is less affected than discriminant
Three Elements in variable selection1. Variable specification .2. Factors are functional relationship(expect no error in prediction, only calculate the exact analysis when the basic assumptions, particularly normality of the independent
always produced – 3. Factors require multiple variables . Using Factor Analysis value). Some random component is always present (error in predicting variable) variables are not met.When Logistic Regression is Preferred: • Discriminant
with Other Multivariate Techniques. Stage2 Designing EFA:Variable in the relationship being examined; calculate the average value) 3) selection of the analysis assumes multivariate normality and equal variancecovariance matrices
selection-Specialized methods exist for the use of dummy variables, but a small dependent and independent variables (a. support theory either conceptual or across groups, and these assumptions are often not met. Logistic regression does
number of “dummy variables” can be included in a set of metric variables that are theoretical b. measurement error c. specification error). Measurement error: not face these strict assumptions and is much more robust when these assumptions
factor analyzed.Sample size:mimum sample size 50 observation.Correlatin problematic can be addressed through either of two approaches: Summated are not met, making its application appropriate in many situations. • Even if the
matrx: (“R analysis”) most often performed, although similarity/distance matrix scales- mitigate measurement error, or Structural equation modeling procedures- assumptions are met, some researchers prefer logistic regression because it is
(“Q analysis”) also an option.ROT3-1::Factor Analysis Design: 1) Factor can directly accommodate measurement error. Specification error: the exclusion similar to multiple regression. It has straightforward statistical tests, similar
analysis is performed most often only on metric variables, although of relevant and inclusion of irrelevant independent variables. **when in doubt, approaches to incorporating metric and nonmetric variables and nonlinear effects,
specialized methods exist for the use of dummy variables; a small number of include potentially irrelevant variables (they can only confuse interpretation) and a wide range of diagnostics.DECISION PROCESS FOR LOGISTIC
“dummy variables” can be included in a set of metric variables that are rather than omitting a relevant variable (which can bias all regression REGRESSION: Stage 1: Objectives of Logistic Regression: explanantion and
factor analyzed 2) If a study is being designed to reveal factor structure, estimates)Stage 2: Research Design of a Multiple Regression Analysis Issues classification Stage 2: Research Design for Logistic Regression:Use of the
strive to have at least five variables for each proposed factor 3) For sample to consider 1)Sample Size 2) Creating additional variables : Nonmetric variable: Logistic Curve:• Since the dependent variable can only take on two values (1 or
size: • The sample must have more observations than variables • The scan only be included in a regression analysis by creating dummy variables.. 0), need a relationship form that is non-linear so that the predicted values will
minimum absolute sample size should be 50 observations • Strive to Dummy variables can only be interpreted in relation to their reference never be below 0 or above 1.• Not possible with multiple regression, thus use
maximize the number of observations per variable, with a desired ratio of 5 category.Moderator Effect: When the moderator variable, a second independent logistic curve.• Also violatesassumptions of multiple regression, requiring
observations per variable. Stage 3: Assumptions in Factor Analysis variable, changes the form of the relationship between another independent different model form as well.sample size: should be 400 Stage 3: Assumptions of
1)Conceptual: assume a homogeneity of sub sample factor solution 2)Stastical variable and the dependent variable. The moderator term is a compound variable Logistic Regression. Advantage:• The primary assumption is the independence of
issuse:- A statistically significant Bartlett’s test of sphericity (sig. < .05) indicates formed by multiplying X1 by the moderator X2, >> (X1X2). The coefficient (b3) observations, which if violated requires some form of hierarchical/nested model
that sufficient correlations exist among the variables to proceed with an of the interaction/moderator term indicates the unit change in the effect of X1 as approach.• An inherent assumption that should be addressed with the Box–
exploratory factor analysis- the KMO predicts if data are likely to factor well X2 changes. The coefficients (b1, b2) of the two independent variables now Tidwell test is the linearity of the independent variables, especially continuous
based on correlation and partial correlation- KMO statistic exists for each represent the effects when the other independent variable is zero. Mediator variables, with the outcome. Stage 4: Estimation of the Logistic Regression
individual variable and an overall value that can vary from 0 to 1.0.. If KMO effect: when the effect of an independent variable may “work through” an Model and Assessing Overall Fit-:Uses the specific form of the logistic curve,
value must exceed .50 for both the overall test and each individual variable. Stage intervening variable (the mediating variable) to predict the dependent which is S-shaped, to stay within the range of 0 to 1.Estimating the Logistic
4:Deriving factors & assessing overall fit:1Partitioning the variable.Stage 3: Assumptions in Multiple Regression Analysis:four primary Regression Model: Two basic steps:• Transforming a probability into odds and
Variance:Common variance: Variance of a variable that is shared with all other assumption- 1) Linearity of the phenomenon measured 2) Homoscedasticity: logit values.• Model estimation using a maximum likelihood approach, not least
variables in the analysis.Unique variance – composed of Specific(cannot be Constant variance of the error terms 3) Normality of the error term distribution 4) squares as in regular multiple regression.The logistic transformation has two
explained by the correlations to the other variables but reflects the unique Independence of the error terms.STAGE 4: ESTIMATING THE basic steps:• Restating a probability as odds, then• Calculating the logit
characteristics )and Error variance(due to unreliability in the data-gathering REGRESSION MODEL ANDASSESSING OVERALL MODEL values.Maximum Likelihood Estimation:• Maximizes the likelihood that an
process). Optimal scaling is a process to derive interval measurement properties FIT:Variable Specification-Use variables in their original formAllows for use of event will occur – the event being a respondent is assigned to one group versus
for variables which were originally nominal or ordinal measures. Principal direct measures of the variables of interest. As number of variables increases, another.• The basic measure of how well the maximum likelihood estimation
component (data reduction is a primary concern, prior knowledge suggests that interpretability may become problematic-Can be either software controlled or procedure fits is the likelihood value.Comparisons of the likelihood values
specific and error variance represent a relatively small proportion of the total user controlled.• Software controlled – software independently forms follow three steps:• Estimate a Null Model – which acts as the “baseline” for
variance) vs Common factor analysis (obj = to identify latent dimensions or dimensional reduction and then proceeds with analysis (e.g., principal making comparisons of improvement in model fit.• Estimate Proposed Model –
constructs represented in common variance of the original variables, use when components regression).• User controlled – research performs some form of the model containing the independent variables to be included in the logistic
reasearcher has little knowledge about specific &error variance). Stoping Rule: dimensional reduction process (e.g., exploratory factor analysis) and forms regression.Measures of Model Fit: Global Null Hypothesis Test:• Test for the
Criteria for number of factors to extract: priori criteria: researcher knows composites which are then substituted for original variables in the significance of any of the estimated coefficients.Pseudo R2 Measures:•
how many factors to extract and specifies for factor extraction . Latent root analysis.Components of Model Fit:• Total Sum of Squares (SST)-total amount of Interpreted in a manner similar to the coefficient of determination in multiple
criteria/kaiser rule: • factors having latent roots or eigenvalues greater than 1 variation that exists to be explained by the independent variables. TSS = the sum regression.• Different pseudo R2 measures vary widely in terms of magnitude and
are considered significant.• most applicable to principal components analysis of SSE and SSR.• Sum of Squared Errors (SSE)- the variance in the dependent no one version has been deemed most preferred. • For all of the pseudo R2
where the diagonal value representing the amount of variance for each variable is variable not accounted for by the regression model = residual. The objective is to measures, however, the values tend to be much lower than for multiple regression
1.0.• less accurate with a small number of variables or lower obtain the smallest possible sum of squared errors as a measure of prediction models.Issues in Model Estimation: 1 Small sample size 2. complete separation
communalities.Percentage of Variance: Extract enough components to achieve a accuracy.• Sum of Squares Regression (SSR)- the amount of improvement in 3. Quasi complete separation(zero cell effect).Predictive accuracy – ability to
specified cumulative percentage of total variance extracted .Scree test: identify explanation of the dependent variable attributable to the independent variables. F classify observations into correct outcome group.Generates Four Outcomes:•
the optimum no. of factors that can be extracted bf the amount of unique variance statistic – statistical significance of overall model• Significance means that it is True Negatives• True Positives• False Negatives• False Positives.Predictive
begins to dominate the common variance structure. Elbow: the point at which the unlikely your sample will produce a large R2 when the population R2 is actually Accuracy of Actual Outcomes• Sensitivity: true positive rate – percentage of
curve first begins to straighten out.Parallel Analysis: empirical measure based on zero. • A rule of thumb is there must be <.05 probability for statistical positive outcomes correctly predicted.• Specificity: true negative rate – percentage
the specific characteristics. Rules of thumb:• the principal component analysis significance.R2 (Coefficient of Determination):R2 ranges from 0 to 1.0, with of negative outcomes correctly predicted.Predictive Accuracy of Predicted
model is most appropriate when data reduction is paramount .:•decision on the large R2 indicating the linear relationship works well.Adjusted R2• based on the Outcomes• PPV (positive predictive value): percentage of positive predictions
number of retained factors should be based on several considerations: - Factors number of independent variables relative to the sample size.Influential that are correct.• NPV (negative predictive value): percentage of negative
With Eigenvalues greater than 1.0. - Enough factors for a specified percentage of observations include all observations that• lie outside the general patterns of the predictions that are correct. Two Types of- Casewise Diagnostics Similar to
variance explained (e.g., usually 60%).- Factors which have eigenvalues greater data set.• disproportionate effect on the regression results.Three basic types Multiple Regression• Residuals- Both residuals (Pearson and deviance) reflect
than factors from randomly-generated data.- Factors above the threshold based upon the nature of their impact on the results:• Outliers are observations standardized differences between predicted probabilities and outcome value (0
established by parallel analysis. - More factors when there is heterogeneity among that have large residual values and can be identified only with respect to a specific and 1). Values above ± 2 merit further attention.• Influential Observations-
sample subgroupsStage 5: Interpreting the Factors: 3 process: 1.1)Estimate the regression model.• Leverage points are observations that are distinct from the Influence measures reflect impact on model fit and estimated coefficients if an
Factor Matrix (compute unrotated factor loadings – correlation of variable and remaining observations based on their independent variable values. • Influential observation is deleted from the analysis.- Comparable to those measures found in
factor).1.2)Factor Rotation:Impact• The ultimate effect of rotating the factor observations are the broadest category, including all observations that have a multiple regression.Stage 5: Interpretation of the Results l:ogistic regression a
matrix is to redistribute the variancefrom earlier factors to later ones to achieve a disproportionate effect on the regression results. Influential observations zero coefficient corresponds to odds of 1.0 –no change.Wald Statistic• Measure
simpler, theoretically more meaningful factor pattern.1.3Alternative Methods• potentially include outliers and leverage points but may include other of statistical significance for each estimated coefficient so that hypothesis testing
Orthogonal rotation – n the research goal is data reduction to either a smaller observations as well.Impacts: reinforcing, conflicting, shifting..Identifying can occur just as it does in multiple regression.Interpreting the Coefficients: •
number of variables or a set of uncorrelated measures for subsequent use in other Influential Observations: Step 1: Examining Residuals(defined by a) cases Original logistic coefficient -estimated parameter from the logistic model that
multivariate techniques.–Quartimax (simplify rows) - Varimax (simplify used to calculate and b) type of standardization) and Partial Regression reflects the change in the logged odds value (logit) for a one unit change in the
columns) - Equimax (combination• Oblique rotation – obtaining several Plots(depict relationship of variable controlling for other variables)• .Step 2: independent variable.- It is similar to a regression weight or discriminant
theoretically meaningful factors or constructs because, realistically, very few Identifying Leverage Points:Leverage points – substantially different on one or coefficient. Exponentiated logistic coefficient• Antilog of the logistic
constructs in the “real world” are uncorrelated: .4) Significance of Facto• more independent variable- Diagnostic measures – Hat matrix and Mahalanobis coefficient, which is used for interpretation purposes in logistic regression.• The
Equivalent to zero: less than +/- .10• Minimal Level: in the range of +/- .30 to distance (D2).Step 3: Single-case Diagnostics.Step 4: Selecting Influential exponentiated coefficient minus 1.0 equals the percentage change in the
+/- .40• Practically Significant: values of +/- .50 or greater• Well-defined Observati:A. No issues – fit well and no extreme valuesB. High leverage, but not odds.Stage 6: Validation of the Results: Involves ensuring both the internal and
structure: exceeding +/- .70r 5)•. A smaller loading is needed given either a larger outlier – very different on IVs, but still predicted well by modelC. Outliers, But external validity of the results.Holdout versus Estimation Samples . • The most
sample size or a larger number of variables.• A larger loading is needed given a Acceptable Leverage – high residual, but no extreme values on IVD. Outliers and common form of estimating external validity is creation of a holdout or validation
factor solution . Intrepreting factor matrix:Five step process1. Examine the High Leverage –poor prediction and quite different on IV.STAGE 5: sample and calculating the hit ratio.Cross-validation .• Typically achieved with a
factor matrix of loadings 2) PROCESS TO IDENTIFY:A). Identify potential INTERPRETING THE REGRESSION VARIATE:key function:1prediction- jackknife or “leave-one-out” process of calculating the hit ratio. Ch 4 Cluster
cross-loadings. B)Compute the ratio of the squared loadings.C)Designate the pair 2. Explanantion: Interpretation with Regression Coefficients – primary measure of Analysis is a group of multivariate techniques whose primary purpose is to group
of loadings as follows based on the ratio:• Between 1.0 and 1.5 — problematic the relative impact and importance of the independent variables in their objects based on the characteristicsthey possess. Conceptual Development with
cross-loading • Between 1.5 and 2.0 — potential cross-loading• Greater than 2.0 relationship with the dependent variable,Standardizing the Regression Cluster analysis• Data reduction – reduces population to smaller number of
— ignorable cross-loading.3). Assess the communalities of the variable: should be Coefficients: Beta Coefficients – converts variables to a common scale and homogeneous groups• Hypothesis Generation – means of developing or assessing
greater than 0.5 .4). Respecify the factor model if needed.• Ignore the variability, the most common being a mean of zero (0.0) and standard deviation of hypothesesNecessity of Conceptual Support• Strong conceptual support of
problematic variable.• Deletion of variables from the analysis.• Use a different one (1.0).Multicollinearity: relationship between two (collinearity) or more existence of clusters helps negate criticisms:• Cluster analysis is descriptive,
rotational approach• Extract a different number of factors.• Change the method of (multicollinearity) independent variables. Multicollinearity occurs when any atheoretical, and non-inferential• Cluster analysis will always create clusters,
extraction. 5). Label the factors:Variables with higher loadings are considered single independent variable is highly correlated with a set of other independent regardless of the actual existence of any structure• The cluster solution is not
more important and have greater influence on the name or label selected to variables. Steps in Assessing and Addressing Multicollinearity1. Understand generalizable because it is totally dependent upon cluster variate. Three Basic
represent a factor.STAGE 6: VALIDATION OFEXPLORATORY FACTOR new measures of correlation which incorporate multicollinearity.2. Assess the Questions: How do we measure similarity? • We require a method of
ANALYSIS:Use of Replication or a Confirmatory Perspective. Assessing Factor degree of multicollinearity.3. Determine its impact on the results.4. Apply the simultaneously comparing observations on the clustering variables. Several
Structure Stability.Detecting Influential Observations. STAGE 7: necessary remedies if needed. Three measure of correlation:Bivariate or zero- methods are possible, including the correlation between objects or perhaps a
ADDITIONAL USES OF EFA RESULTS-A)Selecting Surrogate Variables order correlation,Semi-partial or part correlation,Partial correlation.identifying measure of their proximity in two-dimensional space such that the distance
for Subsequent Analysis:Adva: • simple to administer and interpret.Disadv:• Multicollinearity: Variance Inflation Factor (VIF) – measures how much the between observations indicates similarity.How do we form clusters? • No matter
does not represent all “facets” of a factor. • prone to measurement error.Creating variance of the regression coefficients is inflated by multicollinearity problems. how similarity is measured, the procedure must group those observations that are
Summated Scales: Advan:• compromise between the surrogate variable and The square root of the VIF is the expected increase in the standard error of the most similar into a cluster, thereby determining the cluster group membership of
factor score options.• reduces measurement error.• represents multiple facets of a coefficients.Tolerance – the amount of variance of an independent variable that is each observation for each set of clusters formed. How many groups do we
concept.• easily replicated across studies.Disadvs:• includes only the variables not explained by the other independent variables (i.e., an independent variable is form? • The final task is to select one cluster solution (i.e., set of clusters) as the
that load highly on the factor.• not necessarily orthogonal.• requires extensive considered a dependent variable, predicted by all the other independent final solution. In doing so, the researcher faces a trade-off: fewer clusters and less
homogeneity within clusters versus a larger number of clusters and more preferred when:• The number of clusters is known and initial seed points can be Clustering high-dimensional data creates difficulties in:- establishing object
withingroup homogeneity.two key elements:1. Subjectivity in selecting final specified according to some practical, objective or theoretical basis.• There is similarity.- ensuring variable relevancy.Stage 1: objectives of cluster analysis:
solution• While there are empirical diagnostic measures to assist the researcher in concern about outliers since nonhierarchical methods generally are less Primary Objective• develop a taxonomy that segments objects (HBAT customers)
selecting the final solution(s), there are no methods by which one solution is susceptible to outliers.Combination approach – using a hierarchical approach into groups with similar perceptions. • Once identified, strategies with different
deemed optimal.• Thus, it still falls to the researcher to make the final decision as followed by a nonhierarchical approach is often advisable.1. A nonhierarchical appeals can be formulated for the separate groups—the requisite basis for market
to the number of clusters• to accept as the final solution.2. Judgment required of approach is used to select the number of clusters and profile cluster centers that segmentation.Stages 2( detecting outliers, defining similarity, sample size,
the researcher• Judgment of the researcher in selecting the characteristics to be serve as initial cluster seeds in the nonhierarchical procedure.2. A nonhierarchical standardization) and 3: Research Design and Assumtpions in Cluster
used, the methods of combining clusters, and even the interpretation of cluster method then clusters all observations using the seed points to provide more Analysis(sample representation, multicollinerarity). Stages 4- Hierarchical
solutions makes any final solution unique to that researcher.DECISION accurate cluster memberships.Should The Cluster Analysis Be and Nonhierarchical Methods. Stage 5:Profiling the Hierarchical Cluster
PROCESS:Stage 1: Objectives of Cluster Analysis:Primary Goal• to partition Respecified:Primary focus• Identification of single object or very small clusters Results/ nonhierrarchial method. Stage 6: Validation and Profiling the
a set of objects into two or more groups based on the similarity of the objects for a that represent disparate observations that do not match the research objectives.• Clusters. Ch6:ReEmergence of Experimentation:Experimentation long been
set of specified characteristics (the cluster variate).Two key issues:• The research Similar considerations to outlier identification and many times operate on the the foundational principle of the scientificmethod of research.Yet while social
questions being addressed, and• The variables used to characterize objects in the same conditions in the sample..If respecification occurs• Should reanalyze sciences have been less frequent users of experimentationin the past, that
clustering process.Research Question: How to form the taxonomy,How to remaining data, especially when using hierarchical procedures.Determining the trend is being reversed.• Traditional randomized laboratory experiments
simplify the data,Which relationships can be identified.Selection of Clustering Number of Clusters: Stopping rules• Criteria used with hierarchical techniques becoming increasingly common inconsumer behavior and behavioral economics.•
Variables: Two Issues in Variable Selection .1. Conceptual considerations - to identify potential cluster solutions.• Foundational principle – a natural increase Field experiments have become increasingly common both in academic
Variables characterize the objects being clustered- Relate specifically to the in heterogeneity comes from the reduction in number of clusters.• Common to all settingand the business sector.What Distinguishes The Experimental
objectives of the cluster analysis.2. Practical considerations.-Should always use stopping rules:-• evaluating the trend in heterogeneity across cluster solutions to Approach?Primary objective: Focus on a single or small number of “causes” to
the “best” variables available (i.e., little measurement error, etc.).Cluster analysis identify marked increases. • substantive increases in this trend indicate relatively establisha very specific “effect.”Basic elements of an experiment• Treatment à
is used for:• Taxonomy description – identifying natural groups within the distinct clusters were joined and that the cluster structure before joining is a Outcome.• Treatment/factor – a conceptual “cause” of a specific outcome (i.e.,
data.• Data simplification – the ability to analyze groups of similar observations potential candidate for the final solution.• Issues in applying stopping rules- The cause à effect).• Experimental design – research approach based on developing a
instead of all individual observations.• Relationship identification – the ad hoc procedures must be computed by the researcher and often involve fairly narrowly focusedresearch program which can be tested in a hypothesis or set of
simplified structure from cluster analysis portrays relationships not revealed complex approaches.• Many times measures are software-specific.Two Classes of hypotheses.• Distinguished from the techniques/methods typically used in
otherwise.• Theoretical, conceptual and practical considerations must be observed Stopping Rules:Class 1: Measures of Heterogeneity Change• measures analyzing this type ofdata (e.g., ANOVA and MANOVA) by its rigid research
when selecting clustering variables for cluster analysis:• Only variables that relate heterogeneity change between cluster solutions at each successive decrease in the controls to isolate theproposed “cause” as the only factor which has a relationship
specifically to objectives of the cluster analysis are included, since “irrelevant” number of clusters. A cluster solution is a candidate for the final cluster solution with the outcome.MANOVA Defined:Multivariate extension of ANOVA for
variables can not be excluded from the analysis once it begins.• Variables are when the heterogeneity change measure makes a sudden jump.• Measures of assessing the differences betweengroup means on more than one dependent
selected which characterize the individuals (objects) being clustered.Stage 2: heterogeneity change -• Percentage Changes in Heterogeneity – simple variable at the same time.Basic elements• A variate is tested for equality.-
Research Design in Cluster Analysis: Types of Variables Included: •Can percentage change in heterogeneity.• Measures of variance change – use of root Actually two variates are compared – one for the dependent variables and another
employ either metric or non-metric, but generally not in mixed fashion.• Multiple mean square standard deviation (RMSSTD) to compare solutions.• Statistical for theindependent variables.- The dependent variable variate is of more interest
measures of similarity for each type.Number of Clustering Variables• Can measures of heterogeneity change – pseudo T2 statistic compares goodness-of-fit because the metric-dependent measurescan be combined in a linear combination,
suffer from “curse of dimensionality” when large number of variables analyzed.• between k and k-1 clusters. Thus, large pseudo T2 value at six clusters indicates a as we have already seen in multiple regression anddiscriminant analysis.- The
Can have impact with as few as 20 variables.Relevancy of Clustering Variables• seven cluster solution is the possible final solution.Class 2: Direct Measures of unique aspect of MANOVA is that the variate optimally combines the multiple
No internal method of ascertaining the relevancy of clustering variables.• Heterogeneity• directly measure heterogeneity of each cluster solution and then dependentmeasures into a single value that maximizes the differences across
Researcher should always include only those variables with strongest conceptual allow analyst to evaluate each cluster solution against a criterion measure.• groups.• Cells/groups- Formed by combinations of independent variables
support. Detecting Outliers: They should be removed if the outlier represents: Measures of heterogeneity:- Comparative cluster heterogeneity – the cubic (factorial designs) (e.g., 2 x 2 = four cells).The relationship between the
Aberrant observations not representative of the population. Observations of small clustering criterion (CCC) is a SAS measure of the deviation of the clusters from univariate and multivariate procedures is shown below:
or insignificant segments within the population. They should be retained if the an expected multivariate normal distribution. Choose cluster solution(s) with high
outlier represents: an under-sampling/poor representation of relevant groups in values of CCC.- Statistical significance of cluster variation – pseudo F statistic
the population. In this case, the sample should be augmented to ensure measures the separation among all the clusters by the ratio of between-cluster
representation of these groups.Outliers can be identified: • Finding observations variance (separation of clusters) to withincluster variance (homogeneity of
with large distances from all other observations – pairwise similarities or clusters). Higher values indicate a possible cluster solution.- Internal validation
summated measure of squared differences from mean of each clustering variable.• index – characterize a cluster solution on two dimensions: separation and
Graphic profile diagrams or parallel coordinate graphs highlighting outlying compactness. Common measure is the Dunn index ratio, the ratio between the
cases.• Their appearance in cluster solutions as single-member or very small minimal within-cluster distance to maximal between-cluster distance. Higher
clusters.Interobject similarity• an empirical measure of correspondence, or values indicate better solutions.Rules of Thumb – Deriving The Final Cluster
resemblance, between objects to be clustered. • calculated across the entire set of Solution: • Measures of heterogeneity change: • These measures, whether they
clustering variables to allow for the grouping of observations and their be percentage changes in heterogeneity, measures of variance change (RMSSTD)
Differences Between MANOVA and Discriminant Analysis: To some extent, they
comparison to each other.Three methods most widely used in applications of or statistical measure of change (pseudo T2), all evaluate the change in
are “mirror images” of each other:• The dependent variables in MANOVA (a set
cluster analysis: • Distance measures – most often used.• Correlational measures heterogeneity when moving from k to k - 1 clusters.• Candidates for a final cluster
– less often used as they measure patterns, not distance.• Association measures – solution are those cluster solutions which preceded a large increase in of metric variables) are the independent variables in discriminant analysis.• The
applicable for non-metric clustering variables.Types of distance measure: Many heterogeneity by joining two clusters (i.e., a large change in heterogeneity going single nonmetric dependent variable of discriminant analysis becomes the
different distance measures, most common are:- Euclidean (straight line) distance from k to k –1 clusters would indicate that the k cluster solution is better).• Direct independent variable in MANOVA.• Both use the same methods in forming the
is the most common measure of distance.- Squared Euclidean distance is the sum measures of heterogeneity:• These measures directly reflect the compactness variates and assessing the statistical significance between groups.Differences are
of squared distances and is the recommended measure for the centroid and Ward’s and separation of a specific cluster solution. These measures are compared across in the Research Objectives• In discriminant analysis, the groups formed by the
methods of clustering.-Mahalanobis distance (D2) accounts for variable a range of cluster solutions, with the cluster solution(s) exhibiting more nonmetric dependent measure are assumed as given and interest is in the set of
intercorrelations and weight each variable equally. When variables are highly compactness and separation being preferred.• Among the most prevalent metric independent variables for their ability to discriminate among the groups.
intercorrelated, Mahalanobisdistance is most appropriate.Data Standardization: measures are the CCC (cubic clustering criterion), a statistical measure of cluster • In MANOVA, the set of metric variables used as dependent measures are
should be standardized whenever possible to avoid problems resulting from the variation (pseudo F statistic) or the internal validation index (Dunn's assumed given and interest is in finding the nonmetric variable(s) that form
use of different scale values among clustering variables.Two approaches to index).Additional Approaches to Clustering: Density-based approach• groups with the greatest differences on the set of dependent
standardization• Relative to other cases:-most common standardization is Z Fundamental principle – clusters can be identified by “dense” clusters of objects measures.DECISION PROCESS FOR MANOVA:Stage 1: Objectives of MANOVA-
scores.• Relative to other responses within an object-If groups are to be identified within the sample, separated by regions of lower object density.• Researcher Objectives• Analyze a dependence relationship represented as the differences in
according to an individual’s response style, then within-case or row-centering must decide• ε, the radius around a point that defines a point’s neighborhood, a set of dependent measure across groups formed by one or more nonmetric
standardization is appropriate.Stage 3: Assumptions in Cluster Analysis : and• the minimum number of objects (minObj) necessary within a neighborhood independent measures.• To provide insights into the nature and predictive
Structure Exists• Since cluster analysis will always generate a solution, to define it a cluster• Has advantages of• Ability to identify clusters of any power of the independent measures as well as the interrelationships and
researcher must assume that a “natural” structure of objects exists which is to be arbitrary shape• Ability to process very large samples,• Requires specification of
differences in the multiple dependent measures.Benefits• Controlling the
identified by the technique.Representativeness of the Sample• Must be only two parameters,• No prior knowledge of number of clusters,• Explicit
experiment-wide error rate when there is some degree of intercorrelation
confident that the obtained sample is truly representative of the designation of outliers as separate from objects assigned to clusters,• Applicable
among dependent variables.• More statistical power than ANOVA when 5 or less
population.Impact of multicollinearity• Multicollinearity among subsets of to a “mixed” set of clustering variables (i.e., both metric and nonmetric). Model-
dependent variables.• May detect combined effects among dimensions (i.e.,
variables is an implicit “weighting” of the clustering variables• Potential remedies Based Approach• varies from other approaches in that it is a statistical model
for multicollinear subsets of variables- Reduce the variables to equal numbers in versus algorithmic.• uses differing probability distributions of objects as the basis variates) of the dependent measures.Types of Multivariate Questions Suitable
each set of correlated measures.- Use a distance measure that compensates for the for forming groups rather than groupings of similarity in distance or high density.• for MANOVA: Multiple Univariate Questions• Control for overall experiment-
correlation, like Mahalanobis Distance.- Take a proactive approach and include basic model – mixture model where objects are assumed to be represented by a wide error rate with overall test followed up by univariate tests .Structured
only cluster variables that are not highly correlated.Stage 4: Deriving Clusters mixture of probability distributions (known as components), each representing a Multivariate Questions• Specific relationships between dependent measures
and Assessing Overall Fit: Hierarchical• Most common approach is where all different cluster.• Advantages• Can be applied to any combination of clustering requires joint testing.Intrinsically Multivariate Questions• Principal concern is
objects start as separate clusters and then are joined sequentially such that each variables (metric and/or nonmetric).• Statistical tests are available to compare how the whole set of independent variables differ across groups, with combined
step forms a new cluster joining by two clusters at a time until only a single different models and determine best model fit to define best cluster solution.• effect more of interest than separate effects.Selecting the Dependent
cluster remains. Two types:Agglomerative Methods • Buildup: all observations Missing data can be directly handled.• No scaling issues or transformations of Measures: Researchers should include only dependent variables that have
start as individual clusters, join together sequentially.A multi-step process• Start variables needed.• Once the cluster solution is finalized, can include strong theoretical support• Inappropriate or irrelevant variables can not be
with all observations as their own cluster.• Using the selected similarity measure antecedent/predictor and outcome/validation variables.Stage 5: Interpretation of empirically removed (e.g., no sequential inclusion method) and thus can impact
and agglomerative algorithm, combine the two most similar observations into a the Clusters:intrepretation:• Involves examining each cluster in terms of the the overall tests.Levels of correlation among dependent measures• If inter-
new cluster, now containing two observations.• Repeat the clustering procedure cluster variate to name or assign a label accurately describing the nature of the correlations are very high (> .7 or .8), they have a tendency to create
using the similarity measure/agglomerative algorithm to combine the two most clusters. • The cluster centroid, a mean profile of the cluster on each clustering redundancies and reduce the statistical efficiency. This may become especially
similar observations or clusters (i.e., combinations of observations) into another variable, is particularly useful in the interpretation stage.- Interpretation involves impactful if multiple composite variates (i.e., discriminant functions) are formed
new cluster.• Continue the process until all observations are in a single examining the distinguishing characteristics of each cluster’s profile and when analyzing two or more independent variables. • If correlations are very low
cluster.Most widely used algorithms• Single Linkage (nearest neighbor) – shortest identifying substantial differences between clusters-Cluster solutions failing to (< .3), the analyst should consider running separate ANOVAs and adjusting for
distance from any object in one cluster to any object in the other.• Complete show substantial variation indicate other cluster solutions should be examined.•
the experiment-wide error (e.g., Bonferroni adjustment) Stage 2: Issues in the
Linkage (farthest neighbor) – based on maximum distance between observations The cluster centroid should also be assessed for correspondence with the
Research Design of MANOVA: MANOVA is Applicable to Multiple Research
in each cluster.• Average Linkage – based on the average similarity of all researcher’s prior expectations based on theory or practical experience.Stage 6:
ApproachesExperimentation: Randomization of Groups• Controlled experiment
individuals in a cluster. • Centroid Method – measures distance between cluster Validation and Profiling of the Clusters: Validation is essential in cluster
– respondents are randomly assigned to treatment or control group to ensure
centroids.• Ward’s Method – based on the total sum of squares within analysis since the clusters are descriptive of structure and require additional
clusters..Divisive Methods • Breakdown: initially all observations in a single support for their relevance.Two approaches• Cross-validation – empirically that causal inferences can be made.• Field experiment – implementation of a
cluster, then divided into smaller clusters. Non-hierarchical• the number of validates a cluster solution by creating two subsamples (randomly splitting the controlled experiment in a “natural” setting (i.e., outside a controlled laboratory
clusters is specified by the analyst and then the set of objects are formed into that sample) and then comparing the two cluster solutions for consistency with respect environment) in an attempt to increase external validity, but also raising threats
set of groupings.1. Determine number of clusters to be extracted2. Specify to number of clusters and the cluster profiles.• Criterion validity – achieved by to internal validity.Experimentation: Non-Randomization • Quasi-experiment –
cluster seeds.3. Assign each observation to one of the seeds based on examining differences on variables not included in the cluster analysis but for similar to a controlled experiment except it lacks assignment to the two groups
similarity.• Sequential Threshold = selects one seed point, develops cluster; then which there is a theoretical and relevant reason to expect variation across the through randomization.• Natural experiment – non-random form of
selects next seed point and develops cluster, and so on. Observation cannot be clusters.Profiling A Cluster Solution: Describing the characteristics of each experimental research where the treatment occurs naturally (e.g., natural
reassigned to another cluster following its original assignment.• Parallel cluster on a set of additional variables (not the clustering variables) to further disaster, etc.).Non-experimental• Observational study (cross-sectional study) –
Threshold = sets all seed points simultaneously, then develops clusters.• understand the differences between clusters• Examples include descriptive non-random research design limited since respondent selection is not controlled
Optimization = allow for re-assignment of observations based on the sequential variables (e.g., demographics) as well as other outcome-related measures.• by the researcher and confounds are very difficult to identify and then account
proximity of observations to clusters formed during the clustering process.Pros Provides insight to researchers as to nature and character of the clusters.Clusters for in the analysis.Types of Variables in Experimental Research: Basic
and Cons of Nonhierarchical Methods: Pros• Results are less susceptible to:• should differ on these relevant dimensions. This typically involves the use of Relationship: primary interest of the research• Treatment/factor – independent
outliers in the data,• the distance measure used, and• the inclusion of irrelevant or discriminant analysis or ANOVA.Rules of Thumb – Deriving the Final Cluster variable which is hypothesized as the reason/”cause” of the outcome variable.•
inappropriate variables.• Can easily analyze very large data sets.Cons• Best SolutionThere is no single objective procedure to determine the ‘correct’ number Outcome – dependent variable which represents the values arising from the
results require knowledge of seed points.• Difficult to guarantee optimal solution.• of clusters. Rather the researcher must evaluate alternative cluster solutions on the different levels of the treatments/factors.• Main effect – individual effect of a
Generates typically only spherical and more equally sized clusters.• Less efficient following considerations to select the “best” solution:• Single-member or treatment variable on the dependent variable(s).Stage 3: Assumptions of
in examining wide number of cluster solutions.Pros and Cons of Hierarchical extremely small clusters are generally not acceptable and should generally be
ANOVA and MANOVA. Stage 4: Estimation of the MANOVA Model and
Methods: Pros• Simplicity – generates tree-like structure which is simplistic eliminated.• For hierarchical methods, several ad hoc stopping rules are available
Assessing Overall Fit Stage 5: Interpretation of the MANOVA Results Stage 6:
portrayal of process.• Measures of similarity – multiple measures to address many to indicate the number of clusters based on the rate of change in a total similarity
ValidationoftheResult
situations.• Speed – generate entire set of cluster solutions in single measure as the number of clusters increases or decreases or measures of
analysis.Cons• Permanent combinations – once joined, clusters are never heterogeneity.• All clusters should be significantly different across the set of
separated.• Impact of outliers – outliers may appear as single object or very small clustering variables.• Cluster solutions ultimately must have theoretical validity
clusters.• Large samples – not amenable to very large samples, may require assessed through external validation.Implications for Big Data Analytics:
samples of large populations. Hierarchical clustering solutions are preferred Primary advantage• Simplification by reducing the large number of observations
when:• A wide range, even all, alternative clustering solutions is to be examined.• into a much smaller number of groupings from which the general nature and
The sample size is moderate (under 300-400, not exceeding 1,000) or a sample of character of the entire dataset can be observed.Challenges• Increasing sample
the larger dataset is acceptable.Nonhierarchical clustering methods are sizes pose difficulties for clustering methods, particularly hierarchical methods.•
 Multiple regression dependent variable
 Used when there’s a theory
 Overlapping is good but not much (multi COLLINEARITY which is not
good that cause all result could be reverse, signs change x and y from
positive to negative). Theory tells whether there is overlapping or not.
That’s why we should test if there multi collinearity issues.
 Univariate- one variable (in multi regression its dependent)
 Simple regression is independent.
 Text mining-gaining knowledge from existing data. (internet etc)
 Nature of relationships (usually we do the linear)
 How to select Dependent and Independent Variables
 Focus on all the dependent on how it shows relation with the
independent variables
 we care the most for measurement error in dependent variables
 It is rare to see measurement error in the multiple regression
 Specification theory is easy to error as long as you have good theory
(right perspective to choose the theory).
 You can add the potentially irrelevant variables intentionally in order
to know if it will cause any changes.
 When you use same measurement in method is Common method
biases.
  Generalize the research to the real world
 Detail interesting factor can be found using MANOVA and ANOVA
rather than multiple regression.
 Homoscedasticity- put relevant data together.
 Test the assumption before you run the data.
Q1 :When should I delete the "star"  Use standardized for the data that study Across
outliers and when should I delete the "circle" outliers? You should always   Log can be used if there’s too much Homoscedasticity.
delete the outliers represented as "stars". However, for the "circles", sometimes
 Linear regression is preferable.
deleting all of them will result in the lost of too many observations. You can
decide by yourself whether to delete or not. A good option is to only delete the  Forward, backward, stepwise are the variables we will be using the
observations that repeatedly marked as circle in many items and keep the ones most
that are marked as circle for only 1 item. Q2Why use multivariate analysis?  Deal with multicollinearity problem first before using original form
multivariate analysis is a tool to find patterns and relationships between several  In multiple regression we use Confirmatory (Simultaneous)
variables simultaneously. It lets us predict the effect a change in one variable will  Constrained situation is rare to use because it is only use when we
have on other variables. ... This gives multivariate analysis a decisive advantage know what to expect.
over other forms of analysis. Q3.Why is knowledge of measurement scales
 TSS = SSE + SSR.
important in using multivariate analysis? Since the type of measurement should
be defined by the researcher (even it will be nonmetric/qualitative or  SSE TSS
metric/quantitative) for each variable, in MV analysis defining data has great 
effect what the data present. For computers, the values are just numbers; in 
nominal scales, 1 can be assigned to male, and 0 to female. But the data present  SSR
gender and researcher determine how it can be analyzed. Since nonmetric data
 When you run multiple regression there will be two results for
(ordinal and nominal scales) used as independent variable in most MV
multiple regression and MANOVA
techniques. Knowledge of measurement scales important in order not to use
wrong scale for the variables. As exemplified above, 0 and 1 are just nonmetric   When you run MANOVA there will be also two results for multiple
regression and MANOVA
data to present gender; not having any numerical meaning like female are less
than male, or we cannot calculate any mean value Q4.What is the difference  Just ignore the additional one
between component analysis and common factor analysis? In factor analysis,  For the Experiment , use the MANOVA
the original variables are defined as linear combinations of the factors the goal in   For the Secondary data , use the multiple regression
factor analysis is to explain the covariance’s or correlations between the
 Test Anova result before jumping into multiple regression.
variables. Use principal components analysis to reduce the data into a smaller
 Significant should be greater than 0.5 even though MANOVA result is
number of components. Q5.When should multiple regression be used? It is
okay.
used when we want to predict the value of a variable based on the value of two
or more other variables. The variable we want to predict is called the dependent  R2 is bigger the better
variable (or sometimes, the outcome, target or criterion variable). Q6.When  Adjusted R2 is lower than R2
should a linear regression be used? Simple linear regression is appropriate when  IF T and P value is significant but the value is too small means Practical
the following conditions are satisfied. The dependent variable Y has a linear significance is not good.
relationship to the independent variable X. To check this, make sure that the XY  C, f and d can be problematic
scatterplot is linear and that the residual plot shows a random pattern. 8.How do
 Explanation is done theory based but if there no theory at least
you use regression coefficients? when the regression line is linear (y = ax + b) the
prediction.
regression coefficient is the constant (a) that represents the rate of change of
one variable (y) as a function of changes in the other (x); it is the slope of the  Multicollinearity occurs when Adopt new measurement.
regression line. Q7.Should we use factors scores or summated ratings in follow  Nowadays VIF is up to 0.30
up analysis? It’s hard to answer without context. Usually summated rating is use Relative importance of independent variables
when the items are designed as summated scale but it is very rare. Usually for  X1Family income,x2profession,x3bank,x4 social class- these kind of
latent constructs, in my field, they use factor score. However, in some fields, like variables are comparable.
economics, sometimes I see summated rating. I don’t really know why they use
 In finance, we use secondary data.
it. Maybe it is a way to put weight on the items that they think are important.
 First check we check multicollinerary , relative importance of
Summated rating is when the score of all items add up into a score of the
comparable variables.
construct right? So if the items are measure on different scales. For example
anitem is 1 to 5, another is 1 to 9. When they add up, the item with 1 to 9 scale  Sum-mated constraints (several variables combine to represent one
will havehigherweight constraints) are better than single constraints.
 In case wanna Add Additional measures but we need to collect data
and do research.
 In multiverse no need to distinguish shared and unique variance.
 How can we know when the value come up with the valid result
 - additional or split samples :If we have multiple data sets we can test it
and results will be similar.
 -when we calculate hypothesis best word is to use confirm or support
not prove.
 - comparing regression models: if the difference id huge data is not
good. - No significant result.
Logistics regression9 class notes)- Log is used to choose the exponential value -
 -when we choose variable which updated when time goes by. So that
x^0=1. Sample size-Likelihood is not same as the probability.More than 400 is
people can use your data in near future.
better to achieve good results.Generates four outcomes True positives and
true negatives - are true results whether you are true or false.False positives  Higher the R2 more valid
and False negative - not sure results whether it’s true or false.This kind of  If VIf is above 0.5 there is multicollinaery but R2 is okay (DONT
problems can be reduced by increasing more sample size.Predictive accuracy TRUST THE R2)
of actual outcomes -If one only sensitivity is good specificity will be low Use of
aggregated data-If we have multiple dependent variables, each variable is two
levels(2X2X2) (2X2X2) because one binary is (2X2X2). Assumption of logistics
regression-When we do multiple regression, we don’t our sample to be
contaminated. We need our data to be independent. We can run hierarchical
test in order to make it independent. How to keep data uncontaminated?-
More research .Fitting the logistics curve to sample data -The curve going more
to the 1 is success and more to the 0 is failure-odd and log odds is not same as
probability.-Those data we want divided by overall is probability. Maximum
likelihood model-Better the model we estimate, better the maximum
likelihood model.-We don’t want too many data to fall in between of the curve
data which will lead to poorly fitted relationship curve.If we don’t know what
hypothesis to test go through two steps - Null hypothesis should be rejected-
which will lead to support in our alternative hypothesis. -2LL shows if model is
good or not. Measure of Model Fit-we use pseudo R2 in logistics regression,
not the normal R2.Estimate the issue- we have three problematic issues. -
cannot decide when sample size is low-0 and 1 is not defined in log odd. - third
problem is most frequently encountered (QUASI-COMPLETE SEPARATION).N=1
which means no control group. Influential obersativation -Delete the influential
observation. Two form of coefficient-It’s okay to use original logistics
coefficients if we don’t know how to use exponential logistics coefficient.
 EFA- interdependency

You might also like