Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Chapter1: Overview of Multivariate Method data (Hot/Internal or cold deck/external imputation: substitute a value from another of factors that

on: substitute a value from another of factors that exist within a set of variable and belong to which construct. -CFA is Dummy variable Binary metric variable used to represent a single category of a
Multivariate Analysis: all statistical methods simultaneously analyze multiple source for the missing values vs Case substitution: entire observations w/ missing used to provide a confirmatory test our measurement theory. It is theory driven. - nonmetric variable. Eigenvalue Column sum of squared loadings for a factor; also
measurements on each individual or object under investigation; any simultaneous data are replaced by choosing another non-sampled observation) How to get CFA in SPSS -Enter exact number of factor in fixed number of factors. - referred to as the latent root. It represents the amount of variance accounted for by
analysis of > 2 variables Variate: a linear combination of variables with empirically 3) Calculating replacement value (Mean substitution: replace missing values for a How to get CFA in SPSS: Choose Factor Analysis > Extraction > Fixed number of a factor.EQUIMAX One of the orthogonal factor rotation methods that is a
determined weights of a sef of variables specified by researchers. Weights are variable w/ the mean value of tht variable calculated from all valid response vs “compromise” between the VARIMAX and QUARTIMAX approaches, but is not widely
factors > factor to extract 1
determined by multivariate technique to best achieve specific objective. Multivariate Regression imputation: predict the missing values of a variable based on its used. Factor Linear combination (variate) of the original variables. Factors also
technique: extension of univariate analysis (analysis of single distribution) & relationship to other variables in the dataset) Chapter 3: EFA Factor Analysis represent the underlying dimensions (constructs) that summarize or account for the
bivariate analysis (cross-classification, correlation, analysis of variance, and simple ROT2-3::Imputation of Missing Data: 1) Under 10% – Any of the imputation methods EFA: an interdependence technique whose primary purpose is to define the original set of observed variables. Factor loadings Correlation between the original
regression used to analyze two variables) Variate value (Y’) = w1X1 + w2X2 + w3X3 can be applied when missing data is this low, although the complete case method has underlying structure among the variables in the analysis; examines interrelationships variables and the factors, and the key to understanding the nature of a particular
+ . . . + wnXn (where Xn is the observed variable and wn is the weight determined by been shown to be the least preferred. 2) 10 to 20% – The increased presence of among a large no. of variables and attempts to explain in terms of common factor. Squared factor loadings indicate what percentage of the variance in an
the multivariate technique) Types of data: 1) Nonmetric data (Qualitative) missing data makes the all available, hot deck case substitution and regression underlying dimensions ( factors or components); A summarization and data original variable is explained by a factor. Factor rotation Process of manipulation or
describes differences in type by indicating the presence or absence of a characteristic methods most preferred for MCAR data, while model-based methods are necessary reduction technique that does not have independent and dependent variables, but is adjusting the factor axes to achieve a simpler and pragmatically more meaningful
or property; only represent categories or classes 1.1) Nominal (categorical) scale with MAR missing data processes 3) Over 20% – If it is necessary to impute missing an interdependence technique in which all variables are considered simultaneously. factor solution. QUARTIMAX A type of orthogonal factor rotation method focusing
only label represent categories or classes and don’t imply amounts of an attribute or data when the level is over 20%, the preferred methods are: the regression method Factor Analysis Decision Process: Stage 1: Objectives1. EFA used to discover the on simplifying the rows of a factor matrix. Generally considered less effective than
characteristic e.g. sex, religions 1.2 ) Ordinal scale can be compared, only the order for MCAR situations // model-based methods when MAR missing data occurs. factor structure of a construct and examine its reliability. It is data driven. CFA used the VARIMAX rotation. Q factor analysis Forms groups of respondents or cases based
of the values ; only the order will be known but not the amount of difference btw to confirm the fit of the hypothesized factor structure to the observed (sample) data. on their similarity on a set of characteristics. R factor analysis Analyzes relationships
Outliers: an observation/response with a unique combination of characteristics It is theory driven 2. Specify the unit of analysis( Variables: R-factor analysis to
values e.g. ranking, yes-no-greater 2) Metric data (Quantitative) used when subjects among variables to identify groups of variables forming latent dimensions (factors).
identifiable as distinctly different from the other observations/responses. Types: 1) analyze a set of variables to identify the dimensions for the variable that are
differ in amount or degree on a particular attribute 2.1) Interval Scales Interval scales Surrogate variable Selection of a single variable with the highest factor loading to
Error 2) Interesting 3 )Influential. Occurs bc 1) Procedural Error (data entry error/ a latent/not easily observed vs Cases: Q factor analysis to combine or condense large
and ratio scales provide the highest level of measurement precision, permitting represent a factor in the data reduction stage instead of using a summated scale or
mistake in coding) is better if it’s detected and cleaned on data cleaning stage 2) no. of cases into distinctly different groups w/thin a larger population; use cluster
nearly any mathematical operation to be performed. These two scales have constant factor score VARIMAX The most popular orthogonal factor rotation methods
Extraordinary Event (hurricane occurs when we try to track average daily rainfall) – analysis to group indi cases) 3. Data summarization (derive underlying dimensions
units of measurement, so differences between any two adjacent points on any part focusing on simplifying the columns in a factor matrix. Generally considered superior
should decide whether event fit the research objective or not. If not, should be that describe the data in a much smaller of concepts than the original indi
of the scale are equal. This uses an arbitrary zero point e.g. time, temperature, no to other orthogonal factor rotation methods in achieving a simplified factor
deleted. 3) Extraordinary Observations (unique profile that unexpectedly occurs) – variables)and/or reduction (deriving empirical value (factor or summated scale score)
true zero, not always relatable 2.2) Ratio scales represent the highest form of structure.
could be represent an emerging element, must use judgement in retention/deletion for each dimension (factor) and then substituting this value for the original values in
measurement precision because they possess the advantages of all lower scales plus Chapter 3: EFA Tutorial
selection 4) Observations unique in their combination of values (the observation falls subsequent analysis)4. Using factor analysis with other techniques Stage 2: Designing
an absolute zero point. All mathematical operations are permissible with ratio-scale Factor Analysis Steps: 1. Click on Analyze > Dimension Reduction > Factor Select> put
within the ordinary range of values on each variables) – should retain unless there is a Factor Analysis Three Basic Decisions: 1. Calculation of input data – R vs. Q analysis.
measurements, relatable e.g. height, weight (0.0) Measurement error: degree to it in “variable box” 2. Click on Descriptive and Check: Initial Solution, KMO &
specific evidence that not stated this outlier as valid member of population**Impact 2. Design of study in terms of number of variables, measurement properties of
which the observed values are not representative of the true values; caused by data Bartletts, Anti-image, and Continue 3. Click on Extraction and then [Methods: Select
of outliers: 1distance measure become less useful 2) distort results >>irrelevant variables, and the type of variables. 3.Sample size necessary.
entry errors, imprecision of the measurement, inability of respondents to accurately “Principle components”] -> Check: Correlation Matrix, Unrotated factor solution,
varieable. Dealing with Outliers: 1) Identify outliers(designation): after finding ROT3-1::Factor Analysis Design: 1) Factor analysis is performed most often only on
provide info. Validity: the degree to which a measure accurately represents what it is Based on Eigenvalue “type 1 in the box”, Scree plot (optional) 4. Click on Rotation
outliers in detecting outliers steps, select only observations that show real metric variables, although specialized methods exist for the use of dummy variables; a
supposed to. E.g. if we want to measure discretionary income, we should not ask and Check: a. VARIMAX, b. Rotated Solution, c. Max Iteration 25 -> Continue 6. Click
uniqueness compared to the rest of the Q:When should I delete the "star" outliers small number of “dummy variables” can be included in a set of metric variables that are
about total household income Reliability: the degree to which the observed variable on Options and Check “Sorted by size” Reliability Analysis Steps All items remained
and when should I delete the "circle" outliers? You should always delete the outliers factor analyzed 2) If a study is being designed to reveal factor structure, strive to have
measures the true value and is error free thus, it is the opposite of measurement from the results of factor analysis will be adopted to proceed the reliability test. 1.
represented as "stars". However, for the "circles", sometimes deleting all of them will at least five variables for each proposed factor 3) For sample size: • The sample must
error. If the same measure is asked repeatedly, e.g., more reliable measures will Click on Analyze -> Scale >Reliability Analysis -> Select the (items) variables from the
result in the lost of too many observations. You can decide by yourself whether to have more observations than variables • The minimum absolute sample size should be
show greater consistency. left panel (but you do need to make sure that the items are the formal results of
delete or not. A good option is to only delete the observations that repeatedly 50 observations • Strive to maximize the number of observations per variable, with a
Summated scale: method of combining several variables that measure the same factor analysis steps) and move it to the right panel “Items box” 2. Click on Statistics -
marked as circle in many items and keep the ones that are marked as circle for only 1 desired ratio of 5 observations per variable
concept into a single variable in an attempt to increase the reliability of the > Check: Scale if item deleted, Correlations -> Continue -> OK
item. Stage 3: Assumptions in Factor Analysis 1)Conceptual: assume a homogeneity of sub
measurement through multivariate measurement. The separate variables are EFA RT
ROT2-4::Outlier Detection: 1) Univariate methods: Examine all metric variables to sample factor solution 2)Addressing Multicollinearity :Multicollinearity is assessed
summed and then their total or average score is used in the analysis for which KMO > 0.5 predicts if data are likely to Item-to-total correlation > 0.50
identify unique or extreme observations • For small samples (80 or fewer using MSA (measure of sampling adequacy). MSA is measured by the Kaiser-Meyer-
several variables are joined in a composite measure to represent a concept. factor well based on correlation and correlation between an individual item
observations), outliers typically are defined as cases with standard scores of 2.5 or Olkin (KMO) statistic. As a measure of sampling adequacy, the KMO predicts if data
Objective is to avoid the use of only a single variable to represent a concept and partial correlation + can be used to and the total score without that item. to
greater • For larger sample sizes, increase the threshold value of standard scores up to are likely to factor well based on correlation and partial correlation. KMO can be
instead use several variables representing differing facets of the concept to obtain a identify which variables to drop from the check if any item in the set of tests is
4 • If standard scores are not used, identify cases falling outside the ranges of 2.5 used to identify which variables to drop from the factor analysis because they lack
more well-rounded perspective. thus enables the researcher to more precisely factor analysis because they lack inconsistent with the averaged behavior of
versus 4 standard deviations, depending on the sample size 2) Bivariate methods: multicollinearity. There is a KMO statistic for each individual variable, and their sum
specify the desired responses >> place total reliance on the “average to a set of multicollinearity the others, and thus can be discarded.
Focus on specific variable relationships, such as the independent versus dependent is the KMO overall statistic. KMO varies from 0 to 1.0 Overall KMO should be .50
related responses. Component Metrix “Factor Loading” > Cronbach’s alpha (α) > 0.60 exploratory &
variables • Use scatterplots with confidence intervals at a specified alpha level 3) to proceed with factor analysis. If it is not, remove the variable with the lowest
Variable Specification: specifying the variable whether 1) use original/indi variables 0.6 > 0.70 confirmatory Measure of reliability
Multivariate methods: Best suited for examining a complete variate, such as the individual KMO statistic value one at a time until KMO overall rises above .50, and
- retain most detailed attribute, but may suffer from multicollinearity or 2) perform see correlation between the original that ranges from 0 to 1, with values of .60
independent variables in regression or the variables in factor analysis • Threshold each individual variable KMO is above .50. • Homogeneity of sample factor solutions
dimensional reduction - finding combinations of the indi variables tht captures variables and the factors, and the key to to .70 deemed the lower limit of
levels for the D2/df measure should be conservative (.005 or .001), resulting in values (MSA Measure calculated both for the entire correlation matrix and each individual
multicollinearity among set of variables and allow for a single composite value understanding the nature of a factor acceptability
of 2.5 (small samples) versus 3 or 4 in larger samples variable evaluating the appropriateness of applying factor analysis. Values above .50
representing the set of variables . Tot. Var. Explained – “Eigen value” Total Factor loadings: plays a large role in that
population 2) Describe outliers: generate profile of each outlier, identify the for either the entire matrix or an individual variable indicate appropriateness.)
Variable Selection: identifying the variables to be included in the analysis . > 1 to determine the number of factors to factor. The relationship of each var to the
variable(s) that makes it an outlier. Assign them to the outlier classes to help on
Multicollinearity: the degree of correlation among the variables in the variate may ROT3-2::Testing Assumptions of Factor Analysis: 1) A strong conceptual foundation extract underlying factor (eg. XX has high factor
retain/delete decision. 3) Delete/Retain: Outliers should be retained except there
result in a confounding effect (a situation in which a relationship btw exposure and needs to support the assumption that a structure does exist before the factor analysis Tot. Var. Explained - “Cumulative %” > loading means it is the strongest
are proof that show they doesn’t represent any observation in the population. If
outcome is distorted by the presence of another variable) in the interpretation of the is performed 2) A statistically significant Bartlett’s test of sphericity (sig. < .05) indicates 60% way of expressing frequency association to the underlying
they do portray a representative element or segment of the population, they should
indi variables of the variate; the measure of the shared variance with other variables that sufficient correlations exist among the variables to proceed 3) Measure of distribution. It calculates the percentage var)Communality: amount of var in each
be retained to ensure generalizability to the entire population. As outliers are
in the variate sampling adequacy (MSA) values must exceed .50 for both the overall test and each of the cumulative frequency within each var accounted for by all components//
deleted, the researcher runs the risk of improving the multivariate analysis but
Types of Statistical Error: 1)Type I error/ Alpha (α): probability of rejecting the null individual variable; variables with values less than .50 should be omitted from the interval, much as relative frequency strength of factor in explaining each var.
limiting its generalizability. Some techniques are less affected by violating certain
hypo when it is actually true; false positive. By specifying an alpha level,researcher factor analysis one at a time, with the smallest one being omitted each time distribution calculates the percentage of Low communality means the var has little
assumptions, which is termed robustness. Stage 4: Deriving Factors and Assessing Overall Fit 1)Selecting the factor extraction
sets acceptable limits for error and indicates probability of concluding that frequency. common with other variable. Cumulative
Testing multivariate analysis Assumptions: foundation for making statistical method: common variance (variance if a variable that is shared w/ all other variables
significance exists when it really doesn’t 2) Type II error/Beta (β): the probability of (Explained variance): the %of var
interferences & results. Four important assumptions 1. Normality: referring to the in the analysis) vs. unique variance (Specific variance: can’t be explained by
not rejecting the null hypo when it is actually false; accept false Power or 1 – β: accounted for by the n components. Item-
shape of the data distribution for an individual metric variable and its correlations to other variables vs Error variance: due to unreliability in data gathering
probability of correctly rejecting the null hypo when it should be rejected Type I and Communality > 0.5 Total amount of to-correlation: the α if item deleted, we
correspondence to the normal distribution. Univariate normality vs multivariate process, measurement error)2)Determining number of factors to represent the data
Type II errors are inversely related Type I error becomes more restrictive (moves variance an original variable share with all can see that when remove any item and α
normality. Nonnormality: kurtosis- the peakedness or flatness of the distribution 3) Principal component (data reduction is a primary concern, prior knowledge
closer to zero) as probability of a Type II error increases. Reducing Type I errors other variables included in the analysis is lower, so reconsider not to remove.
compared w/ normal distribution. (+)value indicates a relatively peaked distribution, suggests that specific and error variance represent a relatively small proportion of
reduces the power of the statistical test. Cronbach’s Alpha (α), the items are
(-) value indicates a relatively flat distribution. Skewnes- distribution is unbalanced & the total variance) vs Common factor analysis (obj = to identify latent dimensions or
shift to one side. Impact: serious effects in small samples (fewer than 50 cases), but reliable
constructs represented in common variance of the original variables )
the impact effectively diminishes when sample sizes reach 200 cases or more. How Some Issues w/ EFA: KMO does not appear in the results Problem: If the correlation
Stoping Rule: Scree test: identify the optimum no. of factors that can be extracted bf
to test normality: 1) Graphical analyses: check of histogram (compare observed data matrix is nonpositive definite (some of the eigenvalues of the correlation matrix are
the amount of unique variance begins to dominate the common variance structure.
w/ a distribution) or normal probability plot/straight diagonal line/z-value not positive numbers) and the KMO will not be displayed. Why: This may happen if
Elbow: the point at which the curve first begins to straighten out
(comparing the cumulative distribution of actual data values w/ cumulative variables depend on other variables, such as one variable being a product or sum of
Parallel Analysis: empirical measure based on the specific characteristics
distribution of a normal distribution). If below or above 0, it means not normal.if the other variables. Solution: Go back and check and try to delete the troublesome
Stage 5: Interpreting the Factors 1) Factor Interpretation: 1.1)Estimate the Factor
actual data closely follow diagonal, it’s normal 2) Check Kurtosis: peakedness items. Always delete 1 item each time and observe the changes. Factor Rotation:
Statistical Power (1 – β): is determined by 1) Effect size: actual magnitude of effect Matrix (compute unrotated factor loadings – correlation of variable and factor)
/flatness of distribution compared to normal distribution.Remedies: data There are two ways of rotations Orthogonal rotations (varimax, quatimax, equamax)
of interest; helps determine whether the observed relationship (difference or 1.2)Factor Rotation (employ rational method to achieve simpler and theoretically
transformation 2. Homoscedasticity Variance of the error terms appears constant constrain the factors to be uncorrelated vs Oblique rotations (direct oblimin, promax)
correlation) is meaningful. e.g. If a weight loss firm claims its program leads to an more meaningful factor solutions) 1.3) Factor Interpretation and Respecification of
over a range of predictor variables. It is desirable bc the variance of the dependent permit the factors to be correlated with one another. Solutions: Try oblique rotation
average weight loss of 25 pounds, the 25 pounds = effect size 2) Alpha (α): as alpha factor model, if needed may involve Deletion of variables from analysis, Desire to use
variable being explained in the dependence relationship should not be concentrated first, if no correlation is detected  try orthogonal rotation **Run Oblique rotation>
becomes more restrictive (smaller), power decreases . Typical a different rotational approach, Need
in only a limited range of the independent values. Opposite is termed Check Component Correlation Matrix>If all correlation coefficients are greater than
α= moderate
** 0.05. 3) Sample
effect size:
size power
at any reaches
given alpha
acceptable
level, increased
levels at sample sizes ,ofpower
100 or ROT3-3::Choosing Factor Models and Number of Factors: 1) Although both
Heteroscedasticity: The result of heteroscedasticity is to cause the predictions to be 0.32  factors are correlated  should use oblique rotation. Factor Analysis
increases.
more for alpha levels of both .05 and .01 component and common factor analysis models yield similar results in common
better at some levels of the independent variable than at others. heteroscedasticity Option: Apart from “sort by size and exclude cases listwise” > You can choose
** when small effect size, statistical tests have little power research settings (30 or more variables or communalities of .60 for most variables): •
are result of non-normality in one or more variables. How to test Homoscedasticity Suppress Small Coefficients to have a better look of the matrix but should not be
E.g. if the effect size is small a sample of 200 with an alpha of .05 still has only a 50 The component analysis model is most appropriate when data reduction is paramount
1. Graphical: boxplots; 2. Statistical test: Levene test (univariate), Box’s M higher than 0.3. Cross-loading: When 1 item is loaded on more than 1 factor, or in
percent chance of significant differences being found. ** if the researcher expects that • The common factor model is best in well-specified theoretical applications 2) Any
(multivariate). 3. Linearity:the concept that the model possesses the properties of other words, the difference between the loadings across factors is greater than 0.3
the effect sizes will be small the study must have much larger sample sizes and/or decision on the number of factors to be retained should be based on several
additivity and homogeneity. Linear models predict values that fall in a straight line by (strict) or 0.2 (more flexible).If the difference is too small, you can delete the item.If
less restrictive alpha levels (e.g., .10). considerations: • Use of several stopping criteria to determine the initial number of
having a constant unit change (slope) of the dependent variable for a constant unit the difference is not too small, and one loading meets requirement in one factor,
factors to retain: - Factors with eigenvalues greater than 1.0 - A predetermined
change of the independent variable. Y = b0 + b1X1 + e, the effect of a change of 1 in X1 you can consider to categorize it into the factor with higher loadings. Negative
number of factors based on research objectives and/or prior research - Enough factors
is to add b1 (a constant) units to Y. How to test: 1)Examine scatterplots of the Loading: When consider the value of a loading or coefficient, we need to consider
to meet a specified percentage of variance explained, usually 60% or higher - Factors
ROT::STATISTIC power analysis: 1) Researcher should design study to achieve a power variables to identify nonlinear patterns in the data. 2) Run a simple regression two components: The magnitude (absolute value)- When we say a coefficient must
shown by the scree test to have substantial amounts of common variance (i.e., factors
level of 0.80 at the desired significance level. 2) More stringent significance levels (eg analysis to examine the residuals (=Portion of a dependent variable not explained by be greater than 0.6, it means the absolute value should be greater than 0.6. The sign
before inflection point) - More factors when heterogeneity is present among sample
0.01 instead of 0.05) require a larger samples to achieve the desire power level 3) a multivariate technique) which reflect the unexplained portion of the dependent (+ or -) showing the direction the item is related to the factor. In case negative
subgroups • Consideration of several alternative solutions (one more and one less
power can be increased by choosing a less stringent α (alpha) level ( eg 0.10 instead of variable. Remedies for Nonlinearity:transform one or both variables to achieve loadings appear, you should reconsider your item statement and your data, most
factor than the initial solution) to ensure the best structure is identified
0.05) 4) Smaller effect sizes require large sample sizes to achieve the desired power 5) linearity or explicit model components used to represent the nonlinear portion of the cases, it is because of your item is a reversed one. Cronbach Alpha: based on
to extract a different number of factors. -Desire to change method of extraction)
An increase in power is most likely achieved by increasing the sample size relationship. 3) Use curve fitting that reflect non-linear elements 4. Non-correlated standardized item. Result: If the resultsare closed to criteria, you might consider to
2)Factor Extraction 3) Rotation of Factors :The ultimate effect of rotating the factor
Errors Correlated errors arise from a process that must be treated much like missing keep the items instead of deleting. Factor: One factor should have at least 2 items
Multivariate Technique: 1) Dependence technique: a variable or set of variables is matrix is to redistribute the variance from earlier factors to later ones to achieve a
data. Correlated error occurs bc 1) data collection process 2) time series data (data (Either delete the item, which means deleting the factor, or Try to force the item into
identified as the dependent variable to be predicted/explained by other independent simpler, theoretically more meaningful factor pattern. Orthogonal: axes are
for any time period is highly related to the data time periods both bf and afterward. the most correlated factor). Eigen Value: Consider to delete Eigen value in the
variables. Can be categorized by two characteristics: 1.1) number of dependent maintained at 90 degrees. vs Oblique: axes are not maintained at 90 degrees).
How to identify: 1) finding differences in the prediction errors in the two groups 2) summary table (optional)
variables (single, several dependent var) 1.2) type of measurement scale employed Orthogonal Rotation Methods- Quartimax (simplify rows) - Varimax (simplify
for time series data, see any patterns when we order/sort the data. Remedies: Chapter 5: Multiple Regression
by the variables (metric, nonmetric) /level of attribute & respondent Conjoint columns) - Equimax (combination 4) Significance of Factor Loadings (Practical
including the omitted causal factor into the multivariate analysis. 5. Autocorrelation Simple Regression: when obj of regression involves a single independent variable.
/metric (DV & IVs)  Multiple regression // nonmetric DV& metric IVs  multiple Significance vs Statistical Significance: sample size) 5) Interpreting Factor Matrix
can’t YX check by seeing D Multiple Regression: a statistical technique that can be used to analyze the
discriminant analysis & linear probability models // metric DVs & nonmetric IVs rotation (factor loading matrix, factor loading for each variable, communalities of
Data Transformations: A variable may have an undesirable characteristic that relationship between a single dependent (criterion) variable and several independent
MANOVA // non metric DV&metric IVs  Logistic regression // nonmetric variables, respecify model if needed, label factor)
detracts from its use in a multivariate technique. A transformation, such as taking the (predictor) variables. Formula: Y’ = b0 + b1X1 + b2X2 + . . . + bnXn + e Y=Dependent
(DVs&IVs)  dummy variable  canonical analysis both// if a set of dependent or ROT3-4::Choosing Factor Rotation Methods: 1) Orthogonal rotation methods: • Are Variable, b0=intercept (constant), b1= (regression coefficient or parameter estimate
logarithm or square root of the variable, creates a transformed variable that is more
independent variable relationships is postulated  structural equation modeling 2) the most widely used rotational methods • Are the preferred method when the how much of which X is depended on Y), X1 = Independent Variable 1, e = prediction
suited to portraying the relationship. Transformations may be applied to either the
Interdependence technique: analysis of all variables in the set w/out distinction btw research goal is data reduction to either a smaller number of variables or a set of error (residual). (A variate value (Y’) is calculated for each respondent. The Y’ value is
dependent or independent variables, or both. The need and specific type of
dependent and independent variables. analyzes the structure of the uncorrelated measures for subsequent use in other multivariate techniques 2) Oblique a linear combination of the entire set of variables that best achieves the statistical
transformation may be based on theoretical reasons (e.g., transforming a known
interrelationships among a large number of variables to determine a set of common rotation methods: Are best suited to the goal of obtaining several theoretically objective. Impact of Multicollinearity: Collinearity: the association measured as the
nonlinear relationship) or empirical reasons (e.g., problems identified through
underlying dimensions (factors).  EFA&CFA // each object is similar to the other meaningful factors or constructs, because, realistically, few constructs in the real correlation btw two independent variables. Multicollinearity: the correlation among
graphical or statistical means). Help achieve 4 outcomes 1) enhancing statistical
objects in the cluster  cluster analysis // identifies “unrecognized” dimensions  world are uncorrelated three or more independent variables and evidenced when one is regressed against
properties – achieve normality, homoscedascity, linearity 2) ease of interpretation-
perceptual mapping(Multidimensional scaling) // correspondence analysis others. Impact measures of predictive power: the impact of multicollinearity is to
standardization(provide common metric for comparison) vs centering (allow ROT3-5::Assessing Factor Loadings: 1) Although factor loadings of ±.30 to ±.40 are
Guidelines for Multivariate Analysis: 1) Practical Significance: means of assessing reduce any single independent variable’s unique predictive power by the extent to
comparison across variables) 3) representing specific relationship types 4) minimally acceptable, values greater than ±.50 are generally considered necessary for
multivariate analysis results based on their substantive findings whether the result is which it’s associated w/ other independent variables. If collinearity increases, the
simplification ( binding: categorization of values into smaller no. of categories vs practical significance 2) To be considered significant: • A smaller loading is needed
useful. 2) Statistical significance: determines whether the result is attributable to unique variance explained by each independent variable decrease >> shared
smoothing: use of response surface method to represent generalized patterns in given either a larger sample size or a larger number of variables being analyzed • A
chance in in achieving the research objectives 3) Sample Size – large or small affect prediction percentage increases. Multicollinearity has the impact of reducing the
data larger loading is needed given a factor solution with a larger number of factors,
result 3) Know Your Data – outlier, assumption violation, missing data can create impacted variable’s regression coefficients. Favors variables w/ low multicollinearity:
ROT: Transforming Data especially in evaluating the loadings on later factors 3) Statistical tests of significance
substantial effects 4) Strive for Model Parsimony- simple model w/ greater bc it will maximize the prediction from the given no. of independent variables
Guideline/Rules of Thumb 2–6 (Transforming Data) for factor loadings are generally conservative and should be considered only as starting
explanatory predictive power 5)Look at error 6)simplify model 7)Validate results Multiple Regression Decision Process: Stage 1: Objectives of Multiple Regression
-To judge the potential impact of a transformation, calculate the ratio of the variable’s points needed for including a variable for further consideration
Chapter2: Examining Your Data researcher must consider three primary issues: 1) the appropriateness of the
mean to its standard deviation: ROT3-6::Interpreting the Factors: 1) An optimal structure exists when all variables
Graphical Examination: 1) Univariate profiling: examine the shape of research problem (Predictive purpose- maximize predictive accuracy by ensuring the
**Noticeable effects should occur when the ratio is less than 4. have high loadings only on a single factor 2) Variables that cross-load (load highly on
distribution(histogram) - the frequencies are plotted to examine the shape of the validity of the set of independent variables , model comparison by comparing two or
**When the transformation can be performed on either of two variables, select the two or more factors) are usually deleted unless theoretically justified or the objective
distribution of values 2) Bivariate profiling: examining relationship between more independent variables to ascertain the predictive power of each variate vs
variable with the smallest ratio. is strictly data reduction 3) Variables should generally have communalities of greater
variables, use scatterplot; examining group differences, use box-whisker plot 3) Explanation purpose – relative imp of independent variables by assessing magnitude
-Transformations should be applied to the independent variables except in the case of than .50 to be retained in the analysis 4) Respecification of a factor analysis can
Multivariate profiles: use multivariate graphical display - Shape: Histogram, Bar and direction of each independent variables, nature of relationship w/ dependent
heteroscedasticity. include such options as the following: • Deleting a variable(s) • Changing rotation
Chart, Box & Whisker plot, Stem and Leaf plot - Relationships: Scatterplot, Outliers variables, nature of relationships among independent variables (multicollinearity) ) 2)
- Heteroscedasticity can be remedied only by the transformation of the dependent methods • Increasing or decreasing the number of factors
Missing Data: Info not available for subject/case about whom other info is available. specification of a statistical relationship (Multiple regression is appropriate when
variable in a dependence relationship. If a heteroscedastic relationship is also Stage 6: Validation of Factor Analysis 1)Replication vs Confirmatory Perspective
Occurs when respondents fail to one or more answer in a survey.Impact: 1) reduce research is interested in statistical relation , not functional relationship(expect no
nonlinear, the dependent variable, and perhaps the independent variables, must be 2)Assessing Factor Structure Stability (large sample provide more confidence as to
sample size 2) distorts results. 4 Step to Identify Missing Data and Applying error in prediction, only calculate the exact value). Some random component is
transformed. generalizability and stability) 3) Detecting Influential Observations(estimate model
Remedies Step 1: Determine type of missing data 1) ignorable missing data - take always present (error in predicting variable) in the relationship being examined;
-Transformations may change the interpretation of the variables. For example, w/ and w/out observations identified as outliers to assess impacts on the results)
data from sample population rather than whole population, data collection calculate the average value) 3) selection of the dependent and independent
transforming variables by taking their logarithm translates the relationship into a Stage 7: Additional uses of Factor Analysis Results 1)Selecting Surrogate Variables
instrument(skip pattern), censored data(data not yet observed) 2) or not ignorable variables (a. support theory either conceptual or theoretical b. measurement error c.
measure of proportional change (elasticity). Always be sure to explore thoroughly the 2)Creating Summated Scales 3) Computing Factor Scores
missing data: known process (procedural factor-data entry/mgt), unknown process specification error). Measurement error: problematic can be addressed through
possible interpretations of the transformed variables.
(the respondent refuse to respond a certain question) Step 2 : Determine the extent ROT3-7::Summated Scales: 1) A summated scale is only as good as the items used either of two approaches: Summated scales- mitigate measurement error, or
Use variables in their original (untransformed) format when profiling or interpreting
of missing data determine whether the extent/amount of missing data is low enough to represent the construct; even though it may pass all empirical tests, it is useless Structural equation modeling procedures-can directly accommodate measurement
results.
to not affect the result, even if it operates in a nonrandom manner by tabulating 1) % without theoretical justification 2) Never create a summated scale without first error. Specification error: the exclusion of relevant and inclusion of irrelevant
of variables with missing data 2) no. of cases w/ missing data. if the extent missing Dummy Variables: a nonmetric independent variable that has two (or more) distinct assessing its unidimensionality with exploratory or confirmatory factor analysis 3) independent variables. **when in doubt, include potentially irrelevant variables
data is high → step 3, if it’s low → step 4 levels that are coded 0 and 1(dichotomous). These variables act as replacement Once a scale is deemed unidimensional, its reliability score, as measured by (they can only confuse interpretation) rather than omitting a relevant variable (which
ROT2-1::How Much Missing Data Is Too Much?: 1) Missing data under 10% for an variables to enable non-metric variables to be used as metric variables. To account Cronbach’s alpha: • Should exceed a threshold of .70, although a .60 level can be can bias all regression estimates)
individual case or observation can generally be ignored (acceptable), except when the for L levels of a nonmetric variable,L- 1 dummy variables are needed. For example, used in exploratory research • The threshold should be raised as the number of
missing data occurs in a specific nonrandom fashion (e.g. concentration in a specific gender is measured as male or female and could be represented by two dummy items increases, especially as the number of items approaches 10 or more 4) With Stage 2: Research Design of a Multiple Regression Analysis Issues to consider
set of questions, attrition at the end of the questionnaire) 2) no.of cases with no variables (X1 and X2). When the respondent is male, X1 = 1 and X2 = 0 Likewise, reliability established, validity should be assessed in terms of: • Convergent validity 1)Sample Size 2) Creating additional variables
missing data must be sufficient for the selected analysis technique if replacement when the respondent is female, X1 = 0 and X2 = 1. However, when X1 = 1, we know scale correlates with other like scales • Discriminant validity scale is sufficiently ROT4-2::Sample Size Considerations: Statistical power1) Simple regression can be
values will not be substituted (imputed) for the missing data. that X2 must equal 0. Thus, we need only one variable, either X1 or X2, to represent different from other related scales • Nomological validity scale “predicts” as effective with a sample size of 20, but maintaining power at .80 in multiple
the variable gender. If a nonmetric variable has three levels, only two dummy theoretically suggested
Step 3: Diagnose the randomness of the missing data processes 1) Missing at regression requires a minimum sample of 50 and preferably 100 observations for
variables are needed.We always have one dummy variable less than the number of
random (MAR): Classification of missing data applicable when missing values of Y ROT3-8::Representing Factor Analysis in Other Analyses: 1) The Single Surrogate most research situations. Generalizability 2) The minimum ratio of observations to
levels
depend on X, but not on Y. When missing data are MAR, observed data for Y are a Variable Advantages: • Simple to administer and interpret Disadvantages: • Does not variables is 5 to 1, but the preferred ratio is 15 or 20 to 1, and this should increase
Simple approach to understand data: - Tabulation = a listing of how respondents
truly random sample for the X values in the sample, but not a random sample of all Y represent all “facets” of a factor • Prone to measurement error 2) Factor Scores when stepwise estimation is used. 3) Maximizing the degrees of freedom improves
answered all possible answers to each question. This typically is shown in a
values due to missing values of X. 2) Missing completely at random (MCAR): when Advantages • Represent all variables loading on the factor • Best method for complete generalizability and addresses both model parsimony and sample size concerns.
frequency table. - Cross Tabulation = a listing of how respondents answered two or
missing values of Y are not dependent on X. When missing data are MCAR, observed data reduction • Are by default orthogonal and can avoid complications caused by ROT4-3::Variable Transformations: 1) Nonmetric variables can only be included in a
more questions. This typically is shown in a two-way frequency table to enable
values of Y are a truly random sample of all Y values, with no underlying process that multicollinearity Disadvantages • Interpretation more difficult because all variables regression analysis by creating dummy variables. 2) Dummy variables can only be
comparisons between groups. - ANOVA= a statistic that tests for significant
lends bias to the observed data. contribute through loadings • Difficult to replicate across studies 3) Summated Scales interpreted in relation to their reference category. 3) Adding an additional
differences between two means.
ROT2-2::Deletions Based on Missing Data: 1) Variables with as little as 15 percent Advantages • Compromise between the surrogate variable and factor score options • polynomial term represents another inflection point in the curvilinear relationship.
Chapter 2: Tutorial Class
missing data are candidates for deletion , but higher levels of missing data (20% to How to Check Missing Data: Click on Analyze > Descriptive Statistics > Reduce measurement error • Represent multiple facets of a concept • Easily 4) Quadratic กำลังสอง and cubic กำลังสาม polynomials are generally sufficient
30%) can often be remedied 2) Be sure the overall decrease in missing data is large replicated across studies Disadvantages • Include only the variables that load highly on to represent most curvilinear relationships. 5) Assessing the significance of a
Frequencies Click on Statistic box > Check on Minimum, maximum, and arrange in
enough to justify deleting an individual variable or case 3) Cases with missing data for the factor and excludes those having little or marginal impact • Not necessarily polynomial or interaction term is accomplished by evaluating incremental R2, not
dispersion part > continue>ok. How to check outliers via Descriptive Statistics
dependent variable(s) typically are deleted to avoid any artificial increase in orthogonal • Require extensive analysis of reliability and validity issues the significance of individual coefficients, due to high multicollinearity.
relationships with independent variables Click on Analyze > Descriptive Statistics > Explore>> insert items into dependent
list>> In statistics choose outliers>>continue>>OK Dealing w/ missing values: VOCABS Anti-image correlation matrix Matrix of the partial correlations among Moderator Effect: When the moderator variable, a second independent variable,
Step 4: Select the imputation method Imputation is the process of estimating the variables after factor analysis, representing the degree to which the factors explain changes the form of the relationship between another independent variable and the
Click on Transform>Replace missing value> insert items w/ missing values into
missing value based on valid values of other variables and/or cases in the sample. each other in the results. The diagonal contains the measures of sampling adequacy dependent variable. The moderator term is a compound variable formed by
box (new variable) >In Method choose Series mean>OK for each variable, and the off-diagonal values are partial correlations among variables
Nonmetric variables are not amenable to imputation bc even though estimates of multiplying X1 by the moderator X2, >> (X1X2). The coefficient (b3) of the
Factor analysis Bartlett test of sphericity Statistical test for the overall significance of all correlations interaction/moderator term indicates the unit change in the effect of X 1 as X2
the missing data can be made w/ such a mean of all value as mean of all valid values,
no comparable measures are available for nonmetric variables Imputation of a MAR Exploratory factor analysis (EFA): EFA can be conducted without knowing how many within a correlation matrix. Cluster analysis Multivariate technique with the changes. The coefficients (b1, b2) of the two independent variables now represent the
Missing Data Process: use model-based approach (Maximum likelihood and EM: factors really exist or which variables belong to which constructs. -Explore how many objective of grouping respondents or cases with similar profiles on a defined set of effects when the other independent variable is zero. Mediator effect: when the
direct estimation of means and covariance matrix vs Multiple imputation: generating factors are needed to best present the data. –EFA= the factors were derived from characteristics. Communality Total amount of variance an original variable share effect of an independent variable may “work through” an intervening variable (the
multiple datasets w/ imputed data differing in each dataset). Imputation of a MCAR statistical results, not from theory. It is data driven. Factors can only be named after with all other variables included in the analysis. Component analysis Factor model in mediating variable) to predict the dependent variable
Missing Data Process: 1) Use only valid data (Complete case approach: include only which the factors are based on the total variance. Correlation matrix Table showing Stage 3: Assumptions in Multiple Regression Analysis 1) Linearity of the
the factor analysis is performed. How to get EFA in SPSS: Choose Factor
observations with complete data; use only case w/ no missing data vs Using all the intercorrelations among all variables. Cronbach’s alpha Measure of reliability phenomenon measured 2) Homoscedasticity: Constant variance of the error terms
AnalysisExtractionBased on Eigenvalue > 1 that ranges from 0 to 1, with values of .60 to .70 deemed the lower limit of
available data: imputes the distribution characteristics/relationships from every valid 3) Normality of the error term distribution 4) Independence of the error terms.
-Confirmatory factor analysis (CFA):CFA is similar to EFA in some respects, but acceptability. Cross-loading A variable has two more factor loadings exceeding the
value; the imputation process occur by using the obtained correlation on just the
philosophically it is quite different. -Researcher must know specify both the number threshold value deemed necessary for inclusion in the factor interpretation process.
case w/ valid data as representative for entire sample) 2) Using known replacement
ROT4-4:: Assessing Statistical Assumptions: 1) Testing assumptions must be done not only The partial F-test is simply a statistical test for the additional contribution to change in the dependent variable shown by: Percentage change in odds = objects being clustered. •relate specifically to the obj. 2. Practical considerations:
for each dependent and independent variable, but for the variate as well. 2) Graphical prediction accuracy of a variable above that of the variables already in the equation (Exponentiated coefficient - 1.0) × 100 always use the BEST var.
analyses (i.e., partial regression plots, residual plots, and normal probability plots) are the Chapter 05: Multiple Regression Tutorial /Multiple Regression Step by Step: ROT7-2::Model estimation and Model Fit: 1) Although stepwise estimation may
most widely used methods of assessing assumptions for the variate. 3) Remedies for
Stage2:Research Design: Types of Var included: metric & non-metric, but not mixed.
1. Compute mean scores: compute the mean score of each factor in independent seem “optimal” by selecting the most parsimonious set of maximally discriminating Number of clustering var: can have impact as few as 20 var. “Curse of
problems found in the variate must be accomplished by modifying one or more independent
variables construct and the factor of the dependent construct. Should pick up the final formal variables, beware of the impact of multicollinearity on the assessment of each dimensionality” when large number of Var analyzed. Relevancy of Clustering Var:
Lin results of factor analysis stage (how many items remain after analysis). Step1: Click variables discriminatory power 2) overall model fit assesses the statistical only include var with strongest conceptual support. Sample size: enough sample size
earity: relationship btw dependent and independent variables represents the degree transform > Compute Variable Step 2: In Target variable, type factor name Step3: In significance between groups on the discriminant Z scores, but doesn’t assess is needed to ensure representativeness of the population. Of particular interest is
to which the change in the dependent variable is associated w/ the independent numeric expression, enter target variable>OK 2. How to product result Step1: Click predictive accuracy 3) with more than two groups, do not confine your analysis to the ability to detect small groups w/n the population. Min group sizes are based on
variable. Examined through residual plots and comparison to null plot Q: how to Analyze >> Regression >> Linear >> Select a factor of Dependent construct from the only the statistically significant discriminant functions, but consider if nonsignificant the relevance of each group.*Increase sample size may pose probs for hierarchy
determine which independent variable to select for corrective action - use partial left panel to “Dependent” >> Select all the factors of independent constructs from functions (with sig. levels up to 0.3) add explanatory power. clustering methods & required “hybrid” approach.* Detecting outlier: should be
regression plot: shows the relationship of a single independent variable to the the left panel to “Independent” Step 2: Click on “Statistics” >> Select “R squared ROT7-3::Assessing Predictive Accuracy: There are multiple criteria for comparison to removed if outlier present aberrant observation not population, insignificant
dependent variable, controlling for the effects of all other independent change”, “Collinearity Diagnostics”, “Durbin-Watson”, and “Covariance Matrix” >> the hit ratio: 1) The maximum chance criterion for evaluating the hit ratio is the most segment w/n population. Retain if represent under-sampling, poor representation of
variables .Corrective action 1) data value transformation 2)include polynomial in Click on “Continue.” Step3: Click on Method” >>Stepwise>> OK For next round, you conservative, giving the highest baseline value to exceed. 2) Be cautious in using the relevant group in population. Defining and measuring Interobject similarity:
regression model 3) non-linear regression method . Homoscedasticity การกระจายของค่า may select “Enter” 3. How to check and select result Model Summary: R-square, maximum chance criterion in situations with overall samples less than 10 and/or measure correspondence between objects to be clustered. Calculate across the
คลาดเคลื่อน (residuals) : Diagnosis with graphical plots or simple statistical tests. Adjusted R-square, Durbin-Walson (D-W) and other related values. ANOVA: F-value group sizes under 20. 3) The proportional chance criterion considers all groups in entire set of clustering var to allow grouping.3 methods •1. Association: non-metric
Remedies include: 1) variable transformation 2) weighted least squares - weight and significant level (sig.) Coefficients: Standardized Coefficient (Beta—β), t-value, p- establishing the comparison standard and is the most popular 4) The actual 2. Correlation: less used, measure patterns not distance.3. Distance measure: most
each observation based on its variance and mitigates the variation in variance of value, and VIF (VIF range) Excluded Variables (Deletion/Reduction) need to compare predictive accuracy (hit ratio) should exceed the any criterion value by at least 25% often used with higher representing greater dissimilarity (distance between case).3.1
residual seen in heteroscedasticity 3) heteroscedasticity-consistent standard errors w/ t-value (>1.96) and p-value criteria (p <0.05 or <0.001 or 0.01) The chi-square test is used to evaluate the reduction in the log likelihood value. Euclidean(straight line) most common used 3.2 squared Euclidean: sum of square
(HCSE) - estimates of standard errors are corrected for any heteroscedasticity that (Coefficient)Beta (β) > 0.015 represents However, these statistical tests are particularly sensitive to sample size. Nagelkerke distance.3.3 Mahalanobis(D2) var intercorrelations and weights each var equally.
may be present. Normality: normal probability plot- the standardized residuals are how many units in the dependent variable R2 Change In stepwise, each time of R2 proposed a modification that had the range of 0 to 1. Both of these additional Data Standardization: Rescaling data test to have common scale 2 Approach to
compared w/ normal distribution. Normal distribution makes straight diagonal line will increase if the independent variable adding more independent variables into a measures are interpreted as reflecting the amount of variation accounted for by the standardize: 1. Relative to other cases: most common is Z score 2. Relative to other
and plotted residuals are compared w/ diagonal. If distribution is normal, the residual increases by 1 unit. It shows the model, the program compares the new logistic model, with 1.0 indicating perfect model fit. *The Wald statistic is used to responses w/n an object. If groups are to be identified according to an individual’s
line closely follows the diagonal. Regression considered robust to violations of magnitude of the relationship. When model with the previous one. R squared assess significance in a manner similar to the t test used in multiple regression. response style, then within-case or row-centering standardization is appropriate.
normality when sample size exceeds 200. Independence of error term: Predicted interpreting Beta, you need to pay change represents the improvement of R PREDICTIVE ACCURACY: based on cut-off value is 0.50 is applicable when the group Stage 3:Assumption -1. Structure exists: assume that ”natural” structure of object
values are not correlated to any variables in the analysis. Residuals plotted versus attention to the sign of Beta to check squared gained by adding those new sizes of the dependent variable are equal; generate 4 outcomes (TN, TP,FN,FP). exists. 2. Representativeness of sample: obtained sample is truly representative of
offending variables. Offending variables 1) time series data - represent the consistent with the direction of the independent variables. The R-square Accuracy: no. of true positive and true negative divided by the total samples. In population 3. Impact of multicollinearity: multicoll among subset of var is an implicit
observations on the same unit over multiple occasions 2)cluster hypothesized relationship. Also, Beta change is tested with an F-test, which is assessing discriminant analysis, that is, measuring how well group membership is weighting of cluster var. • potential remedies •reduce the var to equal number in
data/grouping/sequencing variable - when data is hierarchically distributed e.g. coefficient of different independent referred to as the F-change. A significant predicted and developing a hit ratio(accuracy of +and - combined), which is the each set of correlated measure. •use distance measure that compensate for the
student and classroom Stage 4: Estimating the Regression Model and Assessing variables can be compared with each F-change means that the variables added percentage correctly classified. Chi-Square-Based Measure: This test provides a correlation •take proactive approach and include only cluster var that are not highly
Overall Model Fit Variable specification 1)Use variables in their original form: Allows other. By doing that, we know which in that step significantly improved the comprehensive measure of predictive accuracy that is based not on the likelihood correlated. Stage 4: Deriving cluster& assess overall fit-1.Hierarchical: all obj start at
for use of direct measures of the variables of interest. As the number of variables independent variables have stronger prediction. value, but rather on the actual prediction of the dependent variable. The appropriate separate then join tgt, each step form new cluster joining by 2 at a time and only
increases, interpretability may become problematic.2) Dimensional reduction: either effect on the dependent variable. use of this test requires an appropriate sample size. Hosmer and Lemeshow test is a single cluster remain. 2.Non-hierarchical Approach: 2 types of Hierarchical
software controlled (principal component regression or user controlled (EFA). Adj.R2 > 0.1 modified version of R- classification test of the statistical significance of the actual versus predicted approach 1.Agglomerative:combine factor which has>1 observation, all observation
Variable selection 1) user-controlled 1.1) confirmatory: allow direct testing of a pre- squared that has been adjusted for the outcomes starts as individual and join tgt sequentially (•Single linkage: nearest in one cluster,
specified model 1.2) Combinatorial (All-Possible-Subsets): allowing the researcher to number of predictors in the model. The R2 > 0.1 represent the proportion of Stage 5: Interpretation of the Results logistic regression model results in coefficients •complete linkage:farthest: max distance between observation in each cluster
review the entire set of roughly equivalent models in terms of predictive accuracy. adjusted R-squared increases only if the variance in the dependent variable for the independent variables much like regression coefficients and quite different •average: avg similarity of all individual in cluster, less affect by outlier •centroid
2)Software controlled 2.1) Sequential search method 2.1.1) Forward inclusion and new independent variable improves the explained by the independent variables. from the loadings of discriminant analysis. The Wald statistic provides the statistical :distance btw cluster, less affected by outlier •Ward’s : total sum of square w/n
backward elimination: trial and error process for finding the best regression model more than would be expected by The higher the R squared, the better the significance for each estimated coefficient so that hypothesis testing can occur just as cluster easily distort by outlier) 2.Divisive: Breakdown from single cluster then
estimates. Forward inclusion: builds regression equation starting w/ a single chance. It decreases when a predictor model can predict the dependent it does in multiple regression. Directionality of the Relationship A positive divided into smaller.(Pro: simplicity(tree like structure),speed, measure similarityto
independent variable 2.1.2)Backward elimination: starts w/ regression equation improves the model by less than expected variable. (how much X can explain Y) relationship means an increase in the independent variable is associated with an address in many situations, Con: permanent combi: once join never separate Outlier
including all independent variables and then deletes independent variables that do by chance. Show % the means explain the increase in the predicted probability, and vice versa. But the direction of the may appear, require Large sample) Non-hierarchical: 1.Determine number of cluster
not contribute significantly 2.1.3) Stepwise: variables possible removed once variance relationship is reflected differently for the original and exponentiated logistic to be extracted 2. Specify cluster seed: researcher specify 3.Assign each observation
included in regression equation Q: Dif btw forward inclusion & backward coefficients. Original coefficient signs indicate the direction of the relationship.
VIF range < 10 (ideally < 3) shows us the to one of the seed based on similarity.(Pro: less sensitive to outlier, distance
elimination and stepwise: stepwise has the ability to add or delete variables at each DW (1.5 - 2.5) shows us the severity of Exponentiated coefficients are interpreted differently since they are the logarithms
severity of multicollinearity. measure, inclusion of irrelevant or inappropriate var & can easily analyze very large
stage. Once variable is deleted in forward or backward, action can’t be reserved at a autocorrelation. If autocorrelation is of the original coefficients and do not have negative values. Thus, exponentiated
Multicollinearity happens when data, Con: require knowledge of seed point, diff guarantee optimal solution, less
later stage. 2.2) Constrained: A variant of sequential methods whereby variables severe, that means the assumption that coefficients above 1.0 represent a positive relationship (=a relationship w/ no and
independent variables are highly efficient, spherical and equally sized cluster ) Selecting btw hierar and non-hierar:
regression weights are constrained to maximize parsimony in the final model results. each observation is independent is values less than 1.0 represent negative relationships. Magnitude of the Relationship
correlated with each other. If greater than Hierarchical: •sample size is moderate 300-400 not exceed 1000, wide range of
2.2.1) Ridge regression: method employing shrinkage; the regression estimates are violated (Cannot YX). If w/thin this (to determine how much the odds will change given a one-unit-change in the
1 (predictor is moderately correlated, VIF alternative clustering solution. Non-Hierarchical: know the number of cluster and
shrunk based on the tuning parameter or ridge estimator. Obj: to reduce the range, there’s no first order linear auto independent variable, the numeric value of the coefficient must be evaluated) The
must <3. If the VIP>10 the regression is initial seed points can be specified, concern about outlier since this method less
regression estimates that are indicated by the effects of multicollinearity loser tw correlation in data magnitude of metric independent variables is interpreted differently for original and
poorly estimated due multicollinearity. susceptible to outlier. Determine Number of cluster: Stopping rule: criteria used w/
their true value. 2.2.2) Lasso: adding a variable selection component, which comes exponentiated logistic coefficients:
P-Value < 0.05 (ANOVA, Sig ***p<0.001, F-Value≥ 4 (ANOVA) means that the hierarchical technique to identify potential cluster solution, Foundation principle: a
from the addition of additional constraint on the estimated coefficient. Result= small
** p<0.01, *p<0.05) indicates how the regression model explain a significant Original logistic coefficients – are less useful since the reflect the change in the logit natural increase in heterogeneity comes from the reduction in number of cluster.
coefficient can be reduced to 0 >> eliminating them from the variate.
hypothesis significances. Represents the amount of the variance (variance = data (logged odds) value. Exponentiated coefficients : directly reflect the magnitude of Deriving The Final Cluster Solution: No single objective procedure to determine the
ROT4-5::Estimation Techniques: 1) No matter which estimation technique is chosen, the change in the odds value. But their impact is multiplicative and a coefficient of correct number of clusters; rather the researcher can evaluate alternative cluster
probability of type I error (when we think spread, differentiate among variables; if
theory must be a guiding factor in evaluating the final regression model because: 1.0 denotes no change (1.0 times the independent variable = no change). solutions on two general types of stopping rules:1.Measures of heterogeneity change
that there is a relationship even though sig. at P.. level spread enough)
• Confirmatory Specification, the only method to allow direct testing of a pre- :percentage changes in heterogeneity, when moving from k to k - 1
there is not). For example, a p-value of T-Value > 1.96 means coefficient level. Stage 6: Validation of the Results ensuring the external as well as internal validity of
specified model, is also the most complex from the perspectives of specification error, clusters .Candidates for a final cluster solution are a large increase in heterogeneity
0.015 means there is 1.5% chance that the check sign negative or positive the results. The most common approach for establishing external validity is the
model parsimony and achieving maximum predictive accuracy. • Sequential search by joining two clusters 2.Direct measures of heterogeneity :reflect the compactness
relationship we found does not exist relationship of Y and X assessment of hit ratios through either a separate sample (holdout sample) or
(e.g., stepwise), while maximizing predictive accuracy, represents a completely and separation of a specific cluster solution. These measures are compared across a
utilizing a procedure that repeatedly processes the estimation sample. External
“automated” approach to model estimation, leaving the researcher almost no control range of cluster solutions, with the cluster solution(s) exhibiting more compactness
validity is supported when the hit ratio of the selected approach exceeds the
over the final model specification. • Combinatorial estimation, while considering all and separation being preferred. Among the most prevalent measures are the CCC
comparison standards that represent the predictive accuracy expected by chance.
possible models, still removes control from the researcher in terms of final model (cubic clustering criterion), a statistical measure of cluster variation (pseudo F
The goodness-of-fit for a logistic regression model can be assessed in two ways: (1)
specification even though the researcher can view the set of roughly equivalent statistic) or the internal validation index (Dunn's
using pseudo R2 values, similar to that found in multiple regressions, and (2)
models in terms of predictive accuracy. 2) No single method is “Best” and the prudent
examining predictive accuracy (i.e., the classification matrix in discriminant analysis) EFA
strategy is to use a combination of approaches to capitalize on the strengths of each
to reflect the theoretical basis of the research question. Chapter 08: Step by step of Logistic Regression When all criteria fit>> reliability test
Components of Model Fit: 1) Total Sum of Squares (SST): total amount of variation 1.Compute mean score(same like multiple regression) IF run FA and see criteria doesn’t fit delete
that exists to be explained by the independent variables. TSS = the sum of SSE and 2. Label or Divide Group Mean (i.e., 1 and 2) convert the dependent variable into that item out 1) communality – delete the
SSR. 2)Sum of Squared Errors (SSE): the variance in the dependent variable not dichotomous variable by using K-Means Cluster method (KMC) (Analysis >> Classify item w/ unmet criteria 2)component
accounted for by the regression model = residual. Obj is to obtain the smallest >> K-Means Cluster then After clicking on K-Means Cluster >> select and enter “var” matrix or factor loading 3)meaning related
possible sum of squared errors as a measure of prediction accuracy. 3) Sum of to “Variable” box >> click on “Save” >> click on “Cluster membership” box >> Click on Interpretation: item deleted bc of having
Squares Regression (SSR): the amount of improvement in explanation of the “Continue” Then, click on “Options” >> Click on “Initial cluster centers”, “ANOVA communalities of xxx, which is lower than
How to interpret the result? :1) Check the Correlation between all variables Click on criteria, Cronbach’s alpha meet criteria =
dependent variable attributable to the independent variables. We want SSR to be tables”, and “Cluster information for each case” box >> click on “Continue” >> “OK.”
Analyze > Correlate > Bivarate, then enter all variable 2) Report the multiple item is reliable
good. Then, go back to data view in SPSS, able to see K-Means >> change name of cluster
regression table How to name factor: 1) read questionnaire
Measures of model good: 1) F statistic is used to determine if the overall regression tht we made
Method: Stepwise (this method will choose the best IND variables into equation and item or 2) see which item gives the highest
model is statistically significant. If the F statistic is significant, it means it is unlikely
exclude the bad IND variables, then it generates models with different). factor loading>>name based on what is
your sample will produce a large R2 when the population R2 is zero. To be considered
SAMPLE Multiple Rereession INTERPRETATIO:In the 1st and 2nd model, use the that item abt.
statistically significant, a rule of thumb is there must be <.05 probability for
stepwise method. This method will make the variables in the model could explain a
statistical significance 2) R2 (Coefficient of Determination) – strength of overall
significant amount of additional variance. In the model 1, we found that the R² of
variate relationship. If the R2 is statistically significant, we then evaluate the strength
our model is 0.514 with the adjusted R² = .511. This means that the linear regression Interpretation: Based on the result of K-Mean Cluster, the KMC will be separated
of the linear association between the dependent variable and the several
explains 51.4% of the variance. into 2 clusters (we consider 2 groups). The 1st group has 62 cases and the mean of
independent variables. R2, also called the coefficient of determination, is used to
(There is no cut-off value of R². In business field, R² bigger than 0.1 indicate the KMC of this group is 3.57. The 2nd group has 138 cases, the mean of KMC of this
measure the strength of the overall relationship. R2 ranges from 0 to 1.0. The sign (+
acceptable explanation of the variance, but many cases they accept the lower R²). In group is 5.39. The 2 clusters are separated well. 3. How to produce logistic
or -) indicates the direction of the relationship. The value can range from +1 to -1,
Anova table, The F-test is highly significant (F=209.036, p<0.001), which means the regression results(Analyze >> Regression >> Binary Logistic Then, select Cluster
with +1 indicating a perfect positive relationship, 0 indicating no relationship, and -1
regression model explains a significant amount of the variance in Dependent number case “dependent var” and enter it to “Dependent” panel and select
indicating a perfect negative or reverse relationship. It represents the amount of the
variable. (Criteria for F test: F>4, p<0.05). The “Behavior Innovation” significantly “independent var” mean score to “Covariates” panel 4. check and select the results:
dependent variable “explained” by the independent variables combined. A large R2
predicted the “Technology Innovation” with the coefficient β = 0.717*** (p<0.001)
indicates the straight line works well while a small R 2 indicates it does not work well.
(criteria for beta: t>1.96, p<0.05) .Durbin-Watson test for auto-correlation test: The
Even though an R2 is statistically significant, it does not mean it is practically
Durbin-Watson d = 1.612, which is between the two critical values of 1.5 < d < 2.5.
significant. We also must ask whether the results are meaningful. For example, is the
Therefore, we can assume that there is no first order linear auto-correlation in our
value of knowing you have explained 4 % of the variation worth the cost of collecting
multiple regression data. For the multicollinearity test in our multiple regression
and analyzing the data? 3) Adjusted R2 is based on no. of independent variables
model, the VIF = 1, there is no multicollinearity among factors. (If the VIF is greater
relative to sample size >> makes allowance for degree of freedom of each model.
than 1, the predictors may be moderately correlated, and VIF should <3. If the Discriminant Analysis single, non-metric (categorical) dependent variable is
Fewer observations will reduce adjusted R2 Significant Test for Regression
VIF>10, you can assume that the regression coefficients are poorly estimated due to predicted by several metric independent variables. Examples of Dependent
Coefficient: a statistically based probability estimates of whether the estimated
multicollinearity). In any model, if there is at least 1 beta significant and other criteria Variables: Gender (Male vs. Female); Culture (USA vs. Outside USA) (Dependence
coefficient across large no. of samples of a certain size will indeed be different from
are met, you could conclude that hypothesis is supported. techniques) Logistic Regression A single nonmetric dependent variable is predicted
zero 1) Establishing confidence interval: desired alpha=0.05, standard error:
3) Check for normality of residuals with a normal P-P plot: Step1: Click on Analyze by several metric independent variables. This technique is similar to discriminant
expected sampling error of coefficient, confidence interval: no. of standard error
>Regression >Linear >Plots Step2: Choose Scatter 1: Y-> *ZRESID X as *ZPRED In analysis but relies on calculations more like regression (Dependence techniques)
based on alpha level times the value of the standard error 2) Applying confidence
Standardized Residual Plots, choose Histogram and Normal Probability Plot. Result: Manova Several metric dependent variables are predicted by a set of nonmetric
interval: statistical significance established if confidence interval does not include
The plot shows that the points generally follow the normal (diagonal) line with no As far as you got Chi Square is significant and at least a Wald test significant, so (categorical) independent variables. Canonical Analysis Several metric dependent
zero , sample size has direct effect on standard error (increase sample size will
strong deviations. This indicates that the residuals are normally distributed. you can conclude that the hypothesis is supported. #OddRatio in %=((Exp(B)) - variables are predicted by several metric independent variables. (Dependence
decrease standard error), must ensure practical significance when using large sample
Different between stepwise and enter method? Stepwise method allows the 1.0)*100 techniques) Conjoint Analysis is used to understand respondents’ preferences for
size. Influential Observations: all observation that lie outside general pattern of data
program to test many competing models. It starts with a model with only 1 products and services. In doing this, it determines the importance of both: attributes
set and has disproportionate effect on the regression result. Types: 1) Outliers: Chi Squared test sig., with p<0.05 tests for significant differences between the
independent variable, then add more variables in the subsequent models to see if and levels of attributes based on a smaller subset of combinations of attributes and
observations that have large residual values and can be identified only with respect frequency distributions for two (or more) categorical variables (non-metric) in a
there is any improvement. If the program finds that adding 1 more certain variable levels. • Typical Applications: Soft Drinks, Candy Bars, Cereals, Beer, Apartment
to a specific regression model. 2) Leverage points: observations that are distinct cross-tabulation table. Used to evaluate the reduction in the log likelihood value.
will not improve much the predictive power of the model, then that variable will be Buildings/Condos, Solvents/Cleaning Fluids. (Dependence techniques) Structural
from the remaining observations based on their independent variable values. (>6.63)
dropped and not showed in the result. Enter method gives full control to the Equations Modeling (SEM) Estimates multiple, interrelated dependence
3)Influential observations are the broadest category, including all observations that Cox and Snell R2 (range from 0-0.75) OR Negelkerke R2 (range from 0-1) OR
researcher. We can decide exactly which variables are included in the model. With relationships based on two components: 1. Measurement Model 2. Structural Model
have a disproportionate effect on the regression results. Influential observations “Pseudo” R square A value of overall model fit that can be calculated for logistic
this method, we can test the model that has all independent variables included. (Dependence techniques)
potentially include outliers and leverage points but may include other observations regression; comparable to the R2 measure used in multiple regression. higher values
Which criteria are used to conclude whether a hypothesis is supported or not? The Overview: The “what” and “why” of factor analysis
as well. Impact: 1) Reinforcing: reinforcing general pattern and lowering standard indicating greater model fit. No cut-off number of “pseudo” R-square
most important indices are Beta, t-value and p-value, R-squared and adjusted R Factor analysis is a method of data reduction. by seeking underlying unobservable
error of the prediction and coefficient. It‘s leverage point but has small/zero residual -2Log Likelihood approach 0 is more appropriate (100% ->likelihood=1,-2LL=0
squared. A hypothesis is supported only when all of those indices satisfy the criteria. (latent) variables that are reflected in the observed variables (manifest variables).
value bc it’s predicted well by regression model 2) Conflicting: has effect that is Reflected in pseudo R square. No special criteria
DW and VIF range are just for reference. It is better if they satisfy the criteria but if Factor analysis is a technique that requires a large sample size (300,500,1000). Factor
contrary to the general pattern of remaining data but still have small residuals 3)
they don't, that does not necessary mean the hypothesis is not supported. The beta β: relationship between the independent variables and the dependent analysis is based on the correlation matrix of the variables involved, and correlations
Shifting: affect all results in similar manner 4 groups of influential observations: 1)
Chapter 08: Logistics Regression: a specialized form of regression that is formulated variable, where the dependent variable is on the logit scale. No special criteria for usually need a large sample size before they stabilize. minimum of 10 observations
no issues: fit well and no extreme values 2)high leverage but no outlier: very
to predict and explain a binary (two-group) categorical variable rather than a metric beta. Its value is depended on Wald tests. per variable is necessary to avoid computational difficulties.
different but still predicted well by the model 3) outlier but acceptable leverage:
dependent measure. Q: Why Logistic Regression is preferred? ANS: 1) Logistic The Wald number is used to determine statistical significance for each of the Kaiser-Meyer-Olkin Measure of Sampling Adequacy – This measure varies between 0
high residual, no extreme value 4)outlier and high leverage: poor prediction and
regression does not face these strict assumptions unlink discriminant analysis 2) independent variables (same role as t-value in linear regression). It should >1.96 at and 1, and values closer to 1 are better.Bartlett’s Test of Sphericity – This tests the
quite different << we want to delete this type. Corrective Actions: 1)An error in
similar to multiple regression (incorporating metric and nonmetric variables and p<0.05. HOWEVER, Wald method has been the subject of extensive criticism by null hypothesis that the correlation matrix is an identity matrix. An identity matrix is
observations or data entry : remedy by correcting the data or deleting the case 2)A
nonlinear effects and a wide range of diagnostics) statisticians for exaggerating results. So, for getting a safe result, you should use matrix in which all of the diagonal elements are 1 and all off diagonal elements are 0.
valid but exceptional observation that is explainable by an extraordinary situation:
Logistic Regression Decision Process: Stage 1: Objectives of Logistic Regression 1) p<0.025 as suggested from some articles. You want to reject this null hypothesis. **should pass KMO & batlett bf doing FA
remedy by deletion of the case unless variables reflecting the extraordinary situation
Explanation: Identifying the independent variables that impact group membership in Hit-Ratio>62.5% (@ percentage correct). It is calculated as the number of objects in Communalities – This is the proportion of each variable’s variance that can be
are included in the regression equation 3)An exceptional observation with no likely
the dependent variable; Estimating the importance of each independent variable in the diagonal of the classification matrix divided by the total number of objects. Also explained by the factors t is also noted as h2 and can be defined as the sum of
explanation: presents a special problem bc there is no reason for deleting the case,
explaining group membership 2) Classification: Establishing a classification system known as the percentage correctly classified. squared factor loadings for the variables.Initial Eigenvalues – Eigenvalues are the
but its inclusion cannot be justified either, suggesting analyses w/ and w/ouut the
based on the logistic model for determining group membership Stage 2: Research SAMPLE INTERPRETATION variances of the factors. Total – This column contains the eigenvalues. The first
observations to make a complete assessment, and 4) An ordinary observation in its
Design for Logistic Regression factor will always account for the most variance. % of Variance – This column
individual characteristics but exceptional in its combination of characteristics: Chi-square and its significance level (sig.) identify the significance of overall model. In
1)Representation of the Binary Dependent Variable (Represents discrete events or contains the percent of total variance accounted for by each factor.Extraction Sums
indicates modifications to the conceptual basis of the regression model and should this case, χ2(df=4) = 94.245, p < .001. The model explained a significant amount of
objects in only two classes -dependent variable can only take on two values (1 or 0), of Squared Loadings – The number of rows in this panel of the table correspond to
be retained.Stage5: Interpreting the Regression Variate the variance in dependent variables (52.1% of Negelkerke R2) and correctly classify
need a relationship form that is non-linear so that the predicted values will never be the number of factors retained.Rotation Sums of Squared Loadings – The values in
Multicollinearity: relationship btw two (collinearity) and more (multicollinearity) 78.5% of all cases. Competence Upgrading has significant impact on Technology
below 0 or above 1) 2) Sample Size (400 to achieve best results with maximum this panel of the table represent the distribution of the variance after the varimax
independent variables. Occurs when any single variable independent variable is Innovation (beta=0.199, Wald=9.497 with p<0.001) Behavior Innovation also has
likelihood estimation, smaller samples sizes less efficient in model estimation. MLE rotation. Varimax rotation tries to maximize the variance of each of the factors.
highly corelated w/ a set of other independent variables. Measure of correlation significant impact on Technology Innovation (beta=0.197, Wald=29.036 with p<0.001
requires larger samples such that, all things being equal, logistic regression will Factor Matrix – This table contains the unrotated factor loadings, which are the
incorporating multicollinearity: 1) Bivariate/zero-order correlation: association btw
require a larger sample size than multiple regression. Overall optimal: 400 minimal: correlations between the variable and the factor. Because these are correlations,
two variables, not accounting for the variation shared w/ any other variables 2) semi-
at least 10 per group) 3) Use of Aggregated Data: (logistic regression can analyze possible values range from -1 to +1. Logistic Regression Overall Percentage – This
partial/ part correlation: unique predictive effect 3) partial correlation: incremental
aggregated data if all the independent variables are nonmetric- e.g. binary variables gives the percent of cases for which the dependent variables was correctly predicted
predictive effect. Multicollinearity Identification: 1) Variance Inflation Factor (VIF):
would result in 8 combinations (2 x 2 x 2) Stage 3: Assumptions of Logistic given the model. B – This is the coefficient for the constant (also called the
measures how much the variance of the regression coefficients is inflated by
Regression Advantage of logistic regression: the result of the general lack of “intercept”) in the null model.
multicollinearity problems. If VIF equals 0, there is no correlation between the
assumptions, not require linear relationships between IND var and dep var, not S.E. – This is the standard error around the coefficient for the constant.
independent measures. A VIF measure of 1 is an indication of some association
requires Heteroscedasticity of the IND variables as does multiple regression Primary Wald and Sig. – This is the Wald chi-square test that tests the null hypothesis that
between predictor variables, but generally not enough to cause problems. A
assumption is independence of observation. Assumption should be addressed with the constant equals 0. hypothesis is rejected because the p-value (listed in the
maximum acceptable VIF value would be 5; anything higher would indicate a CHAPTER 4 – Cluster Analysis Group objects based on characteristic they possess:
Box-tidwell - Stage 4: Estimation of the Logistic Regression Model and Assessing column called “Sig.”) is smaller than the critical p-value of .05 (or .01). Hence, we
problem with multicollinearity. 2)Tolerance the amount of variance in an obj in the same cluster are similar w/in a cluster but different form obj in others.
Overall Fit conclude that the constant is not 0. Usually, this finding is not of interest to
independent variable that is not explained by the other independent variables. If the Descriptive, a theoretical, non-inferential. Cluster will always create cluster
other variables explain a lot of the variance of a particular independent variable we **logistic curve (S-shaped) within range of 0 and 1 researchers.
regardless any structure. The solution not generalizable (depends on cluster variate).
have a problem with multicollinearity. Thus, small values for tolerance indicate 1)Transforming a probability into odds and logit values- logit values are at 1 and 0. 2) df – This is the degrees of freedom for the Wald chi-square test.
Cluster Variate: set of clustering var used to measure similarity. Concept: reduce
problems of multicollinearity. The minimum cutoff value for tolerance is typically Model estimation using a maximum likelihood approach (Maximizes the likelihood Exp(B) – This is the exponentiation of the B coefficient, which is an odds ratio. This
population to smaller no of homogeneous group. •It has been referred to Q analysis,
0.20. That is, the tolerance value must be smaller than 0.20 to indicate a problem of that an event will occur) .Comparison of likelihood values: Estimate a Null Model, value is given by default because odds ratios can be easier to interpret than the
typology construct, numerical taxonomy. ROT :Cluster analysis is used for
multicollinearity. Estimate Proposed Model, Assess 2LL Difference(=-2loglikelihood; the minimum coefficient, which is in log-odds units. Score and Sig. – This is a Score test that is used
•Taxonomy des: identify natural group w/n data• Data simplification: analyze group
Residual Plots 1) Histogram of standardized residuals: enables you to determine if value for -2LL is 0 which corresponds to a perfect fit where likelihood=1 thus the to predict whether or not an independent variable would be significant in the model.
of similar observation instead of all individual. •Relationship identification: simplified
the errors are normally distributed. 2) Normal probability plot enables you to lower -2LL value, the better the fit of the model ) . Measures of Model Fit: Global Null Why use multivariate analysis? multivariate analysis is a tool to find patterns and
structure from cluster analysis portrays relationships not revealed otherwise.
determine if the errors are normally distributed. It compares the observed (sample) Hypothesis Test(reflects that at least one of the estimated coefficient is relationships between several variables simultaneously. It lets us predict the effect a
Conceptual and practical consideration: •only var that relate specifically to objective
standardized residuals against the expected standardized residuals from a normal significant ,which is similar to F test in multiple regression), Pseudo R2 Measures change in one variable will have on other variables. ... This gives multivariate analysis
are included irrelevant var cannot be excluded from analysis once it begins. •Var are
distribution. 3) Scatter Plot of residuals: can be used to test regression assumptions. (Interpreted in a manner similar to the coefficient of determination in multiple a decisive advantage over other forms of analysis. 2.Why is knowledge of
selected which characterize the individual being clustered •The essence of
It compares the standardized predicted values of the dependent variable against the regression; Different pseudo R2 measures vary widely in terms of magnitude and no measurement scales important in using multivariate analysis? Since the type of
approaches is the classification of data as suggested by natural grouping . 3 Basic
standardized residuals from the regression equation. If the plot exhibits a random one version has been deemed most preferred; ranges from 0.0 to 1.0). Issues in measurement should be defined by the researcher (even it will be
Question: 1. How do we measure similarity? Method of comparing observation on
pattern, then this indicates no identifiable violations of the assumptions underlying Model Estimation: Small sample sizes- Difficult to accurately estimate coefficients nonmetric/qualitative or metric/quantitative) for each variable, in MV analysis
the clustering var. Var w/n same cluster = stronger relation than between cluster.
regression analysis. and standard errors; complete separation = Dependent variable is perfectly defining data has great effect what the data present. For computers, the values are
Correlation between objects, measure of their proximity in two-dimensional space
Stage6: Validation of the result: ensure that model represent general population predicted by an independent variable ; quasi-complete separation (zero cells effect) = just numbers; in nominal scales, 1 can be assigned to male, and 0 to female. But the
such that the distance between observation indicates similarity. 2.How do we form
(generalizability) and appropriate for the situation in which in will be used One or more of the groups defined by the nonmetric independent variable have data present gender and researcher determine how it can be analyzed. Since
cluster? Group observation that are most similar into cluster.3. How many groups do
(transferability) 1)Additional or split sample: comparing results to ensure counts of zero ; when there is no controlled group nonmetric data (ordinal and nominal scales) used as independent variable in most
we form? Depends but few cluster> less homogeneity w/n cluster, larger
comparability of results across different samples 2)calculating PRESS statistics: ROT7-1::Logistic Regression: 2) Sample size are primarily focused on the size of each MV techniques. Knowledge of measurement scales important in order not to use
cluster>more homogeneity. Critisim: 1.subjectivity in selecting final decision: no
employ jackknife procedure to calculate measure of predictive fit and is used to group, which should have 10 times the number of estimated model coefficients 4) wrong scale for the variables. As exemplified above, 0 and 1 are just nonmetric data
optimal solution final decision by researcher. 2.Judgment required of the researcher
calculate P2 (coefficient of prediction) 3)comparing regression model: R2 will increase Model significance tests are made with a chi-square test on the differences in the log to present gender; not having any numerical meaning like female are less than male,
in selecting characteristic to be used, the method of combining cluster,
as variables added 4) forecasting: ensure comparability of new data to data set used likelihood values (-2LL) between two models 5) Coefficients are expressed in two or we cannot calculate any mean value 3.What is the difference between
interpretation of cluster solution make final solution unique to that researcher .
in estimation forms: original and exponentiated to assist in interpretation 6) Interpretation of the component analysis and common factor analysis? In factor analysis, the original
Stage1: Objective: •partition set of obj into 2 or< based on similarity. Key issue:
VOCAB: Beta coefficient Standardized regression coefficient that allows for a direct coefficients for direction and magnitude is as follows: • Direction can be directly variables are defined as linear combinations of the factors the goal in factor analysis
research question being address & var used to characterized obj in cluster process.
comparison between coefficients as to their relative explanatory power of the assessed in the original coefficients (positive or negative signs) or indirectly in the is to explain the covariance’s or correlations between the variables. Use principal
Selecting cluster Var: cluster var represent the sole means of measuring similarity
dependent variable. Degrees of freedom (df) Value calculated from the total number exponentiated coefficients (less than 1 are negative, greater than 1 are positive • components analysis to reduce the data into a smaller number of components.
among obj. 2 issues in Var selection 1. Conceptual consideration: Var characterize the
of observations minus the number of estimated parameters Partial F (or t) values Magnitude is best assessed by the exponentiated coefficient, with the percentage 6.When should multiple regression be used? It is used when we want to predict the
value of a variable based on the value of two or more other variables. The variable
we want to predict is called the dependent variable (or sometimes, the outcome,
target or criterion variable). 7.When should a linear regression be used? Simple
linear regression is appropriate when the following conditions are satisfied. The
dependent variable Y has a linear relationship to the independent variable X. To
check this, make sure that the XY scatterplot is linear and that the residual plot
shows a random pattern. 8.How do you use regression coefficients? when the
regression line is linear (y = ax + b) the regression coefficient is the constant (a) that
represents the rate of change of one variable (y) as a function of changes in the other
(x); it is the slope of the regression line. 9.Should we use factors scores or
summated ratings in follow up analysis? It’s hard to answer without context.
Usually summated rating is use when the items are designed as summated scale but
it is very rare. Usually for latent constructs, in my field, they use factor score.
However, in some fields, like economics, sometimes I see summated rating. I don’t
really know why they use it. Maybe it is a way to put weight on the items that they
think are important. Summated rating is when the score of all items add up into a
score of the construct right? So if the items are measure on different scales. For
example an item is 1 to 5, another is 1 to 9. When they add up, the item with 1 to 9
scale will have higher weight
Three converging trend:
1)Riseof big data:
volume,variety,velocity(faster),veracity(accuracy),variability :impact on organization
decision and analyst. 2)Statistical: structure (small to large dataset, theory based):
Explanation what happen now, specific model. Inference through statistic test vs
data mining (unstructured, very large dataset):’Predictive accuracy’ based on analysis
past, algorithms (neutral network, decision tree, support vector machine) 3)Casual
Difference: strong statement of cause-effect in nonexperimental situation/drawing
conclusion about casual connection based on condition of occurrence.

You might also like