Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

UNIT – 1

OVERVIEW OF MULTIVARIATE STATISTICS


Short answers questions
1. State the multivariate analysis.
2. List the multivariate techniques?
3. Define the Factor analysis?
4. How to step up the Multiple regression?
5. Define the Regression?
6. How to step up the logistics regression?
7. Define canonical correlation?
8. Define PCA?
9. Define statistics?
Long answers questions
1.Elaborate the fallowing.
I)PCA II) Factor analysis III) conjoint analysis
PCA:
 The principal component must be the linear combination of the
original features.
 These components are orthogonal, i.e., the correlation between a pair
of variables is zero.
 The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
Factor analysis

 To reduce a large no. of variable to a smaller no. of factors for


modeling purposes, where the large number of variables precludes
modeling all the measures, individually. As such factor analysis is
integrated in structural equation modeling, helping create the latent
variables modeled by SEM (structure equation model).
 To select a subset of variables from a large set based on which
original variable have the highest correlations with the principal
component factors.
 To create a set of factors to be treated as uncorrelated variable as one
approach to handling multicollinearity regression.

Conjoint analysis:

https://conjointly.com/guides/what-is-conjoint-analysis/

Conjoint analysis is a popular method of product and pricing research that


uncovers consumers’ preferences and uses that information to help:

 Select product features.


 Assess sensitivity to price.
 Forecast market shares.
 Predict adoption of new products or services.

2.Explain the Design steps in PCA?


The steps to perform PCA are the following:
 Standardize the data.
 Compute the covariance matrix of the features from the dataset.
 Perform eigendecompositon on the covariance matrix.
 Order the eigenvectors in decreasing order based on the
magnitude of their corresponding eigenvalues.
 Determine k, the number of top principal components to select.
 Construct the projection matrix from the chosen number of top
principal components.
 Compute the new k-dimensional feature space.
3.Explain design steps in multivariate model building?
4.Elaborate the steps in Factor analysis?
1.Collect data: choose relevant variables.
2. Extract initial factors (via principal component).
3. Choose number of factors to retain.
4. Choose estimation method, estimate model.
5. Rotate and interpret.
6. (a) Decide on changes need to be made (e.g. drop items include items) (b)
Repeat (4), (5).
7. Construct scales and use on further analysis.
5. Explain brief about multivariate techniques?

UNIT – 2
DATA CLEANING AND MULTIVARIATE TECHNIQUES
Short answers questions
1. Define untidy and tidy data?
2. Define outliers?
3. Write different graphical examination of data?
4. Define dummy variables?
5. Write short notes about winsorization?
6. List the assumptions of multivariate analysis.
7. Define metric and nonmetric data?
8. List the graphs for good visualization.
9. Define data cleaning?
10.Define imputation of data?

Long answers questions


1. Elaborate the steps in KNN method?
The parameter k in KNN refers to the number of labeled points
(neighbors) considered for classification. The value of k indicates the
number of these points used to determine the result. Our task is to calculate
the distance and identify which categories are closest to our unknown entity.
Steps:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean


distance.

Step-4: Among these k neighbors, count the number of the data points in
each category.
Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.

Step-6: Our model is ready.

2. Define visualization. Explain brief about any three graphs?


The term ‘data visualization’ refers to any visual display of data that
helps us understand the underlying data better. This can be a plot or figure of
some sort or a table that summarizes the data. Generally, there are a few
characteristics of all good plots.
 Clearly-labeled axes.
 Text that are large enough to see.
 Axes that are not misleading.
 Data that are displayed appropriately considering the type of data you
have.

Histogram
Histograms are helpful when you want to better understand what values
you have in your dataset for a single set of numbers. For example, if you
had a dataset with information about many people, you may want to know
how tall the people in your dataset are. To quickly visualize this, you could
use a histogram.
Density plot
Density plots are smoothed versions of histograms, visualizing the
distribution of a continuous variable. These plots effectively visualize the
distribution shape and are, unlike histograms, are not sensitive to the number
of bins chosen for visualization.

Scatterplot

Scatterplots are helpful when you have numerical values for two
different pieces of information and you want to understand the relationship
between those pieces of information. Here, each dot represents a different
person in the dataset. The dot’s position on the graph represents that
individual’s height and weight. Overall, in this dataset, we can see that, in
general, the more someone weighs, the taller they are. Scatterplots, therefore
help us at a glance better understand the relationship between two sets of
numbers.
Boxplot

Boxplots also summarize numerical values across a category; however,


instead of just comparing the heights of the bar, they give us an idea of the
range of values that each category can take. For example, if we wanted to
compare the heights of men to the heights of women, we could do that with
a boxplot.
Line Plots

The basic plot we’ll discuss here are line plots. Line plots are most effective
at showing a quantitative trend over time.

Barplot

When you only have a single categorical variable that you want
broken down and quantified by category, a barplot will be ideal. For
example if you wanted to look at how many females and how many males
you have in your dataset, you could use a barplot. The comparison in heights
between bars clearly demonstrates that there are more females in this dataset
than males.
Grouped Barplot

Grouped barplots, like simple barplots, demonstrate the counts for a


group; however, they break this down by an additional categorical variable.
For example, here we see the number of individuals within each % category
along the x-axis. But, these data are further broken down by gender (an
additional categorical variable). Comparisons between bars that are side-by-
side are made most easily by our visual system. So, it’s important to ensure
that the bars you want viewers to be able to compare most easily are next to
one another in this plot type.

Stacked Barplot

Another common variation on barplots are stacked barplots. Stacked


barplots take the information from a grouped barplot but stacks them on top
of one another. This is most helpful when the bars add up to 100%, such as
in a survey response where you’re measuring percent of respondents within
each category. Otherwise, it can be hard to compare between the groups
within each bar.
3. Explain imputation of missing data with central tendency method?
Missing data, or missing values, occur when you don’t have data stored for
certain variables or participants. Data can go missing due to incomplete data
entry, equipment malfunctions, lost files, and many other reasons.

Missing data causes problems when a machine learning model is applied to


the dataset. Mostly machine learning models don’t process the data with
missing values.
 They hesitate to put down the information
 Survey information is not that valid
 Hesitation is sharing the information
 People may have died—-NAN

Mean/ Median /Mode imputation

We solve this missing value problem by replacing the NAN values with
the Mean/ Median /Mode
Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.
Mean: It is the average value.

In this, we calculated the mean(average) of all the observed data and got 51
as the mean and replaced it in place of missing values.

Median: It is the midpoint value.

In this, we calculated the median (center value) of all the present values and
got 58 as the median, and replaced it in place of missing data.
Mode: It is the most common value.

In this, we calculated the mode (most frequently occurring value) of all the
present values and got 67 as mode, and replaced it in place of missing
values.

Types
 Missing Completely at Random (MCAR)
 Missing Data Not at Random (MNAR)

4. Define outlier. Explain how to identify outliers with boxplot graph?


Outlier
An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population.
Interquartile Range (IQR

Example:
The data set of N = 90 ordered observations as shown below is examined for
outliers:30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451,
453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550,
559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637,
638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794,
802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005,
1068, 1441

 Median = (n+1)/2 largest data point = the average of the 45th


and 46th ordered points = (559 + 560)/2 = 559.5
 Lower quartile = .25(N+1)th ordered point = 22.75th ordered
point = 411 + .75(436-411) = 429.75
 Upper quartile = .75(N+1)th ordered point = 68.25th ordered
point = 739 +.25(752-739) = 742.25
 Interquartile range = 742.25 - 429.75 = 312.5
 Lower inner fence = 429.75 - 1.5 (312.5) = -39.0
 Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0
 Lower outer fence = 429.75 - 3.0 (312.5) = -507.75
 Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75
5. Explain Incorporating nonmetric data with dummy variables with
function in R?

Dummy Variable:

A dummy variable is a numerical variable used in regression analysis


to represent subgroups of the sample in your study. In research design, a
dummy variable is often used to distinguish different treatment groups. In
the simplest case, we would use a 0,1 dummy variable where a person is
given a value of 0 if they are in the control group or a 1 if they are in the
treated group. Dummy variables are useful because they enable us to use a
single regression equation to represent multiple groups.
Functions in R:
 ifelse() method and another is by using dummy_cols() function
Using ifelse() function
ifelse() function performs a test and based on the result of the test
return true value or false value as provided in the parameters of the function.
Using this function, dummy variable can be created accordingly.
Syntax:
ifelse(test, yes, no)

Parameters:
test: represents test condition
yes: represents the value which will be executed if test condition
satisfies
no: represents the value which will be executed if test condition does
not satisfies

Example code:
# Using PlantGrowth dataset
pg <- PlantGrowth
# Print
cat("Original dataset:\n")
head(pg, 20)
# Create dummy variable
df$gender_m <- ifelse(df$gender == "male", 1, 0)
df$gender_f <- ifelse(df$gender == "female", 1, 0)
# Print
cat("After creating dummy variable:\n")
head(pg, 20)

Original dataframe:

After creating dummy variable:

You might also like