Professional Documents
Culture Documents
MVDA - QUESTION BANK (1)
MVDA - QUESTION BANK (1)
Conjoint analysis:
https://conjointly.com/guides/what-is-conjoint-analysis/
UNIT – 2
DATA CLEANING AND MULTIVARIATE TECHNIQUES
Short answers questions
1. Define untidy and tidy data?
2. Define outliers?
3. Write different graphical examination of data?
4. Define dummy variables?
5. Write short notes about winsorization?
6. List the assumptions of multivariate analysis.
7. Define metric and nonmetric data?
8. List the graphs for good visualization.
9. Define data cleaning?
10.Define imputation of data?
Step-4: Among these k neighbors, count the number of the data points in
each category.
Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
Histogram
Histograms are helpful when you want to better understand what values
you have in your dataset for a single set of numbers. For example, if you
had a dataset with information about many people, you may want to know
how tall the people in your dataset are. To quickly visualize this, you could
use a histogram.
Density plot
Density plots are smoothed versions of histograms, visualizing the
distribution of a continuous variable. These plots effectively visualize the
distribution shape and are, unlike histograms, are not sensitive to the number
of bins chosen for visualization.
Scatterplot
Scatterplots are helpful when you have numerical values for two
different pieces of information and you want to understand the relationship
between those pieces of information. Here, each dot represents a different
person in the dataset. The dot’s position on the graph represents that
individual’s height and weight. Overall, in this dataset, we can see that, in
general, the more someone weighs, the taller they are. Scatterplots, therefore
help us at a glance better understand the relationship between two sets of
numbers.
Boxplot
The basic plot we’ll discuss here are line plots. Line plots are most effective
at showing a quantitative trend over time.
Barplot
When you only have a single categorical variable that you want
broken down and quantified by category, a barplot will be ideal. For
example if you wanted to look at how many females and how many males
you have in your dataset, you could use a barplot. The comparison in heights
between bars clearly demonstrates that there are more females in this dataset
than males.
Grouped Barplot
Stacked Barplot
We solve this missing value problem by replacing the NAN values with
the Mean/ Median /Mode
Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.
Mean: It is the average value.
In this, we calculated the mean(average) of all the observed data and got 51
as the mean and replaced it in place of missing values.
In this, we calculated the median (center value) of all the present values and
got 58 as the median, and replaced it in place of missing data.
Mode: It is the most common value.
In this, we calculated the mode (most frequently occurring value) of all the
present values and got 67 as mode, and replaced it in place of missing
values.
Types
Missing Completely at Random (MCAR)
Missing Data Not at Random (MNAR)
Example:
The data set of N = 90 ordered observations as shown below is examined for
outliers:30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451,
453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550,
559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637,
638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794,
802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005,
1068, 1441
Dummy Variable:
Parameters:
test: represents test condition
yes: represents the value which will be executed if test condition
satisfies
no: represents the value which will be executed if test condition does
not satisfies
Example code:
# Using PlantGrowth dataset
pg <- PlantGrowth
# Print
cat("Original dataset:\n")
head(pg, 20)
# Create dummy variable
df$gender_m <- ifelse(df$gender == "male", 1, 0)
df$gender_f <- ifelse(df$gender == "female", 1, 0)
# Print
cat("After creating dummy variable:\n")
head(pg, 20)
Original dataframe: