Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Questions

March 28, 2024

1 Short questions

1. State Mahalanobis distance and its applications.


A. The Mahalanobis distance is a measure of the dis-
tance between a point P and a distribution N (x|µ, Σ).

Formula
q
DM (P, X) = (P − µ)T Σ−1 (P − µ)
Where:
• DM (P) is the Mahalanobis distance of point P
from the multi variate distribution X.
• P is the vector representing the point.

1
• µ is the mean vector of the distribution X.
• Σ is the covariance matrix of the distribution.
Usage
2. Define vector Space
A. A vector space, denoted by
V
is a set of vectors
(v1 , v2 , ...)
with the following essential closure properties
• Addition : If for any two vectors,
vi , vj , i ̸= j, vi , vj ϵV
,
vk = vi + vj , vk ϵV
• Scalar Multiplication: For any vector vi ∈ V ,
α ∈ R, vk = α · vi , vk ∈ V .
• Null Vector :
0ϵV
:
3. Mention Qualitative and Quantitative attributes in
Engineering college dataset?
A.Qualitative Attributes:
(a) Qualification of Faculty (e.g., PhD, Masters, Bach-
elors)
(b) Department (e.g., Computer Science, Electrical
Engineering)

2
Quantitative Attributes:
(a) Student GPA (Grade Point Average)
(b) Faculty Experience (in years)
4. define the Cost function of Linear Regression model?
How can you optimize it?
A. Linear regression the model s represented as
Ỹ = X · θ (1)
; where θ represents the model parameters and X, Y, Ỹ
represent independent, actual dependent and pre-
dicted dependent variables, respectively. The devi-
ation between actual and model predicted value is
given by
ϵ = (Y − Ỹ ) (2)
Squared error, J(X, Y, θ) = ϵ2 = ϵT · ϵ is called Cost
function, or sometimes called error function also.
This cost function represents the sum of square of
errors between each actual value of dependent vari-
able and model predicted value of dependent vari-
able considering each point represented by indepen-
dent variable from training data set. In ordinary
Least square fit of the Linear model, we minimize
J(X, Y, θ), as the cost function is convex and easy
to obtain explicit solution by equating the gradient
J(θ,X,Y
(θ) = 0. This is same as

J(θ) = ϵT .ϵ = (Y − Ỹ )T · (Y − Ỹ ) (3)

5. State differences between Random Forest and Deci-


sion tree?

3
Comparison between Single Decision Tree and
Random Forest
Criteria Single Decision Random Forest
Tree
Training Data Uses the entire Each tree is
dataset without trained on a boot-
replacement strap sample
Feature Selection Considers all fea- Randomly selects
tures during train- a subset of fea-
ing tures for each tree
Data Subset Trained on the en- Trained on differ-
tire dataset ent subsets of the
data due to boot-
strapping
Prediction Prone to overfit- Less prone to over-
ting due to noise in fitting
A. the data
Diversity Limited diversity Increased diversity
due to all features among trees due to
and full dataset subsets of features
and data.
Generalization May not general- Tends to general-
ize well to new, ize better
unseen data
Model Complexity Can become com- Ensemble of sim-
plex depending on pler trees, collec-
the data tively form a com-
plex model.
Robustness Sensitive to varia- More robust to
tions in the train- variations due to
ing data ensemble averag-
ing and feature
4
randomness.
6. List out any six Ordinal type attributes to be con-
sidered in college dataset.
A.
• Academic Rank (e.g.,Excellent, Very Good, Good,
Fair)
• Course Grade (e.g., A, B, C, D, F)
• Satisfaction Level (e.g., Very Satisfied, Satisfied,
Neutral, Dissatisfied, Very Dissatisfied)
• Student Performance (e.g., Excellent, Good, Av-
erage, Below Average, Poor)
• Course Difficulty Level (e.g., Very Easy, Easy,
Moderate, Difficult, Very Difficult)
• Likelihood of Recommending the College (e.g.,
Definitely Recommend, Likely to Recommend,
Neutral, Unlikely to Recommend, Definitely Not
Recommend)
7. how do you fill missing values in the dataset?
A. Two key aspects of data preprocessing are han-
dling missing values and cleaning noise:
• Missing Value Handling: Missing values can
arise due to reasons like data entry errors, sen-
sor malfunctions, or intentional data omissions.
Techniques for handling missing values include:
– Imputation: Replace missing values with a
suitable estimate, such as mean, median, or
mode of the column.
– Advanced methods: Utilize advanced tech-
niques like K-nearest neighbors (KNN) im-

5
putation or matrix completion algorithms for
more accurate estimation
8. Reasons for the lazy learners take more time to pre-
dict the solution?
A. Lazy learners, also known as lazy learning algo-
rithms, are a type of machine learning algorithm that
delays generalization until a new instance must be
classified or a prediction made. These algorithms
store the training instances themselves and do not
attempt to construct a general internal model. When
a prediction is required for a new instance, the lazy
learner retrieves the most similar instances from the
training data and uses them to make a prediction.
There are a few reasons why lazy learners may take
more time to predict solutions compared to eager
learners (those that build a model during training):
(a) Computational Overhead: Lazy learners typ-
ically involve a computational overhead during
prediction time because they need to compare
the new instance with all instances in the train-
ing set to identify the most similar ones. This can
be particularly time-consuming for large datasets
or high-dimensional data.
(b) No Pre-Computation: Eager learners pre-compute
a model during training, which can make predic-
tion time faster since they don’t need to refer
back to the original training data. In contrast,
lazy learners don’t pre-compute any model and
rely on the training instances themselves, leading
to potentially longer prediction times.

6
(c) Dynamic Nature: Lazy learners adapt to changes
in the training data dynamically, as they do not
construct a fixed model during training. While
this can be advantageous in certain scenarios where
the data distribution is constantly changing, it
can also lead to longer prediction times as the al-
gorithm may need to re-evaluate a larger portion
of the training data for each new instance.
9. What could be value of k for k-NN classifier for ‘n’
samples?
A.Odd values for binary classification: If you’re per-
forming binary classification, it’s often recommended
to use odd values of k to avoid ties when voting. For
example, you might choose k = 1, 2, n/2orn − 1/2 as
n is even or odd.Smaller values of k tend to result
in more complex models with low bias but high vari-
ance, while larger values of k lead to simpler mod-
els with higher bias but lower variance. Choose k
based on the balance between bias and variance that
is suitable for your dataset. Larger values of k re-
quire more computational resources for prediction,
as they involve calculating distances to more neigh-
bors. Consider the computational cost and efficiency
when choosing the value of kk, especially for large
datasets.Overall, there’s no one-size-fits-all answer
for the appropriate value of k in k − N N .
10. How does multivariate analysis help?
A. multivariate analysis plays a crucial role in explor-
ing, analyzing, and interpreting complex datasets
across various domains, leading to better decision-

7
making, insight generation, and problem-solving.
(a) Understanding Relationships: Multivariate
analysis helps in understanding the relationships
between multiple variables simultaneously. By
examining these relationships, researchers can iden-
tify patterns, dependencies, and interactions among
variables.
(b) Dimensionality Reduction: help in reducing
the dimensionality of the dataset by transform-
ing the original variables into a smaller set of un-
correlated variables. This simplification makes it
easier to visualize and interpret the data.
(c) Predictive Modeling: Multivariate analysis is
often used in predictive modeling tasks where the
goal is to predict the value of one or more target
variables based on a set of predictor variables.
(d) Cluster Analysis: Multivariate techniques such
as Cluster Analysis help in identifying groups or
clusters within the dataset based on the similar-
ity or dissimilarity between observations. This is
useful for segmentation and classification tasks in
various fields, including marketing, biology, and
social sciences.
(e) Pattern Recognition: Multivariate analysis helps
in recognizing and extracting patterns from com-
plex datasets. Techniques like Neural Networks,
Support Vector Machines (SVM), and Decision
Trees are commonly used for pattern recognition
tasks in fields such as image processing, speech
recognition, and bioinformatics.

8
2 Major questions

(a) Describe two Data preprocessing techniques with


a sample dataset.
A. Two data pre processing techniques consid-
ered here are
• Missing Value Handling: Missing value han-
dling techniques involve replacing missing val-
ues with substituted values, such as the mean,
median, or mode or other advanced techniques
for the respective feature in which the value
is missing.
Student data set with missing values
Student ID Age Grade Attendance
1 18 A 90%
2 20 B 80%
3 NaN C 95%
4 22 B NaN
5 19 A 85%
Mean of column 1 and 2 are 19.75 and 87.5 %.
Therefore these two values are used to impute
the missing values.
Result after Handling Missing Values:
Student ID Age Grade Attendance
1 18 A 90%
2 20 B 80%
3 20 C 95%
4 22 B 87.5%
5 19 A 85%
• Standardization Data set:
Standardization is a data transformation

9
technique used to rescale numerical features
to have a mean of 0 and a standard deviation
of 1. To standardize a feature, subtract the
mean of the feature from each data point
and then divide by the standard deviation of
the feature. Sample Dataset:
Employee ID Salary (in $) Years of Experience
1 50000 3
2 60000 5
3 45000 2
4 70000 8
5 55000 4
Mean and σ of each feature taken separately
are calculated as below. Mean of salary
feature :
 
1
1 
1 1
µsalary = (Salary T ·1) =
 
50000, 60000, 45000, 70000, 55000 ·
1 

5 5 1 
1
(4)
= µsalary = 56000 (5)
   
50000 1
60000 1
1    
[X − 1 · µsalary ] = ( 45000 − 1 · 56000)
5 
   
70000 1
55000 1
(6)

10

50000 − 56000 = −6000
 
−6000

 60000 − 56000 = 4000   4000 
   
=
45000 − 56000 = −11000  = −11000
  
 70000 − 56000 = 14000   14000 
55000 − 56000 = −1000 −1000
(7) 
−6000

  4000 
 
σsalary = −6000, 4000, −11000, 14000, −1000, · −11000

 14000 
−1000
r (8)
(36 + 16 + 121 + 196 + 1)
σsalary = ∗1000 = 8602
5
(9)
standardized salary =
 −6000 
8602
 −4000 
[X − 1 · µsalary ]  8602 
−11000 
=
 8602  (10)
σsalary  −14000 
8602
−1000
8602

standardized salary =
 −6000  
−0.697500

8602
 4000   0.465 
[X − 1 · µsalary ]  8602 
−11000 
 
=  8602  =  −1.278 
 
σsalary  14000   1.6275 

8602
−1000
8602 −0.1162
(11)
• Result after Standardization:

11
(b) Detail steps involved in hypothesis testing
A. Steps in Hypothesis Testing Hypothesis
testing involves several steps to determine the va-
lidity of a statistical hypothesis.
i. Formulate Hypotheses: Define the null hy-
pothesis (H0 ) and alternative hypothesis (H1 ).
ii. Choose Significance Level (α): Select the
level of significance to determine the threshold
for rejecting H0 .
iii. Select Test Statistic: Choose an appropri-
ate test statistic based on the type of data and
hypothesis being tested.
iv. Collect Data: Gather data from the sample
or population under study.
v. Compute Test Statistic: Calculate the value
of the test statistic using the collected data.
vi. Determine Critical Region: Determine the
critical region based on the chosen significance
level and test statistic.
vii. Make Decision: Compare the test statistic
to the critical region and decide whether to
reject H0 or not.
viii. Draw Conclusion: Based on the decision,
draw conclusions about the hypothesis being
tested.
(c) What are steps involved in SVM procedure for
Classification. Mention the method for classifi-
cation model evaluation?

12
A. Steps Involved in SVM Procedure for Classi-
fication
i. Data Collection: Gather labeled training
data consisting of input features and corre-
sponding class labels.
ii. Data Preprocessing: Perform necessary pre-
processing steps such as normalization, fea-
ture scaling, and handling missing values.
iii. Model Training: Train the SVM model us-
ing the preprocessed training data.
iv. Model Evaluation: Evaluate the trained SVM
model using validation data or cross-validation
techniques to assess its performance.
v. Parameter Tuning: Fine-tune model pa-
rameters (e.g., regularization parameter, ker-
nel type) to optimize model performance.
vi. Final Model Deployment: Deploy the op-
timized SVM model for making predictions on
new, unseen data.
Evaluation of Classification Model
• Accuracy: Percentage of correctly classified
instances out of total instances.
• Precision: Proportion of true positive pre-
dictions out of all positive predictions.
• Recall: Proportion of true positive predic-
tions out of all actual positive instances.
• F1 Score: Harmonic mean of precision and
recall, providing a balance between the two
metrics.

13
• Confusion Matrix: Tabulates true positive,
true negative, false positive, and false negative
predictions.
• ROC Curve: Plots the true positive rate
against the false positive rate for various clas-
sification thresholds.
• Area Under ROC Curve (AUC): Quanti-
fies the model’s ability to distinguish between
classes.
(d) Detail univariate, bivariate and multi variate data
analysis, with relevant examples.
A.
• Univariate Data Analysis: Univariate anal-
ysis focuses on examining one variable at a
time. It involves analyzing the distribution,
summary statistics, and visualizations of a sin-
gle variable. Examples:Exam Scores: An-
alyzing the distribution of exam scores for a
class, calculating measures like mean, median,
mode, and visualizing with histograms or box
plots. Temperature Data: Studying the
distribution of temperatures in a city, calcu-
lating summary statistics, and creating fre-
quency distributions or density plots.
• Bivariate Data Analysis:Bivariate analy-
sis explores the relationship between two vari-
ables. It aims to understand how one vari-
able changes concerning another. Examples:
Height and Weight: Investigating the rela-
tionship between height and weight in a popu-

14
lation, using scatter plots and correlation anal-
ysis. Advertising Spending and Sales:
Examining how advertising spending affects
sales, analyzing with scatter plots and regres-
sion analysis.
• Multivariate Data Analysis: Multivariate
analysis involves the simultaneous examina-
tion of multiple variables to understand com-
plex relationships and patterns. Examples:
Customer Segmentation: Using demographic,
purchasing behavior, and geographic data to
segment customers into different groups for
targeted marketing strategies. Stock Mar-
ket Analysis: Analyzing various factors such
as company performance, industry trends, and
economic indicators to predict stock prices us-
ing techniques like regression analysis or ma-
chine learning algorithms.
(e) Give Jaccard coefficient
A. Jaccard Coefficient gives similarity between
two vectors having binary attributes. Jaccard
Similarity:
|X ∪ Y | XT · Y
Jaccard Similarity(X, Y ) = =
|X ∩ Y | X T · X + Y T · Y − X T · Y
(For binary vectors) The formula is shown for
two vectors X and Y both expressed in binary
format
(f) What are the steps involved in Web scraping.
A. Web Scraping Steps:

15
i. Identify Target Website: Choose the web-
site from which you want to extract data.
ii. Inspect HTML Structure: Understand the
structure of the HTML to locate data ele-
ments.
iii. Use a Scraping Library: Utilize a scrap-
ing library like BeautifulSoup or Scrapy in
Python.
iv. HTTP Requests: Send HTTP requests to
the website to retrieve HTML content.
v. Parse HTML: Parse the HTML content to
extract relevant data.
vi. Data Processing: Clean and process the ex-
tracted data as needed.
(g) Explain Multicollinearity and how it resolves Multi
Linear Regression?
A. Definition Multicollinearity refers to the situ-
ation in which two or more independent variables
in a regression model are highly correlated with
each other. To address multicollinearity and
resolve its impact on multiple linear regression,
important techniques including:
Feature Selection: Removing one or more highly
correlated variables from the model to reduce
multicollinearity.
Feature Transformation: Transforming the vari-
ables using techniques like Principal Component
Analysis (PCA) or Partial Least Squares (PLS)
regression to create new, uncorrelated variables.
Regularization: Using regularization techniques

16
such as Ridge Regression or LASSO Regression,
which penalize large coefficients and help miti-
gate the effects of multicollinearity.
Collecting More Data: Sometimes, multicollinear-
ity arises due to a limited amount of data. Col-
lecting more data can help reduce the correlation
between variables and improve the stability of the
regression coefficients.
(h) What are steps steps in building the KNN clas-
sifier.
A. Steps Involved in k-NN Classifier
• Choose the Value of k:
• Decide on an appropriate value of k based on
factors like data characteristics, bias-variance
trade-off, and computational considerations.
(i) Calculate Distances:
• Compute the distance between the test point
and each data point in the training set. Com-
mon distance metrics include Euclidean dis-
tance, Manhattan distance, or others.
(j) Find k Nearest Neighbors:
• Identify the k data points with the smallest
distances to the test point.
(k) Majority Vote:
• For classification tasks, assign the class label
of the test point based on the majority class
among its k nearest neighbors.
(l) Prediction:

17
• For regression tasks, predict the value of the
test point based on the average (or weighted
average) of the values of its k nearest neigh-
bors.
(m) describe any two python libraries in Data science
projects.
A. Important Packages applicable to Data Sci-
ence learning include, but not limited to, are
Pandas : Widely used for large data sets.
• Data Manipulation
• Analysis
Data Frame structure simplifies
• data handling
• cleaning
• filtering
Python Matplotlib - 2D plots Widely used
for visualization of data, before and after pre-
possessing, and analytics.
• Bar Chart: Suitable for displaying and com-
paring individual categories or groups.
• Line Chart: Ideal for visualizing trends or
patterns in data, especially over a continuous
variable like time.
• Scatter Plot: Effective for showing relation-
ships or correlations between two variables.
• Pie Chart: Good for illustrating the propor-
tion of each category in a whole.
• Histogram: Useful for displaying the distri-
bution of a single continuous variable.

18
• Box Plot: Helpful in visualizing the distri-
bution and identifying outliers in a dataset.
• Heatmap: Great for representing the corre-
lation between multiple variables in a matrix
format.
Matplotlib : Widely used for visualization of
data, before and after preprocessing, and analyt-
ics.
• Scatter Plot in 3D: Visualizes individual data
points in a 3D space.
• Line Plot in 3D: Connects data points with
lines in a 3D space.
• Surface Plot: Visualizes a surface in 3D.
• Wireframe Plot: Represents a 3D surface with
lines connecting the data points.

19

You might also like