Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

EDA VIVA QUESTIONS

Q1: What is data?

Data refers to raw facts, figures, and statistics collected or stored for reference or analysis. It
can be in various forms, such as numbers, text, images, or multimedia.

Q2: What are the different types of data?

Data can be categorized into structured, semi-structured, and unstructured types. Structured
data is organized and follows a predefined format, like data in databases. Semi-structured data
has some organization but doesn't fit neatly into tables or rows, like XML or JSON files.
Unstructured data lacks any predefined structure and includes text documents, images, videos,
etc.

Q3: What is data analysis?

Data analysis is the process of inspecting, cleansing, transforming, and modeling data to
uncover useful information, conclusions, and support decision-making.

Q4: What is the difference between data and information?

Data refers to raw facts or figures, whereas information is data that has been processed,
organized, or structured in a meaningful context, making it useful for decision-making or
understanding.

Q5: What is data visualization?

Data visualization is the graphical representation of data to communicate information clearly


and efficiently. It helps users analyze and interpret complex datasets by presenting them in a
visual format, such as charts, graphs, maps, or infographics.

Q6: What is big data?

Big data refers to large and complex datasets that traditional data processing applications
struggle to handle efficiently. It is characterized by volume, velocity, and variety, and often
requires specialized technologies and techniques for storage, processing, and analysis.

Q7: What is data mining?

Data mining is the process of discovering patterns, correlations, or insights from large datasets
using various techniques, including statistical analysis, machine learning, and artificial
intelligence. It aims to extract valuable information or knowledge from data for decision-
making or prediction purposes.
1
EDA VIVA QUESTIONS

Q8: What is descriptive analytics?

Descriptive analytics involves analyzing historical data to understand and summarize what has
happened in the past. It focuses on describing data patterns, trends, and characteristics using
various statistical and visualization techniques.

Q9: What are the measures of central tendency?

Measures of central tendency are statistical measures used to describe the center or average of
a dataset. The most common measures are mean, median, and mode. The mean is the average
value, the median is the middle value when the data is arranged in ascending order, and the
mode is the value that appears most frequently.

Q10: How is variance calculated?

Variance measures the dispersion or spread of a dataset around its mean. It is calculated by
taking the average of the squared differences between each data point and the mean.

Q12: What is a histogram?

A histogram is a graphical representation of the distribution of a dataset. It consists of bars


where the height of each bar represents the frequency or proportion of data values falling within
a particular interval or "bin" on the horizontal axis.

Q13: What is skewness in statistics?

Skewness measures the asymmetry of the probability distribution of a dataset. A distribution is


symmetric if it looks the same on both sides of its center. Positive skewness indicates that the
right tail of the distribution is longer or fatter than the left tail, while negative skewness
indicates the opposite.

Q14: What is Euclidean distance?

Euclidean distance is a measure of the straight-line distance between two points in Euclidean
space. It is the length of the line segment connecting the two points.

Q15: How is Euclidean distance calculated in two dimensions?

In two dimensions, the Euclidean distance between two points (x1, y1) and (x2, y2) can be
calculated using the distance formula: Distance=(x2−x1)2+(y2−y1)2

Q14: What is supervised learning?

2
EDA VIVA QUESTIONS

Supervised learning is a type of machine learning where the algorithm learns from labelled
data, which means the input data has corresponding output labels. The goal is to learn a
mapping from input features to output labels so that the model can make predictions on new,
unseen data.

Q15: What are some examples of supervised learning tasks?

Examples of supervised learning tasks include:

 Classification: Predicting discrete class labels (e.g., spam or not spam, dog breed
recognition).

 Regression: Predicting continuous numerical values (e.g., house prices, stock prices).

Q16: What is unsupervised learning?

Unsupervised learning is a type of machine learning where the algorithm learns patterns or
relationships in the input data without explicit supervision or labelled output. The goal is to
explore the structure of the data and discover hidden patterns or groupings.

Q17: What are some examples of unsupervised learning tasks?

Examples of unsupervised learning tasks include:

 Clustering: Grouping similar data points together based on their features (e.g., customer
segmentation).

 Dimensionality reduction: Reducing the number of features while preserving the


essential information in the data (e.g., principal component analysis).

Q18: What is overfitting in machine learning?

Overfitting occurs when a machine learning model learns the training data too well, capturing
noise or random fluctuations in the data rather than the underlying patterns. As a result, the
model performs well on the training data but poorly on new, unseen data.

Some indicators of overfitting include:

 High accuracy or performance on the training data but significantly lower performance
on the testing data.

 A model with complex decision boundaries that closely fit the training data points but
fail to generalize well to new data.

3
EDA VIVA QUESTIONS

Q19: What is underfitting in machine learning? A4: Underfitting occurs when a machine
learning model is too simple to capture the underlying structure of the data, resulting in poor
performance both on the training data and new, unseen data.

Some indicators of underfitting include:

 Low accuracy or performance on both the training data and testing data.

 A model that fails to capture the complexity or variability in the data, resulting in high
bias and low variance.

Q20: What is cross-validation in machine learning? A1: Cross-validation is a technique used


to assess the performance and generalization ability of a machine learning model. It involves
partitioning the dataset into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining data. This process helps estimate how well the
model will perform on unseen data.

Q21: What are the common types of cross-validation?

Common types of cross-validation include:

 k-fold cross-validation: The dataset is divided into k subsets, and the model is trained
and evaluated k times, each time using a different subset as the testing data and the
remaining subsets as the training data.

 Leave-one-out cross-validation (LOOCV): Each data point is used as the testing data
once, with the remaining data points used for training.

Q22: How does k-fold cross-validation work?

In k-fold cross-validation, the dataset is divided into k equally sized subsets (or folds). The
model is trained k times, with each iteration using a different fold as the testing data and the
remaining folds as the training data. The final performance metric is calculated by averaging
the performance across all k iterations.

Q23: What is frequentist statistics?

Frequentist statistics is an approach to statistical inference based on the frequency or proportion


of events occurring in repeated, independent experiments. It relies on the concept of probability
as the limit of the relative frequency of events.

Q23: What is linear regression?


4
EDA VIVA QUESTIONS

Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data.The linear regression equation is represented as: y=mx+b where:

 y is the dependent variable (target)

 x is the independent variable (feature)

 m is the slope of the line

 b is the y-intercept

Q24: What are Regression Shrinkage Methods?

Regression Shrinkage Methods, also known as regularization techniques, are approaches used
to prevent overfitting in regression models by adding a penalty term to the loss function. They
include techniques like Ridge Regression, Lasso Regression, and Elastic Net.

Q25: What is Ridge Regression?

Ridge Regression is a regression technique that adds a penalty term to the loss function,
proportional to the square of the magnitude of the coefficients. This penalty term helps to shrink
the coefficients towards zero, reducing the model's complexity and preventing overfitting.

Q26: What is Lasso Regression?

Lasso Regression, or Least Absolute Shrinkage and Selection Operator, is a regression


technique that adds a penalty term to the loss function, proportional to the absolute value of the
coefficients. Lasso tends to shrink some coefficients to exactly zero, effectively performing
variable selection and producing sparse models.

Q27: What is Elastic Net?

Elastic Net is a regression technique that combines the penalties of Ridge Regression and Lasso
Regression. It adds both the L1 (Lasso) and L2 (Ridge) penalties to the loss function, allowing
for variable selection while still handling multicollinearity.

Q28: When should you use Regression Shrinkage Methods?

5
EDA VIVA QUESTIONS

Regression Shrinkage Methods are useful when dealing with high-dimensional data or
situations where multicollinearity is present. They help to prevent overfitting and improve the
generalization performance of regression models.

Q29: What are Tree-based Methods?

Tree-based Methods are machine learning techniques that use decision trees as the primary
model structure. They include algorithms like Decision Trees, Random Forests, Gradient
Boosting Machines (GBM), and Extreme Gradient Boosting (XGBoost).

Q30: What is a Decision Tree?

A Decision Tree is a supervised learning algorithm that recursively splits the dataset into
subsets based on the values of input features, aiming to minimize impurity or maximize
information gain at each split. It results in a tree-like structure where each internal node
represents a decision based on a feature, and each leaf node represents a class label or
prediction.

Q31: What are Random Forests?

Random Forests are an ensemble learning technique that builds multiple decision trees
independently and combines their predictions through averaging or voting. Each tree is trained
on a random subset of the training data and a random subset of features, which helps reduce
overfitting and improve generalization performance.

Q32: What is bagging in machine learning? A1: Bagging, short for Bootstrap Aggregating,
is an ensemble learning technique that combines multiple base learners, typically decision
trees, trained on different bootstrap samples of the training data. It aims to reduce variance and
improve stability by averaging or voting the predictions of individual models.

Q33: What is bagging in machine learning? A1: Bagging, short for Bootstrap Aggregating,
is an ensemble learning technique that combines multiple base learners, typically decision
trees, trained on different bootstrap samples of the training data. It aims to reduce variance and
improve stability by averaging or voting the predictions of individual models.

Q34: How does boosting differ from bagging? A6: Boosting differs from bagging in that it
sequentially builds a series of base learners, with each learner focusing on examples that were

6
EDA VIVA QUESTIONS

misclassified or had high residuals by the previous learners. Boosting aims to reduce bias and
improve performance by iteratively improving the model's predictive ability.

Q35: What is Principal Components Analysis (PCA)?

PCA is a dimensionality reduction technique used to transform high-dimensional data into a


lower-dimensional space while preserving the most important information in the data. It
achieves this by finding the principal components, which are the orthogonal directions that
capture the maximum variance in the data.PCA works by finding the eigenvectors (principal
components) of the covariance matrix of the data and projecting the data onto these
eigenvectors. The first principal component captures the most variance in the data, with each
subsequent component capturing less variance but being orthogonal to the previous ones.

Q36: What are some applications of PCA?

 Dimensionality reduction: PCA can be used to reduce the number of features in high-
dimensional datasets while preserving most of the information.

 Data visualization: PCA can be used to visualize high-dimensional data in lower-


dimensional space, making it easier to explore and interpret.

 Noise reduction: PCA can help remove noise or redundant information from data,
leading to better performance in subsequent analysis tasks.

Q37: What is classification in machine learning?

Classification is a supervised learning task where the goal is to predict the class label or
category of a new instance based on its features. It involves learning a mapping from input
features to discrete class labels or categories.

Q38: What are some common classification algorithms? A6: Some common classification
algorithms include:

 Logistic Regression: A linear model used for binary classification tasks.

 Decision Trees: Tree-based models that partition the feature space into regions based
on the values of input features.

 k-Nearest Neighbors (KNN): A non-parametric algorithm that classifies instances based


on the majority vote of their k nearest neighbors in feature space.

Q39: What is a Support Vector Machine (SVM)?


7
EDA VIVA QUESTIONS

A Support Vector Machine (SVM) is a supervised learning algorithm used for classification,
regression, and outlier detection. It works by finding the hyperplane that best separates the
classes in feature space while maximizing the margin between the classes.

Q40: What is a Kernel?

Kernels are commonly used in algorithms like Support Vector Machines (SVMs) to implicitly
map the input data into a higher-dimensional space, where the data may be more easily

separable. Common types of kernels include:

 Linear Kernel: Computes the dot product of the input features in the original space.

 Polynomial Kernel: Computes the dot product of the input features after applying

polynomial transformations to them.

 Radial Basis Function (RBF) Kernel: Computes the similarity between data points based

on their distance in the feature space, typically using a Gaussian function.

 Sigmoid Kernel: Computes the similarity between data points using a hyperbolic

tangent function.

 Support Vector Machines (SVM): A model that finds the hyperplane that best separates
the classes in feature space.

You might also like