Professional Documents
Culture Documents
Viva EDA
Viva EDA
Data refers to raw facts, figures, and statistics collected or stored for reference or analysis. It
can be in various forms, such as numbers, text, images, or multimedia.
Data can be categorized into structured, semi-structured, and unstructured types. Structured
data is organized and follows a predefined format, like data in databases. Semi-structured data
has some organization but doesn't fit neatly into tables or rows, like XML or JSON files.
Unstructured data lacks any predefined structure and includes text documents, images, videos,
etc.
Data analysis is the process of inspecting, cleansing, transforming, and modeling data to
uncover useful information, conclusions, and support decision-making.
Data refers to raw facts or figures, whereas information is data that has been processed,
organized, or structured in a meaningful context, making it useful for decision-making or
understanding.
Big data refers to large and complex datasets that traditional data processing applications
struggle to handle efficiently. It is characterized by volume, velocity, and variety, and often
requires specialized technologies and techniques for storage, processing, and analysis.
Data mining is the process of discovering patterns, correlations, or insights from large datasets
using various techniques, including statistical analysis, machine learning, and artificial
intelligence. It aims to extract valuable information or knowledge from data for decision-
making or prediction purposes.
1
EDA VIVA QUESTIONS
Descriptive analytics involves analyzing historical data to understand and summarize what has
happened in the past. It focuses on describing data patterns, trends, and characteristics using
various statistical and visualization techniques.
Measures of central tendency are statistical measures used to describe the center or average of
a dataset. The most common measures are mean, median, and mode. The mean is the average
value, the median is the middle value when the data is arranged in ascending order, and the
mode is the value that appears most frequently.
Variance measures the dispersion or spread of a dataset around its mean. It is calculated by
taking the average of the squared differences between each data point and the mean.
Euclidean distance is a measure of the straight-line distance between two points in Euclidean
space. It is the length of the line segment connecting the two points.
In two dimensions, the Euclidean distance between two points (x1, y1) and (x2, y2) can be
calculated using the distance formula: Distance=(x2−x1)2+(y2−y1)2
2
EDA VIVA QUESTIONS
Supervised learning is a type of machine learning where the algorithm learns from labelled
data, which means the input data has corresponding output labels. The goal is to learn a
mapping from input features to output labels so that the model can make predictions on new,
unseen data.
Classification: Predicting discrete class labels (e.g., spam or not spam, dog breed
recognition).
Regression: Predicting continuous numerical values (e.g., house prices, stock prices).
Unsupervised learning is a type of machine learning where the algorithm learns patterns or
relationships in the input data without explicit supervision or labelled output. The goal is to
explore the structure of the data and discover hidden patterns or groupings.
Clustering: Grouping similar data points together based on their features (e.g., customer
segmentation).
Overfitting occurs when a machine learning model learns the training data too well, capturing
noise or random fluctuations in the data rather than the underlying patterns. As a result, the
model performs well on the training data but poorly on new, unseen data.
High accuracy or performance on the training data but significantly lower performance
on the testing data.
A model with complex decision boundaries that closely fit the training data points but
fail to generalize well to new data.
3
EDA VIVA QUESTIONS
Q19: What is underfitting in machine learning? A4: Underfitting occurs when a machine
learning model is too simple to capture the underlying structure of the data, resulting in poor
performance both on the training data and new, unseen data.
Low accuracy or performance on both the training data and testing data.
A model that fails to capture the complexity or variability in the data, resulting in high
bias and low variance.
k-fold cross-validation: The dataset is divided into k subsets, and the model is trained
and evaluated k times, each time using a different subset as the testing data and the
remaining subsets as the training data.
Leave-one-out cross-validation (LOOCV): Each data point is used as the testing data
once, with the remaining data points used for training.
In k-fold cross-validation, the dataset is divided into k equally sized subsets (or folds). The
model is trained k times, with each iteration using a different fold as the testing data and the
remaining folds as the training data. The final performance metric is calculated by averaging
the performance across all k iterations.
Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data.The linear regression equation is represented as: y=mx+b where:
b is the y-intercept
Regression Shrinkage Methods, also known as regularization techniques, are approaches used
to prevent overfitting in regression models by adding a penalty term to the loss function. They
include techniques like Ridge Regression, Lasso Regression, and Elastic Net.
Ridge Regression is a regression technique that adds a penalty term to the loss function,
proportional to the square of the magnitude of the coefficients. This penalty term helps to shrink
the coefficients towards zero, reducing the model's complexity and preventing overfitting.
Elastic Net is a regression technique that combines the penalties of Ridge Regression and Lasso
Regression. It adds both the L1 (Lasso) and L2 (Ridge) penalties to the loss function, allowing
for variable selection while still handling multicollinearity.
5
EDA VIVA QUESTIONS
Regression Shrinkage Methods are useful when dealing with high-dimensional data or
situations where multicollinearity is present. They help to prevent overfitting and improve the
generalization performance of regression models.
Tree-based Methods are machine learning techniques that use decision trees as the primary
model structure. They include algorithms like Decision Trees, Random Forests, Gradient
Boosting Machines (GBM), and Extreme Gradient Boosting (XGBoost).
A Decision Tree is a supervised learning algorithm that recursively splits the dataset into
subsets based on the values of input features, aiming to minimize impurity or maximize
information gain at each split. It results in a tree-like structure where each internal node
represents a decision based on a feature, and each leaf node represents a class label or
prediction.
Random Forests are an ensemble learning technique that builds multiple decision trees
independently and combines their predictions through averaging or voting. Each tree is trained
on a random subset of the training data and a random subset of features, which helps reduce
overfitting and improve generalization performance.
Q32: What is bagging in machine learning? A1: Bagging, short for Bootstrap Aggregating,
is an ensemble learning technique that combines multiple base learners, typically decision
trees, trained on different bootstrap samples of the training data. It aims to reduce variance and
improve stability by averaging or voting the predictions of individual models.
Q33: What is bagging in machine learning? A1: Bagging, short for Bootstrap Aggregating,
is an ensemble learning technique that combines multiple base learners, typically decision
trees, trained on different bootstrap samples of the training data. It aims to reduce variance and
improve stability by averaging or voting the predictions of individual models.
Q34: How does boosting differ from bagging? A6: Boosting differs from bagging in that it
sequentially builds a series of base learners, with each learner focusing on examples that were
6
EDA VIVA QUESTIONS
misclassified or had high residuals by the previous learners. Boosting aims to reduce bias and
improve performance by iteratively improving the model's predictive ability.
Dimensionality reduction: PCA can be used to reduce the number of features in high-
dimensional datasets while preserving most of the information.
Noise reduction: PCA can help remove noise or redundant information from data,
leading to better performance in subsequent analysis tasks.
Classification is a supervised learning task where the goal is to predict the class label or
category of a new instance based on its features. It involves learning a mapping from input
features to discrete class labels or categories.
Q38: What are some common classification algorithms? A6: Some common classification
algorithms include:
Decision Trees: Tree-based models that partition the feature space into regions based
on the values of input features.
A Support Vector Machine (SVM) is a supervised learning algorithm used for classification,
regression, and outlier detection. It works by finding the hyperplane that best separates the
classes in feature space while maximizing the margin between the classes.
Kernels are commonly used in algorithms like Support Vector Machines (SVMs) to implicitly
map the input data into a higher-dimensional space, where the data may be more easily
Linear Kernel: Computes the dot product of the input features in the original space.
Polynomial Kernel: Computes the dot product of the input features after applying
Radial Basis Function (RBF) Kernel: Computes the similarity between data points based
Sigmoid Kernel: Computes the similarity between data points using a hyperbolic
tangent function.
Support Vector Machines (SVM): A model that finds the hyperplane that best separates
the classes in feature space.