ML Quiz QuestionsVV

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Q16: What is the primary goal of the Business Understanding phase in CRISP-DM?

A16: To fully
comprehend the project's objectives and requirements from a business perspective, and to convert this
knowledge into a data mining problem definition.

Q17: Why is the Data Understanding phase crucial in CRISP-DM? A17: It allows for an initial exploration
of the data, helping to understand its structure, quality, and potential quirks or patterns that may
influence subsequent analysis.

Q18: In which phase of CRISP-DM would you handle missing values and outliers? A18: Data Preparation.

Q19: How does the Evaluation phase in CRISP-DM differ from the Modeling phase? A19: While the
Modeling phase focuses on building various models using algorithms, the Evaluation phase concentrates
on assessing the models' quality and validity to ensure they meet the business objectives.

Q20: Why is deployment considered a crucial phase in CRISP-DM, especially from a business
perspective? A20: Deployment ensures the model's findings are applied to business operations, thus
driving strategic or tactical actions. A model that isn't deployed doesn't deliver business value.

Q21: How would you iterate in the CRISP-DM process? A21: After evaluation, if a model does not meet
business objectives or needs refinement, one might return to data preparation or even data
understanding phases. Iteration is a fundamental aspect of CRISP-DM, emphasizing a cyclical, rather than
strictly linear, process.

Q22: What role does feedback play in the CRISP-DM methodology? A22: Feedback is essential for
refining models and ensuring they remain relevant over time. As business needs evolve or as new data
becomes available, models might need adjustments.

Q23: Why might you return to the Business Understanding phase after deploying a model? A23:
Feedback from the deployment might reveal new business questions, changing priorities, or insights that
necessitate a revised or entirely new analysis.

Q24: In which CRISP-DM phase would feature engineering typically occur? A24: Data Preparation.

Q25: Why is understanding business context vital for data miners, even if they are more technically
oriented? A25: Business context provides clarity on which problems are essential to solve, helps
prioritize tasks, and ensures the solutions are aligned with business goals. It ensures the technical work
delivers tangible business value.

Q26: Can you name a risk of skipping the Data Understanding phase in CRISP-DM? A26: Skipping Data
Understanding might lead to oversight of critical issues in the data, such as missing values,
inconsistencies, or biases. Such issues, if unnoticed, can significantly impact the model's performance.

Q27: How can the Evaluation phase in CRISP-DM help in avoiding the pitfall of overfitting? A27: The
Evaluation phase allows for testing the model on a separate set of data (validation or test set) which it
hasn't seen during training. If a model performs well on the training set but poorly on the test set,
overfitting might be a concern.

Q28: How does the CRISP-DM process ensure that data mining efforts align with business goals? A28: By
beginning with the Business Understanding phase and regularly revisiting and refining the business
objectives, CRISP-DM ensures that the technical efforts are always aligned with the business's goals and
priorities.

Q29: You've been given a dataset with 10,000 rows, and upon analysis, you realize there are 1,500
missing values in one of the key columns. In which CRISP-DM phase would you decide how to handle
these missing values?

A29: Data Preparation.

Q30: You've developed a predictive model for forecasting sales. After deploying it, the sales team
provides feedback that its predictions are not aligning with ground realities. What CRISP-DM phase
should you revisit first?

A30: Evaluation, followed by Business Understanding to recheck the objectives and possibly Data
Understanding to delve deeper into the data.

Q31: At the end of a project, you want to summarize and detail all the insights and patterns you've
discovered from your analysis. In which CRISP-DM phase would you do this?

A31: Deployment.

Q32: You're starting a new data mining project. Your first step is to understand what the business aims to
achieve and what problems it aims to solve with the data. Which CRISP-DM phase are you in?

A32: Business Understanding.

Q33: Your company wants to launch a new product. Before doing so, they'd like to analyze historical
sales data to understand which regions might be most receptive to the new product. What CRISP-DM
phase is this process initiating with?

A33: Business Understanding.

Q34: During analysis, you discover that the data from one of the sources is in a different format than
others, causing inconsistencies. In which CRISP-DM phase would you address this?

A34: Data Preparation.

Q35: You've just completed your model, and before deploying it, you want to test its performance on
new, unseen data. Which CRISP-DM phase does this refer to?

A35: Evaluation.

Q36: After deploying your churn prediction model, you realize that customer behavior trends are
changing rapidly, making your model outdated in just a few months. In light of CRISP-DM, how should
you address this?

A36: Reiterate the CRISP-DM process, revisiting phases like Data Understanding and Data Preparation to
accommodate the new trends and then re-evaluating the model.

Q37: You are provided with an extremely large dataset, and it's not feasible to analyze all of it. In which
CRISP-DM phase would you decide on a strategy to sample or reduce the data size for analysis?
A37: Data Understanding.

Q38: Your e-commerce business wants to understand why a specific product has high returns. They have
data on customer reviews, purchase history, and return records. What CRISP-DM phase would you
initially delve into to find potential reasons?

A38: Business Understanding, followed by Data Understanding to explore the data sources and their
potential insights.

Q39: After analyzing customer feedback, you decide to incorporate additional external data, like weather
patterns, to improve your retail sales prediction model. In which CRISP-DM phase would this data
integration happen?

A39: Data Preparation.

Q40: Before finalizing your predictive model, you want to try different algorithms and techniques to see
which one provides the best results. Which CRISP-DM phase are you in?

A40: Modeling.

Q1: What is the basic idea behind the kNN algorithm?

A1: kNN classifies a data point based on how its neighbors are classified.

Q2: What role does the parameter 'k' play in kNN?

A2: 'k' determines the number of nearest neighbors to consider for classification or regression.

Q3: How does kNN handle multi-class classification problems?

A3: kNN considers the majority class among the 'k' nearest neighbors as the prediction.

Q4: Why is feature scaling important in kNN?

A4: Because kNN relies on distance measures, differing scales between features can bias the results.

Q5: How does kNN differ from k-means clustering?

A5: kNN is a supervised learning method for classification or regression, while k-means is an
unsupervised clustering technique.

Q6: What happens when 'k' is set to 1 in kNN?

A6: The classification of a data point is directly assigned based on its closest neighbor's class.
Q7: Why might a very large 'k' value be problematic in kNN?

A7: A large 'k' may include irrelevant points, causing noise and potentially less accurate predictions.

Q8: How do we choose the best 'k' for kNN?

A8: Using techniques like cross-validation to test the performance of various 'k' values.

Q9: Can kNN be used for regression?

A9: Yes, by averaging or taking the median of the 'k' nearest neighbors' values.

Q10: What is a common distance metric used in kNN?

A10: Euclidean distance, though others like Manhattan distance can also be used.

Q11: How does kNN handle missing values?

A11: Common strategies include imputation or using distance metrics that handle missing values.

Q12: Why might kNN be slow on large datasets?

A12: Because it must calculate distances between the input point and every point in the dataset.

Q13: What's a drawback of using kNN for high-dimensional data?

A13: The "curse of dimensionality" can make distances less discriminative, affecting kNN's performance.

Q14: How does weighting affect kNN predictions?

A14: Weights can assign more importance to closer neighbors, influencing the final prediction.

Q15: Why is it not always ideal to choose an even 'k' in kNN for binary classification?

A15: An even 'k' can lead to ties, requiring additional methods to break them.

Q16: In what scenarios might kNN be less effective?


A16: When the dataset is large, high-dimensional, or when features have differing scales.

Q17: Does kNN make assumptions about the underlying data distribution?

A17: No, kNN is non-parametric and makes no assumptions about data distribution.

Q18: Can kNN handle categorical data?

A18: Yes, but with appropriate distance metrics, like the Hamming distance for binary data.

Q19: How does increasing 'k' affect the bias-variance trade-off in kNN?

A19: Increasing 'k' typically increases bias but reduces variance.

Q20: How can the performance of kNN be improved for large datasets?

A20: Using data structures like KD-Trees or Ball Trees to speed up the nearest neighbor search

Q21: How does the choice of distance metric in kNN affect its performance in high-dimensional spaces?

A21: In high-dimensional spaces, many distance metrics, like Euclidean, become less effective due to the
curse of dimensionality, making points seem equidistant.

Q22: Why might a very small 'k' lead to a high variance in the model?

A22: A smaller 'k' makes the algorithm more sensitive to noise in the data, causing overfitting and high
variance.

Q23: How does kNN handle imbalanced datasets?

A23: kNN can be biased towards the majority class in imbalanced datasets, leading to suboptimal
classifications for the minority class.

Q24: How does the locality-sensitive hashing (LSH) technique relate to kNN?

A24: LSH is used to approximate nearest neighbor searches in high-dimensional spaces, making kNN
searches faster.

Q25: How might one address the computational inefficiencies of kNN?


A25: Using techniques like dimensionality reduction, tree-based structures, or approximated methods
can speed up kNN.

Q26: How does the radius-based nearest neighbor approach differ from kNN?

A26: Instead of considering a fixed number 'k' of neighbors, it considers all points within a fixed radius.

Q27: How can you ensure that the kNN model is robust to outliers?

A27: By choosing a higher 'k' value and potentially combining with distance-weighting, the influence of
outliers can be reduced.

Q28: Why might cosine similarity be more appropriate than Euclidean distance for text data in kNN?

A28: Cosine similarity measures the cosine of the angle between two vectors, making it more suited for
sparse, high-dimensional data like text.

Q29: How would kNN's performance be affected if features have different units or scales?

A29: Different scales can disproportionately impact distance calculations, making certain features
dominate the decision over others.

Q30: In what situations might ensemble methods improve the performance of kNN?

A30: Ensembles like Bagging can reduce variance and noise sensitivity in kNN, especially with smaller 'k'
values.

Q31: How might the k-d tree data structure become inefficient in very high-dimensional data for kNN
searches?

A31: Due to the curse of dimensionality, the k-d tree partitions become less effective in separating points
in high-dimensional spaces.

Q32: Why is "lazy learning" an appropriate descriptor for kNN?

A32: kNN doesn't build an explicit model during training but "defers" computation until prediction, thus
termed "lazy".

Q33: Can feature engineering improve the performance of kNN? Why or why not?
A33: Yes, creating meaningful features can improve distance calculations, aiding in better classifications
or predictions.

Q34: How do distance-weighted kNN and majority voting kNN differ?

A34: Distance-weighted kNN assigns weights based on distances, giving closer neighbors more influence,
while majority voting considers all neighbors equally within the 'k'.

Q35: Why can distance-weighted kNN be computationally more expensive?

A35: It requires the computation of distances and their corresponding weights for aggregation, making it
more computationally intensive.

You might also like