Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Business analytics and data mining Modeling using R

1. What type of analytics uses statistical and machine learning techniques?


a. Decision Making
b. Prescriptive
c. Descriptive
d. Predictive
2. Interviewing all members of a given population is called
a. A sample.
b. A gall up pole
c. Censuses.
d. Nielsen audit
3. Which function used to print all variable names in a data frame df in R?
a. names()
NOT SURE

b. names(df)
c. df.names()
d. names(“df”)
4. Which of the following data is put into a formula to produce commonly accepted results
a. Raw.
b. Processed.
c. Synchronized.
d. All of the above.
5. What would you use to compare the frequency distributions of more than one set of data?
a. Box plots.
b. Frequency distribution.
c. Frequency polygon.
d. Line graph.
6. Which of them is the best considered for prediction?
a. Linear regression
b. Logistic regression
c. CART
d. Naïve Bayes
7. Which of the following metrics measures the ‘goodness of fit’ of a regression model?
a. Mean absolute deviation
b. Root mean squared error
c. R-squared
d. The total sum of squared errors
8. Which statement is true about prediction problems?
a. The output attribute must be categorical.
b. The output attribute must be numeric.
c. The resultant model is designed to determine future outcomes.
d. The resultant model is designed to classify current behavior.
9. Which of the following assumptions of multiple linear regression can be relaxed in data mining?
a. Noise follows a normal distribution
b. Observations are independent
c. Linear relationship holds true
d. Heteroscedasticity

10. What is Dummy coding for categorical variables with an example.


DUMMY CODING

It is a way to make the categorical variable into a series of dichotomous variables (variables that
can have a value of zero or one only.) ... You can select any level of the categorical variable as the
reference level.
11. What is Partitioning and Describe its types.
Partitioning is the process of writing the hard drive sectors that
will make up the partition table. It contains information on the
partition, including sector size, position with respect to the
primary partition, types of partitions present, operating systems
installed, etc. When a partition is created, it is given a volume
name, which allows it to be easily identified.
There are three types of partitions: primary partitions, extended
partitions and logical drives. 
-A primary partition is a partition on which you can install an operating system. A
primary partition with an operating system installed on it is used when the
computer starts to load the OS.
-An extended partition is a partition that can be divided into additional logical
drives. Unlike a primary partition, you don't need to assign it a drive letter and
install a file system. Instead, you can use the operating system to create an
additional number of logical drives within the extended partition.
-A logical drive is a drive space that is logically created on top of a physical hard
disk drive. A logical drive is a separate partition with its own parameters and
functions, and it operates independently. A logical drive can also be called a
logical drive partition or logical disk partition
12. What do you mean by Dimension Reduction Techniques?Describe any two Dimension Reduction
Techniques.
DIMensionality reduction refers to techniques that reduce the number of
input variables in a dataset.
More input features often make a predictive modeling task more challenging to
model, more generally referred to as the curse of dimensionality.
High-dimensionality statistics and dimensionality reduction techniques are
often used for data visualization. Nevertheless these techniques can be used in
applied machine learning to simplify a classification or regression dataset in
order to better fit a predictive model.

Principal Component Analysis (PCA)

PCA is one of my favorite machine learning algorithms. PCA is a


linear dimensionality reduction technique (algorithm) that
transforms a set of correlated variables (p) into a smaller k (k<p)
number of uncorrelated variables called principal
components while retaining as much of the variation in the original
dataset as possible. In the context of Machine Learning (ML), PCA
is an unsupervised machine learning algorithm that is used for
dimensionality reduction.

As this is one of my favorite algorithms, I have previously written


several contents for PCA. If you’re interested to learn more about
the theory behind PCA and its Scikit-learn implementation, you
may read the following contents written by me.

 Principal Component Analysis (PCA) with Scikit-learn

 Statistical and Mathematical Concepts behind PCA

 Principal Component Analysis for Breast Cancer Data with R and Python

 Image Compression Using Principal Component Analysis (PCA)

Factor Analysis (FA)


Factor Analysis (FA) and Principal Component Analysis (PCA) are
both dimensionality reduction techniques. The main objective of
Factor Analysis is not to just reduce the dimensionality of the data.
Factor Analysis is a useful approach to find latent variables which
are not directly measured in a single variable but rather inferred
from other variables in the dataset. These latent variables are
called factors.

If you’re interested to learn more about the theory behind FA and


its Scikit-learn implementation, you may read the following content
written by me.

 Factor Analysis on “Women Track Records” Data with R and Python

13. Define Performance Metrics. What is the need for-performance metrics? What are the types of
Performance Metrics based on classification matrix?
PERFORMANCE METRICS: -
Productivity, profit margin, scope and cost are some examples of performance metrics that a
business can track to determine if target objectives and goals are being met. There are different areas
of a business, and each area will have its own key performance metrics.
NEED FOR PERFOMANCE METRICS: -
Performance metrics are used to measure the behavior, activities, and performance of a
business. This should be in the form of data that measures required data within a range, allowing a
basis to be formed supporting the achievement of overall business goals.

The most commonly used Performance metrics for classification problem are as follows,

 Accuracy.
 Confusion Matrix.
 Precision, Recall, and F1 score.
 ROC AUC.
 Log-loss.

14. Define Data Mining. Describe all Phases in a typical Data Mining effort.
15. What do you mean by Datasets? Describe all 4 types of Datasets.

Data Sets Meaning


A data set is an ordered collection of data. While handling the data, the data set can be a bunch of
tables, schema and other objects. The data are essentially organized to a certain model that helps
to process the needed information. The set of data is any permanently saved collection of
information that usually contains either case-level, gathered data, or statistical guidance level
data.
Also, read:

Types of Data Sets


In Statistics, we have different types of data sets available for different types of information. They
are:

 Numerical data sets


 Bivariate data sets
 Multivariate data sets
 Categorical data sets
 Correlation data sets
Let us discuss all these data sets with examples.

Numerical Data Sets


The numerical data set is a data set, where the data are expressed in numbers rather than natural
language. The numerical data is sometimes called quantitative data. The set of all the quantitative
data/numerical data is called the numerical data set. The numerical data is always in the numbers
form, such that we can perform arithmetic operations on it.

 Weight and height of a person


 The count of RBC in a medical report
 Number of pages present in a book

Bivariate Data Sets


A data set that has two variables is called a Bivariate data set. It deals with the relationship
between the two variables. Bivariate dataset usually contains two types of related data.
Example: To find the percentage score and age of the students in a class. Score and age can be
considered as two variables

2. The sales of ice cream versus the temperature on that day. Here the two variables used are
ice cream and temperature. 

(Note: In case, if you have one set of data alone say, temperature, then it is called the univariate
dataset)

Multivariate Data Sets


A data set with multiple variables.  When the dataset contains three or more than three data types
(variables), then the data set is called a multivariate dataset. In other words, the multivariate
dataset consists of individual measurements that are acquired as a function of three or more than
three variables.
Example: If we have to measure the length, width, height, volume of a rectangular box, we have to
use multiple variables to distinguish between those entities.

Categorical Data Sets


Categorical data sets represent features or characteristics of a person or an object. The
categorical dataset consists of a categorical variable also called the qualitative variable, that can
take exactly two values. Hence, it is termed as a dichotomous variable. Categorical data/variables
with more than two possible values are called polytomous variables. The qualitative/categorical
variables are often assumed to be polytomous variable unless otherwise specified.
Example:

 A person’s gender (male or female)


 Marital status (married/unmarried)

You might also like