Joy Jeet

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1.

Handling missing values is a common problem in


data mining. Missing data can bias the results of
machine learning models and/or reduce the accuracy
of the model.
There are different ways of handling missing values, including
deleting the missing values, imputing the missing values, and
using “missingness” as a feature. Imputation methods include
mean imputation, median imputation, mode imputation,
regression imputation,

2.Decision tree algorithms are widely used in a variety


of machine learning applications, including:
 Fraud detection: Decision tree algorithms can be

used to identify fraudulent transactions and other


types of anomalous behavior.
 Risk assessment: Decision tree algorithms can be

used to assess the risk of different events, such as


loan defaults or customer churn.
 Medical diagnosis: Decision tree algorithms can be

used to help doctors diagnose diseases and other


medical conditions.
 Marketing: Decision tree algorithms can be used to

segment customers and target them with


personalized marketing campaigns.
 Decision trees are a type of machine-learning

algorithm that can be used for both classification and


regression tasks. The algorithm works by
recursively splitting the data into smaller and smaller
subsets based on the feature values. At each node,
the algorithm chooses the feature that best splits the
data into groups with different target values.
 Hyperparameter tuning helps to find the optimal values
for the hyperparameters of the model, such as the
learning rate, the number of hidden layers, etc., which
can improve the performance of the model 1.
 On the other hand, regularization is a technique used to
prevent overfitting of the model to the training
data 12. Regularization techniques such as weight decay
(L2 regularization) or dropout can be used to prevent the
model from overfitting to the training data 2

3.
 A data warehouse is a system that collects, integrates, and
stores data from various sources for reporting and analysis
purposes. It is designed to support decision-making and
business intelligence by providing a historical and
comprehensive view of the data. A data warehouse is different
from a regular database, which is mainly used for
transactional processing and daily operations. t enables better
business analytics and insights by allowing users to query and
explore the data in various ways.
 It improves the performance and speed of queries by storing
the data in an optimized and organized manner.
 It enhances the reliability and accuracy of the data by applying
data cleaning and integration techniques.
 It provides a historical perspective of the data by preserving
the data over long periods of time.
The main characteristics of a data warehouse are:

 It is subject-oriented, meaning it focuses on a specific topic


or domain, such as sales, marketing, or customer, rather than
the entire organization’s operations.
 It is integrated, meaning it combines data from different
sources and formats into a consistent and unified schema.
 It is time-variant, meaning it keeps historical data as well as
current data, and allows for analysis of data over time.
 It is non-volatile, meaning it does not change or delete data
once it is stored, but only adds new data.

4. In the top-down approach, the data warehouse is designed


first, followed by the creation of data marts. The essential
components of this approach include external sources, stage
area, data warehouse, data marts, and data mining 1. 1. The
advantages of this approach include improved data
consistency, easier maintenance, better scalability, and
consistent dimensional view of data marts 1.
In contrast,
the bottom-up approach involves creating data marts first,
followed by the creation of the data warehouse. The essential
components of this approach include source systems, staging
area, data marts, and data warehouse 2. The advantages of
this approach include faster development, better user
acceptance, and better alignment with business processes

5.Comparing two objects with one nominal


attribute means comparing the values of this
attribute. In that case, similarity is traditionally
defined as 1 if attribute values match and
as 0 otherwise. A dissimilarity would be defined
in the opposite way: 0 if the attribute values
match, if they do not then 1.
Example-Suppose we have two fruits, apple and banana. If the
color of the apple is red and the color of the banana is yellow,
then the dissimilarity between them would be 1/2 or 0.5. This is
because only one of the two colors matches between the two
fruits

6. Min-max scaling is a technique that transforms numerical


values to a range between 0 and 1. It is useful for
standardizing data and reducing the effect of outliers. The
formula for min-max scaling is: scaled value = (current value -
minimum value) / (maximum value - minimum value)12. For
example, if a feature has values ranging from 1 to 10, and the
current value is 5, the scaled value would be (5 - 1) / (10 - 1) =
0.51.

Min-Max normalization is a data normalization technique used


in data mining and machine learning, which transforms the
original data linearly to scale it between a specific range, usually
0 and 11234.
The formula for Min-Max normalization is:
 x is an original value,
 x' is the normalized value,

 min is the smallest value in the feature column,


3
 max is the largest value in the feature column .

This method is useful for data with different scales and ranges,
as it helps to bring all values into a consistent range3. It’s
generally useful for classification algorithms2. However, it’s
important to note that normalization should be applied only
to the input features, not the target variable2

7. An Enterprise Data Warehouse (EDW) is a centralized


repository that stores and manages all the historical business
data of an enterprise12. The information usually comes from
different systems like Enterprise Resource Planning (ERP)
systems, Customer Relationship Management (CRM)
platforms, finance applications, Internet of Things (IoT)
devices, and mobile and online systems.

A data mart is a subset of a data warehouse that is focused


on a single subject or line of business1234. A data mart is
designed to make specific data available to a defined group of
users, such as a business unit or a department12. A data mart
draws data from fewer sources than a data warehouse, which
can include internal operational systems, a central data
warehouse, and external data4.
Virtual warehouses are a type of compute clusters that power modern
data warehouses, acting as an on-demand resource1. They are
independent compute resources that can be leveraged at any time for
SQL execution and DML (Data Manipulation Language) and then
tur ed off whe t s ’t eeded1. Virtual warehouses are different from
other data warehouse models such as data marts and enterprise data
warehouses2. Data warehouse modeling is the process of designing the
schemas of the detailed and summarized information of the data
warehouse3. The goal of data warehouse modeling is to develop a
schema describing the reality, or at least a part of the fact, which the data
warehouse is needed to support3. Data warehouse modeling is an
essential stage of building a data warehouse for two main reasons.
Firstly, through the schema, data warehouse clients can visualize the
relationships among the warehouse data, to use them with greater ease.
Secondly, a well-designed schema allows an effective data warehouse
structure to emerge, to help decrease the cost of implementing the
warehouse and improve the efficiency of using it3.

You might also like