Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Q1)Compare Bagging and Boosting of a classfier ?

S.NO Bagging Boosting

The simplest way of combining


predictions that belong to the same A way of combining predictions that
Defination type. belong to the different types.

Aim Aim to decrease variance, not bias. Aim to decrease bias, not variance.

Models are weighted according to their


Weight Each model receives equal weight. performance.

Diagram .

If the classifier is unstable (high If the classifier is stable and simple (high bias)
Classfication variance), then apply bagging. the apply boosting.

Algorithm Parallel ensemble method Sequential ensemble method

Base
Learners Trained independently Trained sequentially

Error
Reduction Reduces variance Reduces bias and variance

Adjusts weights to focus on hard samples for


Diversity Uses bootstrap sampling for diversity diversity

Can achieve higher accuracy but prone to


Performance Often more accurate on average overfitting

Training
Speed Faster due to parallelization Slower due to sequential nature

Robustness Less sensitive to noisy data More sensitive to noisy data

Example
Algorithms Random Forest AdaBoost, Gradient Boosting Machines
Q2) Compare star schema, Snow flakes schema and star constellation
Feature Star Schema Snowflake Schema Star Constellation Schema
Similar to star schema, but
Single central fact table dimension tables are normalized into Multiple fact tables connected by
Structure surrounded by dimension tables multiple related tables dimension tables
Less normalized, dimension tables More normalized, dimension tables Less normalized than snowflake, but
Normalization are denormalized are normalized more than star
More complex due to normalized Moderate complexity due to multiple
Complexity Simple and easy to understand dimension tables fact tables
Typically offers better query
performance due to May suffer from increased join Performance varies based on query
Performance denormalization operations impacting performance and schema design

Simplified data integrity Improved data integrity Data integrity management can be
Data Integrity management management due to normalization complex with multiple fact tables
Good scalability for simpler data Scalability may be affected by Scalability depends on schema design
Scalability models increased joins and query workload
Easier to maintain due to simpler More complex to maintain due to Requires careful maintenance due to
Maintenance structure normalization multiple fact tables
Requires more storage space due May require less storage space due Storage requirements vary based on
Storage to denormalization to normalization schema design
Suitable for simpler data Suitable for complex data Suitable for scenarios with multiple
Use Cases warehousing needs warehousing needs related facts
Example Inventory management, financial Healthcare analytics, retail sales
Applications Sales reporting, customer analytics analysis analysis

Q3Explain multileval and multidemensional association rules with suitable examples


Multileval Association Rules

Multilevel association rules extend traditional association rule mining to consider multiple levels of
abstraction in the data.
• Uniform support and reduced support are two techniques used in multilevel association rule mining to
discover patterns efficiently.
Uniform Support:
• Uniform support ensures that the support threshold remains consistent across all levels of abstraction.
Example:
Suppose we have a dataset containing sales transactions of a retail store with the following hierarchy: {Region,
Store, Department, Product}.

- Uniform Support: 0.1 (10%)


- Minimum support threshold is set at 10% for all levels of the hierarchy.

At each level, we identify frequent itemsets and generate association rules with support greater than or equal
to 10%.
- Region Level: Frequent itemsets for regions with support >= 10%
- Store Level: Frequent itemsets for stores within each region with support >= 10%
- Department Level: Frequent itemsets for departments within each store with support >= 10%
- Product Level: Frequent itemsets for products within each department with support >= 10%

Reduced Support:
• Reduced support involves decreasing the minimum support threshold as we move down the levels of
abstraction in the hierarchy.
• This approach acknowledges that lower levels of the hierarchy may have fewer transactions and,
therefore, require lower support thresholds to mine meaningful association rules.

Example:
Continuing with the retail dataset example:
- Reduced Support:
- Region Level: 0.2 (20%)
- Store Level: 0.15 (15%)
- Department Level: 0.1 (10%)
- Product Level: 0.05 (5%)

Multidimensional Association Rules :


Association rule mining helps us to find relationships among large dataset.
In Multidimensional association,
• Multidimensional association rule comprises of more than one aspect
• Numeric attributes should be discretized.
• Attributes can be unmitigated or quantitative.
• Quantitative characteristics are numeric and consolidate pecking order.
1.Using static discretization of quantitative attributes
• Discretization is static and happens preceding mining.
• Discretized ascribes are treated as unmitigated.
• Use apriori calculation to locate all k-regular predicate sets(this requires k or k+1 table outputs).
Each subset of regular predicate set should be continuous.
Example –
If in an information block the 3D cuboid (age, pay, purchases) is continuous suggests (age, pay), (age,
purchases), (pay, purchases) are likewise regular.

2.Using powerful discretization of quantitative traits :


• Known as mining Quantitative Association Rules.
• Numeric properties are progressively discretized.
Example –:
age(X, "20..25") Λ income(X, "30K..41K")buys ( X, "Laptop Computer")
Grid For Tuples
Distance based discretization with clustering
This is a dynamic discretization process that considers the distance between data of interest.

There are two steps involved in the mining process.

• Interval of attributes involved are found by performing clustering.


• Association rules are acquired by searching for groups of clusters that occur together.
Q4.Define data mining Explain KDD process with help of a suitable diagram ?
Data mining
• Data mining is the process of sorting through large data sets to identify patterns and relationships that
can help solve business problems through data analysis.

KDD process :-
• KDD is stand for Knowledge Discovery in Databases
• it a process that involves the extraction of useful, previously unknown, and potentially valuable
information from large datasets.

• Data Cleaning − In this step, the noise and inconsistent data is removed.
There are different data cleaning technique
• Data discrepancy detection
• Data transformation tools.
• Data Integration − In this step, multiple data sources are combined.
There are different data Integration technique
• Data Migration tools,
• Data Synchronization tools
• ETL(Extract-Load-Transformation)
• Data Selection − In this step, Choosing the data relevant for analysis.
There are different data Selection technique
• Neural network
• Decision Trees
• Naive bayes
• Clustering
• Regression
• Data Transformation − In this step, Converting data into a suitable format or structure.
There are different data Transformation technique
• Data Mapping
• Code generation
• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
(Applying algorithms to extract patterns.)
• Pattern Evaluation − In this step, data patterns are evaluated.
There are different data Pattern Evaluation technique
• summarization
• Visualization
• Knowledge Presentation − In this step, knowledge is represented.

Q5Define data warehousing .Explain data warehouse architecture with help of a diagram ?
• A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema.
• There are 2 approaches for constructing data-warehouse:
o Top-down approach
o Bottom-up approach
1. Top-down approach:

1. External Sources –
• External source is a source from where data is collected irrespective of the type of data.
• Data can be structured, semi structured and unstructured
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is
a need to validate this data to load into datawarehouse.
For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard format.
• L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse –
• After cleansing of data, it is stored in the datawarehouse as central repository.
• It actually stores the meta data and the actual data gets stored in the data marts.

4. Data Marts –
• Data mart is also a part of storage component.
• It stores the information of a particular function of an organisation which is handled by
single authority.
• There can be as many number of data marts in an organisation depending upon the
functions.
• We can also say that data mart contains subset of the data stored in datawarehouse.

5. Data Mining –
• The practice of analysing the big data present in datawarehouse is data mining.
• It is used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.

Bottom-up approach:

1. First, the data is extracted from external sources (same as happens in top-down approach).
2. Then, the data go through the staging area (as explained above) and loaded into data marts
instead of datawarehouse.
3. The data marts are created first and provide reporting capability. It addresses a single business
area.
4. These data marts are then integrated into datawarehouse.
Q5)what is an outlier ? list types of outliers .describe methods used for outlier analysis.
outlier :- An outlier is an individual point of data that is distant from other points in the dataset.
list types of outliers
Global Outliers: Deviate significantly from the entire dataset.

Contextual Outliers: Only considered outliers in specific contexts.

Collective Outliers: A group of data points that together form an outlier


1.Supervised Methods –
• Supervised methods model data
normality and abnormality.
• Domain professionals tests and label
a sample of the basic data.
• Outlier detection can be modeled as
a classification issue.
• The service is to understand a
classifier that can identify outliers.
• The sample can be used for training
and testing.

2.Unsupervised Methods –

• In various application methods,


objects labeled as “normal” or
“outlier” are not applicable.
• Therefore, an unsupervised learning
approach has to be used.
• An unsupervised outlier detection method predict that normal objects follow a pattern far more
generally than outliers.
• Normal objects do not have to decline into one team sharing large similarity. Instead, they can form
several groups, where each group has multiple features.
• The collective outliers, share large similarity in a small area.
• Unsupervised methods cannot identify such outliers efficiently.

3.Semi-Supervised Methods –

• In several applications, although obtaining some labeled instance is possible, the number of such labeled
instances is small.
• It can encounter cases where only a small group of the normal and outlier objects are labeled, but some
data are unlabeled.
• Semi-supervised outlier detection methods were produced to tackle such methods.
• Semi-supervised outlier detection methods can be concerned as applications of semisupervised learning
approaches.

Q6 explain concept of information gain and gini value used


in decision tree algorithm
1. Decision Tree Algorithm:
• It's a method used in machine learning for
classification and regression tasks.
• It makes decisions based on splitting the data into
smaller subsets using attributes/features.
2. Information Gain:
• It measures how much a particular feature
contributes to reducing uncertainty about the
outcome.
• In other words, it calculates how much information a feature gives us about the class/target variable.
• Higher information gain means the feature is more useful for splitting the data.

3. Gini Value:
• It's a measure of impurity or uncertainty in a set of data.
• In the context of decision trees, it helps to find the best split by measuring how mixed the classes are in
the subsets created by splitting on a particular feature.
• Lower Gini values indicate better splits.

4. Step-by-step:
- Start: Begin with all the data in one node.
- Calculate Gini or Entropy: Measure the impurity of this node using Gini value or entropy.
- Split: Iterate through each feature and calculate the information gain or decrease in Gini value if we split on
that feature.
- Choose Best Split: Select the feature that gives the highest information gain or the lowest Gini value
decrease.
- Split Data: Partition the data based on the chosen feature.
- Repeat: Recursively repeat the process for each partition until certain stopping criteria are met (like
maximum tree depth or minimum number of samples in a node).

Q7.Write and explain Bayes classification algorithm ?

You might also like