Professional Documents
Culture Documents
Dmbi
Dmbi
Aim Aim to decrease variance, not bias. Aim to decrease bias, not variance.
Diagram .
If the classifier is unstable (high If the classifier is stable and simple (high bias)
Classfication variance), then apply bagging. the apply boosting.
Base
Learners Trained independently Trained sequentially
Error
Reduction Reduces variance Reduces bias and variance
Training
Speed Faster due to parallelization Slower due to sequential nature
Example
Algorithms Random Forest AdaBoost, Gradient Boosting Machines
Q2) Compare star schema, Snow flakes schema and star constellation
Feature Star Schema Snowflake Schema Star Constellation Schema
Similar to star schema, but
Single central fact table dimension tables are normalized into Multiple fact tables connected by
Structure surrounded by dimension tables multiple related tables dimension tables
Less normalized, dimension tables More normalized, dimension tables Less normalized than snowflake, but
Normalization are denormalized are normalized more than star
More complex due to normalized Moderate complexity due to multiple
Complexity Simple and easy to understand dimension tables fact tables
Typically offers better query
performance due to May suffer from increased join Performance varies based on query
Performance denormalization operations impacting performance and schema design
Simplified data integrity Improved data integrity Data integrity management can be
Data Integrity management management due to normalization complex with multiple fact tables
Good scalability for simpler data Scalability may be affected by Scalability depends on schema design
Scalability models increased joins and query workload
Easier to maintain due to simpler More complex to maintain due to Requires careful maintenance due to
Maintenance structure normalization multiple fact tables
Requires more storage space due May require less storage space due Storage requirements vary based on
Storage to denormalization to normalization schema design
Suitable for simpler data Suitable for complex data Suitable for scenarios with multiple
Use Cases warehousing needs warehousing needs related facts
Example Inventory management, financial Healthcare analytics, retail sales
Applications Sales reporting, customer analytics analysis analysis
At each level, we identify frequent itemsets and generate association rules with support greater than or equal
to 10%.
- Region Level: Frequent itemsets for regions with support >= 10%
- Store Level: Frequent itemsets for stores within each region with support >= 10%
- Department Level: Frequent itemsets for departments within each store with support >= 10%
- Product Level: Frequent itemsets for products within each department with support >= 10%
Reduced Support:
• Reduced support involves decreasing the minimum support threshold as we move down the levels of
abstraction in the hierarchy.
• This approach acknowledges that lower levels of the hierarchy may have fewer transactions and,
therefore, require lower support thresholds to mine meaningful association rules.
Example:
Continuing with the retail dataset example:
- Reduced Support:
- Region Level: 0.2 (20%)
- Store Level: 0.15 (15%)
- Department Level: 0.1 (10%)
- Product Level: 0.05 (5%)
KDD process :-
• KDD is stand for Knowledge Discovery in Databases
• it a process that involves the extraction of useful, previously unknown, and potentially valuable
information from large datasets.
• Data Cleaning − In this step, the noise and inconsistent data is removed.
There are different data cleaning technique
• Data discrepancy detection
• Data transformation tools.
• Data Integration − In this step, multiple data sources are combined.
There are different data Integration technique
• Data Migration tools,
• Data Synchronization tools
• ETL(Extract-Load-Transformation)
• Data Selection − In this step, Choosing the data relevant for analysis.
There are different data Selection technique
• Neural network
• Decision Trees
• Naive bayes
• Clustering
• Regression
• Data Transformation − In this step, Converting data into a suitable format or structure.
There are different data Transformation technique
• Data Mapping
• Code generation
• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
(Applying algorithms to extract patterns.)
• Pattern Evaluation − In this step, data patterns are evaluated.
There are different data Pattern Evaluation technique
• summarization
• Visualization
• Knowledge Presentation − In this step, knowledge is represented.
Q5Define data warehousing .Explain data warehouse architecture with help of a diagram ?
• A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema.
• There are 2 approaches for constructing data-warehouse:
o Top-down approach
o Bottom-up approach
1. Top-down approach:
1. External Sources –
• External source is a source from where data is collected irrespective of the type of data.
• Data can be structured, semi structured and unstructured
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is
a need to validate this data to load into datawarehouse.
For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard format.
• L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse –
• After cleansing of data, it is stored in the datawarehouse as central repository.
• It actually stores the meta data and the actual data gets stored in the data marts.
4. Data Marts –
• Data mart is also a part of storage component.
• It stores the information of a particular function of an organisation which is handled by
single authority.
• There can be as many number of data marts in an organisation depending upon the
functions.
• We can also say that data mart contains subset of the data stored in datawarehouse.
5. Data Mining –
• The practice of analysing the big data present in datawarehouse is data mining.
• It is used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.
Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in top-down approach).
2. Then, the data go through the staging area (as explained above) and loaded into data marts
instead of datawarehouse.
3. The data marts are created first and provide reporting capability. It addresses a single business
area.
4. These data marts are then integrated into datawarehouse.
Q5)what is an outlier ? list types of outliers .describe methods used for outlier analysis.
outlier :- An outlier is an individual point of data that is distant from other points in the dataset.
list types of outliers
Global Outliers: Deviate significantly from the entire dataset.
2.Unsupervised Methods –
3.Semi-Supervised Methods –
• In several applications, although obtaining some labeled instance is possible, the number of such labeled
instances is small.
• It can encounter cases where only a small group of the normal and outlier objects are labeled, but some
data are unlabeled.
• Semi-supervised outlier detection methods were produced to tackle such methods.
• Semi-supervised outlier detection methods can be concerned as applications of semisupervised learning
approaches.
3. Gini Value:
• It's a measure of impurity or uncertainty in a set of data.
• In the context of decision trees, it helps to find the best split by measuring how mixed the classes are in
the subsets created by splitting on a particular feature.
• Lower Gini values indicate better splits.
4. Step-by-step:
- Start: Begin with all the data in one node.
- Calculate Gini or Entropy: Measure the impurity of this node using Gini value or entropy.
- Split: Iterate through each feature and calculate the information gain or decrease in Gini value if we split on
that feature.
- Choose Best Split: Select the feature that gives the highest information gain or the lowest Gini value
decrease.
- Split Data: Partition the data based on the chosen feature.
- Repeat: Recursively repeat the process for each partition until certain stopping criteria are met (like
maximum tree depth or minimum number of samples in a node).