Professional Documents
Culture Documents
DWDM IT-32 DATAWAREHOUSING & DATAMINING
DWDM IT-32 DATAWAREHOUSING & DATAMINING
of Pages : 6
[5948]-302
M.C.A. (Management )
IT - 32 : DATA WAREHOUSING AND DATA MINING
(2020 Pattern) (Semester - III)
Time : 2½ Hours] [Max. Marks : 50
Instructions to the candidates:
1) All questions are compulsory.
2) Draw neat & labelled diagrams wherever necessary.
It stores large amount of data or the type of data. Data can be structured, semi structured and 4. sources and provides a more complete view of the organization’s data.
historical data. It holds current data. unstructured as well.
Disadvantages of Top-Down Approach –
It used for analyzing the business. It used for running the business. 2. Stage Area –
Since the data, extracted from the external sources does not follow a 1. The cost, time taken in designing and its maintenance is very high.
In Online transaction processing, the particular format, so there is a need to validate this data to load into
In Data warehousing, the size of size of data base is around 10MB- datawarehouse. For this purpose, it is recommended to use ETL tool. 2. Complexity: The top-down approach can be complex to implement and
database is around 100GB-2TB . 100GB. E(Extracted): Data is extracted from External data source. maintain, particularly for large organizations with complex data needs.
In Data warehousing, denormalized In Online transaction processing, The design and implementation of the data warehouse and data marts
data is present. normalized data is present. T(Transform): Data is transformed into the standard format. can be time-consuming and costly.
It uses Query processing. It uses transaction processing L(Load): Data is loaded into datawarehouse after transforming it into
the standard format. 3. Limited user involvement: The top-down approach can be dominated by ROLAP
It is subject-oriented. It is application-oriented.
IT departments, which may lead to limited user involvement in the
In Data warehousing, data In Online transaction processing, there 3. Data-warehouse – design and implementation process. This can result in data marts that
redundancy is present. is no data redundancy. After cleansing of data, it is stored in the datawarehouse as central Advantages
do not meet the specific needs of business users.
repository. It actually stores the meta data and the actual data gets
stored in the data marts. Note that datawarehouse stores the data in its Can handle large amounts of information: The data size limitation of ROLAP
OR purest form in this top-down approach. technology is depends on the data size of the underlying RDBMS. So, ROLAP itself does
a)Name the different OLAP architectures. Pick any two (2) and describe in not restrict the data amount.
a)Explain the architecture of a Data warehouse with a neat
diagram.[5] 4. Data Marts – detail with advantage. [5]
<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies,
--> A data-warehouse is a heterogeneous collection of different data Data mart is also a part of storage component. It stores the information (works on top of the RDBMS) can control these functionalities.
sources organised under a unified schema. There are 2 approaches for of a particular function of an organisation which is handled by single
authority. There can be as many number of data marts in an
constructing data-warehouse: Top-down approach and Bottom-up
organisation depending upon the functions. We can also say that data
OLAP is considered (Online Analytical Processing) which is a type of Disadvantages
approach are explained as below. software that helps in analyzing information from multiple databases at a
1. Top-down approach: mart contains subset of the data stored in datawarehouse.
particular time. OLAP is simply a multidimensional data model and also Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL
applies querying to it. queries) in the relational database, the query time can be prolonged if the underlying data
5. Data Mining – size is large.
Types of OLAP Servers
The practice of analysing the big data present in datawarehouse is data
Relational OLAP
mining. It is used to find the hidden patterns that are present in the Limited by SQL functionalities: ROLAP technology relies on upon developing SQL
Multi-Dimensional OLAP
database or in datawarehouse with the help of algorithm of data mining.
Hybrid OLAP
This approach is defined by Inmon as – datawarehouse as a central
Transparent OLAP statements to query the relational database, and SQL statements do not suit all
repository for the complete organisation and data marts are created
from it after the complete datawarehouse has been created.
Relational OLAP (ROLAP): Star Schema Based
The ROLAP is based on the premise that data need not be stored multi- needs.
Advantages of Top-Down Approach – dimensionally to be viewed multi-dimensionally, and that it is possible to
1. Since the data marts are created from the datawarehouse, provides exploit the well-proven relational database technology to handle the
Multidimensional OLAP (MOLAP): Cube-Based
consistent dimensional view of data marts. multidimensionality of data. In ROLAP data is stored in a relational
database. In essence, each action of slicing and dicing is equivalent to MOLAP stores data on disks in a specialized multidimensional array
adding a “WHERE” clause in the SQL statement. ROLAP can handle large structure. OLAP is performed on it relying on the random access capability
2. Also, this model is considered as the strongest model for business of the arrays. Arrays elements are determined by dimension instances, and
changes. That’s why, big organisations prefer to follow this approach. amounts of data. ROLAP can leverage functionalities inherent in the
The essential components are discussed below: relational database. the fact data or measured value associated with each cell is usually stored
in the corresponding array element. In MOLAP, the multidimensional array
1. External Sources – 3. Creating data mart from datawarehouse is easy. is usually stored in a linear allocation according to nested traversal of the
External source is a source from where data is collected irrespective of axes in some predetermined order.
[5948]-302 [5948]-302 [5948]-302 [5948]-302
But unlike ROLAP, where only records with non-zero facts are stored, all Q3) a) What are Discretization and concept Hierarchy generation process? concepts to high-level concepts. For example, in computer science, there are
array elements are defined in MOLAP and as a result, the arrays generally different types of hierarchical systems. A document is placed in a folder in windows
Give an example for each. [5]
tend to be sparse, with empty elements occupying a greater part of it. Since at a specific place in the tree structure is the best example of a computer
both storage and retrieval costs are important while assessing online
performance efficiency, MOLAP systems typically include provisions such --> Discretization in data mining hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.
as advanced indexing and hashing to locate data while performing queries
for handling sparse arrays. MOLAP cubes are fast data retrieval, optimal Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In Let's understand this concept hierarchy for the dimension location with the help of
for slicing and dicing, and can perform complex calculations. All
other words, data discretization is a method of converting attributes values of an example.
calculations are pre-generated when the cube is created.
continuous data into a finite set of intervals with minimum data loss. There are two
A particular city can map with the belonging country. For example, New Delhi can
forms of data discretization first is supervised discretization, and the second is
be mapped to India, and India can be mapped to Asia.
unsupervised discretization. Supervised discretization refers to a method in which
the class data is used. Unsupervised discretization refers to a method depending
Top-down mapping
upon the way which operation proceeds. It means it works on the top-down
splitting strategy and bottom-up merging strategy. Top-down mapping generally starts with the top with some general information
and ends with the bottom to the specialized information.
Now, we can understand this concept with the help of an example
Bottom-up mapping
Suppose we have an attribute of Age with the given values
Bottom-up mapping generally starts with the bottom with some specialized
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77 information and ends with the top to the generalized information.
MOALP
alternative cloud data warehouse tool makes it straightforward to query beneath big sets of data. BigQuery is not developed to substitute relational example, Vertica is a column-oriented relational database; therefore, it
data and writes data back to the data lake in open formats. It focuses on databases and for easy CRUD operations and queries. It is oriented for might not qualify as a NoSQL database. A NoSQL database is best
simple Use and Accessibility. MySQL and alternative SQL-based systems running analytical queries. It is a hybrid system that enables the storage of outlined as being a non-relational, shared-nothing, horizontally scalable
are one in all the foremost well-liked and simply usable interfaces for information in columns; however, it takes into the NoSQL additional database while not ACID guarantees. Vertica differs from normal RDBMS
database management. Redshift’s easy query-based system makes features, like the data type, and the nested feature. BigQuery is a better within the approach that it stores data by grouping data at once on disk by
platform adoption and acclimatization a light breeze. It is incredibly quick option than Redshift since we have to pay by the hour. BigQuery may also column instead of by row, Vertica reads the columns documented by the
once it involves loading data and querying it for analytical and reporting be the best solution for data scientists running ML or data mining query, rather than scanning the complete table as row-oriented databases
functions. Redshift features a massively parallel processing (MPP) design operations since they deal with extremely large datasets. Google Cloud should do. Vertica offers the foremost advanced unified analytical
that permits loading data at a very high speed. also offers a set of auto-scaling services that enables you to build a data warehouse that allows the organization to stay up with the dimensions and
lake that integrates with your existing applications, skills, and IT complexness of huge amounts of data volumes. With Vertica, businesses
2. Microsoft Azure: investments. In BigQuery,most of the time is spent on metadata/initiation, can perform tasks like predictive maintenance, client remembrance,
but the actual execution time is very small. economic compliance and network optimization, and far more.
Azure is a cloud computing platform that was launched by Microsoft in
Data Flow through Warehouse Architecture
2010. Microsoft Azure is a cloud computing service provider for building, 4. Snowflake: OR
testing, deploying, and managing applications and services through
Previously, organizations had to build lots of infrastructure for data Microsoft-managed data centers. Azure is a public cloud computing Snowflake is a cloud computing-based data warehousing built on top of the a)Explain the different data sources for data warehouse and methods of data
warehousing but today, cloud computing technology has amazingly platform that offers Infrastructure as a Service (IaaS), Platform as a Service Amazon Web Services or Microsoft Azure cloud infrastructure. The
reduced the efforts as well as the cost of building data warehousing for (PaaS), and Software as a Service (SaaS). The Azure cloud platform Snowflake design allows storage and computes to scale independently, collection. [5]
businesses. Data warehouses and their tools are moving from physical provides more than 200 products and cloud services such as Data thus customers can use and pay money for storage and computation
data centers to cloud-based data warehouses. Many large organizations Analytics, Virtual Computing, Storage, Virtual Network, Internet Traffic individually. In Snowflake data processing is simplified: Users will do data --> Data warehouses are designed to store and manage data from various sources
still operate data through the traditional way of data warehousing but Manager, Web Sites, Media Services, Mobile Services, Integration, etc. blending, analysis, and transformations against varied forms of data to support business intelligence and analytical processes. Data can come from
clearly, the future of the data warehouse is in the cloud. The cloud-based Azure facilitates simple portability and genuinely compatible platform structures with one language, SQL. Snowflake offers dynamic, scalable diverse origins, and methods for collecting and integrating this data can vary.
data warehousing tools are fast, efficient, highly scalable, and available between on-premise and public Cloud. Azure provides a range of cross- computing power with charges primarily based strictly on usage. With Here's an overview of different data sources for data warehouses and the methods
based on pay-per-use. connections including virtual private networks (VPNs), caches, content Snowflake, computation and storage are fully separate, and also the of data collection:
There are various cloud-based Data Warehousing tools available. So, it delivery networks (CDNs), and ExpressRoute connections to improve storage value is that the same as storing the data on Amazon S3. AWS
becomes difficult to select top Data Warehouse tools according to our usability and performance. Microsoft Azure provides a secure base across tried to handle this issue by introducing Redshift Spectrum, which allows
project requirements. Following are the top 8 Data Warehousing tools: physical infrastructure and operational security. Azure App offers a querying data that exists directly on Amazon S3; however, it’s not as
completely managed web hosting service that helps in building web seamless as Snowflake. With Snowflake, we can clone a table, a schema, 1. Operational Databases:
applications, services, and Restful APIs. It offers a variety of plans to meet or perhaps a database in no time and occupying no extra space. This is
1. Amazon Redshift: the requirements of any application, from small to globally scaled web often because the cloned table creates pointers that point to the kept data, Online Transaction Processing (OLTP) Systems: These are the primary systems
applications. Running virtual machines or containers in the cloud is one of however, not the actual data. In alternative words, the cloned table solely where daily business transactions are recorded. Data is often collected from OLTP
Amazon Redshift is a cloud-based fully managed petabytes-scale data the most popular applications of Microsoft Azure. has data that’s completely different from its original table. databases for analytical purposes.
warehouse By the Amazon Company. It starts with just a few hundred
gigabytes of data and scales to petabytes or more. This enables the use of 3. Google BigQuery: 5. Micro Focus Vertica: Methods of Data Collection from Operational Databases:
data to accumulate new insights for businesses and customers. It is a
relational database management system (RDBMS) therefore it is
compatible with other RDBMS applications. Amazon Redshift offers quick BigQuery is a serverless data warehouse that allows scalable analysis over Micro Focus Vertica: Micro Focus Vertica is developed to use in data
querying capabilities over structured data by the use of SQL-based clients petabytes of data. It’s a Platform as a Service that supports querying with warehouses and other big data workloads where speed, scalability,
the help of ANSI SQL. It additionally has inbuilt machine learning simplicity, and openness are crucial to the success of analytics. It is a self- ETL (Extract, Transform, Load): ETL processes are used to extract data from
and business intelligence (BI) tools using standard ODBC and JDBC
capabilities. BigQuery was declared in 2010 and made available for use monitored MPP database and offers scalability and flexibility that other operational databases, transform it to fit the data warehouse schema, and load it
connections. Amazon Redshift is made around industry-standard SQL, with
additional practicality to manage massive datasets and support superior there in 2011. Google BigQuery is a cloud-based big data analytics web tools don’t. It is used on commercial hardware, therefore we can scale the into the data warehouse.
analysis and reporting of these data. It helps to work quickly and easily service to process very huge amount of read-only data sets. BigQuery is database as required. It is designed significantly in-database advanced
along with data in open formats, and simply integrates with and connects to designed for analyzing data that are in billions of rows by simply employing analytics capabilities to improve query performance over traditional
the AWS scheme. Also query and export data to and from the data lake. No SQL-lite syntax. BigQuery can run advanced analytical SQL-based queries relational database systems and unverified open source offerings. For
[5948]-302 [5948]-302 [5948]-302 [5948]-302
Change Data Capture (CDC): CDC techniques track changes in operational Methods of Data Collection from Cloud-Based Services: Q4) a) Consider the data set given below, compute the support for item a) A consultancy wants to categories MCA students into classes as
databases, capturing new and modified data to keep the data warehouse up to sets
Excellent, Good, and Average. The data collected from students are
date. {e}, {b, d} and {b, d, e}
their average percentage in MCA- I year and result of the
2. External Data Sources: APIs: Many SaaS applications provide APIs to access data, making it possible to apptitude test conducted by the consultancy.
extract and load data into the data warehouse. customer ID Transaction ID Items Brought Solve the problem using decision tree Algorithm. [5]
Webhooks: Some SaaS applications support webhooks to push data to the data
0001 {a, d, e} -->
b) Using Bayes an classification to classify the sample data: {6, 43}.
Third-party Data Providers: Organizations can purchase external data, such as warehouse when specific events occur.
As male or female. Training data is given. [5]
market research data, demographic data, or industry-specific data, to enrich their
0024 {a, b, c, e}
analytics. 5. Social Media and User-Generated Content: Person Height Weight
0012 {a, b, d, e}
Male 6.2 82
Methods of Data Collection from External Data Sources:
0031 {a, c, d, e}
Male 5.11 65
Social Media Platforms: Data from social media platforms, forums, and user- 0015 {b, c, e}
generated content can provide insights into customer sentiment and behavior. Male 5.7 58
Data Feeds: Organizations can receive data feeds directly from third-party 0022 {b, d, e}
Male 5.11 55
providers, either as batch files or through APIs. Methods of Data Collection from Social Media and User-Generated Content: 0029 {c, d}
Female 4.10 42
Web Scraping: Web scraping techniques can be used to collect data from websites 0040 {a, b, c}
Female 5.5 50
and online sources. 0033 {a, d, e}
APIs: Social media platforms often provide APIs for accessing their data, which can Female 5.0 43
3. Legacy Systems: be integrated into the data warehouse. 0038 {a, b, e}
Female 5.75 50
Web Scraping: Web scraping techniques can be employed to collect data from
b) Using the result from problem a, above, compute the confidence for the
social media sites and online communities.
Older Systems: Historical data might be stored in legacy systems, which need to be
association rules {b, d} e and {e} {b, d} [5] Q5) a) Construct a FP-Tree Algorithm, to find frequency patterns for the given
integrated into the data warehouse for historical analysis. OR data. [5]
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
Data Migration: Data can be migrated from legacy systems using ETL processes or
custom data conversion tools. 300 {b, f, h, j, o}
500 {a, f, c, e, l, p, m, n}
SaaS (Software as a Service) Applications: Data generated by SaaS applications like Sure, let me guide you through the steps of the FP-Tree algorithm for the given
CRM, marketing automation, or ERP systems can be integrated into the data data.
warehouse.
[5948]-302 [5948]-302 [5948]-302 [5948]-302
1. **Step 1: Scan the data and count the support of each item.** o: 2 | --> A Hierarchical clustering method works via grouping data into a tree of
- Count the support (frequency) of each item in the dataset. ``` a(1) clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
```
1. Identify the 2 clusters which can be closest together, and
``` 3. **Step 3: Sort items based on support.** 2. Merge the 2 maximum comparable clusters. We need to continue these
f: 4 - Sort the selected items in descending order of support. 5. **Step 5: Mine frequent patterns from the FP-Tree.** steps until all the clusters are merged together.
a: 3 - Traverse the FP-Tree to generate frequent patterns. In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-
c: 4 ```
like diagram that statistics the sequences of merges or splits) graphically
d: 1 f: 4 ``` represents this hierarchy and is an inverted tree that describes the order in
g: 1 c: 4 f: 4 which factors are merged (bottom-up view) or clusters are broken up (top-
i: 1 p: 4 f-c: 3 down view).
Hierarchical clustering is a method of cluster analysis in data mining that
m: 3 a: 3 f-c-a: 3 creates a hierarchical representation of the clusters in a dataset. The
p: 4 b: 3 f-c-a-b: 1 method starts by treating each data point as a separate cluster and then
b: 3 m: 3 f-c-a-b-m: 1 iteratively combines the closest clusters until a stopping criterion is
l: 2 l: 2 f-c-p: 1 reached. The result of hierarchical clustering is a tree-like structure, called
a dendrogram, which illustrates the hierarchical relationships among the
o: 2 o: 2 f-p: 1 clusters.
h: 1 ``` f-a: 3
Hierarchical clustering has a number of advantages over other
j: 1 f-a-c: 3 clustering methods, including:
k: 1 4. **Step 4: Construct the FP-Tree.** f-a-c-m: 1 1. The ability to handle non-convex clusters and clusters of different sizes
s: 1 - Initialize an empty tree and insert transactions into the tree in a way that f-a-c-m-b: 1 and densities.
maintains the item order and support. 2. The ability to handle missing data and noisy data.
e: 1 f-a-b: 1
3. The ability to reveal the hierarchical structure of the data, which can be
n: 1 f-a-b-m: 1 useful for understanding the relationships among the clusters.
``` ``` f-a-l: 1 However, it also has some drawbacks, such as:
root f-c-b: 2 4. The need for a criterion to stop the clustering process and determine
| the final number of clusters.
2. **Step 2: Filter items based on minimum support.** f-c-b-m: 1
5. The computational cost and memory requirements of the method can be
- Select only those items whose support is greater than or equal to the minimum f(4) f-b: 2 high, especially for large datasets.
support threshold (let's say min_support = 2). |\ f-b-m: 1 6. The results can be sensitive to the initial conditions, linkage criterion,
c(3) p(1) ``` and distance metric used.
``` | | In summary, Hierarchical clustering is a method of data mining that
groups similar data points into clusters by creating a hierarchical
f: 4 a(3) m(1) These are the frequent patterns found in the dataset using the FP-Tree algorithm. structure of the clusters.
a: 3 | | Note that the support count is mentioned for each pattern. You can filter the 7. This method can handle different types of data and reveal the
c: 4 b(1) o(1) patterns based on a minimum support threshold to get the final set of frequent relationships among the clusters. However, it can have high
patterns. computational cost and results can be sensitive to some conditions.
m: 3 | |
1. Agglomerative: Initially consider every data point as
p: 4 m(1) l(1) an individual Cluster and at every step, merge the nearest pairs of the
b: 3 | cluster. (It is a bottom-up method). At first, every dataset is considered an
l: 2 b(2) b)Explain Hierarchical clustering using examples. [5]
b)What are agent based and database based approaches in web mining?
Example: Web Usage Mining
Explain with example. [5]
--> Agent-based and database-based approaches are two different methods used
in web mining to gather, analyze, and extract valuable information from Web Usage Mining involves collecting and analyzing user interaction data on a
the World Wide Web. Here's an explanation of both approaches with website, such as clickstream data, session logs, and user profiles. This
examples: information is often stored in a database, making it easier to identify user
behavior patterns and optimize website content.
Figure – Divisive Hierarchical clustering
Figure – Agglomerative Hierarchical clustering Agent-Based Approach:
Step-1: Consider each alphabet as a single cluster and calculate the Agent-based web mining involves the use of software agents or bots that
distance of one cluster from all the other clusters. OR
autonomously navigate the web, gather data, and perform various tasks, Content Management Systems: Many websites and content management
Step-2: In the second step comparable clusters are merged together to
such as data retrieval, data filtering, and data processing. These agents are systems (CMS) store web content and metadata in structured databases.
form a single cluster. Let’s say cluster (B) and cluster (C) are very
designed to mimic human behavior or follow predefined rules to interact This allows for easy content retrieval, searching, and presentation. For
similar to each other therefore we merge them in the second step
with web resources. example, WordPress stores blog posts, categories, and tags in a database,
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC),
(DE), (F)] making it simple to query and display content on a websit
Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new clusters Example: Web Crawlers
as [(A), (BC), (DEF)]
Search Engine Crawlers: Search engines like Google, Bing, and Yahoo employ
[5948]-302 [5948]-302 [5948]-302 [5948]-302
[5948]-304 [5948]-305
Total No. of Questions : 5] IT 32: DATA WAREHOUSING AND DATA MINING Historical information is kept in a data warehouse. For example, one can retrieve
SEAT No. :
(2020 Pattern) (Semester - III) files from 3 months, 6 months, 12 months, or even previous data from a data
P6989 [5865] - 302
[Total No. of Pages : 5 warehouse. These variations with a transactions system, where often only the most
Time : 2½ Hours] [Max. Marks : 50 current file is kept.
M.C.A. (Management) Instructions to the candidates:
1) All questions are compulsory.
2) Draw neat & labelled diagram wherever necessary.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the
global organization's ongoing operations. This is done by excluding data that are
not useful concerning the subject and including all data needed by the users to
understand the subject.
Structured Schema: Data in a data warehouse is typically organized into a structured schema, often
in the form of a star or snowflake schema. This schema is optimized for querying and reporting,
Time-Variant making it easier to analyze the data.
[5865] - 302 [5865] - 302 [5865] - 302
schema. A database uses relational model, while a data The sales fact table is same as that in the star schema.
Need for Data Warehouse warehouse uses Star, Snowflake, and Fact Constellation The shipping fact table has the five dimensions, namely
schema. In this chapter, we will discuss the schemas used in a item_key, time_key, shipper_key, from_location,
Data Warehouse is needed for the following reasons: data warehouse. to_location.
The shipping fact table also contains two measures,
Star Schema namely dollars sold and units sold.
Each dimension in a star schema is represented with only It is also possible to share dimension tables between fact
one-dimension table. tables. For example, time, item, and location dimension
This dimension table contains the set of attributes. tables are shared between the sales and shipping fact
The following diagram shows the sales data of a company table.
with respect to the four dimensions, namely time, item,
branch, and location.
Dividing data into training and testing sets for machine learning or model ETL stands for Extract, Transform, Load, which is a process commonly used in data
evaluation. warehousing and data integration. It involves extracting data from various sources,
Handling Missing Data: Imputing missing values or removing rows/columns with In addition to the ETL process, data preprocessing encompasses various techniques
missing data. b)What is OLAP? Describe the characteristics of OLAP. [4] transforming it to fit a common schema or structure, and then loading it into a to prepare data for analysis, which include:
data warehouse or another target system for analysis and reporting. ETL is
Removing Duplicates: Identifying and eliminating duplicate records. --> OLAP, or online analytical processing, is a method in essential for ensuring data quality and consistency in a data warehouse. The three
Outlier Detection and Handling: Identifying and addressing outlier values that
computing that solves complex analytical programs. This business main steps in ETL are as follows:
deviate significantly from the norm. intelligence tool processes large amounts of data from a data
mart, data warehouse or other data storage unit. OLAP uses
Data Transformation: cubes to display multiple categories of data. Extract:
Data Integration: aggregate values. 1. Data Extraction: o Extraction is the initial step of ETL. It involves retrieving
7. OLAP facilitate interactive query and complex analysis for the users. data from diverse source systems, which can include databases, flat files, external
Loading is the process of inserting the transformed data into a target data
applications, and more. o Data extraction can be achieved using methods like batch
8. OLAP provides the ability to perform intricate calculations and comparisons. repository, which is often a data warehouse. This repository is designed to facilitate
Combining data from multiple sources into a unified dataset. processing, real-time streaming, or change data capture (CDC), depending on the
querying, reporting, and analysis. Loading can involve tasks such as:
source system's capabilities and the ETL tool in use.
Data Aggregation: Creating or updating tables and data structures in the data warehouse.
2. Data Cleansing:
Managing data partitions to optimize query performance.
OR Data cleansing is a crucial task in data transformation. It involves identifying and
Summarizing or aggregating data to a higher level of granularity, e.g., monthly Populating indexes and metadata for efficient data retrieval. rectifying data quality issues, such as missing values, duplicate records, and
sales from daily sales data. a) Describe ETL. What are the tasks to be performed during data inaccuracies. Techniques like data validation, data profiling, and outlier detection
transformation. [6] Handling incremental data loads, where only new or changed data is added.
Data Splitting: are applied to ensure that the data is accurate and reliable.
The knowledge base is helpful in the entire process of data mining. It might be Root
helpful to guide the search or evaluate the stake of the result patterns. The Apple: 3 | scss
knowledge base may even contain user views and data from user experiences that Berries: 3 Coconut (2) Copy code
might be helpful in the data mining process. The data mining engine may receive Coconut: 5 | Root
inputs from the knowledge base to make the result more accurate and reliable. The Dates: 4 Apple (1) |
pattern assessment module regularly interacts with the knowledge base to get Now, let's identify the frequent items (support >= 30% * 6 = 1.8): | Coconut (4)
inputs, and also update it.
Berries (2) |
Apple: 3 | Apple (2)
Berries: 3 Dates (1) |
b) Apply FP Tree Algorithm to construct FP Tree and find frequent Coconut: 5 Transaction 3: Coconut, Dates Berries (3)
itemset for the following dataset given below (minimum support = Dates: 4 FP-Tree after this transaction: |
30%) [7]
Step 2: Sort Frequent Items Dates (3)
Transaction ID List of Products Sort the frequent items in descending order of support: scss Transaction 6: Apple, Coconut, Dates
1 Apple, Berries, Copy code FP-Tree after this transaction:
Coconut Coconut (5) Root
2 Berries, Coconut, Dates (4) | scss
Dates Apple (3) Coconut (3) Copy code
3 Coconut, Dates Berries (3) | Root
Step 3: Construct the FP-Tree Apple (1) |
4 Berries, Dates
We'll construct the FP-Tree by processing the transactions one by | Coconut (4)
5 Apple, Coconut one: Berries (2) |
6 Apple, Coconut, Dates | Apple (3)
Transaction 1: Apple, Berries, Coconut Dates (2) |
FP-Tree after this transaction: Transaction 4: Berries, Dates Berries (3)
FP-Tree after this transaction: |
--> To construct an FP-Tree and find frequent itemsets using the scss Dates (4)
FP-Growth algorithm, we need to follow these steps: Copy code scss Step 4: Generate Conditional FP-Trees
Root Copy code To find frequent itemsets, we need to generate conditional FP -
Scan the dataset to count the support of each item and identify | Root Trees for each frequent item. We'll start with Coconut:
frequent items. Coconut (1) |
Sort the frequent items in descending order of support. | Coconut (3) Conditional FP-Tree for Coconut:
Re-scan the dataset and construct the FP -Tree. Apple (1) |
Generate conditional FP-Trees for each frequent item. | Apple (1) mathematica
Recursively mine the FP-Tree to find frequent itemsets. Berries (1) | Copy code
Let's go through these steps for the given dataset with a minimum Transaction 2: Berries, Coconut, Dates Berries (3) Root
support of 30%: FP-Tree after this transaction: | |
Dates (3) Dates (4)
Step 1: Count Support and Identify Frequent Items scss Transaction 5: Apple, Coconut Now, we can mine the conditional FP-Tree for frequent itemsets.
We first count the support of each item: Copy code FP-Tree after this transaction:
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302
Step 5: Recursively Mine the FP-Tree
We'll mine the conditional FP-Tree for frequent itemsets. Starting
with the frequent itemsets for Coconut:
a) Explain data mining techniques in brief. [3] This technique is used to obtain important and relevant information about data and In other words, we can say that Clustering analysis is a data mining technique to identify
Coconut, Dates (support: 4) metadata. This data mining technique helps to classify data in different classes. similar data. This technique helps to recognize the differences and similarities between
Now, let's check the other frequent items: Dates, Apple, and Data Mining Techniques the data. Clustering is very similar to the classification, but it involves grouping chunks
Berries. Data mining techniques can be classified by different criteria, as follows: of data together based on their similarities.
Data mining includes the utilization of refined data analysis tools to find previously
[5865] - 302
o Confidence: b) How does the KNN algorithm works? [7] Majority Voting:
This measurement technique measures how often item B is purchased when item A is Apply KNN classification algorithm for the given dataset and predict the
For classification, KNN assigns the class label to the new data point based on Nearest neighbor: (3, 4) with a distance of 3.0.
class for X(P1 = 3, P2 = 7) (K = 3) majority voting among its k-nearest neighbors. The class label that occurs most Second nearest neighbor: (1, 4) with a distance of 3.61.
purchased as well.
frequently among the k neighbors is the predicted class for the new data point. Third nearest neighbor: (7, 7) with a distance of 4.0.
(Item A + Item B)/ (Item A)
Among these three neighbors, two of them belong to the class "True," and one
Prediction for Regression: belongs to the class "False." Since K = 3, we'll choose the majority class among
5. Outer detection: In regression tasks, KNN predicts a continuous value for the new data point. It these neighbors, which is "True." Therefore, the predicted class for X (P1 = 3, P2 =
P1 P2 Class does this by averaging the target values of its k-nearest neighbors. 7) with K = 3 is "True."
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may 7 7 False
be used in various domains like intrusion, detection, fraud detection, etc. It is also known Final Prediction: X (P1 = 3, P2 = 7) is predicted to belong to the "True" class based on the KNN
7 4 False The algorithm assigns the predicted class (for classification) or predicted value (for algorithm with K = 3.
as Outlier Analysis or Outilier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier. 3 4 True regression) to the new data point.
Outlier detection plays a significant role in the data mining field. Outlier detection is 1 4 True
valuable in numerous fields like network interruption identification, credit or debit card Evaluate Performance:
fraud detection, detecting outlying in wireless sensor network data, etc. After making predictions, you can evaluate the model's performance using various
metrics like accuracy, precision, recall, F1-score, or Mean Squared Error (MSE),
-->The k-Nearest Neighbors (KNN) algorithm is a supervised machine learning
6. Sequential Patterns: depending on the type of task. Q5 A)a)What is text mining? Explain the process of text mining. [4]
algorithm used for classification and regression tasks. It makes predictions based --> Text mining is a component of data mining that deals specifically with
on the similarity between data points. Here's how the KNN algorithm works: unstructured text data. It involves the use of natural language processing (NLP)
The sequential pattern is a data mining technique specialized for evaluating sequential To apply the K-Nearest Neighbors (KNN) classification algorithm to predict the
data to discover sequential patterns. It comprises of finding interesting subsequences in class for a new data point (X) with features (P1 = 3, P2 = 7), and given K = 3, we techniques to extract useful information and insights from large amounts of
Select the Number of Neighbors (k): unstructured text data. Text mining can be used as a preprocessing step for
a set of sequences, where the stake of a sequence can be measured in terms of different need to calculate the distances between X and all the data points in the dataset, and
You need to specify the number of nearest neighbors, denoted as 'k,' that will be data mining or as a standalone process for specific tasks.
criteria like length, occurrence frequency, etc. then use majority voting to determine the class.
used to make predictions. This is typically an odd number to avoid ties in By using text mining, the unstructured text data can be transformed into
classification tasks. structured data that can be used for data mining tasks such as
In other words, this technique of data mining helps to discover or recognize similar Let's calculate the distances between X and the existing data points: classification, clustering, and association rule mining. This allows organizations
patterns in transaction data over some time.
Choose a Distance Metric: to gain insights from a wide range of data sources, such as customer feedback,
Distance between X (3, 7) and (7, 7): social media posts, and news articles.
7. Prediction: KNN uses a distance metric, such as Euclidean distance, Manhattan distance, or
Distance = sqrt((3 - 7)^2 + (7 - 7)^2) = sqrt(16) = 4.0
cosine similarity, to measure the similarity between data points. The choice of
distance metric depends on the nature of the data.
What is the common usage of Text Mining?
Prediction used a combination of other data mining techniques such as trends,
Distance between X (3, 7) and (7, 4): Text mining is widely used in various fields, such as natural language
clustering, classification, etc. It analyzes past events or instances in the right sequence to
Distance = sqrt((3 - 7)^2 + (7 - 4)^2) = sqrt(25 + 9) = sqrt(34) ≈ 5.83 processing, information retrieval, and social media analysis. It has become
predict a future event. Training: an essential tool for organizations to extract insights from unstructured text
KNN does not explicitly build a model during training. Instead, it stores the data and make data-driven decisions.
Distance between X (3, 7) and (3, 4):
training dataset. “Extraction of interesting information or patterns from data in large
Distance = sqrt((3 - 3)^2 + (7 - 4)^2) = sqrt(9) = 3.0 databases is known as data mining.”
Prediction for Classification: Text mining is a process of extracting useful information and nontrivial
Distance between X (3, 7) and (1, 4): patterns from a large volume of text databases.
To classify a new data point, KNN finds the k-nearest neighbors from the training
dataset based on the chosen distance metric. It does this by calculating the distance Distance = sqrt((3 - 1)^2 + (7 - 4)^2) = sqrt(4 + 9) = sqrt(13) ≈ 3.61
between the new data point and all data points in the training set. The k-nearest
Now, we have calculated the distances for all data points. Let's find the three Conventional Process of Text Mining
neighbors are the data points with the shortest distances.
nearest neighbors for X:
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302
Gathering unstructured information from various sources accessible in Assignment: Assign each data point to the nearest centroid. This is done by calculating Update: --> Sure, let's apply the K-Means algorithm to the given data set with K = 2
various document organizations, for example, plain text, web pages, PDF the Euclidean distance (or other distance metrics) between each data point and all clusters and the data points: D = {2, 3, 4, 10, 11, 12, 20, 25, 30}.
centroids and assigning the data point to the cluster associated with the nearest centroid. Now, we recalculate the centroids for each group:
records, etc.
Pre-processing and data cleansing tasks are performed to distinguish and
eliminate inconsistency in the data. The data cleansing process makes sure Initial Centroids:
to capture the genuine text, and it is performed to eliminate stop Update: Recalculate the centroids of each cluster by taking the mean of all data points New Centroid 1: (15 + 16 + 19 + 20 + 21) / 5 = 18.2
words stemming (the process of identifying the root of a certain word and assigned to that cluster. New Centroid 2: (28 + 35 + 40 + 42 + 44 + 60 + 65) / 7 = 45.57 We need to start by selecting initial centroids for the two clusters. For simplicity, we can
indexing the data. start with two random data points as initial centroids. Let's choose 3 and 25 as the initial
Processing and controlling tasks are applied to review and further clean the Repeat: centroids.
data set. Repeat: Repeat the assignment and update steps until the centroids no longer change We repeat the assignment and update steps until the centroids no longer change
Pattern analysis is implemented in Management Information System. significantly or until a predefined number of iterations is reached. significantly. In this case, Centroid 1 may not change, and Centroid 2 may converge to
Information processed in the above steps is utilized to extract important and Iteration 1:
around 45.57.
applicable data for a powerful and convenient decision-making process and
trend analysis. Assignment:
Termination: The algorithm terminates when the centroids stabilize, and the clusters are
formed. Termination:
Once the centroids stabilize, the algorithm terminates. The data points are now divided Calculate the distance of each data point to the centroids.
Now, let's apply the K-Means algorithm to group visitors to a website into two groups into two clusters based on their ages.
Assign each data point to the nearest centroid.
using their ages: 15, 16, 19, 20, 21, 28, 35, 40, 42, 44, 60, 65, with initial centroids 16
and 28 for the two groups. For this iteration:
Cluster 1 (Centroid 1 ≈ 18.2):
For this iteration: of the web by dealing with it from web-based records and services,
server logs, and hyperlinks. The main goal of web mining is to find
the designs in web data by collecting and analyzing data to get
important insights.
Data points closer to 6.33: {2, 3, 4, 10, 11, 12}
Data points closer to 25: {20, 25, 30}
Update:
There are various types of web mining which are as follows −
Calculate the mean of data points in each cluster to get new centroids. Web Content Mining − Web content mining is a procedure of Web Mining in
New Centroid 1 = (2 + 3 + 4 + 10 + 11 + 12) / 6 ≈ 6.33 which essential descriptive data is extracted from websites (WWW). Content
involves audio, video, text documents, hyperlinks, and structured records. Web
New Centroid 2 = (20 + 25 + 30) / 3 ≈ 25
contents are designed to deliver records to users in the design of text, lists,
As you can see, the centroids did not change in the second iteration, which means the K- images, videos, and tables.
Means algorithm has converged.
The function of content mining is data extraction, where structured data is
copied from unstructured websites. The goal is to support data aggregation over
The final clustering for K = 2 is as follows: several websites by utilizing the extracted structured data.
Cluster 1 (Centroid 1 ≈ 6.33): {2, 3, 4, 10, 11, 12}
Web Structured Mining − Web Structure mining is one of the core techniques
Cluster 2 (Centroid 2 ≈ 25): {20, 25, 30}
of web mining that deals with hyperlinks structure. Structure mining essentially
shows the structured summary of the website. It recognizes relationships
These are the two clusters formed by the K-Means algorithm for the given data set with among linked web pages of websites.
K = 2.
Web mining is only data mining that digs information from the web. There are
several algorithmic techniques are used to find data from the web. Structure
mining analyzes hyperlinks of the website to assemble informative records and
sort them out in elements like similarities and relationships. Intra-page is a type
of mining that is implemented at the document level and hyperlink level mining
is called inter-page mining.
Is this conversation helpful so far?
Web Usage Mining − Web usage mining is used to extract useful records,
information, knowledge from the weblog data, and helps in identifying the user
access patterns for web pages.