Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Total No. of Questions : 5] SEAT No. : PA-2562 [Total No.

of Pages : 6
[5948]-302
M.C.A. (Management )
IT - 32 : DATA WAREHOUSING AND DATA MINING
(2020 Pattern) (Semester - III)
Time : 2½ Hours] [Max. Marks : 50
Instructions to the candidates:
1) All questions are compulsory.
2) Draw neat & labelled diagrams wherever necessary.

Q2) a) Discuss the schemas is Data warehousing with the


help of employee database example. [5]  There is a fact table at the center. It contains the keys to
each of four dimensions.
--> Schema is a logical description of the entire database. It  Now the item dimension table contains the attributes
 The fact table also contains the attributes, namely dollars
includes the name and description of records of all record types item_key, item_name, type, brand, and supplier-key.
sold and units sold.
including all associated data-items and aggregates. Much like a  The supplier key is linked to the supplier dimension table.
database, a data warehouse also requires to maintain a schema. Note − Each dimension has only one dimension table and each The supplier dimension table contains the attributes
A database uses relational model, while a data warehouse uses table holds a set of attributes. For example, the location supplier_key and supplier_type.
Star, Snowflake, and Fact Constellation schema. In this chapter, dimension table contains the attribute set {location_key, street,
Note − Due to normalization in the Snowflake schema, the
we will discuss the schemas used in a data warehouse. city, province_or_state,country}. This constraint may cause
redundancy is reduced and therefore, it becomes easy to
data redundancy. For example, "Vancouver" and "Victoria" both
Star Schema maintain and the save storage space.
the cities are in the Canadian province of British Columbia. The
 Each dimension in a star schema is represented with only entries for such cities may cause data redundancy along the
one-dimension table. attributes province_or_state and country.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company Snowflake Schema
with respect to the four dimensions, namely time, item,  Some dimension tables in the Snowflake schema are
b)Give 5 difference between transactional data and ware house
branch, and location. normalized.
 The normalization splits up the data into additional tables.
data based on there characteristic. [5]
 Unlike Star schema, the dimensions table in a snowflake --> Difference between Data Warehousing and Online-Transaction
schema are normalized. For example, the item dimension processing (OLTP) :
table in star schema is normalized and split into two Data Warehousing DWH Online transaction
dimension tables, namely item and supplier table.
It is technique that gathers or collect It is technique that is used for detailed
data from different sources into day to day transaction data which keep
central repository. chaining on everyday.
It is designed for decision making It is designed for business transaction
process. process.
[5948]-302 [5948]-302 [5948]-302

It stores large amount of data or the type of data. Data can be structured, semi structured and 4. sources and provides a more complete view of the organization’s data.
historical data. It holds current data. unstructured as well.
Disadvantages of Top-Down Approach –
It used for analyzing the business. It used for running the business. 2. Stage Area –
Since the data, extracted from the external sources does not follow a 1. The cost, time taken in designing and its maintenance is very high.
In Online transaction processing, the particular format, so there is a need to validate this data to load into
In Data warehousing, the size of size of data base is around 10MB- datawarehouse. For this purpose, it is recommended to use ETL tool. 2. Complexity: The top-down approach can be complex to implement and
database is around 100GB-2TB . 100GB.  E(Extracted): Data is extracted from External data source. maintain, particularly for large organizations with complex data needs.
In Data warehousing, denormalized In Online transaction processing, The design and implementation of the data warehouse and data marts
data is present. normalized data is present.  T(Transform): Data is transformed into the standard format. can be time-consuming and costly.

It uses Query processing. It uses transaction processing  L(Load): Data is loaded into datawarehouse after transforming it into
the standard format. 3. Limited user involvement: The top-down approach can be dominated by ROLAP
It is subject-oriented. It is application-oriented.
IT departments, which may lead to limited user involvement in the
In Data warehousing, data In Online transaction processing, there 3. Data-warehouse – design and implementation process. This can result in data marts that
redundancy is present. is no data redundancy. After cleansing of data, it is stored in the datawarehouse as central Advantages
do not meet the specific needs of business users.
repository. It actually stores the meta data and the actual data gets
stored in the data marts. Note that datawarehouse stores the data in its Can handle large amounts of information: The data size limitation of ROLAP
OR purest form in this top-down approach. technology is depends on the data size of the underlying RDBMS. So, ROLAP itself does
a)Name the different OLAP architectures. Pick any two (2) and describe in not restrict the data amount.
a)Explain the architecture of a Data warehouse with a neat
diagram.[5] 4. Data Marts – detail with advantage. [5]
<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies,
--> A data-warehouse is a heterogeneous collection of different data Data mart is also a part of storage component. It stores the information (works on top of the RDBMS) can control these functionalities.
sources organised under a unified schema. There are 2 approaches for of a particular function of an organisation which is handled by single
authority. There can be as many number of data marts in an
constructing data-warehouse: Top-down approach and Bottom-up
organisation depending upon the functions. We can also say that data
OLAP is considered (Online Analytical Processing) which is a type of Disadvantages
approach are explained as below. software that helps in analyzing information from multiple databases at a
1. Top-down approach: mart contains subset of the data stored in datawarehouse.
particular time. OLAP is simply a multidimensional data model and also Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL
applies querying to it. queries) in the relational database, the query time can be prolonged if the underlying data
5. Data Mining – size is large.
Types of OLAP Servers
The practice of analysing the big data present in datawarehouse is data
 Relational OLAP
mining. It is used to find the hidden patterns that are present in the Limited by SQL functionalities: ROLAP technology relies on upon developing SQL
 Multi-Dimensional OLAP
database or in datawarehouse with the help of algorithm of data mining.
 Hybrid OLAP
This approach is defined by Inmon as – datawarehouse as a central
 Transparent OLAP statements to query the relational database, and SQL statements do not suit all
repository for the complete organisation and data marts are created
from it after the complete datawarehouse has been created.
Relational OLAP (ROLAP): Star Schema Based
The ROLAP is based on the premise that data need not be stored multi- needs.
Advantages of Top-Down Approach – dimensionally to be viewed multi-dimensionally, and that it is possible to
1. Since the data marts are created from the datawarehouse, provides exploit the well-proven relational database technology to handle the
Multidimensional OLAP (MOLAP): Cube-Based
consistent dimensional view of data marts. multidimensionality of data. In ROLAP data is stored in a relational
database. In essence, each action of slicing and dicing is equivalent to MOLAP stores data on disks in a specialized multidimensional array
adding a “WHERE” clause in the SQL statement. ROLAP can handle large structure. OLAP is performed on it relying on the random access capability
2. Also, this model is considered as the strongest model for business of the arrays. Arrays elements are determined by dimension instances, and
changes. That’s why, big organisations prefer to follow this approach. amounts of data. ROLAP can leverage functionalities inherent in the
The essential components are discussed below: relational database. the fact data or measured value associated with each cell is usually stored
in the corresponding array element. In MOLAP, the multidimensional array
1. External Sources – 3. Creating data mart from datawarehouse is easy. is usually stored in a linear allocation according to nested traversal of the
External source is a source from where data is collected irrespective of axes in some predetermined order.
[5948]-302 [5948]-302 [5948]-302 [5948]-302
But unlike ROLAP, where only records with non-zero facts are stored, all Q3) a) What are Discretization and concept Hierarchy generation process? concepts to high-level concepts. For example, in computer science, there are
array elements are defined in MOLAP and as a result, the arrays generally different types of hierarchical systems. A document is placed in a folder in windows
Give an example for each. [5]
tend to be sparse, with empty elements occupying a greater part of it. Since at a specific place in the tree structure is the best example of a computer
both storage and retrieval costs are important while assessing online
performance efficiency, MOLAP systems typically include provisions such --> Discretization in data mining hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.
as advanced indexing and hashing to locate data while performing queries
for handling sparse arrays. MOLAP cubes are fast data retrieval, optimal Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In Let's understand this concept hierarchy for the dimension location with the help of
for slicing and dicing, and can perform complex calculations. All
other words, data discretization is a method of converting attributes values of an example.
calculations are pre-generated when the cube is created.
continuous data into a finite set of intervals with minimum data loss. There are two
A particular city can map with the belonging country. For example, New Delhi can
forms of data discretization first is supervised discretization, and the second is
be mapped to India, and India can be mapped to Asia.
unsupervised discretization. Supervised discretization refers to a method in which
the class data is used. Unsupervised discretization refers to a method depending
Top-down mapping
upon the way which operation proceeds. It means it works on the top-down
splitting strategy and bottom-up merging strategy. Top-down mapping generally starts with the top with some general information
and ends with the bottom to the specialized information.
Now, we can understand this concept with the help of an example
Bottom-up mapping
Suppose we have an attribute of Age with the given values
Bottom-up mapping generally starts with the bottom with some specialized
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77 information and ends with the top to the generalized information.
MOALP

Table before Discretization


Advantages
Attribute Age Age Age Age
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is b)Explain the tools used for data warehouse development. [5]
optimal for slicing and dicing operations. 1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
A data warehouse is a Data management system that is used for storing,
Can perform complex calculations: All evaluation have been pre-generated when After Discretization Child Young Mature Old reporting, and data analysis. It is the primary component of business
the cube is created. Hence, complex calculations are not only possible, but they
intelligence and is also known as an enterprise data warehouse. Data
return quickly. Warehouses are central repositories that store data from one or more
Another example is analytics, where we gather the static data of website visitors.
For example, all visitors who visit the site with the IP address of India are shown heterogeneous sources. Data warehouses are analytical tools built to
Disadvantages under country level. support decision-making for reporting users across many departments.
Data warehouse works to create a single, unified system of truth for an
Limited in the amount of information it can handle: Because all calculations are entire organization and store historical data about business and
performed when the cube is built, it is not possible to contain a large amount of Data discretization and concept hierarchy organization so that it could be analyzed and extract insights from it.
data in the cube itself. generation
Requires additional investment: Cube technology is generally proprietary and The term hierarchy represents an organizational structure or mapping in which
does not already exist in the organization. Therefore, to adopt MOLAP technology, items are ranked according to their levels of importance. In other words, we can say
chances are other investments in human and capital resources are needed. that a hierarchy concept refers to a sequence of mappings with a set of more
general concepts to complex concepts. It means mapping is done from low-level

[5948]-302 [5948]-302 [5948]-302 [5948]-302

alternative cloud data warehouse tool makes it straightforward to query beneath big sets of data. BigQuery is not developed to substitute relational example, Vertica is a column-oriented relational database; therefore, it
data and writes data back to the data lake in open formats. It focuses on databases and for easy CRUD operations and queries. It is oriented for might not qualify as a NoSQL database. A NoSQL database is best
simple Use and Accessibility. MySQL and alternative SQL-based systems running analytical queries. It is a hybrid system that enables the storage of outlined as being a non-relational, shared-nothing, horizontally scalable
are one in all the foremost well-liked and simply usable interfaces for information in columns; however, it takes into the NoSQL additional database while not ACID guarantees. Vertica differs from normal RDBMS
database management. Redshift’s easy query-based system makes features, like the data type, and the nested feature. BigQuery is a better within the approach that it stores data by grouping data at once on disk by
platform adoption and acclimatization a light breeze. It is incredibly quick option than Redshift since we have to pay by the hour. BigQuery may also column instead of by row, Vertica reads the columns documented by the
once it involves loading data and querying it for analytical and reporting be the best solution for data scientists running ML or data mining query, rather than scanning the complete table as row-oriented databases
functions. Redshift features a massively parallel processing (MPP) design operations since they deal with extremely large datasets. Google Cloud should do. Vertica offers the foremost advanced unified analytical
that permits loading data at a very high speed. also offers a set of auto-scaling services that enables you to build a data warehouse that allows the organization to stay up with the dimensions and
lake that integrates with your existing applications, skills, and IT complexness of huge amounts of data volumes. With Vertica, businesses
2. Microsoft Azure: investments. In BigQuery,most of the time is spent on metadata/initiation, can perform tasks like predictive maintenance, client remembrance,
but the actual execution time is very small. economic compliance and network optimization, and far more.
Azure is a cloud computing platform that was launched by Microsoft in
Data Flow through Warehouse Architecture
2010. Microsoft Azure is a cloud computing service provider for building, 4. Snowflake: OR
testing, deploying, and managing applications and services through
Previously, organizations had to build lots of infrastructure for data Microsoft-managed data centers. Azure is a public cloud computing Snowflake is a cloud computing-based data warehousing built on top of the a)Explain the different data sources for data warehouse and methods of data
warehousing but today, cloud computing technology has amazingly platform that offers Infrastructure as a Service (IaaS), Platform as a Service Amazon Web Services or Microsoft Azure cloud infrastructure. The
reduced the efforts as well as the cost of building data warehousing for (PaaS), and Software as a Service (SaaS). The Azure cloud platform Snowflake design allows storage and computes to scale independently, collection. [5]
businesses. Data warehouses and their tools are moving from physical provides more than 200 products and cloud services such as Data thus customers can use and pay money for storage and computation
data centers to cloud-based data warehouses. Many large organizations Analytics, Virtual Computing, Storage, Virtual Network, Internet Traffic individually. In Snowflake data processing is simplified: Users will do data --> Data warehouses are designed to store and manage data from various sources
still operate data through the traditional way of data warehousing but Manager, Web Sites, Media Services, Mobile Services, Integration, etc. blending, analysis, and transformations against varied forms of data to support business intelligence and analytical processes. Data can come from
clearly, the future of the data warehouse is in the cloud. The cloud-based Azure facilitates simple portability and genuinely compatible platform structures with one language, SQL. Snowflake offers dynamic, scalable diverse origins, and methods for collecting and integrating this data can vary.
data warehousing tools are fast, efficient, highly scalable, and available between on-premise and public Cloud. Azure provides a range of cross- computing power with charges primarily based strictly on usage. With Here's an overview of different data sources for data warehouses and the methods
based on pay-per-use. connections including virtual private networks (VPNs), caches, content Snowflake, computation and storage are fully separate, and also the of data collection:
There are various cloud-based Data Warehousing tools available. So, it delivery networks (CDNs), and ExpressRoute connections to improve storage value is that the same as storing the data on Amazon S3. AWS
becomes difficult to select top Data Warehouse tools according to our usability and performance. Microsoft Azure provides a secure base across tried to handle this issue by introducing Redshift Spectrum, which allows
project requirements. Following are the top 8 Data Warehousing tools: physical infrastructure and operational security. Azure App offers a querying data that exists directly on Amazon S3; however, it’s not as
completely managed web hosting service that helps in building web seamless as Snowflake. With Snowflake, we can clone a table, a schema, 1. Operational Databases:
applications, services, and Restful APIs. It offers a variety of plans to meet or perhaps a database in no time and occupying no extra space. This is
1. Amazon Redshift: the requirements of any application, from small to globally scaled web often because the cloned table creates pointers that point to the kept data, Online Transaction Processing (OLTP) Systems: These are the primary systems
applications. Running virtual machines or containers in the cloud is one of however, not the actual data. In alternative words, the cloned table solely where daily business transactions are recorded. Data is often collected from OLTP
Amazon Redshift is a cloud-based fully managed petabytes-scale data the most popular applications of Microsoft Azure. has data that’s completely different from its original table. databases for analytical purposes.
warehouse By the Amazon Company. It starts with just a few hundred
gigabytes of data and scales to petabytes or more. This enables the use of 3. Google BigQuery: 5. Micro Focus Vertica: Methods of Data Collection from Operational Databases:
data to accumulate new insights for businesses and customers. It is a
relational database management system (RDBMS) therefore it is
compatible with other RDBMS applications. Amazon Redshift offers quick BigQuery is a serverless data warehouse that allows scalable analysis over Micro Focus Vertica: Micro Focus Vertica is developed to use in data
querying capabilities over structured data by the use of SQL-based clients petabytes of data. It’s a Platform as a Service that supports querying with warehouses and other big data workloads where speed, scalability,
the help of ANSI SQL. It additionally has inbuilt machine learning simplicity, and openness are crucial to the success of analytics. It is a self- ETL (Extract, Transform, Load): ETL processes are used to extract data from
and business intelligence (BI) tools using standard ODBC and JDBC
capabilities. BigQuery was declared in 2010 and made available for use monitored MPP database and offers scalability and flexibility that other operational databases, transform it to fit the data warehouse schema, and load it
connections. Amazon Redshift is made around industry-standard SQL, with
additional practicality to manage massive datasets and support superior there in 2011. Google BigQuery is a cloud-based big data analytics web tools don’t. It is used on commercial hardware, therefore we can scale the into the data warehouse.
analysis and reporting of these data. It helps to work quickly and easily service to process very huge amount of read-only data sets. BigQuery is database as required. It is designed significantly in-database advanced
along with data in open formats, and simply integrates with and connects to designed for analyzing data that are in billions of rows by simply employing analytics capabilities to improve query performance over traditional
the AWS scheme. Also query and export data to and from the data lake. No SQL-lite syntax. BigQuery can run advanced analytical SQL-based queries relational database systems and unverified open source offerings. For
[5948]-302 [5948]-302 [5948]-302 [5948]-302
Change Data Capture (CDC): CDC techniques track changes in operational Methods of Data Collection from Cloud-Based Services: Q4) a) Consider the data set given below, compute the support for item a) A consultancy wants to categories MCA students into classes as
databases, capturing new and modified data to keep the data warehouse up to sets
Excellent, Good, and Average. The data collected from students are
date. {e}, {b, d} and {b, d, e}
their average percentage in MCA- I year and result of the
2. External Data Sources: APIs: Many SaaS applications provide APIs to access data, making it possible to apptitude test conducted by the consultancy.
extract and load data into the data warehouse. customer ID Transaction ID Items Brought Solve the problem using decision tree Algorithm. [5]

Webhooks: Some SaaS applications support webhooks to push data to the data
0001 {a, d, e} -->
b) Using Bayes an classification to classify the sample data: {6, 43}.
Third-party Data Providers: Organizations can purchase external data, such as warehouse when specific events occur.
As male or female. Training data is given. [5]
market research data, demographic data, or industry-specific data, to enrich their
0024 {a, b, c, e}
analytics. 5. Social Media and User-Generated Content: Person Height Weight
0012 {a, b, d, e}
Male 6.2 82
Methods of Data Collection from External Data Sources:
0031 {a, c, d, e}
Male 5.11 65
Social Media Platforms: Data from social media platforms, forums, and user- 0015 {b, c, e}
generated content can provide insights into customer sentiment and behavior. Male 5.7 58
Data Feeds: Organizations can receive data feeds directly from third-party 0022 {b, d, e}
Male 5.11 55
providers, either as batch files or through APIs. Methods of Data Collection from Social Media and User-Generated Content: 0029 {c, d}
Female 4.10 42
Web Scraping: Web scraping techniques can be used to collect data from websites 0040 {a, b, c}
Female 5.5 50
and online sources. 0033 {a, d, e}
APIs: Social media platforms often provide APIs for accessing their data, which can Female 5.0 43
3. Legacy Systems: be integrated into the data warehouse. 0038 {a, b, e}
Female 5.75 50
Web Scraping: Web scraping techniques can be employed to collect data from
b) Using the result from problem a, above, compute the confidence for the
social media sites and online communities.
Older Systems: Historical data might be stored in legacy systems, which need to be
association rules {b, d} e and {e} {b, d} [5] Q5) a) Construct a FP-Tree Algorithm, to find frequency patterns for the given
integrated into the data warehouse for historical analysis. OR data. [5]

Methods of Data Collection from Legacy Systems: Transaction ID Item Bought

100 {f, a, c, d, g, i, m, p}

200 {a, b, c, f, l, m, o}
Data Migration: Data can be migrated from legacy systems using ETL processes or
custom data conversion tools. 300 {b, f, h, j, o}

4. Cloud-Based Services: 400 {b, c, k, s, p}

500 {a, f, c, e, l, p, m, n}

SaaS (Software as a Service) Applications: Data generated by SaaS applications like Sure, let me guide you through the steps of the FP-Tree algorithm for the given
CRM, marketing automation, or ERP systems can be integrated into the data data.
warehouse.
[5948]-302 [5948]-302 [5948]-302 [5948]-302

1. **Step 1: Scan the data and count the support of each item.** o: 2 | --> A Hierarchical clustering method works via grouping data into a tree of
- Count the support (frequency) of each item in the dataset. ``` a(1) clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
```
1. Identify the 2 clusters which can be closest together, and
``` 3. **Step 3: Sort items based on support.** 2. Merge the 2 maximum comparable clusters. We need to continue these
f: 4 - Sort the selected items in descending order of support. 5. **Step 5: Mine frequent patterns from the FP-Tree.** steps until all the clusters are merged together.
a: 3 - Traverse the FP-Tree to generate frequent patterns. In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-
c: 4 ```
like diagram that statistics the sequences of merges or splits) graphically
d: 1 f: 4 ``` represents this hierarchy and is an inverted tree that describes the order in
g: 1 c: 4 f: 4 which factors are merged (bottom-up view) or clusters are broken up (top-
i: 1 p: 4 f-c: 3 down view).
Hierarchical clustering is a method of cluster analysis in data mining that
m: 3 a: 3 f-c-a: 3 creates a hierarchical representation of the clusters in a dataset. The
p: 4 b: 3 f-c-a-b: 1 method starts by treating each data point as a separate cluster and then
b: 3 m: 3 f-c-a-b-m: 1 iteratively combines the closest clusters until a stopping criterion is
l: 2 l: 2 f-c-p: 1 reached. The result of hierarchical clustering is a tree-like structure, called
a dendrogram, which illustrates the hierarchical relationships among the
o: 2 o: 2 f-p: 1 clusters.
h: 1 ``` f-a: 3
Hierarchical clustering has a number of advantages over other
j: 1 f-a-c: 3 clustering methods, including:
k: 1 4. **Step 4: Construct the FP-Tree.** f-a-c-m: 1 1. The ability to handle non-convex clusters and clusters of different sizes
s: 1 - Initialize an empty tree and insert transactions into the tree in a way that f-a-c-m-b: 1 and densities.
maintains the item order and support. 2. The ability to handle missing data and noisy data.
e: 1 f-a-b: 1
3. The ability to reveal the hierarchical structure of the data, which can be
n: 1 f-a-b-m: 1 useful for understanding the relationships among the clusters.
``` ``` f-a-l: 1 However, it also has some drawbacks, such as:
root f-c-b: 2 4. The need for a criterion to stop the clustering process and determine
| the final number of clusters.
2. **Step 2: Filter items based on minimum support.** f-c-b-m: 1
5. The computational cost and memory requirements of the method can be
- Select only those items whose support is greater than or equal to the minimum f(4) f-b: 2 high, especially for large datasets.
support threshold (let's say min_support = 2). |\ f-b-m: 1 6. The results can be sensitive to the initial conditions, linkage criterion,
c(3) p(1) ``` and distance metric used.
``` | | In summary, Hierarchical clustering is a method of data mining that
groups similar data points into clusters by creating a hierarchical
f: 4 a(3) m(1) These are the frequent patterns found in the dataset using the FP-Tree algorithm. structure of the clusters.
a: 3 | | Note that the support count is mentioned for each pattern. You can filter the 7. This method can handle different types of data and reveal the
c: 4 b(1) o(1) patterns based on a minimum support threshold to get the final set of frequent relationships among the clusters. However, it can have high
patterns. computational cost and results can be sensitive to some conditions.
m: 3 | |
1. Agglomerative: Initially consider every data point as
p: 4 m(1) l(1) an individual Cluster and at every step, merge the nearest pairs of the
b: 3 | cluster. (It is a bottom-up method). At first, every dataset is considered an
l: 2 b(2) b)Explain Hierarchical clustering using examples. [5]

[5948]-302 [5948]-302 [5948]-302 [5948]-302


individual entity or cluster. At every iteration, the clusters merge with  Step-4: Repeating the same process; The clusters DEF and BC are a) Perform K. means clustering and show all the calculations at each iteration, web crawlers to navigate the web, collect web page information, and
different clusters until one cluster is formed. comparable and merged together to form a new cluster. We’re now left index it. These crawlers follow links on web pages, download content,
to form the final cluster. Assume the initial clusters are A, E and H. [5]
The algorithm for Agglomerative Hierarchical Clustering is: with clusters [(A), (BCDEF)]. and build a database of web pages, which is later used for search engine
 Step-5: At last the two remaining clusters are merged together to form a Points X1 X2 ranking and retrieval.
 Calculate the similarity of one cluster with all the other clusters
single cluster [(ABCDEF)].
(calculate proximity matrix)
2. Divisive:
 Consider every data point as an individual cluster
We can say that Divisive Hierarchical clustering is precisely
 Merge the clusters which are highly similar or close to each other. Price Comparison Bots: Websites that provide price comparison services often
the opposite of Agglomerative Hierarchical clustering. In Divisive
 Recalculate the proximity matrix for each cluster use agent-based bots to fetch pricing information from various e-
Hierarchical clustering, we take into account all of the data points as a
 Repeat Steps 3 and 4 until only a single cluster remains. commerce sites. These bots regularly visit different online stores, extract
single cluster and in every iteration, we separate the data points from the
Let’s see the graphical representation of this algorithm using a
clusters which aren’t comparable. In the end, we are left with N clusters. product information and prices, and present them to users for comparison.
dendrogram.
Note: This is just a demonstration of how the actual algorithm works no
calculation has been performed below all the proximity among the clusters Database-Based Approach:
is assumed.
Let’s say we have six data points A, B, C, D, E, and F. The database-based approach in web mining focuses on storing and managing
large volumes of web data in databases, which can be queried and
analyzed more efficiently. This approach relies on structured storage and
the use of database management systems to handle web data.

b)What are agent based and database based approaches in web mining?
Example: Web Usage Mining
Explain with example. [5]
--> Agent-based and database-based approaches are two different methods used
in web mining to gather, analyze, and extract valuable information from Web Usage Mining involves collecting and analyzing user interaction data on a
the World Wide Web. Here's an explanation of both approaches with website, such as clickstream data, session logs, and user profiles. This
examples: information is often stored in a database, making it easier to identify user
behavior patterns and optimize website content.
Figure – Divisive Hierarchical clustering
Figure – Agglomerative Hierarchical clustering Agent-Based Approach:
 Step-1: Consider each alphabet as a single cluster and calculate the Agent-based web mining involves the use of software agents or bots that
distance of one cluster from all the other clusters. OR
autonomously navigate the web, gather data, and perform various tasks, Content Management Systems: Many websites and content management
 Step-2: In the second step comparable clusters are merged together to
such as data retrieval, data filtering, and data processing. These agents are systems (CMS) store web content and metadata in structured databases.
form a single cluster. Let’s say cluster (B) and cluster (C) are very
designed to mimic human behavior or follow predefined rules to interact This allows for easy content retrieval, searching, and presentation. For
similar to each other therefore we merge them in the second step
with web resources. example, WordPress stores blog posts, categories, and tags in a database,
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC),
(DE), (F)] making it simple to query and display content on a websit
 Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new clusters Example: Web Crawlers
as [(A), (BC), (DEF)]

Search Engine Crawlers: Search engines like Google, Bing, and Yahoo employ
[5948]-302 [5948]-302 [5948]-302 [5948]-302

[5948]-304 [5948]-305
Total No. of Questions : 5] IT 32: DATA WAREHOUSING AND DATA MINING Historical information is kept in a data warehouse. For example, one can retrieve
SEAT No. :
(2020 Pattern) (Semester - III) files from 3 months, 6 months, 12 months, or even previous data from a data
P6989 [5865] - 302
[Total No. of Pages : 5 warehouse. These variations with a transactions system, where often only the most
Time : 2½ Hours] [Max. Marks : 50 current file is kept.
M.C.A. (Management) Instructions to the candidates:
1) All questions are compulsory.
2) Draw neat & labelled diagram wherever necessary.

Q2) a)What is a Data warehouse. Explain the need and characteristics of


Datawarehouse. [5]
--> What is a Data Warehouse?
A data warehouse is a centralized repository that is used for storing and
managing large volumes of data from various sources within an organization. It is
designed to support business intelligence (BI) activities, data analysis, and
reporting. Data warehouses are a critical component in the field of data Integrated
management and are used to consolidate and integrate data from different
operational systems, making it available for analysis and decision-making. A data warehouse integrates various heterogeneous data sources like RDBMS, flat Non-Volatile
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions, The data warehouse is a physically separate data storage, which is transformed
Characteristics of Data Warehouse attributes types, etc., among different data sources. from the source operational RDBMS. The operational updates of data do not occur
in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial loading
of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the
warehouse, and data should not change.

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the
global organization's ongoing operations. This is done by excluding data that are
not useful concerning the subject and including all data needed by the users to
understand the subject.

Structured Schema: Data in a data warehouse is typically organized into a structured schema, often
in the form of a star or snowflake schema. This schema is optimized for querying and reporting,
Time-Variant making it easier to analyze the data.
[5865] - 302 [5865] - 302 [5865] - 302

schema. A database uses relational model, while a data  The sales fact table is same as that in the star schema.
Need for Data Warehouse warehouse uses Star, Snowflake, and Fact Constellation  The shipping fact table has the five dimensions, namely
schema. In this chapter, we will discuss the schemas used in a item_key, time_key, shipper_key, from_location,
Data Warehouse is needed for the following reasons: data warehouse. to_location.
 The shipping fact table also contains two measures,
Star Schema namely dollars sold and units sold.
 Each dimension in a star schema is represented with only  It is also possible to share dimension tables between fact
one-dimension table. tables. For example, time, item, and location dimension
 This dimension table contains the set of attributes. tables are shared between the sales and shipping fact
 The following diagram shows the sales data of a company table.
with respect to the four dimensions, namely time, item,
branch, and location.

1. 1) Business User: Business users require a data warehouse to view summarized


OR
 Now the item dimension table contains the attributes
data from the past. Since these people are non-technical, the data may be item_key, item_name, type, brand, and supplier-key. a)Explain Kimball Life Cycle diagram in detail. [5]
presented to them in an elementary form.  The supplier key is linked to the supplier dimension table. -->
2. 2) Store historical data: Data Warehouse is required to store the time variable
The supplier dimension table contains the attributes a data warehouse. It was developed by Ralph Kimball, a prominent figure in the field of
supplier_key and supplier_type. data warehousing. The Kimball Lifecycle diagram represents the various stages and
data from the past. This input is made to be used for various purposes. iterative steps involved in building and evolving a data warehouse. Below is a detailed
explanation of each stage in the Kimball Lifecycle:
3. 3) Make strategic decisions: Some strategies may be depending upon the data in Fact Constellation Schema
the data warehouse. So, data warehouse contributes to making strategic decisions.  A fact constellation has multiple fact tables. It is also
4. 4) For data consistency and quality: Bringing the data from different sources at a known as galaxy schema.
commonplace, the user can effectively undertake to bring the uniformity and
 The following diagram shows two fact tables, namely sales
 There is a fact table at the center. It contains the keys to and shipping.
consistency in data.
each of four dimensions.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected
 The fact table also contains the attributes, namely dollars
loads and types of queries, which demands a significant degree of flexibility and sold and units sold.
quick response time.
Snowflake Schema
 Some dimension tables in the Snowflake schema are
normalized.
b)Explain the schemas of Data warehouse. [5]  The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake 1. Business Understanding: The process begins with a deep understanding of the
--> Schema is a logical description of the entire database. It schema are normalized. For example, the item dimension organization's business objectives, goals, and requirements. This phase involves
includes the name and description of records of all record types table in star schema is normalized and split into two close collaboration with business stakeholders to identify the key business questions
dimension tables, namely item and supplier table. and data needed for decision-making.
including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a
2. Requirements Definition: In this stage, the specific data requirements are
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302
documented. Data requirements are translated into a set of business and technical maintenance, enhancements, and the incorporation of new data sources or business unstructured as well. main steps in ETL are as follows:
requirements. These requirements include the data sources, data quality requirements. This phase involves continuously adapting and evolving the data
expectations, and reporting needs. warehouse to meet changing business needs. 2. Stage Area –
Since the data, extracted from the external sources does not follow a Extract:
3. Dimensional Modeling: Dimensional modeling is a critical step in the Kimball 12. Retirement or Archiving: As data ages and becomes less relevant for business particular format, so there is a need to validate this data to load into
Lifecycle. It involves designing the structure of the data warehouse, including the analysis, it may be archived or retired from the data warehouse to reduce storage datawarehouse. For this purpose, it is recommended to use ETL tool.
creation of fact tables (containing measures) and dimension tables (containing costs and maintain performance.  E(Extracted): Data is extracted from External data source.
Extraction involves retrieving data from one or more source systems, which can
descriptive attributes). The star schema or snowflake schema design is typically
include databases, flat files, web services, APIs, or other data repositories. The goal
used to structure the data for efficient querying and reporting. The Kimball Lifecycle is an iterative process, meaning that as new business  T(Transform): Data is transformed into the standard format.
is to collect raw data from these disparate sources.
requirements arise or data sources change, the cycle may begin again from the
4. ETL (Extract, Transform, Load): This phase involves the extraction of data from Business Understanding stage. This methodology emphasizes user involvement, a  L(Load): Data is loaded into datawarehouse after transforming it into Transform:
source systems, transforming it into the required format, and loading it into the data focus on business goals, and a pragmatic approach to data warehousing, making it a the standard format.
warehouse. ETL processes are responsible for cleaning, aggregating, and integrating widely adopted framework in the industry.
data from different sources. 3. Data-warehouse –
Transformation is the process of converting the extracted data into a format that is
b)What is a Data warehouse? Explain the properties of Data warehouse After cleansing of data, it is stored in the datawarehouse as central
consistent, cleaned, and suitable for analysis. Transformations may include:
5. Data Warehouse Database Design: The data warehouse database is designed and architecture. [5] repository. It actually stores the meta data and the actual data gets
implemented to support the schema defined in the dimensional modeling phase. --> A data warehouse is a centralized repository that is used for storing stored in the data marts. Note that datawarehouse stores the data in its Data cleaning: Handling missing data, removing duplicates, and correcting errors.
This includes defining table structures, indexing, and optimizing for query and managing large volumes of data from various sources within an purest form in this top-down approach.
performance. Data integration: Combining data from multiple sources and resolving any
organization. It is designed to support business intelligence (BI) activities,
data analysis, and reporting. Data warehouses are a critical component in 4. Data Marts – inconsistencies.
the field of data management and are used to consolidate and integrate Data mart is also a part of storage component. It stores the information
Data normalization: Scaling numerical values to a common range to ensure they
data from different operational systems, making it available for analysis of a particular function of an organisation which is handled by single
are directly comparable.
6. Business Intelligence (BI) Development: BI tools and applications are developed and decision-making.. There are 2 approaches for constructing data- authority. There can be as many number of data marts in an
in this stage to enable end-users to access and analyze the data stored in the data warehouse: Top-down approach and Bottom-up approach are explained as organisation depending upon the functions. We can also say that data Load:
warehouse. This may involve creating dashboards, reports, and OLAP cubes. below. mart contains subset of the data stored in datawarehouse.
1. Top-down approach:
7. Testing and Quality Assurance: Rigorous testing is conducted to ensure data 5. Data Mining – Loading is the process of inserting the transformed data into a target data
accuracy, ETL processes' integrity, and report functionality. This includes unit The practice of analysing the big data present in datawarehouse is data
repository, which is often a data warehouse. This repository is designed to facilitate
testing, integration testing, and user acceptance testing. mining. It is used to find the hidden patterns that are present in the
querying, reporting, and analysis. Loading can involve tasks such as:
database or in datawarehouse with the help of algorithm of data
8. Deployment: Once the data warehouse and associated BI tools are thoroughly mining. Creating or updating tables and data structures in the data warehouse.
tested and validated, they are deployed for production use. Users can begin This approach is defined by Inmon as – datawarehouse as a central
repository for the complete organisation and data marts are created Managing data partitions to optimize query performance.
accessing and utilizing the data for reporting and analysis.
from it after the complete datawarehouse has been created. Populating indexes and metadata for efficient data retrieval.
9. Training and User Education: Training sessions are conducted to educate end-
Handling incremental data loads, where only new or changed data is added.
users and business analysts on how to use the data warehouse and BI tools
Q3A)a)What is ETL? Explain data preprocessing techniques in detail. [6]
effectively. This step is crucial for ensuring the successful adoption of the data Data Preprocessing Techniques in Detail:
warehouse. --> What is ETL?
10. Rollout and Support: After deployment, the data warehouse and BI environment ETL stands for Extract, Transform, Load, which is a process commonly used in data
The essential components are discussed below: In addition to the ETL process, data preprocessing encompasses various techniques
are monitored and supported to address any issues, optimize performance, and warehousing and data integration. It involves extracting data from various sources,
incorporate user feedback. 1. External Sources – transforming it to fit a common schema or structure, and then loading it into a to prepare data for analysis, which include:
External source is a source from where data is collected irrespective of data warehouse or another target system for analysis and reporting. ETL is
11. Maintenance and Growth: Data warehouses are not static; they require ongoing the type of data. Data can be structured, semi structured and essential for ensuring data quality and consistency in a data warehouse. The three
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302

Data Cleaning: --> What is ETL? Data Preprocessing Techniques in Detail:

Dividing data into training and testing sets for machine learning or model ETL stands for Extract, Transform, Load, which is a process commonly used in data
evaluation. warehousing and data integration. It involves extracting data from various sources,
Handling Missing Data: Imputing missing values or removing rows/columns with In addition to the ETL process, data preprocessing encompasses various techniques
missing data. b)What is OLAP? Describe the characteristics of OLAP. [4] transforming it to fit a common schema or structure, and then loading it into a to prepare data for analysis, which include:
data warehouse or another target system for analysis and reporting. ETL is
Removing Duplicates: Identifying and eliminating duplicate records. --> OLAP, or online analytical processing, is a method in essential for ensuring data quality and consistency in a data warehouse. The three
Outlier Detection and Handling: Identifying and addressing outlier values that
computing that solves complex analytical programs. This business main steps in ETL are as follows:
deviate significantly from the norm. intelligence tool processes large amounts of data from a data
mart, data warehouse or other data storage unit. OLAP uses
Data Transformation: cubes to display multiple categories of data. Extract:

The main characteristics of OLAP are as follows:


Normalization: Scaling numerical features to a common range, such as between 0 Extraction involves retrieving data from one or more source systems, which can
and 1. 1. Multidimensional conceptual view: OLAP systems let business users have include databases, flat files, web services, APIs, or other data repositories. The goal
How ETL Works?
Encoding Categorical Data: Converting categorical data into a numerical format. a dimensional and logical view of the data in the data warehouse. It helps in is to collect raw data from these disparate sources.
ETL consists of three separate phases:
carrying slice and dice operations. Transform:
Feature Engineering: Creating new features from existing ones.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP
Binning or Discretization: Grouping continuous data into bins or categories.
operation should provide normal database operations, containing retrieval,
Logarithm and Power Transformations: Applying logarithms or power functions to Transformation is the process of converting the extracted data into a format that is
update, adequacy control, integrity, and security.
data. consistent, cleaned, and suitable for analysis. Transformations may include:
3. Accessibility: OLAP acts as a mediator between data warehouses and front-
Data Reduction: Data cleaning: Handling missing data, removing duplicates, and correcting errors.
end. The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end. Data integration: Combining data from multiple sources and resolving any
inconsistencies.
Dimensionality Reduction: Reducing the number of features while preserving 4. Storing OLAP results: OLAP results are kept separate from data sources.
essential information, often using techniques like PCA or LDA. 5. OLAP provides for distinguishing between zero values and missing values so Data normalization: Scaling numerical values to a common range to ensure they
are directly comparable.
Sampling: Reducing the dataset size by selecting a representative subset of data that aggregates are computed correctly.
points. 6. OLAP system should ignore all missing values and compute correct Load:

Data Integration: aggregate values. 1. Data Extraction: o Extraction is the initial step of ETL. It involves retrieving
7. OLAP facilitate interactive query and complex analysis for the users. data from diverse source systems, which can include databases, flat files, external
Loading is the process of inserting the transformed data into a target data
applications, and more. o Data extraction can be achieved using methods like batch
8. OLAP provides the ability to perform intricate calculations and comparisons. repository, which is often a data warehouse. This repository is designed to facilitate
Combining data from multiple sources into a unified dataset. processing, real-time streaming, or change data capture (CDC), depending on the
querying, reporting, and analysis. Loading can involve tasks such as:
source system's capabilities and the ETL tool in use.
Data Aggregation: Creating or updating tables and data structures in the data warehouse.
2. Data Cleansing:
Managing data partitions to optimize query performance.
OR Data cleansing is a crucial task in data transformation. It involves identifying and
Summarizing or aggregating data to a higher level of granularity, e.g., monthly Populating indexes and metadata for efficient data retrieval. rectifying data quality issues, such as missing values, duplicate records, and
sales from daily sales data. a) Describe ETL. What are the tasks to be performed during data inaccuracies. Techniques like data validation, data profiling, and outlier detection
transformation. [6] Handling incremental data loads, where only new or changed data is added.
Data Splitting: are applied to ensure that the data is accurate and reliable.

[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302


summaries. Database or Data Warehouse Server:
3. Data Validation: Data validation checks are performed to ensure that data
conforms to predefined rules and constraints. Validation rules can include data 2. Drill-Down (Roll-Down):Drill-down, or roll-down, enables The database or data warehouse server consists of the original data that is ready to
users to explore data from a higher level of granularity to a be processed. Hence, the server is cause for retrieving the relevant data that is
type checks, range checks, and format checks to ensure data integrity.
lower level within a dimension hierarchy. For instance, you can based on data mining as per user request.
drill down from yearly revenue to view monthly or daily details.
Data Mining Engine:
3. Slice: Slicing involves selecting a single "slice" of data across
4. Data Transformation: o This is the core of the data transformation process. It one dimension, keeping all other dimensions constant. This The data mining engine is a major component of any data mining system. It
involves reshaping and reformatting the data to meet the requirements of the target allows users to focus on a specific subset of data for analysis. contains several modules for operating data mining tasks, including association,
system or data warehouse. o Transformation tasks can include data aggregation, characterization, classification, clustering, prediction, time-series analysis, etc.
pivot, unpivot, merging, splitting, and data normalization to ensure data
consistency and readiness for analysis. In other words, we can say data mining is the root of our data mining architecture.
4. Dice:Dicing is the process of creating a sub cube by selecting
It comprises instruments and software used to obtain insights and knowledge from
specific dimensions and members from a multidimensional data collected from various data sources and stored within the data warehouse.
cube. Users can focus on a smaller, more specialized portion of
5. Data Aggregation: o Data aggregation is the process of summarizing detailed
the data. Pattern Evaluation Module:
data into higher-level categories. This can be necessary for creating summary
reports and improving query performance in the data warehouse.
5. Pivot (Rotate): o Pivoting, also known as rotating, allows The Pattern evaluation module is primarily responsible for the measure of
users to view data from a different perspective by changing the investigation of the pattern by using a threshold value. It collaborates with the data
6. Data Normalization: o Data normalization involves converting data into a
orientation of dimensions in the analysis. For example, you can mining engine to focus the search on exciting patterns.
consistent format. For example, standardizing date formats or unit conversions to pivot a report to view products by region rather than region by Data Source:
ensure data consistency. product. This segment commonly employs stake measures that cooperate with the data
The actual source of data is the Database, data warehouse, World Wide Web mining modules to focus the search towards fascinating patterns. It might utilize a
(WWW), text files, and other documents. You need a huge amount of historical stake threshold to filter out discovered patterns. On the other hand, the pattern
data for data mining to be successful. Organizations typically store data in evaluation module might be coordinated with the mining module, depending on
7. Data Deduplication: o Data deduplication is the removal of duplicate records
from the dataset. Deduplication helps maintain data accuracy and prevents Q4) a)What is Data mining? Explain the architecture of Data mining. [3] databases or data warehouses. Data warehouses may comprise one or more the implementation of the data mining techniques used. For efficient data mining,
redundancy in the data warehouse. databases, text files spreadsheets, or other repositories of data. Sometimes, even it is abnormally suggested to push the evaluation of pattern stake as much as
--> Data mining is a significant method where previously unknown and plain text files or spreadsheets may contain information. Another primary source of possible into the mining procedure to confine the search to only fascinating
potentially useful information is extracted from the vast amount of data. The data data is the World Wide Web or the internet. patterns.
mining process involves several components, and these components constitute a
data mining system architecture. Different processes: Graphical User Interface:
What are the basic operations of OLAP? [4]
--> Online Analytical Processing (OLAP) supports several Data Mining Architecture Before passing the data to the database or data warehouse server, the data must The graphical user interface (GUI) module communicates between the data mining
fundamental operations that enable users to interactively be cleaned, integrated, and selected. As the information comes from various system and the user. This module helps the user to easily and efficiently use the
analyze and explore multidimensional data efficiently. The basic The significant components of data mining systems are a data source, data mining sources and in different formats, it can't be used directly for the data mining system without knowing the complexity of the process. This module cooperates
OLAP operations include: engine, data warehouse server, the pattern evaluation module, graphical user procedure because the data may not be complete and accurate. So, the first data with the data mining system when the user specifies a query or a task and displays
interface, and knowledge base. requires to be cleaned and unified. More information than needed will be collected the results.
1. Roll-Up (Drill-Up): Roll-up, also known as drill-up, allows from various data sources, and only the data of interest will have to be selected
users to aggregate data from a lower level of granularity to a and passed to the server. These procedures are not as easy as we think. Several Knowledge Base:
higher level within a dimension hierarchy. For example, you can methods may be performed on the data as part of selection, integration, and
roll up monthly sales data to view quarterly or yearly cleaning.
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302

The knowledge base is helpful in the entire process of data mining. It might be Root
helpful to guide the search or evaluate the stake of the result patterns. The Apple: 3 | scss
knowledge base may even contain user views and data from user experiences that Berries: 3 Coconut (2) Copy code
might be helpful in the data mining process. The data mining engine may receive Coconut: 5 | Root
inputs from the knowledge base to make the result more accurate and reliable. The Dates: 4 Apple (1) |
pattern assessment module regularly interacts with the knowledge base to get Now, let's identify the frequent items (support >= 30% * 6 = 1.8): | Coconut (4)
inputs, and also update it.
Berries (2) |
Apple: 3 | Apple (2)
Berries: 3 Dates (1) |
b) Apply FP Tree Algorithm to construct FP Tree and find frequent Coconut: 5 Transaction 3: Coconut, Dates Berries (3)
itemset for the following dataset given below (minimum support = Dates: 4 FP-Tree after this transaction: |
30%) [7]
Step 2: Sort Frequent Items Dates (3)
Transaction ID List of Products Sort the frequent items in descending order of support: scss Transaction 6: Apple, Coconut, Dates
1 Apple, Berries, Copy code FP-Tree after this transaction:
Coconut Coconut (5) Root
2 Berries, Coconut, Dates (4) | scss
Dates Apple (3) Coconut (3) Copy code
3 Coconut, Dates Berries (3) | Root
Step 3: Construct the FP-Tree Apple (1) |
4 Berries, Dates
We'll construct the FP-Tree by processing the transactions one by | Coconut (4)
5 Apple, Coconut one: Berries (2) |
6 Apple, Coconut, Dates | Apple (3)
Transaction 1: Apple, Berries, Coconut Dates (2) |
FP-Tree after this transaction: Transaction 4: Berries, Dates Berries (3)
FP-Tree after this transaction: |
--> To construct an FP-Tree and find frequent itemsets using the scss Dates (4)
FP-Growth algorithm, we need to follow these steps: Copy code scss Step 4: Generate Conditional FP-Trees
Root Copy code To find frequent itemsets, we need to generate conditional FP -
Scan the dataset to count the support of each item and identify | Root Trees for each frequent item. We'll start with Coconut:
frequent items. Coconut (1) |
Sort the frequent items in descending order of support. | Coconut (3) Conditional FP-Tree for Coconut:
Re-scan the dataset and construct the FP -Tree. Apple (1) |
Generate conditional FP-Trees for each frequent item. | Apple (1) mathematica
Recursively mine the FP-Tree to find frequent itemsets. Berries (1) | Copy code
Let's go through these steps for the given dataset with a minimum Transaction 2: Berries, Coconut, Dates Berries (3) Root
support of 30%: FP-Tree after this transaction: | |
Dates (3) Dates (4)
Step 1: Count Support and Identify Frequent Items scss Transaction 5: Apple, Coconut Now, we can mine the conditional FP-Tree for frequent itemsets.
We first count the support of each item: Copy code FP-Tree after this transaction:
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302
Step 5: Recursively Mine the FP-Tree
We'll mine the conditional FP-Tree for frequent itemsets. Starting
with the frequent itemsets for Coconut:
a) Explain data mining techniques in brief. [3] This technique is used to obtain important and relevant information about data and In other words, we can say that Clustering analysis is a data mining technique to identify
Coconut, Dates (support: 4) metadata. This data mining technique helps to classify data in different classes. similar data. This technique helps to recognize the differences and similarities between
Now, let's check the other frequent items: Dates, Apple, and Data Mining Techniques the data. Clustering is very similar to the classification, but it involves grouping chunks
Berries. Data mining techniques can be classified by different criteria, as follows: of data together based on their similarities.
Data mining includes the utilization of refined data analysis tools to find previously

Frequent itemsets for Dates:


unknown, valid patterns and relationships in huge data sets. These tools can incorporate i. Classification of Data mining frameworks as per the type of data sources mined: 3. Regression:
statistical models, machine learning techniques, and mathematical algorithms, such as This classification is as per the type of data handled. For example, multimedia, spatial
neural networks or decision trees. Thus, data mining incorporates analysis and Regression analysis is the data mining process is used to identify and analyze the
data, text data, time-series data, World Wide Web, and so on..
Dates (support: 4) prediction. relationship between variables because of the presence of the other factor. It is used to
Frequent itemsets for Apple: ii. Classification of data mining frameworks as per the database involved: define the probability of the specific variable. Regression, primarily a form of planning
Depending on various methods and technologies from the intersection of machine This classification based on the data model involved. For example. Object-oriented and modeling. For example, we might use it to project certain costs, depending on other
learning, database management, and statistics, professionals in data mining have database, transactional database, relational database, and so on.. factors such as availability, consumer demand, and competition. Primarily it gives the
Apple (support: 3)
devoted their careers to better understanding how to process and make conclusions exact relationship between two or more variables in the given data set.
Frequent itemsets for Berries: from the huge amount of data, but what are the methods they use to make it happen?
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
4. Association Rules:
Berries (support: 3) In recent data mining projects, various major data mining techniques have been functionalities. For example, discrimination, classification, clustering, characterization, etc.
So, the frequent itemsets with a minimum support of 30% are: developed and used, including association, classification, clustering, prediction, some frameworks tend to be extensive frameworks offering a few data mining This data mining technique helps to discover a link between two or more items. It finds
sequential patterns, and regression. a hidden pattern in the data set.
functionalities together..
Coconut, Dates (support: 4) iv. Classification of data mining frameworks according to data mining techniques Association rules are if-then statements that support to show the probability of
Dates (support: 4) used: interactions between data items within large data sets in different types of databases.
Apple (support: 3) This classification is as per the data analysis approach utilized, such as neural networks, Association rule mining has several applications and is commonly used to help sales
Berries (support: 3) machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or correlations in data or medical data sets.
These are the frequent itemsets for the given dataset. database-oriented, etc.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
2. Clustering:
OR These are three major measurements technique:
Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes o Lift:
improvement. It models data by its clusters. Data modeling puts clustering from a
This measurement technique measures the accuracy of the confidence over how often
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for clusters item B is purchased.
is unsupervised learning, and the subsequent framework represents a data concept. (Confidence) / (item B)/ (Entire dataset)
From a practical point of view, clustering plays an extraordinary job in data mining o Support:
applications. For example, scientific data exploration, text mining, information retrieval, This measurement technique measures how often multiple items are purchased and
spatial database applications, CRM, Web analysis, computational biology, medical
compared it to the overall dataset.
diagnostics, and much more.
(Item A + Item B) / (Entire dataset)
1. Classification:
[5865] - 302 [5865] - 302 [5865] - 302

[5865] - 302

o Confidence: b) How does the KNN algorithm works? [7] Majority Voting:
This measurement technique measures how often item B is purchased when item A is Apply KNN classification algorithm for the given dataset and predict the
For classification, KNN assigns the class label to the new data point based on Nearest neighbor: (3, 4) with a distance of 3.0.
class for X(P1 = 3, P2 = 7) (K = 3) majority voting among its k-nearest neighbors. The class label that occurs most Second nearest neighbor: (1, 4) with a distance of 3.61.
purchased as well.
frequently among the k neighbors is the predicted class for the new data point. Third nearest neighbor: (7, 7) with a distance of 4.0.
(Item A + Item B)/ (Item A)
Among these three neighbors, two of them belong to the class "True," and one
Prediction for Regression: belongs to the class "False." Since K = 3, we'll choose the majority class among
5. Outer detection: In regression tasks, KNN predicts a continuous value for the new data point. It these neighbors, which is "True." Therefore, the predicted class for X (P1 = 3, P2 =
P1 P2 Class does this by averaging the target values of its k-nearest neighbors. 7) with K = 3 is "True."
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may 7 7 False
be used in various domains like intrusion, detection, fraud detection, etc. It is also known Final Prediction: X (P1 = 3, P2 = 7) is predicted to belong to the "True" class based on the KNN
7 4 False The algorithm assigns the predicted class (for classification) or predicted value (for algorithm with K = 3.
as Outlier Analysis or Outilier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier. 3 4 True regression) to the new data point.
Outlier detection plays a significant role in the data mining field. Outlier detection is 1 4 True
valuable in numerous fields like network interruption identification, credit or debit card Evaluate Performance:
fraud detection, detecting outlying in wireless sensor network data, etc. After making predictions, you can evaluate the model's performance using various
metrics like accuracy, precision, recall, F1-score, or Mean Squared Error (MSE),
-->The k-Nearest Neighbors (KNN) algorithm is a supervised machine learning
6. Sequential Patterns: depending on the type of task. Q5 A)a)What is text mining? Explain the process of text mining. [4]
algorithm used for classification and regression tasks. It makes predictions based --> Text mining is a component of data mining that deals specifically with
on the similarity between data points. Here's how the KNN algorithm works: unstructured text data. It involves the use of natural language processing (NLP)
The sequential pattern is a data mining technique specialized for evaluating sequential To apply the K-Nearest Neighbors (KNN) classification algorithm to predict the
data to discover sequential patterns. It comprises of finding interesting subsequences in class for a new data point (X) with features (P1 = 3, P2 = 7), and given K = 3, we techniques to extract useful information and insights from large amounts of
Select the Number of Neighbors (k): unstructured text data. Text mining can be used as a preprocessing step for
a set of sequences, where the stake of a sequence can be measured in terms of different need to calculate the distances between X and all the data points in the dataset, and
You need to specify the number of nearest neighbors, denoted as 'k,' that will be data mining or as a standalone process for specific tasks.
criteria like length, occurrence frequency, etc. then use majority voting to determine the class.
used to make predictions. This is typically an odd number to avoid ties in By using text mining, the unstructured text data can be transformed into
classification tasks. structured data that can be used for data mining tasks such as
In other words, this technique of data mining helps to discover or recognize similar Let's calculate the distances between X and the existing data points: classification, clustering, and association rule mining. This allows organizations
patterns in transaction data over some time.
Choose a Distance Metric: to gain insights from a wide range of data sources, such as customer feedback,
Distance between X (3, 7) and (7, 7): social media posts, and news articles.
7. Prediction: KNN uses a distance metric, such as Euclidean distance, Manhattan distance, or
Distance = sqrt((3 - 7)^2 + (7 - 7)^2) = sqrt(16) = 4.0
cosine similarity, to measure the similarity between data points. The choice of
distance metric depends on the nature of the data.
What is the common usage of Text Mining?
Prediction used a combination of other data mining techniques such as trends,
Distance between X (3, 7) and (7, 4): Text mining is widely used in various fields, such as natural language
clustering, classification, etc. It analyzes past events or instances in the right sequence to
Distance = sqrt((3 - 7)^2 + (7 - 4)^2) = sqrt(25 + 9) = sqrt(34) ≈ 5.83 processing, information retrieval, and social media analysis. It has become
predict a future event. Training: an essential tool for organizations to extract insights from unstructured text
KNN does not explicitly build a model during training. Instead, it stores the data and make data-driven decisions.
Distance between X (3, 7) and (3, 4):
training dataset. “Extraction of interesting information or patterns from data in large
Distance = sqrt((3 - 3)^2 + (7 - 4)^2) = sqrt(9) = 3.0 databases is known as data mining.”
Prediction for Classification: Text mining is a process of extracting useful information and nontrivial
Distance between X (3, 7) and (1, 4): patterns from a large volume of text databases.
To classify a new data point, KNN finds the k-nearest neighbors from the training
dataset based on the chosen distance metric. It does this by calculating the distance Distance = sqrt((3 - 1)^2 + (7 - 4)^2) = sqrt(4 + 9) = sqrt(13) ≈ 3.61
between the new data point and all data points in the training set. The k-nearest
Now, we have calculated the distances for all data points. Let's find the three Conventional Process of Text Mining
neighbors are the data points with the shortest distances.
nearest neighbors for X:
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302
 Gathering unstructured information from various sources accessible in Assignment: Assign each data point to the nearest centroid. This is done by calculating Update: --> Sure, let's apply the K-Means algorithm to the given data set with K = 2
various document organizations, for example, plain text, web pages, PDF the Euclidean distance (or other distance metrics) between each data point and all clusters and the data points: D = {2, 3, 4, 10, 11, 12, 20, 25, 30}.
centroids and assigning the data point to the cluster associated with the nearest centroid. Now, we recalculate the centroids for each group:
records, etc.
 Pre-processing and data cleansing tasks are performed to distinguish and
eliminate inconsistency in the data. The data cleansing process makes sure Initial Centroids:
to capture the genuine text, and it is performed to eliminate stop Update: Recalculate the centroids of each cluster by taking the mean of all data points New Centroid 1: (15 + 16 + 19 + 20 + 21) / 5 = 18.2
words stemming (the process of identifying the root of a certain word and assigned to that cluster. New Centroid 2: (28 + 35 + 40 + 42 + 44 + 60 + 65) / 7 = 45.57 We need to start by selecting initial centroids for the two clusters. For simplicity, we can
indexing the data. start with two random data points as initial centroids. Let's choose 3 and 25 as the initial
 Processing and controlling tasks are applied to review and further clean the Repeat: centroids.
data set. Repeat: Repeat the assignment and update steps until the centroids no longer change We repeat the assignment and update steps until the centroids no longer change
 Pattern analysis is implemented in Management Information System. significantly or until a predefined number of iterations is reached. significantly. In this case, Centroid 1 may not change, and Centroid 2 may converge to
 Information processed in the above steps is utilized to extract important and Iteration 1:
around 45.57.
applicable data for a powerful and convenient decision-making process and
trend analysis. Assignment:
Termination: The algorithm terminates when the centroids stabilize, and the clusters are
formed. Termination:
Once the centroids stabilize, the algorithm terminates. The data points are now divided Calculate the distance of each data point to the centroids.
Now, let's apply the K-Means algorithm to group visitors to a website into two groups into two clusters based on their ages.
Assign each data point to the nearest centroid.
using their ages: 15, 16, 19, 20, 21, 28, 35, 40, 42, 44, 60, 65, with initial centroids 16
and 28 for the two groups. For this iteration:
Cluster 1 (Centroid 1 ≈ 18.2):

Conventional Process of Text Mining


Initial centroids: Data points closer to 3: {2, 3, 4, 10, 11, 12}
15, 16, 19, 20, 21
Data points closer to 25: {20, 25, 30}
Cluster 2 (Centroid 2 ≈ 45.57):
b) Explain K-means algorithm. Apply K-means algorithm for group of visitors Centroid 1: 16 Update:
to a website into two groups using their age as follows:
Centroid 2: 28
15, 16, 19, 20, 21, 28, 35, 40, 42, 44, 60, 65 28, 35, 40, 42, 44, 60, 65
Assignment: Calculate the mean of data points in each cluster to get new centroids.
(Consider initial centroid 16 and 28 of two groups) [6] These are the two groups of website visitors based on their ages determined by the K-
We calculate the distance of each data point to the initial centroids and assign them to the Means algorithm. New Centroid 1 = (2 + 3 + 4 + 10 + 11 + 12) / 6 ≈ 6.33
--> K-Means is a popular unsupervised machine learning algorithm used for clustering nearest centroid:
data points into groups or clusters. The goal of the K-Means algorithm is to partition the New Centroid 2 = (20 + 25 + 30) / 3 ≈ 25
data into K clusters, where each data point belongs to the cluster with the nearest
centroid. The steps of the K-Means algorithm are as follows: Iteration 2:
Data point 15 is closer to Centroid 1 (distance = 1).
Assignment:
Data point 16 is equally close to both centroids (distance = 0), but we can assign it to
Initialization: Choose the number of clusters (K) and initialize K centroids. Centroids are Centroid 1.
points that represent the centers of the clusters. You can choose these centroids randomly OR
Data points 19, 20, and 21 are closer to Centroid 1 (distance < distance to Centroid 2). Calculate the distance of each data point to the updated centroids.
or by using some other method. In your case, you've provided the initial centroids as 16
and 28. Data points 28, 35, 40, 42, 44, 60, and 65 are closer to Centroid 2. a) Apply K-means algorithm for the given data set where K is the cluster Assign each data point to the nearest centroid.
number D = {2, 3, 4, 10, 11, 12, 20, 25, 30}, K = 2. [6]
[5865] - 302 [5865] - 302 [5865] - 302 [5865] - 302

For this iteration: of the web by dealing with it from web-based records and services,
server logs, and hyperlinks. The main goal of web mining is to find
the designs in web data by collecting and analyzing data to get
 
important insights.
Data points closer to 6.33: {2, 3, 4, 10, 11, 12}
Data points closer to 25: {20, 25, 30}
Update:
There are various types of web mining which are as follows −

Calculate the mean of data points in each cluster to get new centroids. Web Content Mining − Web content mining is a procedure of Web Mining in
New Centroid 1 = (2 + 3 + 4 + 10 + 11 + 12) / 6 ≈ 6.33 which essential descriptive data is extracted from websites (WWW). Content
involves audio, video, text documents, hyperlinks, and structured records. Web
New Centroid 2 = (20 + 25 + 30) / 3 ≈ 25
contents are designed to deliver records to users in the design of text, lists,
As you can see, the centroids did not change in the second iteration, which means the K- images, videos, and tables.
Means algorithm has converged.
The function of content mining is data extraction, where structured data is
copied from unstructured websites. The goal is to support data aggregation over
The final clustering for K = 2 is as follows: several websites by utilizing the extracted structured data.
Cluster 1 (Centroid 1 ≈ 6.33): {2, 3, 4, 10, 11, 12}
Web Structured Mining − Web Structure mining is one of the core techniques
Cluster 2 (Centroid 2 ≈ 25): {20, 25, 30}
of web mining that deals with hyperlinks structure. Structure mining essentially
shows the structured summary of the website. It recognizes relationships

These are the two clusters formed by the K-Means algorithm for the given data set with among linked web pages of websites.
K = 2.
Web mining is only data mining that digs information from the web. There are
several algorithmic techniques are used to find data from the web. Structure
mining analyzes hyperlinks of the website to assemble informative records and
sort them out in elements like similarities and relationships. Intra-page is a type
of mining that is implemented at the document level and hyperlink level mining
is called inter-page mining.
Is this conversation helpful so far?
Web Usage Mining − Web usage mining is used to extract useful records,
information, knowledge from the weblog data, and helps in identifying the user
access patterns for web pages.

b)What are the different types of web mining? [4]


--> Web mining defines the process of using data mining techniques
to extract beneficial patterns trends and data generally with the help
[5865] - 302 [5865] - 302 [5865] - 302

You might also like