Research Activity 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Mark Walter E. Bilaro MR.

DIONNEL NAVA
BSCS – 3A
Data Mining

Research about the definition, applications, and algorithms used of the following data mining
techniques:

• Process Mining
• Data Stream Mining
• Data Scraping
• Bibliomining

ANSWERS:

1. Process Mining
a. Definition
i. Process mining is defined as an analytical discipline for discovering,
monitoring, and improving business processes as they actually are- not as you
think they might be.
ii. Process mining applies data science to discover, validate and improve
workflows. By combining data mining and process analytics, organizations can
mine log data from their information systems to understand the performance of
their processes, revealing bottlenecks and other areas of improvement. Process
mining leverages a data-driven approach to process optimization, allowing
managers to remain objective in their decision-making around resource
allocation for existing processes.
iii. Process mining focuses on different perspectives, such as control-flow,
organizational, case, and time. While much of the work around process mining
focuses on the sequence of activities — which is the control-flow — the other
perspectives also provide valuable information for management teams.
Organizational perspectives can surface the various resources within a process,
such as individual job roles or departments, and the time perspective can
demonstrate bottlenecks by measuring the processing time of different events
within a process.
iv. Types of process mining – three types of process mining, which are discovery,
conformance, and enhancement
1. Discovery: Process discovery uses
event log data to create a process model
without outside influence. Under this
classification, no previous process models
would exist to inform the development of a
new process model. This type of process
mining is the most widely adopted.
2. Conformance: Conformance checking
confirms if the intended process model is
reflected in practice. This type of process
mining compares a process description to an
existing process model based on its event log
data, identifying any deviations from the
intended model.

Page 1 of 13
3. Enhancement: This type of process mining has also been referred to as
extension, organizational mining, or performance mining. In this class
of process mining, additional information is used to improve an existing
process model. For example, the output of conformance checking can
assist in identifying bottlenecks within a process model, allowing
managers to optimize an existing process.
b. Applications
i. Education: Process mining can help identify effective course curriculums by
monitoring and evaluating student performance and behaviors, such as how
much time a student spends viewing class materials.
ii. Finance: Financial institutions have used process mining software to improve
inter-organizational processes, audit accounts, increase income, and broaden its
customer base.
iii. Public works: Process mining has been used to streamline the invoice process
for public works projects, which involve various stakeholders, such as
construction companies, cleaning businesses, and environmental bureaus.
iv. Software Development: Since engineering processes are typically
disorganized, process mining can help to identify a clearly documented process.
It can also help IT administrators monitor the process, allowing them to verify
that the system is running as expected.
v. Healthcare: Process mining provides recommendations for reducing the
treatment processing time of patients.
vi. E-commerce: It can provide insight into buyer behaviors and provide accurate
recommendations to increase sales.
vii. Manufacturing: Process mining can help to assign the appropriate resources
depending on the case – which is the product – attributes, allowing managers to
transform their business operations. They can gain insight into production times
and reallocate resources, such as storage space, machines, or workers,
accordingly.
c. Algorithm
i. Alpha Miner is the first algorithm that bridges the gap between event logs or
observed data and the discovery of a process model. Alpha Miner can build
process models solely based on event logs by understanding relations and
causalities between the steps of processes.

For example, today, several activities are executed in parallel manner rather than
sequential to minimize the execution time of the process. Alpha miner can be
applied to detect the parallel activity because it understands start and end events
by using double timestamps. As in figure below, alpha miner finds out the
processes involved in completing A to C as separate traces, then, restructure
them together in a model by the relations they have.

To do these, alpha miner generates:

Page 2 of 13
1. A Petri net model in which all the transitions are visible, unique, and
correspond to the classified events.
2. The initial marking — Describes the status of the Petri net model when
the execution starts.
3. The final marking — Describes the status of the Petri net model when
the execution ends.

The execution of the process starts from the events included in the initial
marking and finishes at the events included in the final marking.

Some of the characteristics of the algorithm:

1. It does not support classification: The algorithm does not recognize that
one process is the same as another and logs them as independent events.
2. It does not handle noise data well. Noise data is irrelevant or
meaningless data that occurs due to data quality issues (for example, the
data entry errors or incompleteness) and affect accuracy and simplicity
of the process models negatively.

ii. Heuristics Miner is a noise tolerant


algorithm; therefore, it is applied to
demonstrate the process behavior in
noise data. The Heuristics miner
only considers the order of the
events within a case. For instance, it
creates an event log table with the
fields as case id, originator of the
activity, time stamp and activities
that are considered during the
mining. The timestamp of an
activity is used to calculate ordering
of events.

This technique includes three steps:

1. The construction of the


dependency graph
2. The construction of the
input and output expressions
for each activity.
3. The search for long distance
dependency relations.

Some of the characteristics of the


algorithm:

1. Takes frequency of events


into account
2. Can only describe events that are either exclusively dependent on each
other (AND), or completely independent from one another (XOR)

Page 3 of 13
iii. Fuzzy Miner is one of the youngest algorithms and is suitable for mining less
structured processes which exhibit a large amount of unstructured and
conflicting behavior (like it turns spaghetti-like models into more concise
models.). The processes may be unstructured and conflicting because they might
include information about:
1. Activities related to locations in a topology (like towns or road
crossings)
2. Precedence relating to the traffic connections (like railways or
motorways).

Fuzzy miner applies a variety of techniques, such as removing unimportant


edges, clustering highly correlated nodes in to a single node, and removing
isolated node clusters. Fuzzy miner cannot be converted to other types of
process modeling languages such as Business Process Mining Notations
(BPMN) or Business Process Execution Language (BPEL).

iv. Inductive Miner detects splits which are the conditions or connecting steps
between the first and the final stage of the process in the log. There are different
types of splits in an event log:
1. Sequential 3. Concurrent
2. Parallel 4. Loop

Page 4 of 13
Once the inductive miner identifies the splits, it recurs on sub-logs until it can
find a base case. Inductive miner models give a unique label for each visible
transition. These models use hidden transitions specifically for loop splits.
Based on the inductive miner algorithm, a Petri net or process tree can be
produced to map the process flow.

v. Genetic miner algorithm is derived from biology, specifically from znatural


selection. It helps deal with noise and incompleteness in process models. The
way it operates is as follows:
1. Reading event log
2. Building initial representation that defines the search space of a genetic
algorithm.
3. Calculating fitness of the each process in this initial representation
which is also referred as the fitness measure that evaluates the quality
of a process model point in the search space against an event log.
4. Stopping and returning the fittest models
5. Generating the next model by genetic operators which ensures that all
points in the search space that are defined by the internal representation
can be reached by running genetic algorithm.

2. Data Stream Mining


a. Definition
i. Data Stream Mining (also known as stream learning) is the process of extracting
knowledge structures from continuous, rapid data records. A data stream is an
ordered sequence of instances that in many applications of data stream mining can
be read only once or a small number of times using limited computing and storage
capabilities.
ii. Data stream mining can be considered a subfield of data mining, machine
learning, and knowledge discovery.
b. Applications
i. In many data stream mining applications, the goal is to predict the class or value
of new instances in the data stream given some knowledge about the class
membership or values of previous instances in the data stream. Machine learning
techniques can be used to learn this prediction task from labeled examples in an
automated way.
ii. Often, concepts from the field of incremental learning are applied to cope with
structural changes, on-line learning , and real-time demands. In many

Page 5 of 13
applications, especially operating within non-stationary environments, the
distribution underlying the instances or the rules underlying their labeling may
change over time, that is the goal of the prediction, the class to be predicted or the
target value to be predicted, may change over time. This problem is referred to as
concept drift. Detecting concept drift is a central issue to data stream mining.
iii. Other challenges that arise when applying machine learning to streaming data
include: partially and delayed labeled data, recovery from concept drifts, and
temporal dependencies.
iv. Examples:
1. computer network traffic
2. phone conversations
3. ATM transactions
4. web searches
5. sensor data
c. Algorithm
i. Data Streams in Data Mining techniques are implemented to extract patterns and
insights from a data stream. A vast range of algorithms is available for stream
mining. There are four main algorithms used for Data Streams in Data Mining
techniques.

ii.
1. Classification
a. Lazy Classifier or K-NN
b. Native Bayes
c. Decision trees
d. Logistic regression
e. Ensembles
2. Regression
a. Lazy Classifier or K-NN
b. Native Bayes

Page 6 of 13
c. Decision trees
d. Logistic regression
e. Ensembles
3. Clustering
a. K-means
b. Hierarchical
c. Clustering
d. Density-based clustering
4. Frequent Pattern Mining
a. Apriori
b. Eclat
c. FP-Growth

Page 7 of 13
3. Data Scraping
a. Definition
i. Data scraping, also known as web scraping, is the
process of importing information from a website
into a spreadsheet or local file saved on your
computer. It’s one of the most efficient ways to get
data from the web, and in some cases to channel
that data to another website.
b. Applications
i. Contact scraping - a lot of websites contain email
addresses and phone numbers in plaintext. By scraping locations like an online
employee directory, a scraper is able to aggregate contact details for bulk
mailing lists, robo calls, or malicious social engineering attempts. This is one of
the primary methods both spammers and scammers use to find new targets.
1. Research for web content/business intelligence.
ii. Price scraping - by scraping pricing data, competitors are able to aggregate
information about their competition. This can allow them to formulate a unique
advantage.
1. Pricing for travel booker sites/price comparison sites.
iii. Market Research - web scraping can be used for market research by
companies. High-quality web scraped data obtained in large volumes can be
very helpful for companies in analyzing consumer trends and understanding
which direction the company should move in the future.
1. Finding sales leads/conducting market research by crawling public data
sources (Example, Yell and Twitter)
iv. Content scraping - content can be pulled from the website in order to site in
order to replicate the unique advantage of a particular product or service that
relies on content. For example, a product like Yelp relies on reviews; a
competitor could scrape all the review content from Yelp and reproduce the
content on their own site, pretending the content is original.
1. Sending product data from an e-commerce site to another online vendor
(Example, Google Shopping, Lazada Shoppee)
v. Email Marketing – Companies can also use Web scraping for email marketing.
They can collect Email ID’s from various sites using web scraping and then
send bulk promotional and marketing Emails to all the people owning these
Email ID’s.

Page 8 of 13
vi. News Monitoring – Web scraping news sites can provide detailed reports on
the current news to a company. This is even more essential for companies that
are frequently in the news or that depend on daily news for their day-to-day
functioning. After all, news reports can make or break a company in a single
day.
c. Algorithm
i. The layout of a website, which is the presentation of data, is described using
Hypertext Markup Language (HTML). An HTML document basically consists
of four types of elements: document structure, block, inline, and interactive
elements.
ii. There is a Document Type Definition (DTD) for each version of HTML which
describes how the elements are allowed to be nested. It is structured as a
grammar in extended Backus Naur form. There is a strict and transitional
version of the DTD for backward compatibility. The most common abstract
model for HTML documents are trees. An example of an HTML document
modeled as a tree is shown:

iii. Web Scraping Fundamentals - The basic principles of Web Scraping are to
create a program that in an auto-mated fashion retrieves unstructured data from
a website and adds structure to it. This data is often encapsulated as plain text
within the HTML of a website, which is the standard language in which websites
are written [2].

There are a multitude of ways in which web scraping can be implemented, many
of which are discussed in the paper: Web Scraping: State-of-the-Art and Areas
of Application [3]. The complexity of these implementations vary widely. The
complexity depends mainly on one key factor; how general the web scraping
algorithm is. The static nature of scraping results in binary out-comes, either
100 % accuracy, or 0 %. To avoid this, generalization is key.

Page 9 of 13
The commonly used browsers such as; Internet Explorer, Google
Chrome, and Firefox interpret the HTML code and display the sites with which
we are familiar. The code displayed in listing 1.1 shows the essence of HTML,
namely, tags [4]. The tags constrict which type of content is allowed to be
written between the opening-and-closing tag. An example would be; only being
allowed to write plain text between the paragraph tags, <p>, and </p>. They also
create "paths" that can be followed, leading to certain parts of the code. The path
to the paragraph with "Relevant Data" can look something like this: html →
body → p[0], this path would lead to the first node with the tag p, obtaining
"Relevant Data". These paths are often defined through CSS selectors 1 or
Xpaths [5], these are different ways of navigating the HTML structure. XPath
will be commonly used throughout the thesis and stands for XML Path
Language. It is a query language used to select nodes in an XML document, in
this case the HTML.

The simplest practice right now is to create a site-specific scraper in


which a person has identified the parts of HTML in which the sought-after data
resides, then create a program that goes to those exact locations and retrieves it.
These exact locations are specified through the aforementioned paths.
Retrieving the HTML of a website can be achieved through methods such as
HTTP requests, or headless browsers such as puppeteer 2. Puppeteer provides
an easy way of interacting with browsers programmatically, this can, in turn,
allow for more data to be collected. For example; scrolling down the page can
load additional items into the HTML. This is the easiest way of implementing
scraping, however, it has some major flaws.

iv. Web Scraping Flaws – As mentioned in the previous section there are some
flaws with the easier ways of implementing scraping. Firstly, it has to be done
manually for each site, which might be, depending on the scope, infeasible.
Secondly, as soon as the HTML-code for the site changes, the algorithm will
break. This is due to the exact path which previously located the wanted element
no longer leads to an element. This results in quite a bit of maintenance work as
the algorithms need to be updated potentially as often as the website is altered.
v. Advanced Web Scraping – flaws mentioned above have resulted in an interest
in developing more advanced scraping algorithms, to reduce the need for fragile
and maintenance-heavy scraping algorithms. Machine Learning is often used to
create advanced scraping algorithms, as it is well suited for the task of

Page 10 of 13
generalizing. The two aspects of scraping with which machine learning can help
in solving the thesis are; the classification of the text data on the site and
recognizing patterns within the HTML structure. When the objective is a site-
specific scraper the limited amounts of data are a large factor. The aim is to be
able to set this up using only what is found on the page as data for the model.
This limitation requires the site to have repetitive structures which are at the
HTML Code below to be able to extract enough data for the model to work with,
as well as ease the identification of these areas.

vi. HTML Nodes – Nodes in the HTML-code are created each time a new tag is
added. The nodes content is what is between the opening and closing tag, thus
its children are all the tags added within its body. When looking at 1.2 for
example, the "video node" has 4 children: title, length, noise, and creator. When
HTML-code is parsed with a HTML-parser such as HtmlAgilityPack [6] 4 it is
common that these nodes contain a lot of metadata, such as its position in the
code, how many children it has, etc. The selection of the nodes within the HTML
can be done in various ways, meaning
there are multiple languages designed to
select these nodes. One of which is
Xpath [5], which is used commonly
throughout the thesis. Nodes in which
the innertext attribute contains text, are
the nodes which potentially contain the
sought-after data.

Page 11 of 13
4. Bibliomining
a. Definition
i. The application of statistical and pattern-recognition tools to large amounts of
data associated with library systems in order to aid decision-making or justify
services. The term “bibliomining” comes from the combination of
bibliometrics and data mining, which are the two main toolsets used for
analysis
ii. It is the use of a combination of data mining, data warehousing, and
bibliometrics for the purpose of analyzing library services.
b. Applications
i. First a data warehouse must be created. This is done by compiling information
on the resources, such as titles and authors, subject headings, and descriptions
of the collections. Then the demographic surrogate information is organized.
Finally, the library information (such as the librarian, whether or not the
information came from the reference desk or circulation desk, and the location
of the library) is obtained.

Once this is organized, the data can be processed and analyzed. This can be done
via a few methods, such as Online Analytical Processing (OLAP), using a data
mining program, or through data visualization.

Bibliomining application model

Page 12 of 13
ii. Bibliomining is used to discover patterns in what people are reading and
researching and allows librarians to target their community better. Bibliomining
can also help library directors focus their budgets on resources that will be
utilized. Another use is to determine when people use the library more often, so
staffing needs can be adequately met. Combining bibliomining with other
research techniques such as focus groups, surveys and cost-benefit analysis, will
help librarians to get a better picture of their patrons and their needs.
c. Algorithm
i. Bibliomining is just another synonymy of libraries’ data mining that different experts are also
involved in the process. Domain experts, or librarians or library specialists, identify required data
sources to provide specific services, to resolve management issues or to help decision-makings in
library. Data mining experts or database specialists take the responsibility for collecting data,
cleaning data and transferring data or building data warehouses. Data mining experts select proper
tools to discover meaningful patterns to predict or to describe different patrons or clusters of
demographic groups that exhibit certain characteristics. Under the process, the bibliomining
application process can be depicted in Figure 2. For example, librarians would like to improve the
efficiency of circulation service. They should know how to arrange manpower on the desk. Library
specialists may have idea to analysis the time spans of historical circulations to get helpful
information. They thus identify the past 5 academic years’ borrowing records and patron data as
required data resource for data mining. Data mining experts or database specialists then begin to
dig out the required historical circulation data from backup data repositories of S and I library
automation systems. However, the formats or types of fields used to record circulation data are not
quite equal completely, such as 10 characters for I’s student ID but 8 for S’s. Data mining experts
or database specialists must do data conversion by SQL programming to resolve the case. After
required data are well-
cooked into SQL server
database, according to
their experiences on
mining similar cases, data
mining experts or
database specialists apply
association algorithm to
mine the database. Then
we will have the results if
we give much more
prayer. Until the resulted
information is verified
and proved by library
specialists and librarians,
and the library
administrators agree
them, we finish the
bibliomining case.

Page 13 of 13

You might also like