Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

UNIT 3:

Cluster analysis
Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points would
comprise together to form a cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group. This group is
nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information about different vehicles
like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars, Bikes, etc
for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like cars
cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data should
be scalable, if it is not scalable, then we can‘t get the appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms of
clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by organising
it into groups of similar data objects. This makes the job of the data expert easier in order to process the
data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If ―n‖ partitions
are done on ―p‖ objects of the database then each partition is represented by a cluster and n < p. The two
conditions which need to be satisfied with this Partitioning Clustering Method are:
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the object will
be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects is
created. We can classify hierarchical methods and will be able to know the purpose of classification on the
basis of how the hierarchical decomposition is formed. There are two types of approaches for the creation
of hierarchical decomposition, they are:
 Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the termination
condition holds.
 Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration continues
until the condition of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so flexible.
The two approaches which can be used to improve the Hierarchical Clustering Quality in Data Mining are:

 One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
 One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method, the given
cluster will keep on growing continuously as long as the density in the neighbourhood exceeds some
threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to contain at
least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object
space is quantized into a finite number of cells that form a grid structure. One of the major advantages of
the grid-based method is fast processing time and it is dependent only on the number of cells in each
dimension in the quantized space. The processing time for this method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the
data which is best suited for the model. The clustering of the density function is used to locate the clusters
for a given model. It reflects the spatial distribution of data points and also provides a way to
automatically determine the number of clusters based on standard statistics, taking outlier or noise into
account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of
application or user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication with the
clustering process. The user or the application requirement can specify constraints.
Applications Of Cluster Analysis:
 It is widely used in image processing, data analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies and identifying
genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.
 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data
to observe characteristics of each cluster.
Partitioning Method (K-Mean) in Data Mining
Partitioning Method: This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to
be generated for the clustering methods. In the partitioning method when database(D) that contains
multiple(N) objects then the partitioning method constructs user-specified(K) partitions of the data in
which each partition represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm
(Clustering Large Applications) etc. In this article, we will be seeing the working of K Mean algorithm in
detail. K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from
the user and partitions the dataset containing N objects into K clusters so that resulting similarity among
the data objects inside the group (intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster). The similarity of the cluster is determined with
respect to the mean value of the cluster. It is a type of square error algorithm. At the start randomly k
objects from the dataset are chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their distance from the cluster
mean. The new mean of each of the cluster is then calculated with the added data objects. Algorithm: K
mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 4 until no change occurs.

Figure – K-mean
Clustering Flowchart:

Figure – K-mean Clustering

Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-66) as 2
clusters we get using K Mean Algorithm.
Hierarchical clustering in data mining

Hierarchical clustering refers to an unsupervised learning procedure that determines successive clusters
based on previously defined clusters. It works via grouping data into a tree of clusters. Hierarchical
clustering stats by treating each data points as an individual cluster. The endpoint refers to a different set of
clusters, where each cluster is different from the other cluster, and the objects within each cluster are the
same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering used to group similar
objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In
agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters are
combined with different clusters until one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.

Let‘s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work. Here no
calculation has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.


Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let‘s say cluster Q and Cluster R are similar to each
other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)]
together to form new clusters as [(P), (QR), (STV)]

Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]

Divisive Hierarchical Clustering

Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the
data points that are not similar are separated from the cluster. The separated data points are treated as an
individual cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.
Density-based clustering in data mining

Density-based clustering refers to a method that is based on local cluster criterion, such as density connected
points. In this tutorial, we will discuss density-based clustering with examples.

What is Density-based clustering?

Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in
model building and machine learning algorithms. The data points in the region separated by two clusters of
low point density are considered as noise. The surroundings with a radius ε of a given object are known as
the ε neighborhood of the object. If the ε neighborhood of the object comprises at least a minimum number,
MinPts of objects, then it is called a core object.

Density-Based Clustering - Background

There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts


Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence
chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.

Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that
both i and j are considered as density reachable from o with respect to Eps and MinPts.

Working of Density-Based Clustering

Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable form the
object j only if it is located within the ε neighborhood of j, and j is a core object.

An object i is density reachable form the object j with respect to ε and MinPts in a given set of objects, D'
only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii with respect to ε and MinPts.

An object i is density connected object j with respect to ε and MinPts in a given set of objects, D' only if
there is an object o belongs to D such that both point i and j are density reachable from o with respect to ε
and MinPts.
Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.

o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Density-Based Clustering Methods

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a density-
based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with outliers.

OPTICS

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order of
database with respect to its density-based clustering structure. The order of the cluster comprises
information equivalent to the density-based clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster analysis, including determining an intrinsic
clustering structure.

DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description of


arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a huge amount of
noise.
Dealing with large databases
Here are 11 tips for making the most of your large data sets.

Cherish your data


―Keep your raw data raw: don‘t manipulate it without having a copy,‖ says Teal. She recommends storing
your data somewhere that creates automatic backups and that other laboratory members can access, while
abiding by your institution‘s rules on consent and data privacy.

Because you won‘t need to access these data often, says Teal, ―you can use storage options where it can cost
more money to access the data, but storage costs are low‖ — for instance, Amazon‘s Glacier service. You
could even store the raw data on duplicate hard drives kept in different locations. Storage costs for large data
files can add up, so budget accordingly.

Visualize the information


As data sets get bigger, new wrinkles emerge, says Titus Brown, a bioinformatician at the University of
California, Davis. ―At each stage, you‘re going to be encountering new and exciting messed-up behaviour.‖
His advice: ―Do a lot of graphs and look for outliers.‖ Last April, one of Brown‘s students analysed
transcriptomes — the full set of RNA molecules produced by a cell or organism — from 678 marine
microorganisms such as plankton (L. K. Johnson et al. GigaScience 8, giy158; 2019). When Brown and his
student charted average values for transcript length, coverage and gene content, they noticed that some
values were zero — showing where the computational workflow had failed and had to be re-run.

Show your workflow


When particle physicist Peter Elmer helps his 11-year-old son with his mathematics homework, he has to
remind him to document his steps. ―He just wants to write down the answer,‖ says Elmer, who is executive
director of the Institute for Research and Innovation in Software for High Energy Physics at Princeton
University in New Jersey. Researchers working with large data sets can benefit from the same advice that
Elmer gave his son: ―Showing your work is as important as getting to the end.‖

This means recording your entire data workflow — which version of the data you used, the clean-up and
quality-checking steps, and any processing code you ran. Such information is invaluable for documenting
and reproducing your methods. Eric Lyons, a computational biologist at the University of Arizona in
Tucson, uses the video-capture tool asciinema to record what he types into the command line, but lower-tech
solutions can also work. A group of his colleagues, he recalls, took photos of their computer screen‘s display
and posted them on the lab‘s group on Slack, an instant-messaging platform.
Use version control
Version-control systems allow researchers to understand precisely how a file has changed over time, and
who made the changes. But some systems limit the sizes of the files you can use. Harvard Dataverse (which
is open to all researchers) and Zenodo can be used for version control of large files, says Alyssa Goodman,
an astrophysicist and data-visualization specialist at Harvard University in Cambridge, Massachusetts.
Another option is Dat, a free peer-to-peer network for sharing and versioning files of any size. The system
maintains a tamper-proof log that records all the operations you perform on your file, says Andrew Osheroff,
a core software developer at Dat in Copenhagen. And users can direct the system to archive a copy of each
version of a file, says Dat product manager Karissa McKelvey, who is based in Oakland, California. Dat is
currently a command-line utility, but ―we‘ve been actively revamping‖, says McKelvey; the team hopes to
release a more user-friendly front end later this year.

Record metadata
―Your data are not useful unless people — and ‗future you‘ — know what they are,‖ says Teal. That‘s the
job of metadata, which describe how observations were collected, formatted and organized. Consider which
metadata to record before you start collecting, Lyons advises, and store that information alongside the data
— either in the software tool used to collect the observations or in a README or another dedicated file.
The Open Connectome Project, led by Joshua Vogelstein, a neuro-statistician at Johns Hopkins University in
Baltimore, Maryland, logs its metadata in a structured plain-text format called JSON. Whatever your
strategy, try to think long-term, Lyons says: you might one day want to integrate your data with those of
other labs. If you‘re proactive with your metadata, that integration will be easier down the line.

Automate, automate, automate


Big data sets are too large to comb through manually, so automation is key, says Shoaib Mufti, senior
director of data and technology at the Allen Institute for Brain Science in Seattle, Washington. The
institute‘s neuroinformatics team, for instance, uses a template for brain-cell and genetics data that accepts
information only in the correct format and type, Mufti says. When it‘s time to integrate those data into a
larger database or collection, data-quality assurance steps are automated using Apache Spark and Apache
Hbase, two open-source tools, to validate and repair data in real time. ―Our entire suite of software tools to
validate and ingest data runs in the cloud, which allows us to easily scale,‖ he says. The Open Connectome
Project also provides automated quality assurance, says Vogelstein — this generates visualizations of
summary statistics that users can inspect before moving forward with their analyses.

Make computing time count


Large data sets require high-performance computing (HPC), and many research institutes now have their
own HPC facilities. The US National Science Foundation maintains the national HPC network XSEDE,
which includes the cloud-based computing network Jetstream and HPC centres across the country.
Researchers can request resource allocations at xsede.org, and create trial accounts
at go.nature.com/36ufhgh. Other options include the US-based ACI-REF network, NCI Australia,
the Partnership for Advanced Computing in Europe and ELIXIR networks, as well as commercial providers
such as Amazon, Google and Microsoft.
But when it comes to computing, time is money. To make the most of his computing time on the
GenomeDK and Computerome clusters in Denmark, Guojie Zhang, a genomics researcher at the University
of Copenhagen, says his group typically runs small-scale tests before migrating its analyses to the HPC
network. Zhang is a member of the Vertebrate Genomes Project, which is seeking to assemble the genomes
of about 70,000 vertebrate species. ―We need millions or even billions of computing hours,‖ he says.

Capture your environment


To replicate an analysis later, you won‘t just need the same version of the tool you used, says Benjamin
Haibe-Kains, a computational pharmaco-genomicist at the Princess Margaret Cancer Centre in Toronto,
Canada. You‘ll also need the same operating system, and all the same software libraries that the tool
requires. For this reason, he recommends working in a self-contained computing environment — a Docker
container — that can be assembled anywhere. Haibe-Kains and his team use the online platform Code
Ocean (which is based on Docker) to capture and share their virtual environments; other options
include Binder, Gigantum and Nextjournal. ―Ten years from now, you could still run that pipeline exactly
the same way if you need to,‖ Haibe-Kains says.

Don’t download the data


Downloading and storing large data sets is not practical. Researchers must run analyses remotely, close to
where the data are stored, says Brown. Many big-data projects use Jupyter Notebook, which creates
documents that combine software code, text and figures. Researchers can ‗spin up‘ such documents on or
near the data servers to do remote analyses, explore the data, and more, says Brown. Jupyter Notebook is not
particularly accessible to researchers who might be uncomfortable using a command line, Brown says, but
there are more user-friendly platforms that can bridge the gap, including Terra and Seven Bridges
Genomics.

Start early
Data management is crucial even for young researchers, so start your training early. ―People feel like they
never have time to invest,‖ Elmer says, but ―you save yourself time in the long run‖. Start with the basics of
the command line, plus a programming language such as Python or R, whichever is more important to your
field, he says. Lyons concurs: ―Step one: get familiar with data from the command line.‖ In November,
some of his collaborators who were not fluent in command-line usage had trouble with genomic data
because chromosome names didn‘t match across all their files, Lyons says. ―Having some basic command-
line skills and programming let me quickly correct the chromosome names.‖
Cluster software
What is Clustering Software?
Clustering software enables you to configure servers for redundancy to prevent downtime and data loss. A
primary server is clustered with one or more secondary servers. Clustering software monitors the health of
the application and, if it detects a failure, moves application operation to a secondary server in the cluster in
a process called a failover. IT professionals rely on clustering to eliminate a single point of failure and
minimize the risk of downtime. In fact, 86 percent of all organizations are operating their HA applications
with some kind of clustering or high availability mechanism in place.

Types of Cluster Management Software


There are a variety of cluster management software solutions available for Windows and Linux
distributions.

Products, such as SUSE Linux Enterprise High Availability Extension (HAE), Red Hat Cluster Suite, Oracle
Real Application Clusters (RAC), and Windows Server Failover Clustering (WSFC), only support one
operating system. Moreover, Linux open-source HA extensions require a high degree of technical skill,
creating complexity and reliability issues that challenge most operators.

How SIOS Clustering Software Provides High Availability for Windows and Linux Clusters
If you are running an essential application in a Windows or Linux environment, you may want to consider
SIOS Technology Corporation‘s high availability software clustering products for consistent HA/DR
regardless of OS.

In a Windows environment, SIOS DataKeeper Cluster Edition seamlessly integrates with and extends
Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data
replication mechanism. While WSFC manages the software cluster, SIOS performs the replication to enable
disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or
impractical, such as in cloud, virtual, and high-performance storage environments.

In a Linux environment, the SIOS LifeKeeper for Linux provides a tightly integrated combination of high
availability failover clustering, continuous application monitoring, data replication, and configurable
recovery policies, protecting your essential applications from downtime and disasters.

Whether you are in a Windows or Linux environment, SIOS products free your IT team from the complexity
and challenges of computing infrastructures. They provide the intelligence, automation, flexibility, high
availability, and ease-of-use IT managers need to protect essential applications from downtime or data loss.
With over 80,000 licenses sold, SIOS is used by many of the world‘s largest companies.

Here is one case study that discusses how a leading Hospital Information Systems (HIS) provider deployed
SIOS DataKeeper Cluster Edition to improve high availability and network bandwidth in their Windows
cluster environment.
How One HIS Provider Improved RPO and RTO With SIOS DataKeeper Clustering Software
This leading HIS provider has more than 10,000 U.S.-based health care organizations (HCOs) using a
variety of its applications, including patient care management, patient self-service, and revenue
management. To support these customers, the organization had more than 20 SQL Server clusters located in
two geographically dispersed data centers, as well as a few smaller servers and SQL Server log shipping
for disaster recovery (DR).

The organization has a large customer base and vast IT infrastructure and needed a solution that could
handle heavy network traffic and eliminate network bandwidth problems when replicating data to its DR
site. The organization also needed to improve its Recovery Point Objective (RPO) and Recovery Time
Objective (RTO) to reduce the volume of data at risk and get IT operations back up and running faster after
a disaster or system failure. RPO is the maximum amount of data loss that can be tolerated when a server
fails, or a disaster happens. RTO is the maximum tolerable duration of any outage.

To address these challenges, this organization chose SIOS DataKeeper Cluster Edition, which provides
seamless integration with WSFC, making it possible to create SANless clusters.

Once SIOS DataKeeper Cluster Edition passed the organization‘s stringent POC testing, the IT team
deployed the solution in the company‘s production environment. The team deployed SIOS across a three-
node cluster comprised of two SAN-based nodes in the organization‘s primary, on-premises data center and
one SANless node in its remote DR site.

The SIOS solution synchronizes replication across the three nodes in the cluster and eliminates the
bandwidth issues at the DR site, improving both RPO and RTO and reducing the cost of bandwidth. Today,
the organization uses SIOS DataKeeper Cluster Edition to protect their SQL Server environment across
more than 18 cluster nodes.

How SIOS Clustering Software Delivers Reliable HA


SIOS software is an essential part of your cluster solution, protecting your choice of Windows or Linux
environments in any configuration (or combination) of physical, virtual and cloud (public, private, and
hybrid) environments without sacrificing performance or availability.

If you need fast, efficient, replication to transfer data across low-bandwidth local or wide area networks,
SIOS DataKeeper protects essential Windows environments, including Microsoft SQL Server, Oracle,
SharePoint, Lync, Dynamics, and Hyper-V from downtime and data loss in a physical, virtual, or cloud
environment.

SIOS LifeKeeper for Linux supports all major Linux distributions, including Red Hat Enterprise Linux,
SUSE Linux Enterprise Server, CentOS, and Oracle Linux and accommodates a wide range of storage
architectures.

SIOS products uniquely protect any Windows- or Linux-based application operating in physical, virtual,
cloud or hybrid cloud environments and in any combination of site or disaster recovery scenarios.
Applications such as SAP and databases, including Oracle, SQL Server, DB2, SAP HANA and many others,
benefit from SIOS software. The ―out-of-the-box‖ simplicity, configuration flexibility, reliability,
performance, and cost-effectiveness of SIOS products set them apart from other clustering software.
Search engines:
Search Engines

A search engine is an online answering machine, which is used to search, understand, and organize
content's result in its database based on the search query (keywords) inserted by the end-users (internet
user). To display search results, all search engines first find the valuable result from their database, sort them
to make an ordered list based on the search algorithm, and display in front of end-users. The process of
organizing content in the form of a list is commonly known as a Search Engine Results Page (SERP).

Google, Yahoo!, Bing, YouTube, and DuckDuckGo are some popular examples of search engines.

In our search engine tutorial, we are going to discuss the following topics -

o Advantages of Search Engine

Disadvantages of Search Engine

Components of Search Engine

How do search engines work

Search Engine Processing

Search Engine (Google) algorithm updates

Most Popular Search Engines in the world

Advantages of Search Engine/ Characteristics of Search engines

Searching content on the Internet becomes one of the most popular activities all over the world. In the
current era, the search engine is an essential part of everyone's life because the search engine offers various
popular ways to find valuable, relevant, and informative content on the Internet.

A list of advantages of search engines is given below -


1. Time-Saving

Search engine helps us to save time by the following two ways -

o Eliminate the need to find information manually.


o Perform search operations at a very high speed.

2. Variety of information

The search engine offers various variety of resources to obtain relevant and valuable information from the
Internet. By using a search engine, we can get information in various fields such as education, entertainment,
games, etc. The information which we get from the search engine is in the form of blogs, pdf, ppt, text,
images, videos, and audios.

3. Precision

All search engines have the ability to provide more precise results.

4. Free Access

Mostly search engines such as Google, Bing, and Yahoo allow end-users to search their content for free. In
search engines, there is no restriction related to a number of searches, so all end users (Students, Job seekers,
IT employees, and others) spend a lot of time to search valuable content to fulfill their requirements.

5. Advanced Search

Search engines allow us to use advanced search options to get relevant, valuable, and informative results.
Advanced search results make our searches more flexible as well as sophisticated. For example, when you
want to search for a specific site, type "site:" without quotes followed by the site's web address.

Suppose we want to search for java tutorial on javaTpoint then type "java site:www.javatpoint.com" to
get the advanced result quickly.

To search about education institution sites (colleges and universities) for B.Tech in computer science
engineering, then use "computer science engineering site:.edu." to get the advanced result.

6. Relevance

Search engines allow us to search for relevant content based on a particular keyword. For example, a site
"javatpoint" scores a higher search for the term "java tutorial" this is because a search engine sorts its result
pages by the relevance of the content; that's why we can see the highest-scoring results at the top of SERP.

Disadvantages of Search Engine

There are the following disadvantages of Search Engines -


o Sometimes the search engine takes too much time to display relevant, valuable, and informative
content.
o Search engines, especially Google, frequently update their algorithm, and it is very difficult to find
the algorithm in which Google runs.
o It makes end-users effortless as they all time use search engines to solve their small queries also.

Components of Search Engine

There are the following four basic components of Search Engine -

1. Web Crawler

Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an essential role in
search engine optimization (SEO) strategy. It is mainly a software component that traverses on the web, then
downloads and collects all the information over the Internet.

Note: Googlebot is the most popular web crawler.

There are the following web crawler features that can affect the search results -

o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling

2. Database

The search engine database is a type of Non-relational database. It is the place where all the web
information is stored. It has a large number of web resources. Some most popular search engine databases
are Amazon Elastic Search Service and Splunk.

There are the following two database variable features that can affect the search results:

o Size of the database


o The freshness of the database

3. Search Interfaces

Search Interface is one of the most important components of Search Engine. It is an interface between the
user and the database. It basically helps users to search for queries using the database.

There are the following features Search Interfaces that affect the search results -

o Operators
o Phrase Searching
o Truncation

4. Ranking Algorithms

The ranking algorithm is used by Google to rank web pages according to the Google search algorithm.

There are the following ranking features that affect the search results -

o Location and frequency


o Link Analysis
o Clickthrough measurement

How do search engines work

There are the following tasks done by every search engines -

1. Crawling

Crawling is the first stage in which a search engine uses web crawlers to find, visit, and download the web
pages on the WWW (World Wide Web). Crawling is performed by software robots, known as "spiders" or
"crawlers." These robots are used to review the website content.

2. Indexing

Indexing is an online library of websites, which is used to sort, store, and organize the content that we found
during the crawling. Once a page is indexed, it appears as a result of the most valuable and most relevant
query.

3. Ranking and Retrieval

The ranking is the last stage of the search engine. It is used to provide a piece of content that will be the best
answer based on the user's query. It displays the best content at the top rank of the website.

Search Engine Processing/ Search Engine Functionality

There are following two major Search Engine processing functions -

1. Indexing process

Indexing is the process of building a structure that enables searching.

Indexing process contains the following three blocks -

i. Text acquisition
It is used to identify and store documents for indexing.

ii. Text transformation

It is the process of transform documents into index or features.

iii. Index creation

Index creation takes the output from text transformation and creates the indexes or data searches that enable
fast searching.

2. Query process

The query is the process of producing the list of documents based on a user's search query.

There are the following three tasks of the Query process -

i. User interaction

User interaction provides an interface between the users who search the content and the search engine.

ii. Ranking

The ranking is the core component of the search engine. It takes query data from the user interaction and
generates a ranked list of data based on the retrieval model.

iii. Evaluation

Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result helps us to
improve the ranking of the search engine.
Search Engine Architecture:
Architecture
The search engine architecture comprises of the three basic layers listed below:
 Content collection and refinement.
 Search core
 User and application interfaces

Search Engine Processing


Indexing Process
Indexing process comprises of the following three tasks:
 Text acquisition
 Text transformation
 Index creation
Text acquisition
It identifies and stores documents for indexing.
Text Transformation
It transforms document into index terms or features.
Index Creation
It takes index terms created by text transformations and create data structures to suport fast searching.
Query Process
Query process comprises of the following three tasks:
 User interaction
 Ranking
 Evaluation
User interaction
It supporst creation and refinement of user query and displays the results.
Ranking
It uses query and indexes to create ranked list of documents.
Evaluation
It monitors and measures the effectiveness and efficiency. It is done offline.
Examples
Following are the several search engines available today:

Search Description
Engine

Google It was originally called BackRub. It is the most popular search engine globally.

Bing It was launched in 2009 by Microsoft. It is the latest web-based search engine
that also delivers Yahoo‘s results.

Ask It was launched in 1996 and was originally known as Ask Jeeves. It includes
support for match, dictionary, and conversation question.

AltaVista It was launched by Digital Equipment Corporation in 1995. Since 2003, it is


powered by Yahoo technology.

AOL.Search It is powered by Google.

LYCOS It is top 5 internet portal and 13th largest online property according to Media
Matrix.

Alexa It is subsidiary of Amazon and used for providing website traffic information.
Ranking of web pages:
PageRank is a method for rating Web pages objectively and mechanically, paying attention to human
interest. Web search engines have to organize with inexperienced clients and pages manipulating
conventional ranking services. Some evaluation methods which count replicable natures of Web pages are
unimmunized to manipulation.
The task is to take advantage of the hyperlink structure of the Web to produce a global importance ranking
of every Web page. This ranking is called PageRank.
The mechanism of the Web depends on a graph with about 150 million nodes (Web pages) and 1.7 billion
edges (hyperlinks). If Web pages A and B link to page C, A and B are called the backlinks of C. In general,
highly linked pages are more important. Thus they have more backlinks and the important backlinks are less
in quantity.
For instance, a Web page with an individual backlink from Yahoo has to be ranked higher than a page with
multiple backlinks from unknown or private sites. A Web page has a huge rank if the total of the ranks of its
backlinks is too large.
The following is the simplified version of PageRank: Let u, v be Web pages. Therefore let Bu be the group
of pages that point to u. Moreover, let Nv be the multiple links from v. Let c < 1 be a factor for
normalization. It can describe a simple ranking R, which is a simplified interpretation of PageRank −
R(u)=c∑u∈BuR(v)NvR(u)=c∑u∈BuR(v)Nv

The rank of a page is divided between its forward connections evenly to provide to the ranks of the pages
they mark too. The equation is recursive but there is an issue with this simplified function.
If two Web pages point to each other but no other page while some other Web page points to one of them, a
loop will be generated during the iteration. This loop will assemble the rank but will never share any ranks.
This trap formed by loops in a graph without outedges is known as rank sinks.
The Page Rank algorithm begins with the conversion of every URL from the database into a number. The
next phase is to save each hyperlink in a database using the integer IDs to recognize the Web pages. The
iteration is initiated after sorting the link structure by the parent ID and removing dangling links.
The best initial assignment has to be selected to speed up convergence. The weights from the current time
step are kept in memory and the previous weights are accessed on disk in linear time. After the weights
have converged the dangling connection are inserted back and the rankings are recalculated. The calculation
implements well but can be made quicker by easing the convergence criteria and using more effective
optimization approaches.
PageRank (PR) is an algorithm

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring
the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate of
how important the website is. The underlying assumption is that more important websites are likely to
receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that
was used by the company, and it is the best-known.

The above centrality measure is not implemented for multi-graphs.


Algorithm
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections
of documents of any size. It is assumed in several research papers that the distribution is evenly divided
among all documents in the collection at the beginning of the computational process. The PageRank
computations require several passes, called ―iterations‖, through the collection to adjust approximate
PageRank values to more closely reflect the theoretical true value.
The search engine history:

The first well documented search engine that searched content files, namely FTP files, was Archie,
which debuted on 10 September 1990. Prior to September 1993, the World Wide Web was entirely
indexed by hand. There was a list of webservers edited by Tim Berners-Lee and hosted on the CERN
webserver.

Ever since the world wide web became the engine of our lives, search has been the holy grail for developers
and companies. Beginning with Archie in 1990, considered the first search engine, moving on to Excite and
Lycos and Infoseek, by the mid 90s there was a veritable flood of search engines, particularly after Google
showed how it should be done in 1996. The complexity of the algorithms was now matched only by the
voracious appetite of searchers as the number of pages to be indexed ran into billions. Invariably, a lot of
them positioned themselves as specialized engines—for kids or jobs or tech or entertainment. Then came the
deep web search engines like http://www.deepdyve.com/ which indexed obscure and often not-easy to find
content.
Post-Google, there were the much touted ―google killers" including Cuil (pronounced Cool ) and Dogpile.
While the former is no more, the latter is now just a Google clone. Unbelievably, there have also been those
that have tried to go the human-powered search way! With a million plus spam pages being generated every
day besides the billions of legitimate ones, you would imagine most humans would be daunted.
As the original super spider, AltaVista, shuts down, here‘s a brief history of some of the better known search
engines through the years:
1990: Archie—the very first search engine
1991: Veronica and Jughead
1992: Vlib
1993: Excite and World wide web wanderer
1994: AltaVista, Galaxy, Yahoosearch, Infoseek, Webcrawler, Lycos
1995: Looksmart
1996: Google, HotBot, Inktomi
1997: Ask.com
1998: MSN; dmoz
1999: Alltheweb
2005: Snap
2006: Microsoft Livesearch
2008: Cuil
2009: Microsoft Bing
Enterprise Search:

Enterprise search is a valuable tool for businesses since it allows employees to perform instant searches
within the company‘s knowledge base. Enterprise search software decreases the amount of time it takes for
an employee to find the necessary information, leaving more time for higher value-added tasks. This is
especially important for today‘s lean, digital, agile organizations that strive to get the optimal performance
from their teams.

We answered all your enterprise search-related questions:

What is an enterprise search?

Enterprise search is a way of search that helps employees find the data from one or multiple databases in a
single search query. The searched data can be, in any format, from anywhere inside the company -in
databases, document management systems, e-mail servers, on paper and so on.

The relation between Enterprise Search and Knowledge Management

Knowledge management is the process where value is derived from knowledge by making it accessible to
everyone within an organization.

For practical knowledge management, the combination of internal data and web-focused search tools has a
crucial role. Enterprise search enables these features in a single search query. Therefore enterprise search
can be a key driver for successful knowledge management.
Why is it important now?

Capturing data has never been easier. It is less costly than it was and most enterprises are capturing data as
part of their operations. However, optimizing data is as important as capturing it because, in terms of work
productivity, it needs to be easier and more accessible to find.

How does it work?

Enterprise search engines require data preparation. Once data is ready for the search engine, users input text
queries and receive formatted results.

Content Awareness

Content awareness also called ―content collection‖, is the process of connecting databases that the search
can access.

Content Processing

Incoming contents from different databases have different formats such as XML, HTML, office document
formats, or plain text. In this step of enterprise search, documents are converted to the plain text using
document filters so they can be searched efficiently. The content processing phase also includes
tokenization. For example, characters are converted to lower case to enable fast case-insensitive search .

Indexing

After the content is processed, documents are stored in an index. This index consists of all words, including
information about ranking and frequency of the term.

Serving results to user queries

The search system compares the query to the saved index and returns matching results. The search returns
entries that include what the user entered as a query and also returns similar results.

What are its use cases?

Enterprise search engines have some common use cases that increase the efficiency of research processes.
We listed the five most common use cases for you:

 Knowledge management: Applying enterprise search eases and improves the process of knowledge
management within the organization. In other words, if the organization has many documents in
archives, you better use a search engine to find the right document.
 Contact Experts: You don‘t need to know people‘s full names if you are looking for experts within
the organization. You can filter according to attributes and experience to find experts.
 Talent Search: Enterprise search engines can match candidates with job descriptions from the
database of potential candidates.
 Intranet Search: It helps intranet users locate the information they need from the organization‘s
shared drives and databases.
 Insight Engines: Insight engines are an evolved version of enterprise search since insight engines
can leverage AI capabilities to search queries.

What is the difference between enterprise search and insight engine?

Enterprise search engines and insight engines serve the same purpose; to show the results of business users‘
queries. However, insight engines are more advanced platforms and they combine with data and machine
learning algorithms to process content so that they can provide more relevant and personalized results for
users. On the other hand, enterprise search converts content to plain text by using document filters.

What are enterprise search best practices?


 Autocompletion of queries: Autosuggest feature improves the user experience of your
enterprise engine. Make sure you choose a tool that offers a list of possible completed words
and phrases when users start typing.

 Apply search analytics: Collect query data of users so that you can gain insights about your search
engine performance and the topics researched by your users.
 Evaluate your team’s talent: Assess the capability of your business to implement a solution. If your
team is a total stranger to search engine architecture, you can hire a third party integration specialist
to help you implement the solution.

What is the maturity of your enterprise search engine?

With the advancements in technology, enterprise search applications got smarter. There are different levels
of search capabilities of engines. Before investing in a new enterprise solution, you should assess your
current search level and identify your requirements based on your business‘ goals.
Enterprise Search Engine Software.

Businesses can use both open source and proprietary solutions. Each enterprise search vendor has unique
pros and cons. Here is the list of vendors divided into two groups:

Open Source Software


 Apache Solr
 Elasticsearch
 Sphinx
 Terrier
 Xapian

Closed Source Software


 Addsearch
 Algolia
 Commvault
 Concept Searching Limited
 Copernic
 Coveo
 Dassault Exalead
 Dieselpoint
 dtSearch Corp.
 Funnelback
 Hyland
 IBM Watson Explorer
 Lookeen
 Lucidworks Fusion
 MarkLogic
 Micro Focus IDOL
 Microsoft Bing
 Mindbreeze
 Oracle Secure Enterprise Search
 SAP
 SLI Systems
 Swiftype Enterprise Search
 TEXIS
 Varonis DatAnswers
 ZL Technologies

You might also like