Professional Documents
Culture Documents
Screenshot 2024-05-17 at 3.30.05 PM
Screenshot 2024-05-17 at 3.30.05 PM
Cluster analysis
Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points would
comprise together to form a cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group. This group is
nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information about different vehicles
like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars, Bikes, etc
for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like cars
cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data should
be scalable, if it is not scalable, then we can‘t get the appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the
data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms of
clustering. It should be capable of dealing with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and
noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality
clusters. So it should be able to handle unstructured data and give some structure to the data by organising
it into groups of similar data objects. This makes the job of the data expert easier in order to process the
data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If ―n‖ partitions
are done on ―p‖ objects of the database then each partition is represented by a cluster and n < p. The two
conditions which need to be satisfied with this Partitioning Clustering Method are:
One objective should only belong to only one group.
There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the object will
be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects is
created. We can classify hierarchical methods and will be able to know the purpose of classification on the
basis of how the hierarchical decomposition is formed. There are two types of approaches for the creation
of hierarchical decomposition, they are:
Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the termination
condition holds.
Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration continues
until the condition of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so flexible.
The two approaches which can be used to improve the Hierarchical Clustering Quality in Data Mining are:
–
One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method, the given
cluster will keep on growing continuously as long as the density in the neighbourhood exceeds some
threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to contain at
least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object
space is quantized into a finite number of cells that form a grid structure. One of the major advantages of
the grid-based method is fast processing time and it is dependent only on the number of cells in each
dimension in the quantized space. The processing time for this method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the
data which is best suited for the model. The clustering of the density function is used to locate the clusters
for a given model. It reflects the spatial distribution of data points and also provides a way to
automatically determine the number of clusters based on standard statistics, taking outlier or noise into
account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of
application or user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication with the
clustering process. The user or the application requirement can specify constraints.
Applications Of Cluster Analysis:
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and identifying
genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data
to observe characteristics of each cluster.
Partitioning Method (K-Mean) in Data Mining
Partitioning Method: This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to
be generated for the clustering methods. In the partitioning method when database(D) that contains
multiple(N) objects then the partitioning method constructs user-specified(K) partitions of the data in
which each partition represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm
(Clustering Large Applications) etc. In this article, we will be seeing the working of K Mean algorithm in
detail. K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from
the user and partitions the dataset containing N objects into K clusters so that resulting similarity among
the data objects inside the group (intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster). The similarity of the cluster is determined with
respect to the mean value of the cluster. It is a type of square error algorithm. At the start randomly k
objects from the dataset are chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their distance from the cluster
mean. The new mean of each of the cluster is then calculated with the added data objects. Algorithm: K
mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 4 until no change occurs.
Figure – K-mean
Clustering Flowchart:
Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-66) as 2
clusters we get using K Mean Algorithm.
Hierarchical clustering in data mining
Hierarchical clustering refers to an unsupervised learning procedure that determines successive clusters
based on previously defined clusters. It works via grouping data into a tree of clusters. Hierarchical
clustering stats by treating each data points as an individual cluster. The endpoint refers to a different set of
clusters, where each cluster is different from the other cluster, and the objects within each cluster are the
same as one another.
Agglomerative clustering is one of the most common types of hierarchical clustering used to group similar
objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In
agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters are
combined with different clusters until one cluster is formed.
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let‘s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work. Here no
calculation has been done below all the proximity among the clusters are assumed.
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let‘s say cluster Q and Cluster R are similar to each
other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)]
together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the
data points that are not similar are separated from the cluster. The separated data points are treated as an
individual cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.
Density-based clustering refers to a method that is based on local cluster criterion, such as density connected
points. In this tutorial, we will discuss density-based clustering with examples.
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in
model building and machine learning algorithms. The data points in the region separated by two clusters of
low point density are considered as noise. The surroundings with a radius ε of a given object are known as
the ε neighborhood of the object. If the ε neighborhood of the object comprises at least a minimum number,
MinPts of objects, then it is called a core object.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence
chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that
both i and j are considered as density reachable from o with respect to Eps and MinPts.
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable form the
object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of objects, D'
only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D' only if
there is an object o belongs to D such that both point i and j are density reachable from o with respect to ε
and MinPts.
Major Features of Density-Based Clustering
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a density-
based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with outliers.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order of
database with respect to its density-based clustering structure. The order of the cluster comprises
information equivalent to the density-based clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster analysis, including determining an intrinsic
clustering structure.
DENCLUE
Because you won‘t need to access these data often, says Teal, ―you can use storage options where it can cost
more money to access the data, but storage costs are low‖ — for instance, Amazon‘s Glacier service. You
could even store the raw data on duplicate hard drives kept in different locations. Storage costs for large data
files can add up, so budget accordingly.
This means recording your entire data workflow — which version of the data you used, the clean-up and
quality-checking steps, and any processing code you ran. Such information is invaluable for documenting
and reproducing your methods. Eric Lyons, a computational biologist at the University of Arizona in
Tucson, uses the video-capture tool asciinema to record what he types into the command line, but lower-tech
solutions can also work. A group of his colleagues, he recalls, took photos of their computer screen‘s display
and posted them on the lab‘s group on Slack, an instant-messaging platform.
Use version control
Version-control systems allow researchers to understand precisely how a file has changed over time, and
who made the changes. But some systems limit the sizes of the files you can use. Harvard Dataverse (which
is open to all researchers) and Zenodo can be used for version control of large files, says Alyssa Goodman,
an astrophysicist and data-visualization specialist at Harvard University in Cambridge, Massachusetts.
Another option is Dat, a free peer-to-peer network for sharing and versioning files of any size. The system
maintains a tamper-proof log that records all the operations you perform on your file, says Andrew Osheroff,
a core software developer at Dat in Copenhagen. And users can direct the system to archive a copy of each
version of a file, says Dat product manager Karissa McKelvey, who is based in Oakland, California. Dat is
currently a command-line utility, but ―we‘ve been actively revamping‖, says McKelvey; the team hopes to
release a more user-friendly front end later this year.
Record metadata
―Your data are not useful unless people — and ‗future you‘ — know what they are,‖ says Teal. That‘s the
job of metadata, which describe how observations were collected, formatted and organized. Consider which
metadata to record before you start collecting, Lyons advises, and store that information alongside the data
— either in the software tool used to collect the observations or in a README or another dedicated file.
The Open Connectome Project, led by Joshua Vogelstein, a neuro-statistician at Johns Hopkins University in
Baltimore, Maryland, logs its metadata in a structured plain-text format called JSON. Whatever your
strategy, try to think long-term, Lyons says: you might one day want to integrate your data with those of
other labs. If you‘re proactive with your metadata, that integration will be easier down the line.
Start early
Data management is crucial even for young researchers, so start your training early. ―People feel like they
never have time to invest,‖ Elmer says, but ―you save yourself time in the long run‖. Start with the basics of
the command line, plus a programming language such as Python or R, whichever is more important to your
field, he says. Lyons concurs: ―Step one: get familiar with data from the command line.‖ In November,
some of his collaborators who were not fluent in command-line usage had trouble with genomic data
because chromosome names didn‘t match across all their files, Lyons says. ―Having some basic command-
line skills and programming let me quickly correct the chromosome names.‖
Cluster software
What is Clustering Software?
Clustering software enables you to configure servers for redundancy to prevent downtime and data loss. A
primary server is clustered with one or more secondary servers. Clustering software monitors the health of
the application and, if it detects a failure, moves application operation to a secondary server in the cluster in
a process called a failover. IT professionals rely on clustering to eliminate a single point of failure and
minimize the risk of downtime. In fact, 86 percent of all organizations are operating their HA applications
with some kind of clustering or high availability mechanism in place.
Products, such as SUSE Linux Enterprise High Availability Extension (HAE), Red Hat Cluster Suite, Oracle
Real Application Clusters (RAC), and Windows Server Failover Clustering (WSFC), only support one
operating system. Moreover, Linux open-source HA extensions require a high degree of technical skill,
creating complexity and reliability issues that challenge most operators.
How SIOS Clustering Software Provides High Availability for Windows and Linux Clusters
If you are running an essential application in a Windows or Linux environment, you may want to consider
SIOS Technology Corporation‘s high availability software clustering products for consistent HA/DR
regardless of OS.
In a Windows environment, SIOS DataKeeper Cluster Edition seamlessly integrates with and extends
Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data
replication mechanism. While WSFC manages the software cluster, SIOS performs the replication to enable
disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or
impractical, such as in cloud, virtual, and high-performance storage environments.
In a Linux environment, the SIOS LifeKeeper for Linux provides a tightly integrated combination of high
availability failover clustering, continuous application monitoring, data replication, and configurable
recovery policies, protecting your essential applications from downtime and disasters.
Whether you are in a Windows or Linux environment, SIOS products free your IT team from the complexity
and challenges of computing infrastructures. They provide the intelligence, automation, flexibility, high
availability, and ease-of-use IT managers need to protect essential applications from downtime or data loss.
With over 80,000 licenses sold, SIOS is used by many of the world‘s largest companies.
Here is one case study that discusses how a leading Hospital Information Systems (HIS) provider deployed
SIOS DataKeeper Cluster Edition to improve high availability and network bandwidth in their Windows
cluster environment.
How One HIS Provider Improved RPO and RTO With SIOS DataKeeper Clustering Software
This leading HIS provider has more than 10,000 U.S.-based health care organizations (HCOs) using a
variety of its applications, including patient care management, patient self-service, and revenue
management. To support these customers, the organization had more than 20 SQL Server clusters located in
two geographically dispersed data centers, as well as a few smaller servers and SQL Server log shipping
for disaster recovery (DR).
The organization has a large customer base and vast IT infrastructure and needed a solution that could
handle heavy network traffic and eliminate network bandwidth problems when replicating data to its DR
site. The organization also needed to improve its Recovery Point Objective (RPO) and Recovery Time
Objective (RTO) to reduce the volume of data at risk and get IT operations back up and running faster after
a disaster or system failure. RPO is the maximum amount of data loss that can be tolerated when a server
fails, or a disaster happens. RTO is the maximum tolerable duration of any outage.
To address these challenges, this organization chose SIOS DataKeeper Cluster Edition, which provides
seamless integration with WSFC, making it possible to create SANless clusters.
Once SIOS DataKeeper Cluster Edition passed the organization‘s stringent POC testing, the IT team
deployed the solution in the company‘s production environment. The team deployed SIOS across a three-
node cluster comprised of two SAN-based nodes in the organization‘s primary, on-premises data center and
one SANless node in its remote DR site.
The SIOS solution synchronizes replication across the three nodes in the cluster and eliminates the
bandwidth issues at the DR site, improving both RPO and RTO and reducing the cost of bandwidth. Today,
the organization uses SIOS DataKeeper Cluster Edition to protect their SQL Server environment across
more than 18 cluster nodes.
If you need fast, efficient, replication to transfer data across low-bandwidth local or wide area networks,
SIOS DataKeeper protects essential Windows environments, including Microsoft SQL Server, Oracle,
SharePoint, Lync, Dynamics, and Hyper-V from downtime and data loss in a physical, virtual, or cloud
environment.
SIOS LifeKeeper for Linux supports all major Linux distributions, including Red Hat Enterprise Linux,
SUSE Linux Enterprise Server, CentOS, and Oracle Linux and accommodates a wide range of storage
architectures.
SIOS products uniquely protect any Windows- or Linux-based application operating in physical, virtual,
cloud or hybrid cloud environments and in any combination of site or disaster recovery scenarios.
Applications such as SAP and databases, including Oracle, SQL Server, DB2, SAP HANA and many others,
benefit from SIOS software. The ―out-of-the-box‖ simplicity, configuration flexibility, reliability,
performance, and cost-effectiveness of SIOS products set them apart from other clustering software.
Search engines:
Search Engines
A search engine is an online answering machine, which is used to search, understand, and organize
content's result in its database based on the search query (keywords) inserted by the end-users (internet
user). To display search results, all search engines first find the valuable result from their database, sort them
to make an ordered list based on the search algorithm, and display in front of end-users. The process of
organizing content in the form of a list is commonly known as a Search Engine Results Page (SERP).
Google, Yahoo!, Bing, YouTube, and DuckDuckGo are some popular examples of search engines.
In our search engine tutorial, we are going to discuss the following topics -
Searching content on the Internet becomes one of the most popular activities all over the world. In the
current era, the search engine is an essential part of everyone's life because the search engine offers various
popular ways to find valuable, relevant, and informative content on the Internet.
2. Variety of information
The search engine offers various variety of resources to obtain relevant and valuable information from the
Internet. By using a search engine, we can get information in various fields such as education, entertainment,
games, etc. The information which we get from the search engine is in the form of blogs, pdf, ppt, text,
images, videos, and audios.
3. Precision
All search engines have the ability to provide more precise results.
4. Free Access
Mostly search engines such as Google, Bing, and Yahoo allow end-users to search their content for free. In
search engines, there is no restriction related to a number of searches, so all end users (Students, Job seekers,
IT employees, and others) spend a lot of time to search valuable content to fulfill their requirements.
5. Advanced Search
Search engines allow us to use advanced search options to get relevant, valuable, and informative results.
Advanced search results make our searches more flexible as well as sophisticated. For example, when you
want to search for a specific site, type "site:" without quotes followed by the site's web address.
Suppose we want to search for java tutorial on javaTpoint then type "java site:www.javatpoint.com" to
get the advanced result quickly.
To search about education institution sites (colleges and universities) for B.Tech in computer science
engineering, then use "computer science engineering site:.edu." to get the advanced result.
6. Relevance
Search engines allow us to search for relevant content based on a particular keyword. For example, a site
"javatpoint" scores a higher search for the term "java tutorial" this is because a search engine sorts its result
pages by the relevance of the content; that's why we can see the highest-scoring results at the top of SERP.
1. Web Crawler
Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an essential role in
search engine optimization (SEO) strategy. It is mainly a software component that traverses on the web, then
downloads and collects all the information over the Internet.
There are the following web crawler features that can affect the search results -
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
2. Database
The search engine database is a type of Non-relational database. It is the place where all the web
information is stored. It has a large number of web resources. Some most popular search engine databases
are Amazon Elastic Search Service and Splunk.
There are the following two database variable features that can affect the search results:
3. Search Interfaces
Search Interface is one of the most important components of Search Engine. It is an interface between the
user and the database. It basically helps users to search for queries using the database.
There are the following features Search Interfaces that affect the search results -
o Operators
o Phrase Searching
o Truncation
4. Ranking Algorithms
The ranking algorithm is used by Google to rank web pages according to the Google search algorithm.
There are the following ranking features that affect the search results -
1. Crawling
Crawling is the first stage in which a search engine uses web crawlers to find, visit, and download the web
pages on the WWW (World Wide Web). Crawling is performed by software robots, known as "spiders" or
"crawlers." These robots are used to review the website content.
2. Indexing
Indexing is an online library of websites, which is used to sort, store, and organize the content that we found
during the crawling. Once a page is indexed, it appears as a result of the most valuable and most relevant
query.
The ranking is the last stage of the search engine. It is used to provide a piece of content that will be the best
answer based on the user's query. It displays the best content at the top rank of the website.
1. Indexing process
i. Text acquisition
It is used to identify and store documents for indexing.
Index creation takes the output from text transformation and creates the indexes or data searches that enable
fast searching.
2. Query process
The query is the process of producing the list of documents based on a user's search query.
i. User interaction
User interaction provides an interface between the users who search the content and the search engine.
ii. Ranking
The ranking is the core component of the search engine. It takes query data from the user interaction and
generates a ranked list of data based on the retrieval model.
iii. Evaluation
Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result helps us to
improve the ranking of the search engine.
Search Engine Architecture:
Architecture
The search engine architecture comprises of the three basic layers listed below:
Content collection and refinement.
Search core
User and application interfaces
Search Description
Engine
Google It was originally called BackRub. It is the most popular search engine globally.
Bing It was launched in 2009 by Microsoft. It is the latest web-based search engine
that also delivers Yahoo‘s results.
Ask It was launched in 1996 and was originally known as Ask Jeeves. It includes
support for match, dictionary, and conversation question.
LYCOS It is top 5 internet portal and 13th largest online property according to Media
Matrix.
Alexa It is subsidiary of Amazon and used for providing website traffic information.
Ranking of web pages:
PageRank is a method for rating Web pages objectively and mechanically, paying attention to human
interest. Web search engines have to organize with inexperienced clients and pages manipulating
conventional ranking services. Some evaluation methods which count replicable natures of Web pages are
unimmunized to manipulation.
The task is to take advantage of the hyperlink structure of the Web to produce a global importance ranking
of every Web page. This ranking is called PageRank.
The mechanism of the Web depends on a graph with about 150 million nodes (Web pages) and 1.7 billion
edges (hyperlinks). If Web pages A and B link to page C, A and B are called the backlinks of C. In general,
highly linked pages are more important. Thus they have more backlinks and the important backlinks are less
in quantity.
For instance, a Web page with an individual backlink from Yahoo has to be ranked higher than a page with
multiple backlinks from unknown or private sites. A Web page has a huge rank if the total of the ranks of its
backlinks is too large.
The following is the simplified version of PageRank: Let u, v be Web pages. Therefore let Bu be the group
of pages that point to u. Moreover, let Nv be the multiple links from v. Let c < 1 be a factor for
normalization. It can describe a simple ranking R, which is a simplified interpretation of PageRank −
R(u)=c∑u∈BuR(v)NvR(u)=c∑u∈BuR(v)Nv
The rank of a page is divided between its forward connections evenly to provide to the ranks of the pages
they mark too. The equation is recursive but there is an issue with this simplified function.
If two Web pages point to each other but no other page while some other Web page points to one of them, a
loop will be generated during the iteration. This loop will assemble the rank but will never share any ranks.
This trap formed by loops in a graph without outedges is known as rank sinks.
The Page Rank algorithm begins with the conversion of every URL from the database into a number. The
next phase is to save each hyperlink in a database using the integer IDs to recognize the Web pages. The
iteration is initiated after sorting the link structure by the parent ID and removing dangling links.
The best initial assignment has to be selected to speed up convergence. The weights from the current time
step are kept in memory and the previous weights are accessed on disk in linear time. After the weights
have converged the dangling connection are inserted back and the rankings are recalculated. The calculation
implements well but can be made quicker by easing the convergence criteria and using more effective
optimization approaches.
PageRank (PR) is an algorithm
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring
the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate of
how important the website is. The underlying assumption is that more important websites are likely to
receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that
was used by the company, and it is the best-known.
The first well documented search engine that searched content files, namely FTP files, was Archie,
which debuted on 10 September 1990. Prior to September 1993, the World Wide Web was entirely
indexed by hand. There was a list of webservers edited by Tim Berners-Lee and hosted on the CERN
webserver.
Ever since the world wide web became the engine of our lives, search has been the holy grail for developers
and companies. Beginning with Archie in 1990, considered the first search engine, moving on to Excite and
Lycos and Infoseek, by the mid 90s there was a veritable flood of search engines, particularly after Google
showed how it should be done in 1996. The complexity of the algorithms was now matched only by the
voracious appetite of searchers as the number of pages to be indexed ran into billions. Invariably, a lot of
them positioned themselves as specialized engines—for kids or jobs or tech or entertainment. Then came the
deep web search engines like http://www.deepdyve.com/ which indexed obscure and often not-easy to find
content.
Post-Google, there were the much touted ―google killers" including Cuil (pronounced Cool ) and Dogpile.
While the former is no more, the latter is now just a Google clone. Unbelievably, there have also been those
that have tried to go the human-powered search way! With a million plus spam pages being generated every
day besides the billions of legitimate ones, you would imagine most humans would be daunted.
As the original super spider, AltaVista, shuts down, here‘s a brief history of some of the better known search
engines through the years:
1990: Archie—the very first search engine
1991: Veronica and Jughead
1992: Vlib
1993: Excite and World wide web wanderer
1994: AltaVista, Galaxy, Yahoosearch, Infoseek, Webcrawler, Lycos
1995: Looksmart
1996: Google, HotBot, Inktomi
1997: Ask.com
1998: MSN; dmoz
1999: Alltheweb
2005: Snap
2006: Microsoft Livesearch
2008: Cuil
2009: Microsoft Bing
Enterprise Search:
Enterprise search is a valuable tool for businesses since it allows employees to perform instant searches
within the company‘s knowledge base. Enterprise search software decreases the amount of time it takes for
an employee to find the necessary information, leaving more time for higher value-added tasks. This is
especially important for today‘s lean, digital, agile organizations that strive to get the optimal performance
from their teams.
Enterprise search is a way of search that helps employees find the data from one or multiple databases in a
single search query. The searched data can be, in any format, from anywhere inside the company -in
databases, document management systems, e-mail servers, on paper and so on.
Knowledge management is the process where value is derived from knowledge by making it accessible to
everyone within an organization.
For practical knowledge management, the combination of internal data and web-focused search tools has a
crucial role. Enterprise search enables these features in a single search query. Therefore enterprise search
can be a key driver for successful knowledge management.
Why is it important now?
Capturing data has never been easier. It is less costly than it was and most enterprises are capturing data as
part of their operations. However, optimizing data is as important as capturing it because, in terms of work
productivity, it needs to be easier and more accessible to find.
Enterprise search engines require data preparation. Once data is ready for the search engine, users input text
queries and receive formatted results.
Content Awareness
Content awareness also called ―content collection‖, is the process of connecting databases that the search
can access.
Content Processing
Incoming contents from different databases have different formats such as XML, HTML, office document
formats, or plain text. In this step of enterprise search, documents are converted to the plain text using
document filters so they can be searched efficiently. The content processing phase also includes
tokenization. For example, characters are converted to lower case to enable fast case-insensitive search .
Indexing
After the content is processed, documents are stored in an index. This index consists of all words, including
information about ranking and frequency of the term.
The search system compares the query to the saved index and returns matching results. The search returns
entries that include what the user entered as a query and also returns similar results.
Enterprise search engines have some common use cases that increase the efficiency of research processes.
We listed the five most common use cases for you:
Knowledge management: Applying enterprise search eases and improves the process of knowledge
management within the organization. In other words, if the organization has many documents in
archives, you better use a search engine to find the right document.
Contact Experts: You don‘t need to know people‘s full names if you are looking for experts within
the organization. You can filter according to attributes and experience to find experts.
Talent Search: Enterprise search engines can match candidates with job descriptions from the
database of potential candidates.
Intranet Search: It helps intranet users locate the information they need from the organization‘s
shared drives and databases.
Insight Engines: Insight engines are an evolved version of enterprise search since insight engines
can leverage AI capabilities to search queries.
Enterprise search engines and insight engines serve the same purpose; to show the results of business users‘
queries. However, insight engines are more advanced platforms and they combine with data and machine
learning algorithms to process content so that they can provide more relevant and personalized results for
users. On the other hand, enterprise search converts content to plain text by using document filters.
Apply search analytics: Collect query data of users so that you can gain insights about your search
engine performance and the topics researched by your users.
Evaluate your team’s talent: Assess the capability of your business to implement a solution. If your
team is a total stranger to search engine architecture, you can hire a third party integration specialist
to help you implement the solution.
With the advancements in technology, enterprise search applications got smarter. There are different levels
of search capabilities of engines. Before investing in a new enterprise solution, you should assess your
current search level and identify your requirements based on your business‘ goals.
Enterprise Search Engine Software.
Businesses can use both open source and proprietary solutions. Each enterprise search vendor has unique
pros and cons. Here is the list of vendors divided into two groups: