Topic Mining Over Asynchronous Text Sequences

Topic Mining over Asynchronous Text Sequences
1. INTRODUCTION
More and more text sequences are being generated in various forms, such as news
streams, weblog articles, emails, instant messages, research paper archives, web forum
discussion threads, and so forth. To discover valuable knowledge from a text sequence, the
first step is usually to extract topics from the sequence with both semantic and temporal
information, which are described by two distributions, respectively: a word distribution
describing the semantics of the topic and a time distribution describing the topic’s intensity
over time. In many real-world applications, we are facing multiple text sequences that are
correlated with each other by sharing common topics. Intuitively, the interactions among
these sequences could provide clues to derive more meaningful and comprehensive topics
than those found by using information from each individual stream solely. The intuition was
confirmed by very recent work, which utilized the temporal correlation over multiple text
sequences to explore the semantic correlation among common topics. The method proposed
therein relied on a fundamental assumption that different sequences are always synchronous
in time, or in their own term coordinated, which means that the common topics share the
same time distribution over different sequences.
However, this assumption is too strong to hold in all cases. Rather, asynchronism
among multiple sequences, i.e., documents from different sequences on the same topic have
different time stamps, is actually very common in practice. For instance, in news feeds, there
is no guarantee that news articles covering the same topic are indexed by the same time
stamps. There can be hours of delay for news agencies, days for newspapers, and even weeks
for periodicals, because some sources try to provide first-hand flashes shortly after the
incidents, while others provide more comprehensive reviews afterward. Another example is
research paper archives, where the latest research topics are closely followed by newsletters
and communications within weeks or months, then the full versions may appear in
conference proceedings, which are usually published annually and at last in journals, which
may sometimes take more than a year to appear after submission.
Dept. of MCA, VTU PG Center, RO Gulbarga Page 1

1.1 Problem statement
We formally define our problem of mining common topics from multiple asynchronous text
sequences. We introduce a generative topic model which incorporates both temporal and
semantic information in given text sequences. We derive our objective function, which is to
maximize the likelihood estimation subject to certain constraints.
1.2 Objective of the study
The main contributions of our work are:
 We address the problem of mining common topics from multiple asynchronous text
sequences. To the extent of our knowledge, this is the first attempt to solve this
problem.
 We formalize our problem by introducing a principled probabilistic framework and
propose an objective function for our problem.
 We develop a novel alternate optimization algorithm to maximize the objective
function with a theoretically guaranteed (local) optimum.
 The effectiveness and advantage of our method are validated by an extensive
empirical study on two real-world data sets.
1.3 Scope of the study
The key idea of our approach is to utilize the semantic and temporal correlation among
sequences and to build up a mutual reinforcement process. We start with extracting a set of
common topics from given sequences using their original time stamps. Based on the
extracted topics and their word distributions, we update the time stamps of documents in all
sequences by assigning them to most relevant topics. This step reduces the asynchronism
among sequences. Then after synchronization, we refine the common topics according to the
new time stamps.

1.4 Methodology used
1. Text Mining:
Text mining is the analysis of data contained in natural language text mining
technique to solve business problems is called text analytics.
2. Synchronous Text Sequences:
Synchronous text sequences are extracts as common topics in same amount of

time. It is comes under reliable content. Using multiple text sequences extract only
common topics of content information. To apply the time distribution and find out
topic discovered content. These operations are perform in own creation datasets
specification process.
3. Asynchronous Text Sequences:
Asynchronous text sequences are extracted with mining from multiple

sequences. It is the trivial content of information extraction on multiple sequences.
Here there is no correlation operation in between of multiple text sequences. It can
give low accuracy of result in output content.

2. LITERATURE SURVEY
Topic mining has been extensively studied in the literature, starting with the Topic
Detection and Tracking (TDT) project which aimed to find and track topics (events) in news
sequences1 with clustering-based techniques. Later on, probabilistic generative models were
introduced into use, such as Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet
Allocation (LDA), and their derivatives.
In many real applications, text collections carry generic temporal information and,
thus, can be considered as text sequences. To capture the temporal dynamics of topics,
various methods have been proposed to discover topics over time in text sequences.
However, these methods were designed to extract topics from a single sequence, which
adopted the generative model, time stamps of individual documents were modeled with a
random variable, either discrete or continuous. Then, it was assumed that given a document
in the sequence, the time stamp of the document was generated conditionally independently
from word. The authors introduced hyper-parameters that evolve over time in state transfer
models in the sequence. For each time slice, a hyper parameter is assigned with a state by a
probability distribution, given the state on the former time slice. The time dimension of the
sequence was cut into time slices and topics were discovered from documents in each slice
independently. As a result, in multiple-sequence cases, topics in each sequence can only be
estimated separately and potential correlation between topics in different sequences, both
semantically and temporally, could not be fully explored. The semantic correlation between
different topics in static text collections was considered. Similarly, explored common topics
in multiple static text collections. A very recent work by first proposed a topic mining
method that aimed to discover common (bursty) topics over multiple text sequences. Their
approach is different from ours because they tried to find topics that shared common time
distribution over different sequences by assuming that the sequences were synchronous, or
coordinated. Based on this premise, documents with same time stamps are combined together
over different sequences so that the word distributions of topics in individual sequences can
be discovered. As a contrast, in our work, we aim to find topics that are common in
semantics, while having asynchronous time distributions in different sequences.

[1] D.M. Blei and J.D. Lafferty, “Dynamic Topic Models,” Proc. Int’l Conf. Machine
Learning (ICML), pp. 113-120, 2006.
Managing the explosion of electronic document archives requires new tools for
automatically organizing, searching, indexing, and browsing large collections. Recent
research in machine learning and statistics has developed new techniques for finding patterns
of words in document collections using hierarchical probabilistic models (Blei et al.,2003;
McCallum et al., 2004; Rosen-Zvi et al., 2004; Griffithsand Steyvers, 2004; Buntine and
Jakulin, 2004; Bleiand Lafferty, 2006). These models are called “topic models” because the
discovered patterns often reflect the underlying topics which combined to form the
documents. Such hierarchical probabilistic models are easily generalized to other kinds of
data; for example, topic models have been used to analyze images (Fei-Fei and Perona, 2005;
Sivicet al., 2005), biological data (Pritchard et al., 2000), and survey data (Erosheva, 2002).In
an exchangeable topic model, the words of each document are assumed to be independently
drawn from a mixture of multinomial’s. The mixing proportions are randomly drawn for each
document; the mixture components, or topics, are shared by all documents. Thus, each
document reflects the components with different proportions. These models are a powerful
method of dimensionality reduction for large collections of unstructured documents.
Moreover, posterior inference at the document level is useful for information retrieval,
classification, and topic-directed browsing. Treating words exchange ably is a simplification
that it is consistent with the goal of identifying the semantic themes within each document.
For many collections of interest, however, the implicit assumption of exchangeable
documents is inappropriate. Document collections such as scholarly journals, email, news
articles, and search query logs all reflect evolving content. For example, the Science article
“The Brain of Professor Laborde” may be on the same scientific path as the article
“Reshaping the Cortical Motor Map by Unmasking Latent Intracortical Connections,” but the
study of neuroscience looked much different in 1903 than it did in 1991. The themes in a
document collection evolve over time, and it is of interest to explicitly model the dynamics of
the underlying topics. In this document, we develop a dynamic topic model which captures
the evolution of topics in a sequentially organized corpus of documents. We demonstrate its
applicability by analyzing over 100 years of OCR’ed articles from the journal Science, which
was founded in 1880 by Thomas Edison and has been published through the present. Under

this model, articles are grouped by year, and each year’s articles arise from a set of topics that
have evolved from the last year’s topics.
In the subsequent sections, we extend classical state space models to specify a statistical
model of topic evolution. We then develop efficient approximate posterior inference
techniques for determining the evolving topics from a sequential collection of documents.
Finally, we present qualitative results that demonstrate how dynamic topic models allow the
exploration of a large document collection in new ways, and quantitative results that
demonstrate greater predictive accuracy when compared with static topic models.
[2] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, “Parameter Free Bursty Events Detection in
Text Streams,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 181-192, 2005.
A fundamental problem in text data mining is to extract meaningful structure from document
streams that arrive continuously over time. E-mail and news articles are two natural examples
of such streams, each characterized by topics that appear, grow in intensity for a period of
time, and then fade away. The published literature in a particular research _eld can be seen to
exhibit similar phenomena over a much longer time scale. Underlying much of the text
mining work in this area is the following intuitive premise | that the appearance of a topic in a
document stream is signalled by a \burst of activity," with certain features rising sharply in
frequency as the topic emerges.
The goal of the present work is to develop a formal approach for modeling such\
bursts," in such a way that they can be robustly and e_ciently identi_ed, and can provide an
organizational framework for analyzing the underlying content. The ap-proach is based on
modeling the stream using an in_nite-state automaton, in which bursts appear naturally as
state transitions; it can be viewed as drawing an analogy with models from queueing theory
for bursty network tra_c. The resulting algorithms are highly e_cient, and yield a nested
representation of the set of bursts that imposes a hierarchical structure on the overall stream.
Experiments with e-mail and research paper archives suggest that the resulting structures
have a natural meaning in terms of the content that gave rise to them. Documents can be
naturally organized by topic, but in many settings we also experience their arrival over time.
E-mail and news articles provide two clear examples of such docu-ment streams: in both
cases, the strong temporal ordering of the content is necessary for making sense of it, as
particular topics appear, grow in intensity, and then fade away again. Over a much longer
time scale, the published literature in a particular research _eld can be meaningfully

understood in this way as well, with particular research themes growing and diminishing in
visibility across a period of years.
Underlying a number of these techniques is the following intuitive premise | that the
appearance of a topic in a document stream is signalled by a \burst of activity," with certain
features rising sharply in frequency as the topic emerges. The goal of the present work is to
develop a formal approach for modeling such \bursts," in such a way that they can be
robustly and efficiently identified, and can provide an organizational framework for
analyzing the underlying content. The approach presented here can be viewed as drawing an
analogy with models from queuing theory for bursty network. In addition, however, the
analysis of the underlying burst patterns reveals a latent hierarchical structure that often has a
natural meaning in terms of the content of the stream.
[3] A. Krause, J. Leskovec, and C. Guestrin, “Data Association for Topic Intensity
Tracking,” Proc. Int’l Conf. Machine Learning (ICML), pp. 497-504, 2006.
When following a news event, the content and the temporal information are both important
factors in understanding the evolution and the dynamics of the news topic over time. When
recognizing human activity, the observed person often performs a variety of tasks in parallel,
each with a different intensity, and this intensity changes over time. Both examples have in
common a notion of classification: e.g., classifying documents into topics, and actions into
activities. Another common point is the temporal aspect: the intensity of each topic or
activity changes over time. In a stream of incoming email for example, we want to associate
each email with a topic, and then model bursts and changes in the frequency of emails of
each topic. A simple approach to this problem would be to first consider associating each
email with a topic using some supervised, semi-supervised or unsupervised (clustering)
method; thus segmenting the joint stream We present a unified model of what was
traditionally viewed as two separate tasks: data association and intensity tracking of multiple
topics over time. In the data association part, the task is to assign a topic (a class) to each data
point, and the intensity tracking part models the bursts and changes in intensities of topics
over time. Our approach to this problem combines an extension of Factorial Hidden Markov
models for topic intensity tracking with exponential order statistics for implicit data
association. Experiments on text and email datasets show that the interplay of classification
and topic intensity tracking improves the accuracy of both classification and intensity
tracking. Even a little noise in topic assignments can mislead the traditional algorithms.

However, our approach detects correct topic intensities even with 30% topic noise. into a
stream for each topic. Then, using only data from each individual topic, we could identify
bursts and changes in topic activity over time. In this traditional view (Kleinberg, 2003), the
data association (topic segmentation) problem and the burst detection (intensity estimation)
problem are viewed as two distinct tasks. However, this separation seems unnatural and
introduces additional bias to the model. We combine the tasks of data association and
intensity tracking into a single model, where we allow the temporal information to influence
classification. The intuition is that by using temporal information the classification would
improve, and by improved classification the topic intensity and topic content evolution
tracking also benefit.
[4] Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for Retrospective News
Event Detection,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), pp. 106-113, 2005.
Retrospective news event detection (RED) is defined as the discovery of previously

unidentified events in historical news corpus. Although both the contents and time
information of news articles are helpful to RED, most researches focus on the utilization of
the contents of news articles. Few research works have been carried out on finding better
usages of time information. In this paper, we do some explorations on both directions based
on the following two characteristics of news articles. On the one hand, news articles are
always aroused by events; on the other hand, similar articles reporting the same event often
redundantly appear on many news sources. The former hints a generative model of news
articles, and the latter provides data enriched environments to perform RED. With
consideration of these characteristics, we propose a probabilistic model to incorporate both
content and time information in a unified framework. This model gives new representations
of both news articles and news events. Furthermore, based on this approach, we build an
interactive RED system, HISCOVERY, which provides additional functions to present
events, Photo Story and Chronicle.
A news event is defined as a specific thing happens at a specific time and place [1], which
may be consecutively reported by many news articles in a period. Retrospective news Event
Detection (RED) is defined as the discovery of previously unidentified events in historical
news corpus [12].RED has lots of applications, such as detecting earthquakes happened in the

last ten years from historical news articles. Although RED has been studied for many years, it
is yet an open problem [12]. A news article contains two kinds of information: contents and
timestamps. Both of them are very helpful for RED task, but most previous research work
focus on finding better utilizations of the contents [12]. The usefulness of time information is
often ignored, or at least time information is used in unsatisfied manners. According to these
observations, we explore RED from the following two aspects. On the one hand, we consider
the better representations of news articles and events, which should effectively model both
the contents and time information. On the other hand, we notice that the previous work
consider little on modelling events in probabilistic manners. As a result, in this paper we
propose a probabilistic model for RED, in which both contents and time information are
utilized. Furthermore, based on it, we build a RED system, HISCOVERY (HIStory
disCOVERY), which can provide a vivid multimedia representation of the results of event
detection.
Our main contributions include:
1). Proposing a multi-modal RED algorithm, in which both the contents and time information
of news articles are modelled explicitly and effectively.
2). Proposing an approach to determine the approximate number of events from the articles
count-time distribution.
[5]Q.Mei and C.Zhai, “Discovering Evolutionary Theme patterns from text:An Exploration
of temporal text mining,” Proc ACM SIGKDD int’l conf. Knowledge Discovery and Data
Mining(KDD), pp. 198-207, 2005.
Temporal Text Mining (TTM) is concerned with discovering temporal patterns in text
information collected over time. Since most text information bears some time stamps, TTM
has many applications in multiple domains, such as summarizing events in news articles and
revealing research trends in scientific literature. In this paper, we study a particular TTM task
discovering and summarizing the evolutionary patterns of themes in a text stream. We this
new text mining problem and present general probabilistic methods for solving this problem
through (1) discovering latent themes from text (2) constructing an evolution graph of
themes; and (3) analyzing life cycles of themes. Evaluation of the proposed methods on Two
domains (i.e., news articles and literature) shows that the proposed methods can discover
interesting evolutionary theme patterns effectively. In many application domains, we
encounter a stream of text, in which each text document has some meaningful timestamp. For

example, a collection of news articles about atopic and research papers in a subject area can
both be viewed as natural text streams with publication dates astime stamps. In such stream
text data, there often exist in-teresting temporal patterns. For example, an event covered in
news articles generally has an underlying temporal and evolutionary structure consisting of
themes (i.e., subtopics) characterizing the beginning, progression, and impact of the event,
among others. Similarly, in research papers, research topics may also exhibit evolutionary
patterns. For example, the study of one topic in some time period may have stimulated the
study of another topic after the time period. In all these cases, it would be very useful if we
can discover, extract, and summarize these evolutionary theme patterns (ETP) automatically.
Indeed, such patterns not only are useful by themselves, but also would facilitate organization
and navigation of the information stream ac-cording to the underlying thematic structures.
Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query
to Google News (http:==news.google.com) returned more than 80,000 online news articles
about this event within one month (Jan.17through Feb.17, 2005). It is generally very difficult
to navigate through all these news articles. For someone who has not been keeping track of
the event but wants to know about this disaster , a summary of this event would be extremely
useful. Ideally, the summary would include both the major subtopics about the event and any
threads corresponding to the evolution of these themes. For example, the themes may include
the report of the happening of the event, the statistics of victims and damage, the aids from
the world, and the lessons from the tsunami. A thread can indicate when each theme starts,
reaches the peak, and breaks, as well as which subsequent themes it inuences. A timeline-
based theme structure as shown in Figure 1 would be a very informative summary of the
event, which also facilitates navigation through themes.
2.1 Existing System
Previous existing systems perform the interaction on text sequences in individual

topic. Search on text sequences and provides the result as a asynchronous text sequences.
Different sequences are displayed from same topic with different time stamps. It is not
possible for gets the valuable results in news agencies. Extraction techniques are not display
the valuable words of content.

Example:
Consider the input as a research articles. These research articles are select from
different organizations. Two organizations are display two text sequences sharing the same
topic of result. These text sequences always synchronous it’s not possible here.
Drawbacks
 It’s not cover the total number of dimensions.

 It is not gives the coincide results.
 Relative frequency of words is too low.
 Here uses only navies Bayesian technique and gives less satisfactory solution only.
Proposed System
In multiple text sequences apply the correlation operations under extraction process.
Extraction results are gives as a high quality. It is the new investigation procedure for
extraction efficient data. It is works based on probabilistic framework environment. It is
starts the search process on topic discovery. After extraction of topics display the next type
results are comes time synchronization based results process. Time synchronization provides
based on mutual sharing of document of content. It gives the highly relevant content of
information as a final output. This result comes under reduced asynchronous text sequences.
It is same process to implement as a refinement process.
Advantages:
a. Extraction only high relevant content.

b. It is optimal and guaranteed final result.
c. It is good meaningful and semantic content of information.
2.2 Feasibility Study
Preliminary investigation examine project feasibility, the likelihood the system

will be useful to the organization. The main objective of the feasibility study is to test the
Technical, Operational and Economical feasibility for adding new modules and debugging

old running system. All system is feasible if they are unlimited resources and infinite time.
There are aspects in the feasibility study portion of the preliminary investigation:
 Technical Feasibility
 Operational Feasibility
 Economical Feasibility
2.2.1 Economical Feasibility
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
The system is economically feasible. It does not require any addition hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
2.2.2 Operational Feasibility
Proposed projects are beneficial only if they can be turned out into information
system. That will meet the organization’s operating requirements. Operational feasibility
aspects of the project are to be taken as an important part of the project implementation.
Some of the important issues raised are to test the operational feasibility of a project includes
the following: -
 Is there sufficient support for the management from the users?

 Will the system be used and work properly if it is being developed and implemented?
 Will there be any resistance from the user that will undermine the possible application
benefits?
This system is targeted to be in accordance with the above-mentioned issues.

Beforehand, the management issues and user requirements have been taken into
consideration.

So there is no question of resistance from the users that can undermine the possible
application benefits.
The well-planned design would ensure the optimal utilization of the computer resources and
would help in the improvement of performance status.
2.2.3 Technical Feasibility
The current system developed is technically feasible. It is a web based user interface
for audit workflow at NIC-CSD. Thus it provides an easy access to the users. The database’s
purpose is to create, establish and maintain a workflow among various entities in order to
facilitate all concerned users in their various capacities or roles. Permission to the users
would be granted based on the roles specified. Therefore, it provides the technical guarantee
of accuracy, reliability and security. The software and hard requirements for the development
of this project are not many and are already available in-house at NIC or are available as free
as open source.
The work for the project is done with the current equipment and existing software
technology. Necessary bandwidth exists for providing a fast feedback to the users
irrespective of the number of users using the system.
2.3 Tools and technologies used
Tools Used:
 Java(1.7)
 Servlet
 JSP
Java:
 Java is purely object-oriented programming language.

 It is developed by sun-micro system in 1990.
 Java is machine independent language write once and run anywhere.
 It is widely used programming language.

Servlet:
 The servlet is a Java programming language class used to extend the capabilities of a
server.
 Although servers can respond to any types of requests, they are commonly used to
extend the applications hosted by web servers.
 Java applets that run on servers instead of in web browsers.
JSP:
 is a technology that helps software developer create dynamically generated web pages
based on HTML, XML, or other document types.
 Released in 1999 by Sun Microsystems,JSP is similar to PHP, but it uses the Java
programming language.
Research methodologies used:
1. INTRODUCTION TO DATA MINING:
Information systems are powerful instruments for organizational problem solving

through formal information processing. Data mining (DM) and knowledge discovery are
intelligent tools that help to accumulate and process data and make use of it (Fayyad, 1996).
Data mining bridges many technical areas, including databases, statistics, machine learning,
and human-computer interaction. The set of data mining processes used to extract and verify
patterns in data is the core of the knowledge discovery process. Numerous data mining
techniques have recently been developed to extract knowledge from large databases.
The area of data mining is historically more related to AI (Artificial Intelligence),

pattern recognition, statistical, and database communities, though we think there is no
objective reason for that. And nowadays, although the field of data mining according to the
ACM classification system*** for the computing field is a subject of database applications
(H.2.8) that in sequence related to database management (H.2) and to information systems
field (H.), there exists a gap between the data mining and information systems communities.
Each of the two scientific communities publishes its own journals and books, and organizes
different conferences that rarely cover the same issues. This situation is not beneficial since

both communities share in common many similar problems being solved and therefore are
potentially helpful for each other.
In this current study we consider some existing frameworks for data mining,
including database perspective and inductive databases approach, the reductionist statistical
and probabilistic approaches, data compression approach, and constructive induction
approach. We consider their advantages and limitations analyzing what these approaches
account in the data mining research and what they do not.
The study of research methods in information systems by Järvinen (1999) encouraged

us to analyses connections and appropriateness of them to the area of data mining. In Section
3 we are trying to view the data mining research as a continuous information system
development process. We refer to the traditional framework presented by Ives et al. (1980)
that is widely known and has been used in the classification of Information Systems research
literature. The framework is a synthesis of many other frameworks considered before by
other researchers and covers their main elements. For us this framework is more substantial
than the others since it also focuses on the development of information systems.
Ives et al. (1980) considers an information system (IS) in an organizational environment that
is further surrounded by an external environment. According to the framework an
information system itself includes three environments: user environment, IS development
environment, and IS operations environment. Drawing an analogy to this framework we
consider a data mining system as a special kind of adaptive information system that processes
data and helps to make use of it. Adaptation in this context is important because of the fact
that the data mining system is often aimed to produce solutions to various real-world
problems, and not to a single problem. On the one hand, a data mining system is equipped
with a number of techniques to be applied for a problem at hand. From the other hand there
exist a number of different problems and current research has shown that no single technique
can dominate some other technique over all possible data-mining problems (Wolpert and
MacReady, 1996). Nevertheless, many empirical studies report that a technique or a group of
techniques can perform significantly better than any other technique on a certain data-mining
problem or a group of problems (Kiang, 2003). Therefore viewing data mining research as a
continuous and never-ending development process of a DM system towards the efficient

utilization of available DM techniques for solving a current problem impacted by the

dynamically changing environment is a well-motivated position.
This paper we focus on the IS development process. We consider information systems

development framework of Nunamaker (1990-91) adapted to data-mining systems
development. We discuss three basic groups of information systems research methods.
Namely, we consider theoretical, constructive and experimental approaches with regard to
Nunamaker’s framework in the context of data mining. We demonstrate how these
approaches can be applied iteratively and/or in parallel for the development of an artefact – a
data-mining tool, and contribute to theory creation and theory testing. We conclude with a
brief summary and discussion of our further research in Section 4.
2. THEORETICAL FRAMEWORKS FOR DATA MINING
A database perspective and inductive databases
A database perspective on data mining and knowledge discovery was introduced in

Imielinski and Mannila (1996). The main postulate of their approach is: “there is no such
thing as discovery, it is all in the power of the query language”. That is, one can benefit from
viewing common data mining tasks not as dynamic operations constructing new pieces of
information, but as operations finding unknown (i.e. not found so far) but existing parts of
knowledge.
In Boulicaut et al. (1999) an inductive databases framework for the data mining and
knowledge discovery in databases (KDD) modeling was introduced. The basic idea here is
that data-mining task can be formulated as locating interesting sentences from a given logic
that are true in the database. Then discovering knowledge from data can be viewed as
querying the set of interesting sentences. Therefore the term “an inductive database” refers to
such a type of databases that contains not only the data but a theory about the data as well
(Boulicaut et al., 1999).
This approach has some logical connection to the idea of deductive databases, which
contain normal database content and additionally a set of rules for deriving new facts from
the facts already present in the database. This is a common inner data representation. For a
database user, all the facts derivable from the rules are presented, as they would have been

actually stored there. In a similar way, there is no need to have all the rules that are true about
the data stored in an inductive database. However, a user may imagine that all these rules are
there, although in reality, the rules are constructed on demand. The description of an
inductive database consists of a normal relational database structure with an additional
structure for performing generalizations. It is possible to design a query language that works
on inductive databases (Boulicaut et al., 1998).
Usually, the result of a query on an inductive database is an inductive database as

well. Certainly, there might be a need to find a solution about what should be presented to a
user and when to stop the recursive rule generation while querying. We refer an interested
reader to the work of Boulicaut et al. (1999).
The reductionist approach
In Mannila (2000) two simple approaches to the theory of data mining are analysed.
The first one is the reductionist approach of viewing data mining as statistics. Generally, it is
possible to consider the task of data mining from the statistical point of view, emphasizing
the fact that DM techniques are applied to larger datasets than it is in statistics. And in this
situation the analysis of the appropriate statistics literature, where strong analytical
background is accumulated, would solve most of the data mining problems. Many data
mining tasks naturally may be formulated in statistical terms, and many statistical
contributions may be used in data mining in a quite straightforward manner. The second
approach discussed by Mannila (2000) is a probabilistic approach. Generally, many data
mining tasks can be seen as the task of finding the underlying joint distribution of the
variables in the data. Good examples of this approach would be Bayesian network or a
hierarchical Bayesian model, which give a short and understandable representation of the
joint distribution. Data mining tasks dealing with clustering and/or classification fit easily
into this approach. However, it should be admitted that data mining researchers with
computer science background typically have rather little education in statistics and this is a
reason to the fact that achievements from statistics are used not to such an extent as could be
possible.
A deeper consideration of data mining and statistics shows that the volume of the data
being analysed and different background of researchers are, probably, not the most important

ones that make the difference between the areas. Data mining is an applied area of science
and limitations in available computational resources is a big issue when applying results from
statistics to data mining. The other important issue is that data mining approaches emphasize
database integration, simplicity of use, and understandability of results. Last but not least
Mannila (2000) points out that the theoretical framework of statistics does not concern much
about data analysis as a process that generally includes data understanding, data preparation,
data exploration, results evaluation, and visualization steps. However, there are persons
(mainly with strong statistical background) who equate DM to applied statistics, because
many tasks of DM may be perfectly represented in terms of statistics.
3. DATA COMPRESSION APPROACH
A data compression approach to data mining can be stated in the following way:
compress the dataset by finding some structure or knowledge for it, where knowledge is
interpreted as a representation that allows coding the data by using fewer amount of bits. For
example, the minimum description length (MDL) principle (Mehta et al., 1995) can be used
to select among different encodings accounting to both the complexity of a model and its
predictive accuracy. Machine learning practitioners have used the MDL principle in different
interpretations to recommend that even when a hypothesis is not the most empirically
successful among those available; it may be the one to be chosen if it is simple enough.
The idea is in trading between consistency with training examples and empirical
adequacy by predictive success as it is, for example, with accurate decision tree construction.
Bensusan (2000) connects this to another methodological issue, namely that theories should
not be ad hoc that is they should not over fit the examples used to build them. Simplicity is
the remedy for being ad hoc both in the recommendations of philosophy of science and in the
practice of machine learning. The data compression approach has also connection with the
rather old Occam’s razor principle that was introduced in 14th century. The most commonly
used formulation of this principle in data mining is "when you have two competing models
which make exactly the same predictions, the one that is simpler is the better".
Many (if not every) data mining techniques can be viewed in terms of the data compression
approach. For example, association rules and pruned decision trees can be viewed as ways of

providing compression of parts of the data. Clustering approaches can also be considered as a
way of compressing the dataset. There is a connection to Bayesian theory for modelling the
joint distribution – any compression scheme can be viewed as providing a distribution on the
set of possible instances of the data. However, in order to produce a structure that would be
comprehensible to the user, it is necessary to select such compression method(s) that is (are)
based on concepts that are easy to understand.
4. CONSTRUCTIVE INDUCTION APPROACH
Constructive induction is a learning process that consists of two intertwined phases, one of
which is responsible for the construction of the “best” representation space and the second
concerns with generating hypothesis in the found space . Constructive induction methods are
classified into three categories: data driven (information from the training examples is used),
hypothesis-driven (information from the analysis of the form of intermediate hypothesis is
used) and knowledge-driven (domain knowledge provided by experts is used) methods. Any
kind of induction strategy (implying induction, abduction, analogies and other forms of non-
truth preserving and non-monotonic inferences) can be potentially used. However, the focus
usually is on operating higher-level data-concepts and theoretical terms rather than pure data.
Michalski (1997) considers constructive (expands the representation space by attribute
generation) and destructive (contract the representational space by feature selection or feature
abstraction) operators that can be applied to produce a better representation space comparing
to the original one. In Bensusan (1999) it was shown that too many theoretical terms could
impair induction. This vindicates an old advise of the philosophy of science: avoid adding
unnecessary metaphysical baggage to a theory. Theoretical terms are often contrasted with
observational terms. It is generally accepted that the more data we have the better model we
can construct. However, this is not true for higher-level concepts that constitute a theory.
Many data mining techniques that apply wrapper/filter approaches to combine feature
selection, feature extraction or feature construction processes (as means of dimensionality
reduction and/or as means of search for better representation of the problem) and a classifier
or other type of learning algorithm can be considered as constructive induction approaches.

5. DATA MINING AND INFORMATION SYSTEMS FRAMEWORK
Generations of DM systems
Present history of data mining systems’ development totals three main stages/generations.
Year 1989 can be referred to as the first generation of data mining/KDD systems when a few
single-task data mining tools such as C4.5 decision tree algorithm (Quinlan, 1993) existed.
They were difficult to use and required significant preparation. Most of such systems were
based on a loosely-coupled architecture, where the database and the data mining subsystems
were realised as separate independent parts. This architecture demands continuous context
switching between the data-mining engine and the database (Imielinski and Mannila, 1996).
Then, the year 1995 can be associated with formation of the second-generation tools-suits.
Data mining as a core part of KDD started to be seen as “the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad,
1996, 22). Some examples of the knowledge discovery systems that follow Fayyad’s view on
DM as the process are: SPSS Clementine, SGI Mine set , and IBM Intelligent Miner
Numerous KDD systems have recently been developed. At the beginning of the millennium
there exist about 200 tools that could perform several tasks (clustering, classification,
visualization) each for specialized applications (therefore often called “vertical solutions”)
(Piatetsky-Shapiro, 2000). This growing trend towards integrating data-mining tools with
specialized applications has been associated with the third generation of DM systems
Because of increasing number of such “vertical solutions” and possibility to accumulate

knowledge from these solutions, there is a growing potential for appearance of next-
generation database mining systems to manage KDD applications. These systems should be
able to discover knowledge by selecting and combining several available most suitable for
specific domain KDD techniques. While today’s algorithms tend to be fully automatic and
therefore fail to allow guidance from knowledgeable users at the key stages in the search for
data regularities, the researchers and the developers, who are involved into the creation of the
next generation data mining tools, are motivated to provide a broader range of automated
steps in the data mining process and make this process more mixed-initiative, in which
human experts collaborate more closely with the computer to form hypotheses and test them

against the data . Therefore a good data mining system should be adaptive for solving a
current problem impacted by the dynamically changing environment and being continuously
developed towards the efficient utilization of available DM techniques.
2.4 Hardware and Software Requirements
Hardware Requirements
Processor Intel Processor

RAM 512 MB
Hard Disk 80 GB
Key Board Standard Keyboard
Mouse Optical Mouse
Software Requirements
Operating System Windows OS

Processor Intel Processor
Technology Java
DataBase Oracle 10g
Coding Language Java
IDE My eclipse 8.6

3. SOFTWARE REQUIREMENT SPECIFICATION
3.1 Users:
admin: admin will uploaded the data uploaded data will inserted in to data base we can
search the data easily.
User: User will login the application by giving the username and password to validates to get
user home page.
Graph: result will represent in the form of graph it shows the how many times the searched
data is in paper.
Synchronous: The data will show is in the form synchronous data.it gives the relationship
between one document to another document.
Asynchronous: The data will show is in the form asynchronous data.it is not gives the
relationship between one document to another document.
3.2 Functional Requirements:
1. Upload the Content

2. Search the keyword
3. Find the Asynchronous and Synchronous Words
4. View Top 10 topical word
5. View Result Graph
3.3 Non Functional Requirements
The major non-functional Requirements of the system are as follows
Usability
The system is designed with completely automated process hence there is no or less user
intervention.

Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using java is more reliable.
Performance
This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.
Supportability
The system is designed to be the cross platform supportable. The system is supported on a
wide range of hardware and any software platform, which is having JVM, built into the
system.

4. SYSTEM DESIGN
4.1 Introduction
The purpose of the design phase is to plan a solution of the problem specified by
the requirement document. This phase is the first step in moving from the problem domain to
the solution domain. In other words, starting with what is needed, design takes us toward
how to satisfy the needs. The design of a system is perhaps the most critical factor affection
the quality of the software; it has a major impact on the later phase, particularly testing,
maintenance. The output of this phase is the design document. This document is similar to a
blueprint for the solution and is used later during implementation, testing and maintenance.
The design activity is often divided into two separate phases System Design and Detailed
Design.
System Design also called top-level design aims to identify the modules that should
be in the system, the specifications of these modules, and how they interact with each other to
produce the desired results. At the end of the system design all the major data structures, file
formats, output formats, and the major modules in the system and their specifications are
decided. During, Detailed Design, the internal logic of each of the modules specified in
system design is decided. During this phase, the details of the data of a module are usually
specified in a high-level design description language, which is independent of the target
language in which the software will eventually be implemented.
In system design the focus is on identifying the modules, whereas during detailed
design the focus is on designing the logic for each of the modules. In other works, in system
design the attention is on what components are needed, while in detailed design how the
components can be implemented in software is the issue. Design is concerned with
identifying software components specifying relationships among components. Specifying
software structure and providing blue print for the document phase. Modularity is one of the
desirable properties of large systems. It implies that the system is divided into several parts.
In such a manner, the interaction between parts is minimal clearly specified.

4.2 Design Overview
After the software requirements and the feasibility study report when frozen, the next
vital step was design of the software. This is the most crucial part of the software
development and success of any system depends upon how efficiently the system has been
designed. The main design of the Live Video Broadcasting undergoes through these phases
of the system development.
 Conceptual design.
 Detail design
Conceptual Design
Input Design:
Once the analysis of the system has been done, it would be necessary to identify the
data that is required to be processed to produce the outputs. Input design features can ensure
reliability of the system and generate correct reports from the accurate data. The input design
also determines whether the user can interact efficiently with the system.
Since, JAVA has been chosen for this system, the user should easily understand the
screens designed. The validations are carried out easily and the user will have no difficulty in
adding a new entry in the applet page.
Output Design:
Computer output is the most important and direct source of information to the user.
Efficient, intelligible output design should improve the systems relationship with the user and
help in the decision making. A major form of output is viewing the frames in the destination
place.
Code Design:
When large volumes of data are being handled, it is important that the items to be
stored, selected easily and quickly.

To accomplish each data item must have a unique specification and must be related to
other forms or items of data of the same type.
The purpose of codes is to facilitate the identification and retrieval of item of

information. The system analysis will find code structures will not always be the most
suitable for efficient computer processing principles of code design. When large volumes of
data are being handled, selected easily and quickly. To accomplish this, each data item must
have a unique identification and must be related to the other items.
Process Design:
The purpose of the process design is, to establish the communication between the data
and the code. The process takes the data from the database and checks these data with the
user-inputted data or any other data. Once the process has been finished, the output or the
report will be generated based on the process results.
Detailed Design:
System analysis is the process of gathering and interpreting and facts, diagnosing
problems and using the information to recommend improvements to the system. Structured
Analysis is done using the aid of detail design, which includes.
4.3 Modules Description
The system contains five modules is as follows.
1. Text Sequence Data

2. Common Topic Extraction
3. Asynchronism
4. Topic Extraction

1. Time Synchronization
Text Sequence Data:
Select N number of documents as a sequence documents here. All the documents are
related to single time stamp based results. In multiple documents identifies the same time
stamp and same sequence of data. These sequences of data extracts from different topics.
Common Topic Extraction:
Common topic data extraction possible based on words distribution process. Using
different asynchronous sequences extract the results with the help of interaction. Interaction
process performs like mining. It can give the common topic data as a meaningful data in
output specification process.
Asynchronism:
All the documents display the results with different time stamps. Those documents of
data is arrange properly. In these type of documents perform the adjusted time environment
then it’s possible for extraction of efficient results in output environment process.
Topic Extraction:
In all current timestamps applies the extraction process. It can gives the efficient
solution and display the synchronous text sequences of results in output environment. This
same process applies in the form of refinement process till reaches the good data results
display as maximization features extraction. It can give the guaranteed solution as a
meaningful result.
Time Synchronization:
Here perform the operation like matching in between of two or more number of
documents. According to time update the sequences and identifies the new sequence based
results. It can give the new discovery of content based results. It can give the smallest results
as a rewritten based results here in implementation.

4.4 Data Flow Diagram
A graphical tool used to describe and analyze the moment of data through a system manual or
automated including the process, stores of data, and delays in the system. Data Flow
Diagrams are the central tool and the basis from which other components are developed. The
transformation of data from input to output, through processes, may be described logically
and independently of the physical components associated with the system. The DFD is also
know as a data flow graph or a bubble chart.
DFDs are the model of the proposed system. They clearly should show the requirements on
which the new system should be built. Later during design activity this is taken as the basis
for drawing the system’s structure charts. The Basic Notation used to create a DFD’s are as
follows:
1. Dataflow: Data move in a specific direction from an origin to a destination.
2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.
3. Source: External sources or destination of data, which may be People, programs,

organizations or other entities.
4. Data Store: Here data are stored or referenced by a process in the System.

CONTEXT LEVEL DIAGRAM
Context Level 0DFD
Context level 1 DFD:

Context level2 Diagram:
4.5 Use case diagram
Use case diagrams represent the functionality of the system from a user point of view. A Use
case describes a function provided by the system that yields a visible result for an actor. an
actor describe any entity that interacts with the system.
The identification of actors and use cases results in the definition of the boundary of the
system, which is , in differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors outside the boundary of the system, where as
the use cases are inside the boundary of the system
A Use case contains all the events that can occur between an actor and a set of scenarios that
explains the interactions as sequence of happenings.
Actors
Actors represent external entities that interact with the system. An actor can be human or
external system.

Actor are not part of the system. They represent anyone or anything that interact with the
system.
An Actor may
 Only input information to the system.

 Only receive information from the system.
 Input and receive information from to and from the system.
During this activity , developers indentify the actors involved in this system are:
User:
User is an actor who uses the system and who performs the operations like data
classifications and execution performance that are required for him.
Use Cases:
Use cases are used during requirements elicitation and analysis to represent the functionality
of the system. Use case focus on the behaviour of the system from an external point of view.
The identification of actors and use cases results in the definition of the boundary of the
system , which is , in differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors are outside the boundary of the system , where
as the use cases are inside the boundary of the system.

Upload Different topics with time

Stamp
Enter the Keyword
Display the related topics
Results are Asynchronous

Documents
Results are Synchronous

Documents
Admin
User
Extract top-10 topical words
Asynchronous words
Synchronous words
Display the Results in Graph
Use case Diagram

4.6 Sequence diagram
Sequence diagrams are used to formalize the dynamic behavior of the system and to visualize
the communication among the objects. They are useful for identifying the additional objects
that participate in the use case. Sequence diagram represent the objects participating in the
interaction horizontally and time vertically.
Sequence diagrams typically show a user or actor and the objects and the components they
interact with the execution of the use case. Each column represent an objects that participate
in the interaction. Message is shown by solid arrows. Labels on the solid arrows represent the
message names. Activations are depicted by vertical rectangles. The actor who initiates the
interaction is shown in the left most columns . The messages coming from the actor represent
the interactions described in the use case diagrams.
Messages Description:
1. User provides the input file to the system which is a text Document of file.
2. Performing the clustering among all the document files.
3. Check the document information and cluster information.. it gives the no of cluster after
performing the clustering among all the document files.
4. Calculate the correlation measure by using the cosine similarity method and other
parameters.
5. Check the accuracy of the CPI and K means of the document clustering.
6. Finally it compares the results with k means and CPI method to represent about the best one
of these two methods.

User Registration Login Upolad Data Enter the Asynchronous Synchronous Top 10 Topical
Keyword Documents Documents Asynchronous Synchronous View Graph Admin
words Words Words
Registration
Login
Authentication
Login Fail
View Asynchronoud Documents
Enter Keyword
View Synchronoud Documents
View Asynchronous Words
Find Top 10 Topical words
View Synchronoud word
Display Graph
View Graph
Sequence Diagram
4.7 Activity diagram:
An Activity diagram describes the behaviour of the system in terms of activities.

Activities are modeling elements that represent the execution of set of operations. The
completion of these operations triggers a transition to another activity. Activity diagrams
similar to flowchart diagrams in that they can be used to represent control flow and data
flow . Activities are represented by rounded rectangles and arrows are represented transition
between activities . Think bars represent the synchronization of the control flow.

User
Login
Login Fail
Login Success
User Home
Search Display View Asynchronoun View Result

Keyword Related Topic Docoments and Words Graph
Logout
Activity Diagram

4.8 State Chart Diagram
UML State chart is notation for describing the sequence of states an object goes through in
response to external events. Objects have behaviour and state. The state of an object depends
on its current activity or condition. A state chart diagram shows the possible states of the
object ad the transitions that cause a change in state.
A state is depicted by a rounded rectangle A transition is depicted by open arrows connecting

two states. States are labeled with their names. A small solid black circle indicates the initial
state and a circle surrounding the small solid circle indicates the final state.
4: Authentication
Enter the
Registrat 2: Login 7: Enter Keyword Keyword
ion
Login
6: View Asynchronoud Documents 10: Find Top 10 Topical words
Asynchronous
1: Registration
Documents
5: Login Fail
Top 10 Topical
11: View Synchronoud word words
User
13: View Graph
Synchronou
s Words
9: View Asynchronous Words

View
Graph Synchronous
Documents
Upolad 12: Display Graph
Data
8: View Synchronoud Documents
Asynchrono
Admin us Words
3:
State Chart Diagram

4.9 Activity Diagram
An Activity diagram describes the behavior of the system in terms of activities. Activities are
modeling elements that represent the execution of set of operations. The completion of these
operations triggers a transition to another activity. Activity diagrams similar to flowchart
diagrams in that they can be used to represent control flow and data flow.
User
Login
Login Fail
Login Success
User Home
Search Display View Asynchronoun View Result

Keyword Related Topic Docoments and Words Graph
Logout
Activity Diagram

4.10 Component Diagram:
Admin Asynchronous Synchronous

Documentds and words Documentds and words
User Result Top 10 Topical

Graph Word
4.11 Deployment Diagram:
Search
Ke...
Asynchr
on...
Upload
Content
System
synchro
n...
View
Res...
ER Diagram:
In software engineering, an entity-relationship model (ERM) is an abstract

and conceptual representation of data. Entity-relationship modeling is a database modeling
method, used to produce a type of conceptual schema or semantic data model of a system,
often a relational database, and its requirements in a top-down fashion.

Diagrams created by this process are called entity-relationship diagrams, ER diagrams, or

ERDs. The definitive reference for entity-relationship modeling is Peter Chen's 1976 paper.
However, variants of the idea existed previously, and have been devised subsequently. An
entity may be defined as a thing which is recognized as being capable of an independent
existence and which can be uniquely identified.
An entity is an abstraction from the complexities of some domain. When we speak of an

entity we normally speak of some aspect of the real world which can be distinguished from
other aspects of the real world. An entity may be a physical object such as a house or a car,
an event such as a house sale or a car service, or a concept such as a customer transaction or
order. Although the term entity is the one most commonly used, following Chen we should
really distinguish between an entity and an entity-type. An entity-type is a category. An
entity, strictly speaking, is an instance of a given entity-type.
There are usually many instances of an entity-type. Because the term entity-type is somewhat
cumbersome, most people tend to use the term entity as a synonym for this term. Entities can
be thought of as nouns. Examples: a computer, an employee, a song, a mathematical theorem.
A relationship captures how two or more entities are related to one another. Relationships can
be thought of as verbs, linking two or more nouns.
Examples: an owns relationship between a company and a computer, a supervises

relationship between an employee and a department, a performs relationship between an
artist and a song, a proved relationship between a mathematician and a theorem.
The model's linguistic aspect described above is utilized in the declarative database query
language ERROL, which mimics natural language constructs. Entities and relationships can
both have attributes. Examples: an employee entity might have a Social Security Number
(SSN) attribute; the proved relationship may have a date attribute.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying
attributes, which is called the entity's primary key. Entity-relationship diagrams don't show
single entities or single instances of relations. Rather, they show entity sets and relationship
sets. Example: a particular song is an entity.

The collection of all songs in a database is an entity set. The eaten relationship between a
child and her lunch is a single relationship. The set of all such child-lunch relationships in a
database is a relationship set. In other words, a relationship set corresponds to a relation in
mathematics, while a relationship corresponds to a member of the relation. Certain
cardinality constraints on relationship sets may be indicated as well.
ER Diagram:
Data Dictionary:
1 User details:

2. Address
1. Login master
2. uploaddata data

5. SYSTEM IMPLEMENTATION
Introduction
Implementation is the stage where the theoretical design is turned in to working

system. The most crucial stage is achieving a new successful system and in giving confidence
on the new system for the users that it will work efficiently and effectively.
The system can be implemented only after through testing is done and if it found to
work according to the specification. It involves careful planning, investigation of the current
system and its constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning. Two major tasks of
preparing the implementation are education and training of the users and testing of the
system.
The more complex the system being implemented, the more involved will be the
systems analysis and design effort required just for implementation. The implementation
phase comprises of several activities. The required hardware and software acquisition is
carried out. The System may require some hardware and software acquisition is carried out.
The system may require some software to be developed. For this, programs are written and
tested. The user then changes over to his new fully tested system and the old system is
discontinued.
Implementation is the process of having systems personnel check out and put new
equipment in to use, train users, install the new application, and construct any files of data
needed to it.
Depending on the size of the organization that will be involved in using the
application and the risk associated with its use, system developers may choose to test the
operation in only one area of the firm, say in one department or with only one or two persons.
Sometimes they will run the old and new systems together to compare the results. In still
other situations, developers will stop using the old system one-day and begin using the new
one the next. As we will see, each implementation strategy has its merits, depending on the

business situation in which it is considered. Regardless of the implementation strategy used,

developers strive to ensure that the system’s initial use in trouble-free.
5.1 Code Snippet:
Logics
public class FormAction extends HttpServlet {\
Connection con = null;
PreparedStatement ps = null;
ResultSet rs = null;
RequestDispatcher rd = null;
String k = null;
int sc = 0;
int sid = 0;
String keyword = null;
int count = 0;
String key=null;
int cnt=0;
int id=0;
String keyword1=null;
public FormAction() {
con = ConnectionFactory.getConnection();
public void doGet(HttpServletRequest request, HttpServletResponse response)

throws ServletException, IOException {
doPost(request, response);
public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
HttpSession session = request.getSession();
key = request.getParameter("title");
session.setAttribute("key",key);
System.out.println("keyword value:" + key);
//WebFormBean wf =new WebFormBean();
//String timestamp = DateWrapper.parseDate(wf.getTimestamp());
String str[] = new String[15];
try {
ps = con.prepareStatement("select
catagory,topicname,datas,To_CHAR(timestamp,'dd-MON-yyyy') from uploaddata order by
timestamp ");
rs = ps.executeQuery();
int i = 1;
WebFormBean example;
Vector<WebFormBean> vector = new Vector<WebFormBean>();
StringBuffer WebFormBean = new StringBuffer();

while (rs.next()) {
example = new WebFormBean();
String catagory = rs.getString(1);
System.out.println("Docid" + catagory);
String DocCont = rs.getString(3);
String docname = rs.getString(2);
System.out.println("Document" + DocCont);
String timestamp = rs.getString(4);
System.out.println("timestamp db" +timestamp);
System.out.println("executed");
Scanner s = new Scanner(DocCont);
System.out.println("Value of A:" + s);
str[i] = DocCont;
while (null != s.findWithinHorizon("(?i)\\b" + key + "\\b", 0)) {
System.out.println("inside key loop");
System.out.println("it is in while");
str[i] = DocCont;
System.out.println("a value is" + DocCont);
System.out.println(str.toString());

example.setDatas(DocCont);
example.setCatagory(catagory);
example.setTimestamp(timestamp);
example.setTopicname(docname);
vector.add(example);
session.setAttribute("vector", vector);
// fileIds.append(""+fileId);
// fileIds.append(",");
break;
s.close();
} catch (Exception e) {
e.printStackTrace();
System.out.println("final");
rd = request.getRequestDispatcher("./SearchResult.jsp");
rd.forward(request, response);

5.2 Screen Shots:
Fig 5.3.1 login page
Fig 5.3.2 Enter invalid user name and password

Fig 5.3.3 Upload the data
Fig 5.3.4 Data uploaded successfully

Fig 5.3.5 Homepage
Fig 5.3.6 Enter the topic name to be search

Fig 5.3.7 Searched data

Fi
g 5.3.8 Enter the topic name to be searched

Fig 5.3.9 Searched data represent in the form of graph
Fig 5.3.9 Synchronous word

6. SOFTWARE TESTING
6.1 Testing Strategies:
The software engineering process can be viewed as a spiral. Initially system

engineering defines the role of software and leads to software requirements analysis where
the information domain , functions , behavior , performance , constraints and validation
criteria for software are established. moving inward along the spiral , we come to design and
finally to coding . To develop computer software we spiral in along streamlines that
decreases the level of abstraction on each item.
A Strategy for software testing may also be viewed in the context of the spiral. Unit testing
begins at the vertex of the spiral and concentrates on each unit of the software as
implemented in source code. Testing will progress by moving outward along the spiral to
integration testing , where the focus on the design and the concentration of the software
architecture.
Talking another turn on outward on the spiral we encounter validation testing where
requirements established as part of software requirements analysis are validated against the
software that has been constructed . Finally we arrive at system testing , where the software
and other system elements are tested as a whole .
6.2 Level of Testing:
6.2.1 Unit Testing:
Unit testing focuses on the building blocks of the software system, that is, objects and sub
system . There are three motivations behind focusing on components. First, unit testing
reduces the complexity of the overall tests activities, allowing us to focus on smaller units of
the system. Second , unit testing makes it easier to pinpoint and correct faults given that few
components are involved in this test . Third , Unit testing allows parallelism in the testing
activities , that is each component can be tested independently of one another . Hence the
goal is to test the internal logic of the module.

6.2.2 Integration Testing:
In the integration testing, many test modules are combined into sub systems , which are
then tested . The goal here is to see if the modules can be integrated properly, the emphasis
being on testing module interaction.
After structural testing and functional testing we get error free modules. These modules are
to be integrated to get the required results of the system. After checking a module, another
module is tested and is integrated with the previous module. After the integration, the test
cases are generated and the results are tested.
6.2.3 System Testing:

In system testing the entire software is tested. The reference document for this process is the
requirement document and the goal is to see whether the software meets its requirements.
The system was tested for various test cases with various inputs.
Acceptance Testing
Acceptance testing is sometimes performed with realistic data of the client to demonstrate
that the software is working satisfactory. Testing here focus on the external behavior of the
system , the internal logic of the program is not emphasized . In acceptance testing the system
is tested for various inputs.
Types of Testing
1. Black box or functional testing
2. White box testing or structural testing
Black box testing

This method is used when knowledge of the specified function that a product has been
designed to perform is known . The concept of black box is used to represent a system
whose inside workings are not available to inspection . In a black box the test item is a
"Black" , since its logic is unknown , all that is known is what goes in and what comes out ,
or the input and output.
Black box testing attempts to find errors in the following categories:
Incorrect or missing functions

Interface errors
Errors in data structure
Performance errors
Initialization and termination errors
As shown in the following figure of Black box testing, we are not thinking of the internal
workings, just we think about
What is the output to our system?
What is the output for given input to our system?
Input Output
?
The Black box is an imaginary box that hides its internal workings
White box testing

White box testing is concerned with testing the implementation of the program. the intent of
structural is not to exercise all the inputs or outputs but to exercise the different programming
and data structure used in the program. Thus structural testing aims to achieve test cases that
will force the desire coverage of different structures . Two types of path testing are statement
testing coverage and branch testing coverage.
Input Output
INTERNAL
WORKING

6.3 Test Cases
A Test case is a set of input data and expected results that exercises a component with the
purpose of causing failure and detecting faults . test case is an explicit set of instructions
designed to detect a particular class of defect in a software system , by bringing about a
failure . A Test case can give rise to many tests.
Test case format
Test Cases usually have the following components:
 Test Case Summary

 Initial Condition
 Steps to run the test case
 Expected behavior/outcome

Test Case 1:
Login Page Test Case
Test Steps
Test Case Test Case

Name Description
Step Expected Actual
Login Validate To verify that Enter login name An error

Login Login name on less than 1 chars message “Login
login page must (say a) and not less than 1
be greater than 1 password and click character” must
characters Submit button. be displayed.
Enter login name Login success

1 chars (say a) and full or an error
password and click message
Submit button. “Invalid Login
or Password”
must be
displayed.
Pwd Validate To verify that enter Password an error

Password Password on less than 1 chars message
login page must (say nothing) and “Password not
be greater than 1 Login Name and less than 1
characters click Submit characters”
button must be
displayed
Pwd02 Validate To verify that enter Password Login success

Password Password on with special full or an error

characters(say ! message
login page must @hi&*P) Login “Invalid Login
be allow special Name and click or Password”
characters Submit button must be
displayed
Link Verify To Verify the Click Sign Up Home Page

Hyperlinks Hyper Links Link must be
available at left displayed
side on login
page working or
not
Click Sign Up Sign Up page
Link must be
displayed
Click New Users New Users

Link Registration
Form must be
displayed

Registration Page Test Case
Test Steps
Test Case Test Case Step Expected Actual

Name Description
Registratio Validate To verify that User enter User name an error message
n User Name name on Registration click Submit User Name Must
page must be button be Declared
Declared
Validate To verify that enter Password an error message

Password Password on click Submit Password Must be
Registration page button Declared
must be Declared
Validate To verify that First enter First Name an error message

First Name Name on click Submit First Name Must
Registration page button be Declared
must be Declared
Validate To verify that Last enter Last Name an error message

Last Name Name on click Submit Last Name Must
Registration page button be Declared
must be Declared
Validate To verify that enter Address an error message

Address Address on click Submit Address Must be
Registration page button Declared
must be Declared

7. CONCLUSION
In the current project we tackle the problem of mining common topics from multiple
asynchronous text sequences. The study proposes a novel method which can automatically
discover and fix potential asynchronism among sequences and consequentially extract better
common topics. The key idea of our method is to introduce a self-refinement process by
utilizing correlation between the semantic and temporal information in the sequences. It
performs topic extraction and time synchronization alternately to optimize a unified objective
function. A local optimum is guaranteed by our algorithm.

8. FUTURE ENHANCEMENTS
We justified the effectiveness of our method on two real-world data sets, with comparison to
a baseline method. Empirical results suggest that
1) Our method is able to find meaningful and discriminative topics from asynchronous text
sequences;
2) Our method significantly outperforms the baseline method, evaluated both in quality and
in quantity;
3) The performance of our method is robust and stable against different parameter settings
and random initialization

9. BIBLIOGRAPHY
[1] D.M. Blei and J.D. Lafferty, “Dynamic Topic Models,” Proc. Int’l Conf. Machine
Learning (ICML), pp. 113-120, 2006.
[2] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, “Parameter Free Bursty Events Detection in
Text Streams,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 181-192, 2005.
[3] J.M. Kleinberg, “Bursty and Hierarchical Structure in Streams,” Proc. ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 91-101, 2002.
[4] A. Krause, J. Leskovec, and C. Guestrin, “Data Association for Topic Intensity
Tracking,” Proc. Int’l Conf. Machine Learning (ICML), pp. 497-504, 2006.
[5] Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for Retrospective News
Event Detection,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), pp. 106-113, 2005.
References Made From:
BOOKS AUTHOR
JAVA2 complete Reference Herbert Schildt
JAVA How to Program Deitel & Deitel
Core JAVA Cay s.Horstmann & Gary Cornel
Using JAVA2 Platform Joseph L. Weber

WEBSITES
1. 1.
2. www.java.sun.com
3. www.google.com
4. F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, “Information hiding: A survey,” Proc.
IEEE (Special Issue on Identification and Protection of Multimedia Information), vol. 87,
pp. 1062-1078, July 1999.
5. I. J. Cox, M. L. Miller, and J. A. Bloom, Digital Watermarking. San Francisco, CA: Morgan-
Kaufmann, 2002.
6. F. Hartung and M. Kutter, “Multimedia watermarking techniques,” Proc.IEEE (Special Issue
on Identification and Protection of Multimedia Information), vol. 87, pp. 1079-1107, July
1999.
7. G. C. Langelaar, I. Setyawan, and R. L. Lagendijk, “Watermarking digital image and video
data: A state-of-the-art overview,” IEEE Signal Processing Magazine, vol. 17, pp. 20-46,
Sept. 2000.

Topic Mining Over Asynchronous Text Sequences

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Mining Over Asynchronous Text Sequences

Uploaded by

Copyright:

Available Formats

Topic Mining over Asynchronous Text Sequences

Dept. of MCA, VTU PG Center, RO Gulbarga Page 1

1.1 Problem statement

1.2 Objective of the study

The main contributions of our work are:

1.3 Scope of the study

Dept. of MCA, VTU PG Center, RO Gulbarga Page 2

1.4 Methodology used

2. Synchronous Text Sequences:

Synchronous text sequences are extracts as common topics in same amount of

3. Asynchronous Text Sequences:

Asynchronous text sequences are extracted with mining from multiple

Dept. of MCA, VTU PG Center, RO Gulbarga Page 3

Dept. of MCA, VTU PG Center, RO Gulbarga Page 4

Dept. of MCA, VTU PG Center, RO Gulbarga Page 5

Dept. of MCA, VTU PG Center, RO Gulbarga Page 6

Dept. of MCA, VTU PG Center, RO Gulbarga Page 7

Retrospective news event detection (RED) is defined as the discovery of previously

Dept. of MCA, VTU PG Center, RO Gulbarga Page 8

Dept. of MCA, VTU PG Center, RO Gulbarga Page 9

2.1 Existing System

Previous existing systems perform the interaction on text sequences in individual

Dept. of MCA, VTU PG Center, RO Gulbarga Page 10

 It’s not cover the total number of dimensions.

a. Extraction only high relevant content.

2.2 Feasibility Study

Preliminary investigation examine project feasibility, the likelihood the system

Dept. of MCA, VTU PG Center, RO Gulbarga Page 11

2.2.1 Economical Feasibility

2.2.2 Operational Feasibility

 Is there sufficient support for the management from the users?

This system is targeted to be in accordance with the above-mentioned issues.

Dept. of MCA, VTU PG Center, RO Gulbarga Page 12

2.2.3 Technical Feasibility

2.3 Tools and technologies used

 Java is purely object-oriented programming language.

Dept. of MCA, VTU PG Center, RO Gulbarga Page 13

Research methodologies used:

1. INTRODUCTION TO DATA MINING:

Information systems are powerful instruments for organizational problem solving

The area of data mining is historically more related to AI (Artificial Intelligence),

Dept. of MCA, VTU PG Center, RO Gulbarga Page 14

The study of research methods in information systems by Järvinen (1999) encouraged

Dept. of MCA, VTU PG Center, RO Gulbarga Page 15

utilization of available DM techniques for solving a current problem impacted by the

This paper we focus on the IS development process. We consider information systems

2. THEORETICAL FRAMEWORKS FOR DATA MINING

A database perspective and inductive databases

A database perspective on data mining and knowledge discovery was introduced in

Dept. of MCA, VTU PG Center, RO Gulbarga Page 16

Usually, the result of a query on an inductive database is an inductive database as

The reductionist approach

Dept. of MCA, VTU PG Center, RO Gulbarga Page 17

3. DATA COMPRESSION APPROACH

Dept. of MCA, VTU PG Center, RO Gulbarga Page 18

4. CONSTRUCTIVE INDUCTION APPROACH

Dept. of MCA, VTU PG Center, RO Gulbarga Page 19

5. DATA MINING AND INFORMATION SYSTEMS FRAMEWORK

Because of increasing number of such “vertical solutions” and possibility to accumulate

Dept. of MCA, VTU PG Center, RO Gulbarga Page 20

2.4 Hardware and Software Requirements

Processor Intel Processor

Operating System Windows OS

Dept. of MCA, VTU PG Center, RO Gulbarga Page 21