Professional Documents
Culture Documents
Topic Mining Over Asynchronous Text Sequences
Topic Mining Over Asynchronous Text Sequences
1. INTRODUCTION
More and more text sequences are being generated in various forms, such as news
streams, weblog articles, emails, instant messages, research paper archives, web forum
discussion threads, and so forth. To discover valuable knowledge from a text sequence, the
first step is usually to extract topics from the sequence with both semantic and temporal
information, which are described by two distributions, respectively: a word distribution
describing the semantics of the topic and a time distribution describing the topic’s intensity
over time. In many real-world applications, we are facing multiple text sequences that are
correlated with each other by sharing common topics. Intuitively, the interactions among
these sequences could provide clues to derive more meaningful and comprehensive topics
than those found by using information from each individual stream solely. The intuition was
confirmed by very recent work, which utilized the temporal correlation over multiple text
sequences to explore the semantic correlation among common topics. The method proposed
therein relied on a fundamental assumption that different sequences are always synchronous
in time, or in their own term coordinated, which means that the common topics share the
same time distribution over different sequences.
However, this assumption is too strong to hold in all cases. Rather, asynchronism
among multiple sequences, i.e., documents from different sequences on the same topic have
different time stamps, is actually very common in practice. For instance, in news feeds, there
is no guarantee that news articles covering the same topic are indexed by the same time
stamps. There can be hours of delay for news agencies, days for newspapers, and even weeks
for periodicals, because some sources try to provide first-hand flashes shortly after the
incidents, while others provide more comprehensive reviews afterward. Another example is
research paper archives, where the latest research topics are closely followed by newsletters
and communications within weeks or months, then the full versions may appear in
conference proceedings, which are usually published annually and at last in journals, which
may sometimes take more than a year to appear after submission.
We formally define our problem of mining common topics from multiple asynchronous text
sequences. We introduce a generative topic model which incorporates both temporal and
semantic information in given text sequences. We derive our objective function, which is to
maximize the likelihood estimation subject to certain constraints.
We address the problem of mining common topics from multiple asynchronous text
sequences. To the extent of our knowledge, this is the first attempt to solve this
problem.
We formalize our problem by introducing a principled probabilistic framework and
propose an objective function for our problem.
We develop a novel alternate optimization algorithm to maximize the objective
function with a theoretically guaranteed (local) optimum.
The effectiveness and advantage of our method are validated by an extensive
empirical study on two real-world data sets.
The key idea of our approach is to utilize the semantic and temporal correlation among
sequences and to build up a mutual reinforcement process. We start with extracting a set of
common topics from given sequences using their original time stamps. Based on the
extracted topics and their word distributions, we update the time stamps of documents in all
sequences by assigning them to most relevant topics. This step reduces the asynchronism
among sequences. Then after synchronization, we refine the common topics according to the
new time stamps.
1. Text Mining:
Text mining is the analysis of data contained in natural language text mining
technique to solve business problems is called text analytics.
2. LITERATURE SURVEY
Topic mining has been extensively studied in the literature, starting with the Topic
Detection and Tracking (TDT) project which aimed to find and track topics (events) in news
sequences1 with clustering-based techniques. Later on, probabilistic generative models were
introduced into use, such as Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet
Allocation (LDA), and their derivatives.
In many real applications, text collections carry generic temporal information and,
thus, can be considered as text sequences. To capture the temporal dynamics of topics,
various methods have been proposed to discover topics over time in text sequences.
However, these methods were designed to extract topics from a single sequence, which
adopted the generative model, time stamps of individual documents were modeled with a
random variable, either discrete or continuous. Then, it was assumed that given a document
in the sequence, the time stamp of the document was generated conditionally independently
from word. The authors introduced hyper-parameters that evolve over time in state transfer
models in the sequence. For each time slice, a hyper parameter is assigned with a state by a
probability distribution, given the state on the former time slice. The time dimension of the
sequence was cut into time slices and topics were discovered from documents in each slice
independently. As a result, in multiple-sequence cases, topics in each sequence can only be
estimated separately and potential correlation between topics in different sequences, both
semantically and temporally, could not be fully explored. The semantic correlation between
different topics in static text collections was considered. Similarly, explored common topics
in multiple static text collections. A very recent work by first proposed a topic mining
method that aimed to discover common (bursty) topics over multiple text sequences. Their
approach is different from ours because they tried to find topics that shared common time
distribution over different sequences by assuming that the sequences were synchronous, or
coordinated. Based on this premise, documents with same time stamps are combined together
over different sequences so that the word distributions of topics in individual sequences can
be discovered. As a contrast, in our work, we aim to find topics that are common in
semantics, while having asynchronous time distributions in different sequences.
[1] D.M. Blei and J.D. Lafferty, “Dynamic Topic Models,” Proc. Int’l Conf. Machine
Learning (ICML), pp. 113-120, 2006.
Managing the explosion of electronic document archives requires new tools for
automatically organizing, searching, indexing, and browsing large collections. Recent
research in machine learning and statistics has developed new techniques for finding patterns
of words in document collections using hierarchical probabilistic models (Blei et al.,2003;
McCallum et al., 2004; Rosen-Zvi et al., 2004; Griffithsand Steyvers, 2004; Buntine and
Jakulin, 2004; Bleiand Lafferty, 2006). These models are called “topic models” because the
discovered patterns often reflect the underlying topics which combined to form the
documents. Such hierarchical probabilistic models are easily generalized to other kinds of
data; for example, topic models have been used to analyze images (Fei-Fei and Perona, 2005;
Sivicet al., 2005), biological data (Pritchard et al., 2000), and survey data (Erosheva, 2002).In
an exchangeable topic model, the words of each document are assumed to be independently
drawn from a mixture of multinomial’s. The mixing proportions are randomly drawn for each
document; the mixture components, or topics, are shared by all documents. Thus, each
document reflects the components with different proportions. These models are a powerful
method of dimensionality reduction for large collections of unstructured documents.
Moreover, posterior inference at the document level is useful for information retrieval,
classification, and topic-directed browsing. Treating words exchange ably is a simplification
that it is consistent with the goal of identifying the semantic themes within each document.
For many collections of interest, however, the implicit assumption of exchangeable
documents is inappropriate. Document collections such as scholarly journals, email, news
articles, and search query logs all reflect evolving content. For example, the Science article
“The Brain of Professor Laborde” may be on the same scientific path as the article
“Reshaping the Cortical Motor Map by Unmasking Latent Intracortical Connections,” but the
study of neuroscience looked much different in 1903 than it did in 1991. The themes in a
document collection evolve over time, and it is of interest to explicitly model the dynamics of
the underlying topics. In this document, we develop a dynamic topic model which captures
the evolution of topics in a sequentially organized corpus of documents. We demonstrate its
applicability by analyzing over 100 years of OCR’ed articles from the journal Science, which
was founded in 1880 by Thomas Edison and has been published through the present. Under
this model, articles are grouped by year, and each year’s articles arise from a set of topics that
have evolved from the last year’s topics.
In the subsequent sections, we extend classical state space models to specify a statistical
model of topic evolution. We then develop efficient approximate posterior inference
techniques for determining the evolving topics from a sequential collection of documents.
Finally, we present qualitative results that demonstrate how dynamic topic models allow the
exploration of a large document collection in new ways, and quantitative results that
demonstrate greater predictive accuracy when compared with static topic models.
[2] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, “Parameter Free Bursty Events Detection in
Text Streams,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 181-192, 2005.
A fundamental problem in text data mining is to extract meaningful structure from document
streams that arrive continuously over time. E-mail and news articles are two natural examples
of such streams, each characterized by topics that appear, grow in intensity for a period of
time, and then fade away. The published literature in a particular research _eld can be seen to
exhibit similar phenomena over a much longer time scale. Underlying much of the text
mining work in this area is the following intuitive premise | that the appearance of a topic in a
document stream is signalled by a \burst of activity," with certain features rising sharply in
frequency as the topic emerges.
The goal of the present work is to develop a formal approach for modeling such\
bursts," in such a way that they can be robustly and e_ciently identi_ed, and can provide an
organizational framework for analyzing the underlying content. The ap-proach is based on
modeling the stream using an in_nite-state automaton, in which bursts appear naturally as
state transitions; it can be viewed as drawing an analogy with models from queueing theory
for bursty network tra_c. The resulting algorithms are highly e_cient, and yield a nested
representation of the set of bursts that imposes a hierarchical structure on the overall stream.
Experiments with e-mail and research paper archives suggest that the resulting structures
have a natural meaning in terms of the content that gave rise to them. Documents can be
naturally organized by topic, but in many settings we also experience their arrival over time.
E-mail and news articles provide two clear examples of such docu-ment streams: in both
cases, the strong temporal ordering of the content is necessary for making sense of it, as
particular topics appear, grow in intensity, and then fade away again. Over a much longer
time scale, the published literature in a particular research _eld can be meaningfully
understood in this way as well, with particular research themes growing and diminishing in
visibility across a period of years.
Underlying a number of these techniques is the following intuitive premise | that the
appearance of a topic in a document stream is signalled by a \burst of activity," with certain
features rising sharply in frequency as the topic emerges. The goal of the present work is to
develop a formal approach for modeling such \bursts," in such a way that they can be
robustly and efficiently identified, and can provide an organizational framework for
analyzing the underlying content. The approach presented here can be viewed as drawing an
analogy with models from queuing theory for bursty network. In addition, however, the
analysis of the underlying burst patterns reveals a latent hierarchical structure that often has a
natural meaning in terms of the content of the stream.
[3] A. Krause, J. Leskovec, and C. Guestrin, “Data Association for Topic Intensity
Tracking,” Proc. Int’l Conf. Machine Learning (ICML), pp. 497-504, 2006.
When following a news event, the content and the temporal information are both important
factors in understanding the evolution and the dynamics of the news topic over time. When
recognizing human activity, the observed person often performs a variety of tasks in parallel,
each with a different intensity, and this intensity changes over time. Both examples have in
common a notion of classification: e.g., classifying documents into topics, and actions into
activities. Another common point is the temporal aspect: the intensity of each topic or
activity changes over time. In a stream of incoming email for example, we want to associate
each email with a topic, and then model bursts and changes in the frequency of emails of
each topic. A simple approach to this problem would be to first consider associating each
email with a topic using some supervised, semi-supervised or unsupervised (clustering)
method; thus segmenting the joint stream We present a unified model of what was
traditionally viewed as two separate tasks: data association and intensity tracking of multiple
topics over time. In the data association part, the task is to assign a topic (a class) to each data
point, and the intensity tracking part models the bursts and changes in intensities of topics
over time. Our approach to this problem combines an extension of Factorial Hidden Markov
models for topic intensity tracking with exponential order statistics for implicit data
association. Experiments on text and email datasets show that the interplay of classification
and topic intensity tracking improves the accuracy of both classification and intensity
tracking. Even a little noise in topic assignments can mislead the traditional algorithms.
However, our approach detects correct topic intensities even with 30% topic noise. into a
stream for each topic. Then, using only data from each individual topic, we could identify
bursts and changes in topic activity over time. In this traditional view (Kleinberg, 2003), the
data association (topic segmentation) problem and the burst detection (intensity estimation)
problem are viewed as two distinct tasks. However, this separation seems unnatural and
introduces additional bias to the model. We combine the tasks of data association and
intensity tracking into a single model, where we allow the temporal information to influence
classification. The intuition is that by using temporal information the classification would
improve, and by improved classification the topic intensity and topic content evolution
tracking also benefit.
[4] Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for Retrospective News
Event Detection,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), pp. 106-113, 2005.
A news event is defined as a specific thing happens at a specific time and place [1], which
may be consecutively reported by many news articles in a period. Retrospective news Event
Detection (RED) is defined as the discovery of previously unidentified events in historical
news corpus [12].RED has lots of applications, such as detecting earthquakes happened in the
last ten years from historical news articles. Although RED has been studied for many years, it
is yet an open problem [12]. A news article contains two kinds of information: contents and
timestamps. Both of them are very helpful for RED task, but most previous research work
focus on finding better utilizations of the contents [12]. The usefulness of time information is
often ignored, or at least time information is used in unsatisfied manners. According to these
observations, we explore RED from the following two aspects. On the one hand, we consider
the better representations of news articles and events, which should effectively model both
the contents and time information. On the other hand, we notice that the previous work
consider little on modelling events in probabilistic manners. As a result, in this paper we
propose a probabilistic model for RED, in which both contents and time information are
utilized. Furthermore, based on it, we build a RED system, HISCOVERY (HIStory
disCOVERY), which can provide a vivid multimedia representation of the results of event
detection.
Our main contributions include:
1). Proposing a multi-modal RED algorithm, in which both the contents and time information
of news articles are modelled explicitly and effectively.
2). Proposing an approach to determine the approximate number of events from the articles
count-time distribution.
[5]Q.Mei and C.Zhai, “Discovering Evolutionary Theme patterns from text:An Exploration
of temporal text mining,” Proc ACM SIGKDD int’l conf. Knowledge Discovery and Data
Mining(KDD), pp. 198-207, 2005.
Temporal Text Mining (TTM) is concerned with discovering temporal patterns in text
information collected over time. Since most text information bears some time stamps, TTM
has many applications in multiple domains, such as summarizing events in news articles and
revealing research trends in scientific literature. In this paper, we study a particular TTM task
discovering and summarizing the evolutionary patterns of themes in a text stream. We this
new text mining problem and present general probabilistic methods for solving this problem
through (1) discovering latent themes from text (2) constructing an evolution graph of
themes; and (3) analyzing life cycles of themes. Evaluation of the proposed methods on Two
domains (i.e., news articles and literature) shows that the proposed methods can discover
interesting evolutionary theme patterns effectively. In many application domains, we
encounter a stream of text, in which each text document has some meaningful timestamp. For
example, a collection of news articles about atopic and research papers in a subject area can
both be viewed as natural text streams with publication dates astime stamps. In such stream
text data, there often exist in-teresting temporal patterns. For example, an event covered in
news articles generally has an underlying temporal and evolutionary structure consisting of
themes (i.e., subtopics) characterizing the beginning, progression, and impact of the event,
among others. Similarly, in research papers, research topics may also exhibit evolutionary
patterns. For example, the study of one topic in some time period may have stimulated the
study of another topic after the time period. In all these cases, it would be very useful if we
can discover, extract, and summarize these evolutionary theme patterns (ETP) automatically.
Indeed, such patterns not only are useful by themselves, but also would facilitate organization
and navigation of the information stream ac-cording to the underlying thematic structures.
Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query
to Google News (http:==news.google.com) returned more than 80,000 online news articles
about this event within one month (Jan.17through Feb.17, 2005). It is generally very difficult
to navigate through all these news articles. For someone who has not been keeping track of
the event but wants to know about this disaster , a summary of this event would be extremely
useful. Ideally, the summary would include both the major subtopics about the event and any
threads corresponding to the evolution of these themes. For example, the themes may include
the report of the happening of the event, the statistics of victims and damage, the aids from
the world, and the lessons from the tsunami. A thread can indicate when each theme starts,
reaches the peak, and breaks, as well as which subsequent themes it inuences. A timeline-
based theme structure as shown in Figure 1 would be a very informative summary of the
event, which also facilitates navigation through themes.
Example:
Consider the input as a research articles. These research articles are select from
different organizations. Two organizations are display two text sequences sharing the same
topic of result. These text sequences always synchronous it’s not possible here.
Drawbacks
Proposed System
In multiple text sequences apply the correlation operations under extraction process.
Extraction results are gives as a high quality. It is the new investigation procedure for
extraction efficient data. It is works based on probabilistic framework environment. It is
starts the search process on topic discovery. After extraction of topics display the next type
results are comes time synchronization based results process. Time synchronization provides
based on mutual sharing of document of content. It gives the highly relevant content of
information as a final output. This result comes under reduced asynchronous text sequences.
It is same process to implement as a refinement process.
Advantages:
old running system. All system is feasible if they are unlimited resources and infinite time.
There are aspects in the feasibility study portion of the preliminary investigation:
Technical Feasibility
Operational Feasibility
Economical Feasibility
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
The system is economically feasible. It does not require any addition hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
Proposed projects are beneficial only if they can be turned out into information
system. That will meet the organization’s operating requirements. Operational feasibility
aspects of the project are to be taken as an important part of the project implementation.
Some of the important issues raised are to test the operational feasibility of a project includes
the following: -
So there is no question of resistance from the users that can undermine the possible
application benefits.
The well-planned design would ensure the optimal utilization of the computer resources and
would help in the improvement of performance status.
The current system developed is technically feasible. It is a web based user interface
for audit workflow at NIC-CSD. Thus it provides an easy access to the users. The database’s
purpose is to create, establish and maintain a workflow among various entities in order to
facilitate all concerned users in their various capacities or roles. Permission to the users
would be granted based on the roles specified. Therefore, it provides the technical guarantee
of accuracy, reliability and security. The software and hard requirements for the development
of this project are not many and are already available in-house at NIC or are available as free
as open source.
The work for the project is done with the current equipment and existing software
technology. Necessary bandwidth exists for providing a fast feedback to the users
irrespective of the number of users using the system.
Tools Used:
Java(1.7)
Servlet
JSP
Java:
Servlet:
The servlet is a Java programming language class used to extend the capabilities of a
server.
Although servers can respond to any types of requests, they are commonly used to
extend the applications hosted by web servers.
Java applets that run on servers instead of in web browsers.
JSP:
is a technology that helps software developer create dynamically generated web pages
based on HTML, XML, or other document types.
Released in 1999 by Sun Microsystems,JSP is similar to PHP, but it uses the Java
programming language.
both communities share in common many similar problems being solved and therefore are
potentially helpful for each other.
In this current study we consider some existing frameworks for data mining,
including database perspective and inductive databases approach, the reductionist statistical
and probabilistic approaches, data compression approach, and constructive induction
approach. We consider their advantages and limitations analyzing what these approaches
account in the data mining research and what they do not.
Ives et al. (1980) considers an information system (IS) in an organizational environment that
is further surrounded by an external environment. According to the framework an
information system itself includes three environments: user environment, IS development
environment, and IS operations environment. Drawing an analogy to this framework we
consider a data mining system as a special kind of adaptive information system that processes
data and helps to make use of it. Adaptation in this context is important because of the fact
that the data mining system is often aimed to produce solutions to various real-world
problems, and not to a single problem. On the one hand, a data mining system is equipped
with a number of techniques to be applied for a problem at hand. From the other hand there
exist a number of different problems and current research has shown that no single technique
can dominate some other technique over all possible data-mining problems (Wolpert and
MacReady, 1996). Nevertheless, many empirical studies report that a technique or a group of
techniques can perform significantly better than any other technique on a certain data-mining
problem or a group of problems (Kiang, 2003). Therefore viewing data mining research as a
continuous and never-ending development process of a DM system towards the efficient
In Boulicaut et al. (1999) an inductive databases framework for the data mining and
knowledge discovery in databases (KDD) modeling was introduced. The basic idea here is
that data-mining task can be formulated as locating interesting sentences from a given logic
that are true in the database. Then discovering knowledge from data can be viewed as
querying the set of interesting sentences. Therefore the term “an inductive database” refers to
such a type of databases that contains not only the data but a theory about the data as well
(Boulicaut et al., 1999).
This approach has some logical connection to the idea of deductive databases, which
contain normal database content and additionally a set of rules for deriving new facts from
the facts already present in the database. This is a common inner data representation. For a
database user, all the facts derivable from the rules are presented, as they would have been
actually stored there. In a similar way, there is no need to have all the rules that are true about
the data stored in an inductive database. However, a user may imagine that all these rules are
there, although in reality, the rules are constructed on demand. The description of an
inductive database consists of a normal relational database structure with an additional
structure for performing generalizations. It is possible to design a query language that works
on inductive databases (Boulicaut et al., 1998).
In Mannila (2000) two simple approaches to the theory of data mining are analysed.
The first one is the reductionist approach of viewing data mining as statistics. Generally, it is
possible to consider the task of data mining from the statistical point of view, emphasizing
the fact that DM techniques are applied to larger datasets than it is in statistics. And in this
situation the analysis of the appropriate statistics literature, where strong analytical
background is accumulated, would solve most of the data mining problems. Many data
mining tasks naturally may be formulated in statistical terms, and many statistical
contributions may be used in data mining in a quite straightforward manner. The second
approach discussed by Mannila (2000) is a probabilistic approach. Generally, many data
mining tasks can be seen as the task of finding the underlying joint distribution of the
variables in the data. Good examples of this approach would be Bayesian network or a
hierarchical Bayesian model, which give a short and understandable representation of the
joint distribution. Data mining tasks dealing with clustering and/or classification fit easily
into this approach. However, it should be admitted that data mining researchers with
computer science background typically have rather little education in statistics and this is a
reason to the fact that achievements from statistics are used not to such an extent as could be
possible.
A deeper consideration of data mining and statistics shows that the volume of the data
being analysed and different background of researchers are, probably, not the most important
ones that make the difference between the areas. Data mining is an applied area of science
and limitations in available computational resources is a big issue when applying results from
statistics to data mining. The other important issue is that data mining approaches emphasize
database integration, simplicity of use, and understandability of results. Last but not least
Mannila (2000) points out that the theoretical framework of statistics does not concern much
about data analysis as a process that generally includes data understanding, data preparation,
data exploration, results evaluation, and visualization steps. However, there are persons
(mainly with strong statistical background) who equate DM to applied statistics, because
many tasks of DM may be perfectly represented in terms of statistics.
A data compression approach to data mining can be stated in the following way:
compress the dataset by finding some structure or knowledge for it, where knowledge is
interpreted as a representation that allows coding the data by using fewer amount of bits. For
example, the minimum description length (MDL) principle (Mehta et al., 1995) can be used
to select among different encodings accounting to both the complexity of a model and its
predictive accuracy. Machine learning practitioners have used the MDL principle in different
interpretations to recommend that even when a hypothesis is not the most empirically
successful among those available; it may be the one to be chosen if it is simple enough.
The idea is in trading between consistency with training examples and empirical
adequacy by predictive success as it is, for example, with accurate decision tree construction.
Bensusan (2000) connects this to another methodological issue, namely that theories should
not be ad hoc that is they should not over fit the examples used to build them. Simplicity is
the remedy for being ad hoc both in the recommendations of philosophy of science and in the
practice of machine learning. The data compression approach has also connection with the
rather old Occam’s razor principle that was introduced in 14th century. The most commonly
used formulation of this principle in data mining is "when you have two competing models
which make exactly the same predictions, the one that is simpler is the better".
Many (if not every) data mining techniques can be viewed in terms of the data compression
approach. For example, association rules and pruned decision trees can be viewed as ways of
providing compression of parts of the data. Clustering approaches can also be considered as a
way of compressing the dataset. There is a connection to Bayesian theory for modelling the
joint distribution – any compression scheme can be viewed as providing a distribution on the
set of possible instances of the data. However, in order to produce a structure that would be
comprehensible to the user, it is necessary to select such compression method(s) that is (are)
based on concepts that are easy to understand.
Constructive induction is a learning process that consists of two intertwined phases, one of
which is responsible for the construction of the “best” representation space and the second
concerns with generating hypothesis in the found space . Constructive induction methods are
classified into three categories: data driven (information from the training examples is used),
hypothesis-driven (information from the analysis of the form of intermediate hypothesis is
used) and knowledge-driven (domain knowledge provided by experts is used) methods. Any
kind of induction strategy (implying induction, abduction, analogies and other forms of non-
truth preserving and non-monotonic inferences) can be potentially used. However, the focus
usually is on operating higher-level data-concepts and theoretical terms rather than pure data.
Michalski (1997) considers constructive (expands the representation space by attribute
generation) and destructive (contract the representational space by feature selection or feature
abstraction) operators that can be applied to produce a better representation space comparing
to the original one. In Bensusan (1999) it was shown that too many theoretical terms could
impair induction. This vindicates an old advise of the philosophy of science: avoid adding
unnecessary metaphysical baggage to a theory. Theoretical terms are often contrasted with
observational terms. It is generally accepted that the more data we have the better model we
can construct. However, this is not true for higher-level concepts that constitute a theory.
Many data mining techniques that apply wrapper/filter approaches to combine feature
selection, feature extraction or feature construction processes (as means of dimensionality
reduction and/or as means of search for better representation of the problem) and a classifier
or other type of learning algorithm can be considered as constructive induction approaches.
Generations of DM systems
Present history of data mining systems’ development totals three main stages/generations.
Year 1989 can be referred to as the first generation of data mining/KDD systems when a few
single-task data mining tools such as C4.5 decision tree algorithm (Quinlan, 1993) existed.
They were difficult to use and required significant preparation. Most of such systems were
based on a loosely-coupled architecture, where the database and the data mining subsystems
were realised as separate independent parts. This architecture demands continuous context
switching between the data-mining engine and the database (Imielinski and Mannila, 1996).
Then, the year 1995 can be associated with formation of the second-generation tools-suits.
Data mining as a core part of KDD started to be seen as “the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad,
1996, 22). Some examples of the knowledge discovery systems that follow Fayyad’s view on
DM as the process are: SPSS Clementine, SGI Mine set , and IBM Intelligent Miner
Numerous KDD systems have recently been developed. At the beginning of the millennium
there exist about 200 tools that could perform several tasks (clustering, classification,
visualization) each for specialized applications (therefore often called “vertical solutions”)
(Piatetsky-Shapiro, 2000). This growing trend towards integrating data-mining tools with
specialized applications has been associated with the third generation of DM systems
against the data . Therefore a good data mining system should be adaptive for solving a
current problem impacted by the dynamically changing environment and being continuously
developed towards the efficient utilization of available DM techniques.
Hardware Requirements
Software Requirements
3.1 Users:
admin: admin will uploaded the data uploaded data will inserted in to data base we can
search the data easily.
User: User will login the application by giving the username and password to validates to get
user home page.
Graph: result will represent in the form of graph it shows the how many times the searched
data is in paper.
Synchronous: The data will show is in the form synchronous data.it gives the relationship
between one document to another document.
Asynchronous: The data will show is in the form asynchronous data.it is not gives the
relationship between one document to another document.
Usability
The system is designed with completely automated process hence there is no or less user
intervention.
Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using java is more reliable.
Performance
This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.
Supportability
The system is designed to be the cross platform supportable. The system is supported on a
wide range of hardware and any software platform, which is having JVM, built into the
system.
4. SYSTEM DESIGN
4.1 Introduction
The purpose of the design phase is to plan a solution of the problem specified by
the requirement document. This phase is the first step in moving from the problem domain to
the solution domain. In other words, starting with what is needed, design takes us toward
how to satisfy the needs. The design of a system is perhaps the most critical factor affection
the quality of the software; it has a major impact on the later phase, particularly testing,
maintenance. The output of this phase is the design document. This document is similar to a
blueprint for the solution and is used later during implementation, testing and maintenance.
The design activity is often divided into two separate phases System Design and Detailed
Design.
System Design also called top-level design aims to identify the modules that should
be in the system, the specifications of these modules, and how they interact with each other to
produce the desired results. At the end of the system design all the major data structures, file
formats, output formats, and the major modules in the system and their specifications are
decided. During, Detailed Design, the internal logic of each of the modules specified in
system design is decided. During this phase, the details of the data of a module are usually
specified in a high-level design description language, which is independent of the target
language in which the software will eventually be implemented.
In system design the focus is on identifying the modules, whereas during detailed
design the focus is on designing the logic for each of the modules. In other works, in system
design the attention is on what components are needed, while in detailed design how the
components can be implemented in software is the issue. Design is concerned with
identifying software components specifying relationships among components. Specifying
software structure and providing blue print for the document phase. Modularity is one of the
desirable properties of large systems. It implies that the system is divided into several parts.
In such a manner, the interaction between parts is minimal clearly specified.
After the software requirements and the feasibility study report when frozen, the next
vital step was design of the software. This is the most crucial part of the software
development and success of any system depends upon how efficiently the system has been
designed. The main design of the Live Video Broadcasting undergoes through these phases
of the system development.
Conceptual design.
Detail design
Conceptual Design
Input Design:
Once the analysis of the system has been done, it would be necessary to identify the
data that is required to be processed to produce the outputs. Input design features can ensure
reliability of the system and generate correct reports from the accurate data. The input design
also determines whether the user can interact efficiently with the system.
Since, JAVA has been chosen for this system, the user should easily understand the
screens designed. The validations are carried out easily and the user will have no difficulty in
adding a new entry in the applet page.
Output Design:
Computer output is the most important and direct source of information to the user.
Efficient, intelligible output design should improve the systems relationship with the user and
help in the decision making. A major form of output is viewing the frames in the destination
place.
Code Design:
When large volumes of data are being handled, it is important that the items to be
stored, selected easily and quickly.
To accomplish each data item must have a unique specification and must be related to
other forms or items of data of the same type.
Process Design:
The purpose of the process design is, to establish the communication between the data
and the code. The process takes the data from the database and checks these data with the
user-inputted data or any other data. Once the process has been finished, the output or the
report will be generated based on the process results.
Detailed Design:
System analysis is the process of gathering and interpreting and facts, diagnosing
problems and using the information to recommend improvements to the system. Structured
Analysis is done using the aid of detail design, which includes.
1. Time Synchronization
Select N number of documents as a sequence documents here. All the documents are
related to single time stamp based results. In multiple documents identifies the same time
stamp and same sequence of data. These sequences of data extracts from different topics.
Common topic data extraction possible based on words distribution process. Using
different asynchronous sequences extract the results with the help of interaction. Interaction
process performs like mining. It can give the common topic data as a meaningful data in
output specification process.
Asynchronism:
All the documents display the results with different time stamps. Those documents of
data is arrange properly. In these type of documents perform the adjusted time environment
then it’s possible for extraction of efficient results in output environment process.
Topic Extraction:
In all current timestamps applies the extraction process. It can gives the efficient
solution and display the synchronous text sequences of results in output environment. This
same process applies in the form of refinement process till reaches the good data results
display as maximization features extraction. It can give the guaranteed solution as a
meaningful result.
Time Synchronization:
Here perform the operation like matching in between of two or more number of
documents. According to time update the sequences and identifies the new sequence based
results. It can give the new discovery of content based results. It can give the smallest results
as a rewritten based results here in implementation.
A graphical tool used to describe and analyze the moment of data through a system manual or
automated including the process, stores of data, and delays in the system. Data Flow
Diagrams are the central tool and the basis from which other components are developed. The
transformation of data from input to output, through processes, may be described logically
and independently of the physical components associated with the system. The DFD is also
know as a data flow graph or a bubble chart.
DFDs are the model of the proposed system. They clearly should show the requirements on
which the new system should be built. Later during design activity this is taken as the basis
for drawing the system’s structure charts. The Basic Notation used to create a DFD’s are as
follows:
2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.
4. Data Store: Here data are stored or referenced by a process in the System.
Use case diagrams represent the functionality of the system from a user point of view. A Use
case describes a function provided by the system that yields a visible result for an actor. an
actor describe any entity that interacts with the system.
The identification of actors and use cases results in the definition of the boundary of the
system, which is , in differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors outside the boundary of the system, where as
the use cases are inside the boundary of the system
A Use case contains all the events that can occur between an actor and a set of scenarios that
explains the interactions as sequence of happenings.
Actors
Actors represent external entities that interact with the system. An actor can be human or
external system.
Actor are not part of the system. They represent anyone or anything that interact with the
system.
An Actor may
During this activity , developers indentify the actors involved in this system are:
User:
User is an actor who uses the system and who performs the operations like data
classifications and execution performance that are required for him.
Use Cases:
Use cases are used during requirements elicitation and analysis to represent the functionality
of the system. Use case focus on the behaviour of the system from an external point of view.
The identification of actors and use cases results in the definition of the boundary of the
system , which is , in differentiating the tasks accomplished by the system and the tasks
accomplished by its environment. The actors are outside the boundary of the system , where
as the use cases are inside the boundary of the system.
Admin
User
Extract top-10 topical words
Asynchronous words
Synchronous words
Sequence diagrams are used to formalize the dynamic behavior of the system and to visualize
the communication among the objects. They are useful for identifying the additional objects
that participate in the use case. Sequence diagram represent the objects participating in the
interaction horizontally and time vertically.
Sequence diagrams typically show a user or actor and the objects and the components they
interact with the execution of the use case. Each column represent an objects that participate
in the interaction. Message is shown by solid arrows. Labels on the solid arrows represent the
message names. Activations are depicted by vertical rectangles. The actor who initiates the
interaction is shown in the left most columns . The messages coming from the actor represent
the interactions described in the use case diagrams.
Messages Description:
1. User provides the input file to the system which is a text Document of file.
2. Performing the clustering among all the document files.
3. Check the document information and cluster information.. it gives the no of cluster after
performing the clustering among all the document files.
4. Calculate the correlation measure by using the cosine similarity method and other
parameters.
5. Check the accuracy of the CPI and K means of the document clustering.
6. Finally it compares the results with k means and CPI method to represent about the best one
of these two methods.
User Registration Login Upolad Data Enter the Asynchronous Synchronous Top 10 Topical
Keyword Documents Documents Asynchronous Synchronous View Graph Admin
words Words Words
Registration
Login
Authentication
Login Fail
Enter Keyword
Display Graph
View Graph
Sequence Diagram
User
Login
Login Fail
Login Success
User Home
Logout
Activity Diagram
UML State chart is notation for describing the sequence of states an object goes through in
response to external events. Objects have behaviour and state. The state of an object depends
on its current activity or condition. A state chart diagram shows the possible states of the
object ad the transitions that cause a change in state.
4: Authentication
Enter the
Registrat 2: Login 7: Enter Keyword Keyword
ion
Login
6: View Asynchronoud Documents 10: Find Top 10 Topical words
Asynchronous
1: Registration
Documents
5: Login Fail
Top 10 Topical
11: View Synchronoud word words
User
Synchronou
s Words
An Activity diagram describes the behavior of the system in terms of activities. Activities are
modeling elements that represent the execution of set of operations. The completion of these
operations triggers a transition to another activity. Activity diagrams similar to flowchart
diagrams in that they can be used to represent control flow and data flow.
User
Login
Login Fail
Login Success
User Home
Logout
Activity Diagram
Search
Ke...
Asynchr
on...
Upload
Content
System
synchro
n...
View
Res...
ER Diagram:
There are usually many instances of an entity-type. Because the term entity-type is somewhat
cumbersome, most people tend to use the term entity as a synonym for this term. Entities can
be thought of as nouns. Examples: a computer, an employee, a song, a mathematical theorem.
A relationship captures how two or more entities are related to one another. Relationships can
be thought of as verbs, linking two or more nouns.
The model's linguistic aspect described above is utilized in the declarative database query
language ERROL, which mimics natural language constructs. Entities and relationships can
both have attributes. Examples: an employee entity might have a Social Security Number
(SSN) attribute; the proved relationship may have a date attribute.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying
attributes, which is called the entity's primary key. Entity-relationship diagrams don't show
single entities or single instances of relations. Rather, they show entity sets and relationship
sets. Example: a particular song is an entity.
The collection of all songs in a database is an entity set. The eaten relationship between a
child and her lunch is a single relationship. The set of all such child-lunch relationships in a
database is a relationship set. In other words, a relationship set corresponds to a relation in
mathematics, while a relationship corresponds to a member of the relation. Certain
cardinality constraints on relationship sets may be indicated as well.
ER Diagram:
Data Dictionary:
1 User details:
2. Address
1. Login master
2. uploaddata data
5. SYSTEM IMPLEMENTATION
Introduction
The system can be implemented only after through testing is done and if it found to
work according to the specification. It involves careful planning, investigation of the current
system and its constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning. Two major tasks of
preparing the implementation are education and training of the users and testing of the
system.
The more complex the system being implemented, the more involved will be the
systems analysis and design effort required just for implementation. The implementation
phase comprises of several activities. The required hardware and software acquisition is
carried out. The System may require some hardware and software acquisition is carried out.
The system may require some software to be developed. For this, programs are written and
tested. The user then changes over to his new fully tested system and the old system is
discontinued.
Implementation is the process of having systems personnel check out and put new
equipment in to use, train users, install the new application, and construct any files of data
needed to it.
Depending on the size of the organization that will be involved in using the
application and the risk associated with its use, system developers may choose to test the
operation in only one area of the firm, say in one department or with only one or two persons.
Sometimes they will run the old and new systems together to compare the results. In still
other situations, developers will stop using the old system one-day and begin using the new
one the next. As we will see, each implementation strategy has its merits, depending on the
Logics
PreparedStatement ps = null;
ResultSet rs = null;
RequestDispatcher rd = null;
String k = null;
int sc = 0;
int sid = 0;
int count = 0;
String key=null;
int cnt=0;
int id=0;
String keyword1=null;
public FormAction() {
con = ConnectionFactory.getConnection();
doPost(request, response);
key = request.getParameter("title");
session.setAttribute("key",key);
try {
ps = con.prepareStatement("select
catagory,topicname,datas,To_CHAR(timestamp,'dd-MON-yyyy') from uploaddata order by
timestamp ");
rs = ps.executeQuery();
int i = 1;
WebFormBean example;
while (rs.next()) {
System.out.println("Docid" + catagory);
System.out.println("Document" + DocCont);
System.out.println("executed");
str[i] = DocCont;
System.out.println("it is in while");
str[i] = DocCont;
System.out.println(str.toString());
example.setDatas(DocCont);
example.setCatagory(catagory);
example.setTimestamp(timestamp);
example.setTopicname(docname);
vector.add(example);
session.setAttribute("vector", vector);
// fileIds.append(""+fileId);
// fileIds.append(",");
break;
s.close();
} catch (Exception e) {
e.printStackTrace();
System.out.println("final");
rd = request.getRequestDispatcher("./SearchResult.jsp");
rd.forward(request, response);
Fi
g 5.3.8 Enter the topic name to be searched
6. SOFTWARE TESTING
Unit testing focuses on the building blocks of the software system, that is, objects and sub
system . There are three motivations behind focusing on components. First, unit testing
reduces the complexity of the overall tests activities, allowing us to focus on smaller units of
the system. Second , unit testing makes it easier to pinpoint and correct faults given that few
components are involved in this test . Third , Unit testing allows parallelism in the testing
activities , that is each component can be tested independently of one another . Hence the
goal is to test the internal logic of the module.
In the integration testing, many test modules are combined into sub systems , which are
then tested . The goal here is to see if the modules can be integrated properly, the emphasis
being on testing module interaction.
After structural testing and functional testing we get error free modules. These modules are
to be integrated to get the required results of the system. After checking a module, another
module is tested and is integrated with the previous module. After the integration, the test
cases are generated and the results are tested.
Acceptance Testing
Acceptance testing is sometimes performed with realistic data of the client to demonstrate
that the software is working satisfactory. Testing here focus on the external behavior of the
system , the internal logic of the program is not emphasized . In acceptance testing the system
is tested for various inputs.
Types of Testing
1. Black box or functional testing
2. White box testing or structural testing
Interface errors
Errors in data structure
Performance errors
Initialization and termination errors
As shown in the following figure of Black box testing, we are not thinking of the internal
workings, just we think about
What is the output to our system?
What is the output for given input to our system?
Input Output
?
The Black box is an imaginary box that hides its internal workings
Input Output
INTERNAL
WORKING
A Test case is a set of input data and expected results that exercises a component with the
purpose of causing failure and detecting faults . test case is an explicit set of instructions
designed to detect a particular class of defect in a software system , by bringing about a
failure . A Test case can give rise to many tests.
Test case format
Test Case 1:
Test Steps
characters(say ! message
login page must @hi&*P) Login “Invalid Login
be allow special Name and click or Password”
characters Submit button must be
displayed
Test Steps
Registratio Validate To verify that User enter User name an error message
n User Name name on Registration click Submit User Name Must
page must be button be Declared
Declared
7. CONCLUSION
In the current project we tackle the problem of mining common topics from multiple
asynchronous text sequences. The study proposes a novel method which can automatically
discover and fix potential asynchronism among sequences and consequentially extract better
common topics. The key idea of our method is to introduce a self-refinement process by
utilizing correlation between the semantic and temporal information in the sequences. It
performs topic extraction and time synchronization alternately to optimize a unified objective
function. A local optimum is guaranteed by our algorithm.
8. FUTURE ENHANCEMENTS
We justified the effectiveness of our method on two real-world data sets, with comparison to
a baseline method. Empirical results suggest that
1) Our method is able to find meaningful and discriminative topics from asynchronous text
sequences;
2) Our method significantly outperforms the baseline method, evaluated both in quality and
in quantity;
3) The performance of our method is robust and stable against different parameter settings
and random initialization
9. BIBLIOGRAPHY
[1] D.M. Blei and J.D. Lafferty, “Dynamic Topic Models,” Proc. Int’l Conf. Machine
Learning (ICML), pp. 113-120, 2006.
[2] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, “Parameter Free Bursty Events Detection in
Text Streams,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 181-192, 2005.
[3] J.M. Kleinberg, “Bursty and Hierarchical Structure in Streams,” Proc. ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 91-101, 2002.
[4] A. Krause, J. Leskovec, and C. Guestrin, “Data Association for Topic Intensity
Tracking,” Proc. Int’l Conf. Machine Learning (ICML), pp. 497-504, 2006.
[5] Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for Retrospective News
Event Detection,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), pp. 106-113, 2005.
BOOKS AUTHOR
WEBSITES
1. 1.
2. www.java.sun.com
3. www.google.com
4. F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, “Information hiding: A survey,” Proc.
IEEE (Special Issue on Identification and Protection of Multimedia Information), vol. 87,
pp. 1062-1078, July 1999.
5. I. J. Cox, M. L. Miller, and J. A. Bloom, Digital Watermarking. San Francisco, CA: Morgan-
Kaufmann, 2002.
6. F. Hartung and M. Kutter, “Multimedia watermarking techniques,” Proc.IEEE (Special Issue
on Identification and Protection of Multimedia Information), vol. 87, pp. 1079-1107, July
1999.
7. G. C. Langelaar, I. Setyawan, and R. L. Lagendijk, “Watermarking digital image and video
data: A state-of-the-art overview,” IEEE Signal Processing Magazine, vol. 17, pp. 20-46,
Sept. 2000.