Suni

1
CHAPTER-1
INTRODUCTION
Microblogging websites have evolved to become a source of varied kind of
information. This is due to nature of microblogs on which users to communicate with
each other. On the twitter users post real time messages about their opinions on a
variety of topics, discuss current issues, complain, and express positive sentiment for
products they use in daily life. In fact, companies manufacturing such products have
started to poll these microblogs to get a sense of general sentiment for their product.
Many times these companies study user reactions and reply to users on microblogs.
In particular, it is not clear whether as an information source Twitter can be simply
regarded as faster news feed that covers mostly the same information as traditional
news media.
In this we empirically compare the content of Twitter with a traditional news

medium, New York Times, using unsupervised topic modeling. We use a
Twitter-LDA model to discover topics from a representative sample of the entire
Twitter.
1.1. What is Data Mining?
Figure.1: Structure of Data Mining

Generally, data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both.
Dept. of CSE, P V P Siddhartha institute of technology

2
Data mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles, categorize it,
and summarize the relationships identified. Technically, data mining is the process of
finding correlations or patterns among dozens of fields in large relational databases.
1.1.1. How Data Mining Works?

While large-scale information technology has been evolving separate
transaction and analytical systems, data mining provides the link between the two.
Data mining software analyzes relationships and patterns in stored transaction data
based on open-ended user queries. Several types of analytical software are available:
statistical, machine learning, and neural networks.
Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For
example, a restaurant chain could mine customer purchase data to determine

when customers visit and what they typically order. This information could be
used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to identify market

segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper
example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behaviour patterns and
trends. For example, an outdoor equipment retailer could predict the

likelihood of a backpack being purchased based on a consumer's purchase of
sleeping bags and hiking shoes.
3
1.1.2. Data mining consists of five major elements:
1) Extract, transform, and load transaction data onto the data warehouse
system.
2) Store and manage the data in a multidimensional database system.
3) Provide data access to business analysts and information technology

professionals.
4) Analyze the data by application software.
5) Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.
Genetic algorithms: Optimization techniques that use process such as
genetic combination, mutation, and natural selection in a design based on the

concepts of natural evolution.
Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision

tree methods include Classification and Regression Trees (CART) and Chi
Square Automatic Interaction Detection (CHAID). CART and CHAID are
decision tree techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to predict which
records will have a given outcome. CART segments a dataset by creating
2-way splits while CHAID segments using chi square tests to create
multi-way splits. CART typically requires less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record in a
dataset based on a combination of the classes of the k record(s) most similar

to it in a historical dataset (where k=1). Sometimes called the k-nearest
neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data relationships.

4
1.1.3. Characteristics of Data Mining:

Large quantities of data: The volume of data so great it has to be analyzed
by automated techniques e.g. satellite information, credit card transactions

etc.
Noisy, incomplete data: Imprecise data is the characteristic of all data
collection.
Complex data structure: conventional statistical analysis not possible.
Heterogeneous data stored in legacy systems
1.1.4. Benefits of Data Mining:

1) It’s one of the most effective services that are available today. With the help
of data mining, one can discover precious information about the customers
and their behavior for a specific set of products and evaluate and analyze,
store, mine and load data related to them
2) An analytical CRM model and strategic business related decisions can be

made with the help of data mining as it helps in providing a complete
synopsis of customers
3) An endless number of organizations have installed data mining projects and it

has helped them see their own companies make an unprecedented
improvement in their marketing strategies (Campaigns)
4) Data mining is generally used by organizations with a solid customer focus.

For its flexible nature as far as applicability is concerned is being used
vehemently in applications to foresee crucial data including industry analysis
and consumer buying behaviours.
5) Fast paced and prompt access to data along with economic processing
techniques have made data mining one of the most suitable services that a
company seek.
5
1.1.5. Advantages of Data Mining:

1. Marketing / Retail:
Data mining helps marketing companies build models based on historical data
to predict who will respond to the new marketing campaigns such as direct mail,
online marketing campaign…etc. Through the results, marketers will have
appropriate approach to sell profitable products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way as
marketing. Through market basket analysis, a store can have an appropriate
production arrangement in a way that customers can buy frequent buying products
together with pleasant. In addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.
2. Finance / Banking:
Data mining gives financial institutions information about loan information and
credit reporting. By building a model from historical customer’s data, the bank and
financial institution can determine good and bad loans. In addition, data mining helps
banks detect fraudulent credit card transactions to protect credit card’s owner.
3. Manufacturing:
By applying data mining in operational engineering data, manufacturers can

detect faulty equipments and determine optimal control parameters. For example
semi-conductor manufacturers has a challenge that even the conditions of
manufacturing environments at different wafer production plants are similar, the
quality of wafer are lot the same and some for unknown reasons even has defects.
Data mining has been applying to determine the ranges of control parameters that
lead to the production of golden wafer. Then those optimal control parameters are
used to manufacture wafers with desired quality.
4. Governments:
Data mining helps government agency by digging and analyzing records of

financial transaction to build patterns that can detect money laundering or criminal
activities.
6
5. Law enforcement:
Data mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by examining trends in location, crime type, habit, and
other patterns of behaviours.
6. Researchers:
Data mining can assist researchers by speeding up their data analyzing process;
thus, allowing those more time to work on other projects.
1.2. LATENT DIRICHLET ALLOCATION:-

Suppose you have the following set of sentences:
I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that
these sentences contain. For example, given these sentences and asked for 2 topics,
LDA might produce something like
Sentences 1 and 2: 100% Topic A.
Sentences 3 and 4: 100% Topic B.
Sentence 5: 60% Topic A, 40% Topic B.
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at
which point, you could interpret topic A to be about food).
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which
point, you could interpret topic B to be about cute animals). The

question, of course, is: how does LDA perform this discovery?
1.2.1. LDA Model:-
In more detail, LDA represents documents as mixtures of topics that spit out
words with certain probabilities. It assumes that documents are produced in the
following fashion: when writing each document, you.

7

Decide on the number of words N the document will have (say, according to a
Poisson distribution).

Choose a topic mixture for the document (according to a Dirichlet distribution
over a fixed set of K topics). For example, assuming that we have the two food
and cute animal topics above, you might choose the document to consist of 1/3
food and 2/3 cute animals.

Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you
sampled above; for example, you might pick the food topic with 1/3
probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s
multinomial distribution). For example, if we selected the food topic, we

might generate the word “broccoli” with 30% probability, “bananas” with
15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries
to backtrack from the documents to find a set of topics that are likely to have
generated the collection.
Example:
Let’s make an example. According to the above process, when generating some
particular document D, you might.
Pick 5 to be the number of words in D.
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick the first word to come from the food topic, which then gives you the
word “broccoli”.
Pick the second word to come from the cute animals topic, which gives you
“panda”.
Pick the third word to come from the cute animals topic, giving you
“adorable”.
Pick the fourth word to come from the food topic, giving you “cherries”.
Pick the fifth word to come from the food topic, giving you “eating”.
So the document generated under the LDA model will be “broccoli panda
adorable cherries eating” (note that LDA is a bag-of-words model).

8
Learning:
So now suppose you have a set of documents. You’ve chosen some fixed
number of K topics to discover, and want to use LDA to learn the topic
representation of each document and the words associated to each topic. How do you
do this? One way (known as collapsed Gibbs sampling) is the following:

Go through each document, and randomly assign each word in the document to
one of the K topics.

Notice that this random assignment already gives you both topic representations
of all the documents and word distributions of all the topics (albeit not very good
ones).

So to improve on them, for each document d…
Go through each word w in d…

And for each topic t, compute two things: 1) p(topic t | document
d) = the proportion of words in document d that are currently

assigned to topic t, and 2) p(word w | topic t) = the proportion of
assignments to topic t over all documents that come from this
word w. Reassign w a new topic, where we choose topic t with
probability p(topic t | document d) * p(word w | topic t) (according
to our generative model, this is essentially the probability that

topic t generated word w, so it makes sense that we resample the
current word’s topic with this probability). (Also, I’m glossing
over a couple of things here, in particular the use of
priors/pseudocounts in these probabilities.).

In other words, in this step, we’re assuming that all topic
assignments except for the current word in question are correct,

and then updating the assignment of the current word using our
model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually
reach a roughly steady state where your assignments are pretty good. So use
these assignments to estimate the topic mixtures of each document and the
words associated to each topic (by counting the proportion of words assigned
to each topic overall).

9
CHAPTER-2
SYSTEM ANALYSIS
2.1. EXISTING SYSTEM:-
We empirically compare the content of Twitter with a traditional news medium,
New York Times, using unsupervised topic modeling, but the content analysis on
Twitter has not been well studied. So to overcome this problem we use a
Twitter-LDA model to discover topics from a representative sample of the entire
Twitter.
2.2. PROBLEM STATEMENT:-

In the existing system the content analysis is not well studied.
2.3. PROPOSED SYSTEM:-

In the proposed system we use the Twitter-LDA model for the content analysis
on the twitter. In Twitter all the tweets of a user as a single document. In fact this
treatment can be regarded as an application of the author-topic model to tweets,
where each document (tweet) has a single author. However, this treatment does not
exploit the following important observation: A single tweet is usually about a single
topic. When writing a tweet, a user first chooses topic based on her topic distribution.
Then the user chooses some particular words one by one based on the chosen topic or
the background model. In this twitter LDA First we identified the topics of
information which is related to the user and then we identify the the background
words of the user selected particular topic and we search for noisy words and
eliminate noisy and background words and later we can able to analysis the tweets of
any particular user for selected topic.
10
CHAPTER-3
LITERATURE SURVEY
[1] Empirical study of topic modeling in Twitter
AUTHORS:- Hong, L., Davison

Over the past few years, Twitter, a microblogging service, has become an
increasingly
popular platform for Web users to communicate with each other. Because tweets are
compact and fast, Twitter has become widely used to spread and share breaking
news, personal updates and spontaneous ideas. The popularity of this new form of
social media has also started to attract the attention of researchers. Several recent
studies examined Twitter from different perspectives, including the topological
characteristics of Twitter, tweets as social sensors of real-time events, the forecast of
box-office revenues for movies, etc.
However, the explorations are still in an early stage and our understanding of
Twitter, especially its large textual content, still remains limited. Due to the nature of
microblogging, the large amount of text in Twitter may presumably contain useful
information that can hardly be found in traditional information sources.
To make use of Twitter's textual content for information retrieval tasks such as
search and recommendation, one of the first questions one may ask is what kind of
special or unique information is contained in Twitter. As Twitter is often used to
spread breaking news, a particularly important question is how the information
contained in Twitter differs from what one can obtain from other more traditional
media such as newspapers. Knowing this difference could enable us to better define
retrieval tasks and design retrieval models on Twitter and in general microblogs. To
the best of our knowledge, very few studies have been devoted to content analysis of
Twitter, and none has carried out deep content comparison of Twitter with traditional
news media.
In this work we perform content analysis through topic modeling on a

representative sample of Twitter within a three-month time span, and we empirically
compare the content of Twitter based on the discovered topics with that of news
11
articles from a traditional news agency, namely, New York Times, within the same
time span. Specifically we try to answer the following research questions:
Does Twitter cover similar categories and types of topics as traditional news
media?
Do the distributions of topic categories and types differ in Twitter.
and in traditional news media?
Are there specific topics covered in Twitter but rarely covered in traditional
news media and vise versa?
If so, are there common characteristics of these specific topics?
Do certain categories and types of topics attract more opinions in Twitter?
Do certain categories and types of topics trigger more information spread in
Twitter?
Some of our major findings are the following:
(1) Twitter and traditional news media cover a similar range of topic
categories, but the distributions of different topic categories and types differ between
Twitter and traditional news media.
(2) As expected, Twitter users tweet more on personal life and pop culture
than world events.
(3) Twitter covers more celebrities and brands that may not be covered in
traditional media.
(4) Although Twitter users tweet less on world events, they do actively
retweet (forward) world event topics, which helps spread important news.
These findings can potential benefit many Web information retrieval

applications.
For example, for Web information retrieval and recommendation, our findings
suggest that Twitter is a valuable source for entertainment and lifestyle topics such as
celebrities and brands to complement traditional information sources.
Retweets can also be used to indicate trendy topics among Web users to help
search engines refine their results.

12
[2] What is Twitter, a social network or a news media?

AUTHORS: Kwak, H. Lee, C. Park, H.Moon
In this paper, Twitter, a microblogging service, has emerged as a new medium
in spotlight through recent happenings, such as an American student jailed in Egypt
and the US Airways plane crash on the Hudson river. Twitter users follow others or
are followed. Unlike on most online social networking sites, such as Facebook or
MySpace, the relationship of following and being followed requires no reciprocation.
A user can follow any other user, and the user being followed need not follow back.
Being a follower on Twitter means that the user receives all the messages (called
tweets) from those the user follows. Common practice of responding to a tweet has
evolved into well-defined markup culture: RT stands for retweet, ’@’ followed by a
user identifier address the user, and ’#’ followed by a word represents a hashtag. This
well-defined markup vocabulary combined with a strict limit of 140 characters per
posting conveniences users with brevity in expression. The retweet mechanism
empowers users to spread information of their choice beyond the reach of the
original tweet’s followers. How are people connected on Twitter.
[3] Earthquake shakes Twitter users: real-time event detection by

social sensors.
AUTHORS: Sakaki, T., Okazaki, M., Matsuo
Twitter is categorized as a micro-blogging service. Microblogging is a form of
blogging that allows users to send brief text updates or micromedia such as
photographs or audio clips. Microblogging services other than Twitter include
Tumblr, Plurk, Emote.in, Squeelr, Jaiku, identi.ca, and so on3. They have their own
characteristics. Some examples are the following: Squeelr adds geolocation and
pictures to microblogging, and Plurk has a timeline view integrating video and
picture sharing. Although our study is applicable to other microblogging services, in
this study, we specifically examine Twitter because of its popularity and data volume.
An important common characteristic among microblogging services is its real-time
nature. Although blog users typically update their blogs once every several
13
days, Twitter users write tweets several times in a single day. Users can know how
other users are doing and often what they are thinking about now, users repeatedly
return to the site and check to see what other people are doing. The large number of
updates results in numerous reports related to events. They include social events such
as parties, baseball games, and presidential campaigns. They also include disastrous
events such as storm, fire, traffic jam, riots, heavy rainfall, and earthquakes.
[4] Latent Dirichlet allocation

AUTHORS: Blei, D.M, Ng, A.Y, Jordan
Latent Dirichlet allocation (LDA) is a generative probabilistic model of a
corpus. The basic idea is that documents are represented as random mixtures over
latent topics, where each topic is characterized by a distribution over words.1 LDA
assumes the following generative process for each document w in a corpus D: 1.
Choose N ∼ Poisson(ξ). 2. Choose θ ∼ Dir(α). 3. For each of the N words wn: (a)
Choose a topic zn ∼ Multinomial(θ). (b) Choose a word wn from p(wn |zn,β), a

multinomial probability conditioned on the topic zn. Several simplifying assumptions
are made in this basic model, some of which we remove in subsequent sections. First,
the dimensionality k of the Dirichlet distribution (and thus the dimensionality of the
topic variable z) is assumed known and fixed. Second, the word probabilities are
parameterized by a k ×V matrix β where βi j = p(wj = 1|zi = 1), which for now we
treat as a fixed quantity that is to be estimated. Finally, the Poisson assumption is not
critical to anything that follows and more realistic document length distributions can
be used as needed. Furthermore, note that N is independent of all the other data
generating variables (θ and z). It is thus an ancillary variable and we will generally
ignore its randomness in the subsequent development.

14
CHAPTER-4
SYSTEM REQUIREMENTS
SOFTWARE REQUIREMENTS:-
⮚ Language : JAVA
⮚ Operating System : Windows

15
CHAPTER-5
SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE:-
Figure.2: System Architecture
In the system architecture we study how the content analysis processes is done.

First the user login into the tweeter with the user id and users post real time
messages about their opinions on a variety of topics, discuss current issues,
complain, and express positive sentiment for products they use in daily life.

The tweeter checks the id of tweet. If the author user then the user logins.
And post the comment.

Tweet Downloader which is provided by the SemEval Task Organiser. It
contains a python script which downloads the tweets given the tweet id.

16

Tweet NLP a twitter specific tweet tokeniser and tagger. It provides a fast and
robust Java-based tokeniser and part-of-speech tagger for Twitter.

Tokenisation : After downloading the tweets using the tweet id's provided in
the dataset, we first tokenise the tweets. This tool tokenises the tweet and
returns the POS tags of the tweet along with the confidence score. It is
important to note that this is a twitter specific tagger in the sense it tags the
twitter specific entries like Emoticons, Hashtag and Mentions too.

After obtaining the tokenised and tagged tweet we move to the next step of
preprocessing.

Remove Non-English Tweets: Twitter allows more than 60 languages.
However, this work currently focuses on English tweets only. We remove the
tweets which are non-English in nature.

Replacing Emoticons: Emoticons play an important role in determining the
sentiment of the tweet. Hence we replace the emoticons by their sentiment
polarity by looking up in the Emoticon Dictionary.

Remove URL: The url's which are present in the tweet are shortened using
TinyUrl due to the limitation on the tweet text. These shortened url's did not
carry much information regarding the sentiment of the tweet. Thus these are
removed.

Remove Target : The target mentions in a tweet done using '@' are usually
the twitter handle of people or organization. This information is also not
needed to determine the sentiment of the tweet. Hence they are removed.

Replace Negative Mentions: Tweets consists of various notions of negation.
In general, words ending with 'nt' are appended with a not. Before we remove
the stop words 'not' is replaced by the word 'negation' Negation play a very
important role in determining the sentiment of the tweet. This is discussed
later in detail.

Hashtags: Hashtags are basically summarizer of the tweet and hence are very
critical. In order to capture the relevant information from hashtags, all special
characters and punctuations are removed before using it as a feature.

H. Sequence of Repeated Characters: Twitter provides a platform for users
to express their opinion in an informal way.Spell correction is an important

17
part in sentiment analysis of user-generated content. People use words like

'coooool' and 'hunnnnngry' in order to emphasise the emotion. In order to
capture such expressions, we replace the sequence of more than three similar
characters by three characters. For example, wooooow is replaced by wooow.
We replace by three characters so as to distinguish words like 'cool' and
'cooooool'.

Numbers : Numbers are of no use when measuring sentiment. Thus, numbers
which are obtained as tokenized unit from the tokeniser are removed in order
to refine the tweet content.

Nouns and Prepositions: Given a tweet token, we identify the word as a
Noun word by looking at its part of speech tag given by the tokeniser. If the
majority sense (most commonly used sense) of that word is Noun, we discard
the word. Noun words dont carry sentiment and thus are of no use in our
experiments. The same reasoning go for prepositions too.

Stop-word Removal: Stop words play a negative role in the task of sentiment
classification. Stop words occur in both positive and negative training set,
thus adding more ambiguity in the model formation. And also, stop words
don't carry any sentiment information and thus are of no use to us. We create
a list of stop words like he, she, at, on, a, the, etc. and ignore them while
scoring the sentiment.
5.2 DATA FLOW DIAGRAM:-

1. The DFD is also called as bubble chart. It is a simple graphical formalism that
can be used to represent a system in terms of input data to the system, various
processing carried out on this data, and the output data is generated by this
system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It
is used to model the system components. These components are the system
process, the data used by the process, an external entity that interacts with the
system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that

18
depicts information flow and the transformations that are applied as data
moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a

system at any level of abstraction. DFD may be partitioned into levels that
represent increasing information flow and functional detail.
Figure.3: Data flow diagram
5.3 UML DIAGRAMS:-

UML stands for Unified Modeling Language. UML is a standardized
general-purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object Management
Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of method
or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,

Visualization, Constructing and documenting the artifacts of software system, as well
as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and
the software development process. The UML uses mostly graphical notations to
express the design of software projects.
19
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that

they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core

concepts.
3. Be independent of particular programming languages and development

process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,

frameworks, patterns and components.
7. Integrate best practices.
5.3.1. Class diagram:-
In software engineering, a class diagram in the Unified Modeling Language

(UML) is a type of static structure diagram that describes the structure of a system by
showing the system's classes, their attributes, operations (or methods), and the
relationships among the classes. It explains which class contains information.
Figure.4: Class diagram

20
5.3.2. Sequence diagram:-
A sequence diagram in Unified Modeling Language (UML) is a kind of

interaction diagram that shows how processes operate with one another and in what
order. It is a construct of a Message Sequence Chart. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing diagrams.
Figure.5: sequence diagram

21
5.3.3. Activity diagram:-
Activity diagrams are graphical representations of workflows of stepwise

activities and actions with support for choice, iteration and concurrency. In the
Unified Modeling Language, activity diagrams can be used to describe the business
and operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.
Figure.6: sequence diagram

22
CHAPTER-6
Software Environment
6.1. JAVA TECHNOLOGY:-
Java technology is both a programming language and a platform.
6.1.1. The Java Programming Language:-
With most programming languages, you either compile or interpret a program

so that you can run it on your computer. The Java programming language is unusual
in that a program is both compiled and interpreted. With the compiler, first you
translate a program into an intermediate language called Java byte codes —the
platform-independent codes interpreted by the interpreter on the Java platform. The
interpreter parses and runs each Java byte code instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool
or a Web browser that can run applets, is an implementation of the Java VM. Java
byte codes help make “write once, run anywhere” possible. You can compile your
program into byte codes on any platform that has a Java compiler. The byte codes
can then be run on any implementation of the Java VM. That means that as long as a
computer has a Java VM, the same program written in the Java programming
language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.1.2. The Java Platform:-
The Java platform is a suite of programs that facilitate developing and running
programs written in the Java programming language. The platform is not specific to
any one processor or operating system, rather an execution engine (called a virtual
machine) and a compiler with a set of libraries are implemented for various hardware
and operating systems so that Java programs can run identically on all of them.
There are multiple platforms, each targeting a different class of devices: Java Card:
23
A technology that allows small Java based applications (applets) to be run securely
on smart cards and similar small memory devices.
Java ME (Micro Edition): Specifies several different sets of libraries (known as

profiles) for devices with limited storage, display, and power capacities. Often used
to develop applications for mobile devices, PDAS, TV set top boxes, and printers.
Java SE (Standard Edition): For general purpose use on desktop PCs, servers
and similar devices. Java EE (Enterprise Edition): Java SE plus various APIs useful
for multitier client–server enterprise applications.
The Java platform consists of several programs, each of which provides a portion of
its overall capabilities. For example, the Java compiler, which converts Java source
code into Java byte code (an intermediate language for the JVM), is provided as part
of the Java Development Kit (JDK). The Java Runtime Environment (JRE),
complementing the JVM with a just in time (JIT) compiler, converts intermediate
byte code into native machine code on the fly. An extensive set of libraries are also
part of the Java platform. The essential components in the platform are the Java
language compiler, the libraries, and the runtime environment in which Java
intermediate byte code executes according to the rules laid out in the virtual machine
specification.
6.1.3 Java Virtual Machine:-
The heart of the Java platform is the concept of a "virtual machine" that
executes Java byte code programs. This byte code is the same no matter what
hardware or operating system the program is running under. There is a JIT (Just In
Time) compiler within the Java Virtual Machine, or JVM. The JIT compiler
translates the Java byte code into native processor instructions at runtime and caches
the native code in memory during execution. The use of byte code as an intermediate
language permits Java programs to run on any platform that has a virtual machine
available. The use of a JIT compiler means that Java applications, after a short delay
during loading and once they have "warmed up" by being all or mostly JIT
compiled, tend to run about as fast as native programs. Since JRE version 1.2, Sun's
JVM implementation has included a just in time compiler instead of an interpreter.
24
Although Java programs are cross platform or platform independent, the code of the
Java Virtual Machines (JVM) that execute these programs is not. Every supported
operating platform has its own JVM.
Figure.7: Block diagram for java environment
6.2. NET BEANS:

NetBeans is a software development platform written in Java.
The NetBeans Platform allows applications to be developed from a set of

modular software components called modules. Applications based on the NetBeans
Platform, including the NetBeans integrated development environment (IDE), can be
extended by third party developers.
The NetBeans IDE is primarily intended for development in Java, but also
supports other languages, in particular PHP, C/C++and HTML5. NetBeans is
cross-platform and runs on Microsoft Windows, Mac OS X, Linux, Solaris and other
platforms supporting a compatible JVM.
The NetBeans Platform is a framework for simplifying the development of

Java Swing desktop applications. The NetBeans IDE bundle for Java SE contains
what is needed to start developing NetBeans plugins and NetBeans Platform based
applications; no additional SDK is required.
Applications can install modules dynamically. Any application can include the
Update Center module to allow users of the application to download digitally signed
upgrades and new features directly into the running application. Reinstalling an
upgrade or a new release does not force users to download the entire application
again.
25
The platform offers reusable services common to desktop applications,

allowing developers to focus on the logic specific to their application. Among the
features of the platform are:
User interface management (e.g. menus and toolbars)
User settings management
Storage management (saving and loading any kind of data)
Window management
Wizard framework (supports step-by-step dialogs)
NetBeans Visual Library
Integrated development tools
These modules are part of the NetBeans IDE.
6.2.1. NetBeans Profiler:-
The NetBeans Profiler is a tool for the monitoring of Java applications: It helps
developers find memory leaks and optimize speed. Formerly downloaded separately,
it is integrated into the core IDE since version 6.0.
The Profiler is based on a Sun Laboratories research project that was named
JFluid. That research uncovered specific techniques that can be used to lower the
overhead of profiling a Java application. One of those techniques is dynamic
bytecode instrumentation, which is particularly useful for profiling large Java
applications. Using dynamic bytecode instrumentation and additional algorithms, the
NetBeans Profiler is able to obtain runtime information on applications that are too
large or complex for other profilers. NetBeans also support Profiling Points that let
you profile precise points of execution and measure execution time.
GUI design tool:
Formerly known as project Matisse, the GUI design-tool enables developers to

prototype and design Swing GUIs by dragging and positioning GUI components.
The GUI builder has built-in support for JSR 295 (Beans Binding technology),
but the support for JSR 296 (Swing Application Framework) was removed in 7.1.

26
6.2.2. NetBeans JavaScript editor:-
The NetBeans JavaScript editor provides extended support for JavaScript,

Ajax, and CSS.
JavaScript editor features comprise syntax highlighting, refactoring, code

completion for native objects and functions, generation of JavaScript class skeletons,
generation of Ajaxcallbacks from a template; and automatic browser compatibility
checks.
CSS editor features comprise code completion for styles names, quick
navigation through the navigator panel, displaying the CSS rule declaration in a List
View and file structure in a Tree View, sorting the outline view by name, type or
declaration order (List & Tree).
6.2.3. NetBeans IDE Download Bundles:-
Users can choose to download NetBeans IDE bundles tailored to specific

development needs. Users can also download and install all other features at a later
date directly through the NetBeans IDE.
NetBeans IDE Bundle for Web and Java EE:-
The NetBeans IDE Bundle for Web & Java EE provides complete tools for all
the latest Java EE 6 standards, including the new Java EE 6 Web Profile, Enterprise
Java Beans (EJBs), servlets, Java Persistence API, web services, and annotations.
NetBeans also supports the JSF 2.0 (Facelets), JavaServer Pages (JSP), Hibernate,
Spring, and Struts frameworks, and the Java EE 5 and J2EE 1.4 platforms. It includes
GlassFish and Apache Tomcat. Some of its features with javaEE includes.
Improved support for CDI, REST services and Java Persistence
New support for Bean Validation
Support for JSF component libraries, including bundled PrimeFaces library
Improved editing for Expression Language in JSF, including code
completion, refactoring and hints.

NetBeans IDE Bundle for Java ME:-
The NetBeans IDE Bundle for Java ME is a tool for developing applications
that run on mobile devices; generally mobile phones, but this also includes
entry-level PDAs, andJava Card, among others.

27
The NetBeans IDE comes bundled with the latest Java ME SDK 3.0 which
supports both CLDC and CDC development. One can easily integrate third-party
emulators for a robust testing environment. You can download other Java platforms,
including the Java Card Platform 3.0, and register them in the IDE.
28
CHAPTER-7
SYSTEM IMPLIMENTATION
For implementing our project we use following things:

First Install Netbeans.

Now open Netbeans. Select File Menu and select "Open Project" and select
the Source code folder.Now right click it and run the project.
The following is the generation process of Tweets

β
1. Draw φ ~ Dir(β), π ~ Dir(γ)
2. For each topic t = 1, . . . , T ,
t
(a) draw φ ~ Dir(β)
3. For each user u = 1, . . . , U ,
u
(a) draw θ ~ Dir(α)
(b) for each tweet s = 1, . . . , N u
i. draw z u,s ~ Multi(θ u )
ii. for each word n = 1, . . . , N u,s
A. draw y u,s,n ~ Multi(π)
β
B. draw w u,s,n ~ Multi(ϕ )
if y u,s,n = 0 and w u,s,n ~Multi(φ z u,s ) if y u,s,n =1
Figure: 2.The generation process of tweets
As a simple example, consider the following tweets:

(1) Fruits and vegetables are healthy.
(2) I like apples, oranges, and avocados. I don’t like the flu or colds.
Let’s remove stop words, giving:
(1) fruits vegetables healthy

(2) apples oranges avocados flu colds.
We’ll let k denote the number of topics that we think these tweets are
generated from. Let’s say there are k=2 topics. Note that there are v=8 words in
our corpus. LDA would tell us that:
Topic 1 = Fruits, Vegetables, Apples, Oranges, Avocados

Topic 2 = Healthy, Flu, Colds

29
And that:
Tweet 1 = (2/3) Topic 1, (1/3) Topic 2
Tweet 2 = (3/5) Topic 1, (2/5) Topic 2
We can conclude that there’s a food topic and a health topic, see words that
define those topics, and view the topic composition of each tweet. Each topic in
LDA is a probability distribution over the words. In our case, LDA would give k=2
distributions of size v=8. Each item of the distribution corresponds to a word in the
vocabulary. For instance, let’s call one of these distributions β1.β2 lets us answer
questions such as: given that our topic is Topic
#1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?
7.1. CATEGORIZING TOPICS:-
New York Times For the NYT dataset, because the articles already have
category labels, intuitively, if a topic is associated with many articles in

a particular category, the topic is likely to belong to that category. To
capture this intuition, we categorize topics by assigning
* *
topic t to category q where q = arg max q p(q|t) = arg max q
p(t|q)p(q)/p(t) = arg max q p(t|q), assuming that all categories are
equally important.
∙ We can estimate the probability of topic t given category as where p
(t|d) denotes the learned probability of topic t given document d and D

NYT,q denote the subset of documents in the NYT collection that are
labeled with category q.
To further remove noisy topics (e.g. topics with incoherent words.) or
back-ground topics (e.g. topics consisting mainly of common words

such as “called,”“made,” “added,” etc.), we exploit the following
observation: Most meaningful topics are related to a single topic
category. If a topic is closely related to many categories, it is likely a
noisy or background topic.
30
The larger CE(t) is, the more likely t is a noisy or background topic. We
remove topics whose CE(t) is larger than a threshold empirically set to

3.41. After removing noisy and background topics, we obtain 83 topics
from Tnyt as the final set of NYT topics we use for our empirical
comparison later.
Twitter Unlike NYT documents, tweets do not naturally have category
labels. We use the following strategy to categorize Twitter topics. For

each Twitter topic we first find the most similar NYT topic. If it is
similar enough to one of the NYT topics, we use that NYT topic’s
category as the Twitter topic’s category. Otherwise, we manually assign
it to one of the topic categories or remove it if it is a noisy topic.
Specifically, to measure the similarity between a Twitter topic t and an
NYT topic tˈ, we use JS-divergence between the two word
distributions, denoted as pt and ptˈ
where pm(w) = 1/2pt(w) + 1/2pt0(w), and KL-div is the KL-divergence.
The JS divergence has the advantage that it is symmetric. After the

semi-automatic topic categorization, we obtain a set of 81 topics from
Twitter to be used in later empirical comparison. In the future we will
look into automatic methods for cleaning and categorizing Twitter
topics.
7.2 ASSIGNING TOPIC TYPES:-

As we described earlier, we have defined three topic types, namely,
event-oriented topics, entity-oriented topics and long-standing topics. Because these
topic types are not based on semantic relatedness of topics, it is hard to automatically
classify the topics into these topic types. We therefore manually classified the Twitter
and the NYT topics into the three topic types.
7.3. EMPIRICAL COMPARISON BETWEEN TWITTER AND

NEW YORK TIMES:-
As we have stated, the focus of this study is to compare the content of Twitter with that of New York Times in order to understand the topical
differences between
31
Twitter and traditional news media and thus help make better use of Twitter as an
information source. In this section we use the discovered topics from the two datasets
together with their category and type information to perform an empirical
comparison between Twitter and NYT.
7.4. DISTRIBUTION OF TOPICS:-

By Topic Categories In traditional news media, while the categories of articles
span a wide range from business to leisure, there is certainly an uneven distribution
over these categories. In microblogging sites such as Twitter, where content is
generated by ordinary Web users, how does the distribution of different categories of
topics differ from traditional news media? To answer this uestion, we first compute
the distributions of different topic categories in NYT and in Twitter respectively in
the following way. For NYT, because we have the category labels of news articles,
we measure the relative strength of a category simply by the percentage of articles
belonging to that category.
For Twitter, similarly, we can use the percentage of tweets belonging to each
category as a measure of the strength of that category. With the help of the
Twitter-LDA model, each tweet has been associated with a Twitter topic, and each
Twitter topic is also assigned to a particular category. We also consider an alternative
measure using the number of users interested in a topic category to gauge the strength
of a category. Only users who have written at least five tweets belonging to that topic
category are counted. We plot the distributions of topic categories in the two But the
relative degrees of presence of different topic categories are quite different between
Twitter and NYT.
For example, in Twitter, Family&Life dominates while this category does not
appear in NYT (because it is a new category we added for Twitter topics and
therefore no NYT article is originally labeled with this category). Arts is commonly
strong in both Twitter and NYT. However, Style is a strong category in Twitter but
not so strong in NYT.
By Topic Types Similarly, we can compare the distributions of different topic types in Twitter and in NYT.
An interesting finding is that Twitter clearly has relatively more tweets and users talking about entity-oriented
topics than NYT. In

32
contrast, event-oriented topics are not so popular in Twitter although it has a much
stronger presence than entity oriented topics in NYT. We suspect that many
entity-oriented topics are about celebrities and brands, and these tend to attract Web
users’ attention. To verify this, we inspected the entity-oriented topics in Twitter and
found that indeed out of the 19 entity-oriented topics in Twitter 10 of them are on
celebrities and the other 9 of them are on brands and big companies. Note that
long-standing topics are always dominating. It may be surprising to see this for NYT,
but it is partly because with LDA model each news article is assumed to have a
mixture of topics. So even if a news article is mainly about an event, it may still have
some fractions contributing to long-standing topics.
TwitterLDAmain.java:-
In the TwitterLDAmain .java it take the input text from the data folder. Then it
tokenizes the tweets. Then we it identifies the unique words, Background words,
word map, words in topics. And the final analysed topics in the text label folder.
package TwitterLDA;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import Common.Stopwords;
import Common.FileUtil;
import Common.JC;
public class TwitterLDAmain {
static ArrayList<String> stopWords;
public static void main(String args[]) throws Exception {

String base = System.getProperty("user.dir") + "/data/";
String name = "test";
char[] options = { 'f', 'i', 'o', 'p', 's' };

33
String filelist = base + "/filelist_" + name + ".txt";

String dataDir = base + "/Data4Model/" + name + "/";
String outputDir = base + "/ModelRes/" + name + "/";
String modelParas = base + "/modelParameters-" + name + ".txt";

String stopfile = base + "/stoplist.txt";
// create output folder
FileUtil.mkdir(new File(base + "/ModelRes/"));
FileUtil.mkdir(new File(outputDir));
ArrayList<String> files = new ArrayList<String>();
FileUtil.readLines(filelist, files);
// 1. get model parameters
ArrayList<String> modelSettings = new ArrayList<String>();

getModelPara(modelParas, modelSettings);
int A_all = Integer.parseInt(modelSettings.get(0)); float

alpha_g = Float.parseFloat(modelSettings.get(1));
float beta_word = Float.parseFloat(modelSettings.get(2));

float beta_b = Float.parseFloat(modelSettings.get(3));
float gamma = Float.parseFloat(modelSettings.get(4));
int nIter = Integer.parseInt(modelSettings.get(5));
System.err.println("Topics:" + A_all + ", alpha_g:" + alpha_g
+ ", beta_word:" + beta_word + ", beta_b:" + beta_b
+ ", gamma:" + gamma + ", iteration:" + nIter);
modelSettings.clear();
new Stopwords();
Stopwords.addStopfile(stopfile);
int outputTopicwordCnt = 30;
int outputBackgroundwordCnt = 50;

34
String outputWordsInTopics = outputDir + "WordsInTopics.txt";

String outputBackgroundWordsDistribution = outputDir
+ "BackgroundWordsDistribution.txt";
String outputTextWithLabel = outputDir + "/TextWithLabel/";
if (!new File(outputTextWithLabel).exists())
FileUtil.mkdir(new File(outputTextWithLabel));
// 2. get documents (users)
HashMap<String, Integer> wordMap = new HashMap<String,
Integer>();
ArrayList<user> users = new ArrayList<user>();
ArrayList<String> uniWordMap = new ArrayList<String>();
for (int i = 0; i < files.size(); i++) {
user tweetuser = new user(dataDir + files.get(i), files.get(i),

wordMap, uniWordMap);
users.add(tweetuser);
// ComUtil.printHash(wordMap);
if (uniWordMap.size() != wordMap.size()) {
System.out.println(wordMap.size());
System.out.println(uniWordMap.size());
System.err
.println("uniqword size is not the same as the
hashmap size!");
System.exit(0);
}
// output wordMap and itemMap

35
FileUtil.writeLines(outputDir + "wordMap.txt", wordMap);

FileUtil.writeLines(outputDir + "uniWordMap.txt", uniWordMap);
int uniWordMapSize = uniWordMap.size();
wordMap.clear();
uniWordMap.clear();
// uniItemMap.clear();
// 3. run the model
Model model = new Model(A_all, users.size(), uniWordMapSize,
nIter,
alpha_g, beta_word, beta_b, gamma);

model.intialize(users);
// model.fake_intialize(us
ers); model.estimate(users,
nIter);
// 4. output model results
System.out.println("Record Topic Distributions/Counts");

model.outputTopicDistributionOnUsers(outputDir, users);
System.out.println("read uniwordmap");
FileUtil.readLines(outputDir + "uniWordMap.txt", uniWordMap);
try {
model.outputTextWithLabel(outputTextWithLabel, users,
uniWordMap);
} catch (Exception e) {
e.printStackTrace();
System.out.println("write text with labels done");

// model.outputTopicCountOnTime(outputTopicsCountOnTime);
users.clear();
try {

36
model.outputWordsInTopics(outputWordsInTopics,
uniWordMap,
outputTopicwordCnt);
} catch (Exception e1) {
e1.printStackTrace();
try {
model.outputBackgroundWordsDistribution(
outputBackgroundWordsDistribution,
uniWordMap,
outputBackgroundwordCnt);
} catch (Exception e1) {
e1.printStackTrace();
System.out.println("Record Background done");
System.out.println("Final Done");
private static void getModelPara(String modelParas,

ArrayList<String> modelSettings) {
modelSettings.clear();
// T , alpha , beta , gamma , iteration , saveStep,

saveTimes modelSettings.clear();
// add default parameter settings
modelSettings.add("40");
modelSettings.add("1.25");

37
ArrayList<String> inputlines = new ArrayList<String>();
FileUtil.readLines(modelParas, inputlines);
for (int i = 0; i < inputlines.size(); i++) {
int index = inputlines.get(i).indexOf(":");
String para = inputlines.get(i).substring(0, index).trim()
.toLowerCase();
String value = inputlines.get(i)
.substring(index + 1,
inputlines.get(i).length()).trim()
.toLowerCase();
switch (ModelParas.valueOf(para)) {
case topics:
modelSettings.set(0, value);
break;
case alpha_g:
break;
case beta_word:
break;
case beta_b:
break;
case gamma:
break;
case iteration:
break;
default:
break; }

38
public enum ModelParas {
topics, alpha_g, beta_word, beta_b, gamma, iteration;
}
39
CHAPTER-8
SYSTEM TESTING
Testing is a process, which reveals errors in the program. It is the major quality
measure employed during software development, during software development,
during testing, the program is executed with a set of test cases and the output of the
program for the test cases is evaluated to determine if the program is performing as it
is expected to perform.
TEST RESULTS:-
TEST CASE: 1
Test Name: Start Application without adding external libraries.
Expected Output: Application with disabled features.
Result: Fails.
Output:
Figure.8: Without library file

40
TEST CASE: 2
Test Name: Start Application with external libraries.
Expected Output: Application with complete features.
Result: Pass
Output:
Figure.9: With library file

41
TEST CASE: 3
Test Name: Run the application.
Expected Output: Should Specify No Data Exists.
Result: fails
Output:
Figure.10: Without data folder

42
TEST CASE: 4
Test Name: Run the application.
Expected Output: Should display results based on the chosen system.
Result: pass
Output:
Figure.11: With data folder

43
CHAPTER-9
SCREEN SHOTS
Screen shot: 1
We open the net beans and upload our project into the net beans .then after
adding library files we run TweeterLDAmain.java file. Then it repeats 100 iterations
after completion of iteration it will identity’s the background words, uni ue words,
the analysis of the particular word is done.

44
Screen shot: 2
45
Screen shot: 3
In the data model folder consist of texts files. Each text file contains the tweets
and retweets of particular topic.
46
Screen shot: 4
We tokenize the each topic. And the tokenized words are placed in the text
Tokenization folder.
47
Screen shot: 5
In the above screen shot shows the result of identify all the background words
in each topic .Those recorded background words are placed in the background word
distribution text file.
48
Screen shot: 6
In the above screen shot shows the topic distribution of words in each text.
49
Screen shot: 7
In the above screen shot shows the topic counts values of words in each text.
50
Screen shot: 8
In the above screen shots shows the result of unique words from the entire
topics.
51
Screen shot: 9
This screenshot shows the results of the text in the word map which we taken
as tweets. In those tweets the particular word appears as the value provided as above.
52
Screen shot: 10
This screen shot shows the analysis of the all topics.

53
CHAPTER-10
CONCLUSION & FUTURE SCOPE
In this project we empirically compared the content of Twitter with a typical
traditional news medium, New York Times, focusing on the differences between
these two. We developed a new Twitter-LDA model that is designed for short tweets
and showed its effectiveness compared with existing models. We introduced the
concepts of topic categories and topic types to facilitate our analysis of the topical
differences between Twitter and traditional news media. Our empirical comparison
confirmed some previous observations and also revealed some new findings.
In particular, we find that Twitter can be a good source of entity oriented topics
that have low coverage in traditional news media. In the future, we will study how to
summarize and visualize Twitter content in a systematic way. Our method of
associating tweets with different categories and types may also help visualization of
Twitter content.
In Future we can extend this project with respect of LSA and LDA comparison.
Moreover LDA is more powerful than existing LSA technique for Segmentation of
Twitter Timeline.

Suni

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Suni

Uploaded by

Copyright:

Available Formats

1

In this we empirically compare the content of Twitter with a traditional news

1.1. What is Data Mining?

Figure.1: Structure of Data Mining

Dept. of CSE, P V P Siddhartha institute of technology

1.1.1. How Data Mining Works?

Generally, any of four types of relationships are sought:

​ Classes: Stored data is used to locate data in predetermined groups. For

example, a restaurant chain could mine customer purchase data to determine

​ Clusters: Data items are grouped according to logical relationships or

consumer preferences. For example, data can be mined to identify market

​ Associations: Data can be mined to identify associations. The beer-diaper

example is an example of associative mining.

​ Sequential patterns: Data is mined to anticipate behaviour patterns and

trends. For example, an outdoor equipment retailer could predict the

1.1.2. Data mining consists of five major elements:

2) Store and manage the data in a multidimensional database system.

3) Provide data access to business analysts and information technology

4) Analyze the data by application software.

5) Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

​ Artificial neural networks: Non-linear predictive models that learn through

training and resemble biological neural networks in structure.

​ Genetic algorithms: Optimization techniques that use process such as

genetic combination, mutation, and natural selection in a design based on the

​ Decision trees: Tree-shaped structures that represent sets of decisions. These

decisions generate rules for the classification of a dataset. Specific decision

​ Nearest neighbor method: A technique that classifies each record in a

dataset based on a combination of the classes of the k record(s) most similar

​ Data visualization: The visual interpretation of complex relationships in

multidimensional data. Graphics tools are used to illustrate data relationships.

Dept. of CSE, P V P Siddhartha institute of technology

1.1.3. Characteristics of Data Mining:

by automated techniques e.g. satellite information, credit card transactions

​ Noisy, incomplete data: Imprecise data is the characteristic of all data

​ Complex data structure: conventional statistical analysis not possible.

​ Heterogeneous data stored in legacy systems

1.1.4. Benefits of Data Mining:

2) An analytical CRM model and strategic business related decisions can be

3) An endless number of organizations have installed data mining projects and it

4) Data mining is generally used by organizations with a solid customer focus.

1.1.5. Advantages of Data Mining:

By applying data mining in operational engineering data, manufacturers can

Data mining helps government agency by digging and analyzing records of

1.2. LATENT DIRICHLET ALLOCATION:-

​ I like to eat broccoli and bananas.

​ I ate a banana and spinach smoothie for breakfast.

​ Chinchillas and kittens are cute.

​ My sister adopted a kitten yesterday.

​ Look at this cute hamster munching on a piece of broccoli.

​ Sentences 1 and 2: 100% Topic A.

​ Sentences 3 and 4: 100% Topic B.

​ Sentence 5: 60% Topic A, 40% Topic B.

which point, you could interpret topic A to be about food).

point, you could interpret topic B to be about cute animals). The

Dept. of CSE, P V P Siddhartha institute of technology

​ First picking a topic (according to the multinomial distribution that you

multinomial distribution). For example, if we selected the food topic, we

15% probability, and so on.

​ Pick 5 to be the number of words in D.

adorable cherries eating” (note that LDA is a bag-of-words model).

Dept. of CSE, P V P Siddhartha institute of technology

Classes: Stored data is used to locate data in predetermined groups. For

Clusters: Data items are grouped according to logical relationships or

Associations: Data can be mined to identify associations. The beer-diaper

Sequential patterns: Data is mined to anticipate behaviour patterns and

Artificial neural networks: Non-linear predictive models that learn through

Genetic algorithms: Optimization techniques that use process such as

Decision trees: Tree-shaped structures that represent sets of decisions. These

Nearest neighbor method: A technique that classifies each record in a

Data visualization: The visual interpretation of complex relationships in

Noisy, incomplete data: Imprecise data is the characteristic of all data

Complex data structure: conventional statistical analysis not possible.

Heterogeneous data stored in legacy systems

I like to eat broccoli and bananas.

I ate a banana and spinach smoothie for breakfast.

Chinchillas and kittens are cute.

My sister adopted a kitten yesterday.

Look at this cute hamster munching on a piece of broccoli.

Sentences 1 and 2: 100% Topic A.

Sentences 3 and 4: 100% Topic B.

Sentence 5: 60% Topic A, 40% Topic B.

First picking a topic (according to the multinomial distribution that you

Pick 5 to be the number of words in D.

Go through each word w in d…

Do the distributions of topic categories and types differ in Twitter.

If so, are there common characteristics of these specific topics?

Do certain categories and types of topics attract more opinions in Twitter?

Do certain categories and types of topics trigger more information spread in

User interface management (e.g. menus and toolbars)

User settings management

Storage management (saving and loading any kind of data)

Wizard framework (supports step-by-step dialogs)

NetBeans Visual Library

Integrated development tools