Professional Documents
Culture Documents
Suni
Suni
CHAPTER-1
INTRODUCTION
Microblogging websites have evolved to become a source of varied kind of
information. This is due to nature of microblogs on which users to communicate with
each other. On the twitter users post real time messages about their opinions on a
variety of topics, discuss current issues, complain, and express positive sentiment for
products they use in daily life. In fact, companies manufacturing such products have
started to poll these microblogs to get a sense of general sentiment for their product.
Many times these companies study user reactions and reply to users on microblogs.
In particular, it is not clear whether as an information source Twitter can be simply
regarded as faster news feed that covers mostly the same information as traditional
news media.
Data mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles, categorize it,
and summarize the relationships identified. Technically, data mining is the process of
finding correlations or patterns among dozens of fields in large relational databases.
1) Extract, transform, and load transaction data onto the data warehouse
system.
statistical significance.
collection.
5) Fast paced and prompt access to data along with economic processing
techniques have made data mining one of the most suitable services that a
company seek.
Dept. of CSE, P V P Siddhartha institute of technology
5
Data mining helps marketing companies build models based on historical data
to predict who will respond to the new marketing campaigns such as direct mail,
online marketing campaign…etc. Through the results, marketers will have
appropriate approach to sell profitable products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way as
marketing. Through market basket analysis, a store can have an appropriate
production arrangement in a way that customers can buy frequent buying products
together with pleasant. In addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.
2. Finance / Banking:
Data mining gives financial institutions information about loan information and
credit reporting. By building a model from historical customer’s data, the bank and
financial institution can determine good and bad loans. In addition, data mining helps
banks detect fraudulent credit card transactions to protect credit card’s owner.
3. Manufacturing:
4. Governments:
5. Law enforcement:
Data mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by examining trends in location, crime type, habit, and
other patterns of behaviours.
6. Researchers:
Data mining can assist researchers by speeding up their data analyzing process;
thus, allowing those more time to work on other projects.
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that
these sentences contain. For example, given these sentences and asked for 2 topics,
LDA might produce something like
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which
In more detail, LDA represents documents as mixtures of topics that spit out
words with certain probabilities. It assumes that documents are produced in the
following fashion: when writing each document, you.
Decide on the number of words N the document will have (say, according to a
Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution
over a fixed set of K topics). For example, assuming that we have the two food
and cute animal topics above, you might choose the document to consist of 1/3
food and 2/3 cute animals.
Generate each word w_i in the document by:
sampled above; for example, you might pick the food topic with 1/3
probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s
Assuming this generative model for a collection of documents, LDA then tries
to backtrack from the documents to find a set of topics that are likely to have
generated the collection.
Example:
Let’s make an example. According to the above process, when generating some
particular document D, you might.
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick the first word to come from the food topic, which then gives you the
word “broccoli”.
Pick the second word to come from the cute animals topic, which gives you
“panda”.
Pick the third word to come from the cute animals topic, giving you
“adorable”.
Pick the fourth word to come from the food topic, giving you “cherries”.
Pick the fifth word to come from the food topic, giving you “eating”.
So the document generated under the LDA model will be “broccoli panda
Learning:
So now suppose you have a set of documents. You’ve chosen some fixed
number of K topics to discover, and want to use LDA to learn the topic
representation of each document and the words associated to each topic. How do you
do this? One way (known as collapsed Gibbs sampling) is the following:
Go through each document, and randomly assign each word in the document to
one of the K topics.
Notice that this random assignment already gives you both topic representations
of all the documents and word distributions of all the topics (albeit not very good
ones).
So to improve on them, for each document d…
After repeating the previous step a large number of times, you’ll eventually
reach a roughly steady state where your assignments are pretty good. So use
these assignments to estimate the topic mixtures of each document and the
words associated to each topic (by counting the proportion of words assigned
to each topic overall).
CHAPTER-2
SYSTEM ANALYSIS
2.1. EXISTING SYSTEM:-
We empirically compare the content of Twitter with a traditional news medium,
New York Times, using unsupervised topic modeling, but the content analysis on
Twitter has not been well studied. So to overcome this problem we use a
Twitter-LDA model to discover topics from a representative sample of the entire
Twitter.
CHAPTER-3
LITERATURE SURVEY
[1] Empirical study of topic modeling in Twitter
popular platform for Web users to communicate with each other. Because tweets are
compact and fast, Twitter has become widely used to spread and share breaking
news, personal updates and spontaneous ideas. The popularity of this new form of
social media has also started to attract the attention of researchers. Several recent
studies examined Twitter from different perspectives, including the topological
characteristics of Twitter, tweets as social sensors of real-time events, the forecast of
box-office revenues for movies, etc.
However, the explorations are still in an early stage and our understanding of
Twitter, especially its large textual content, still remains limited. Due to the nature of
microblogging, the large amount of text in Twitter may presumably contain useful
information that can hardly be found in traditional information sources.
To make use of Twitter's textual content for information retrieval tasks such as
search and recommendation, one of the first questions one may ask is what kind of
special or unique information is contained in Twitter. As Twitter is often used to
spread breaking news, a particularly important question is how the information
contained in Twitter differs from what one can obtain from other more traditional
media such as newspapers. Knowing this difference could enable us to better define
retrieval tasks and design retrieval models on Twitter and in general microblogs. To
the best of our knowledge, very few studies have been devoted to content analysis of
Twitter, and none has carried out deep content comparison of Twitter with traditional
news media.
articles from a traditional news agency, namely, New York Times, within the same
Does Twitter cover similar categories and types of topics as traditional news
media?
Are there specific topics covered in Twitter but rarely covered in traditional
Twitter?
(1) Twitter and traditional news media cover a similar range of topic
categories, but the distributions of different topic categories and types differ between
Twitter and traditional news media.
(2) As expected, Twitter users tweet more on personal life and pop culture
than world events.
(3) Twitter covers more celebrities and brands that may not be covered in
traditional media.
(4) Although Twitter users tweet less on world events, they do actively
retweet (forward) world event topics, which helps spread important news.
Retweets can also be used to indicate trendy topics among Web users to help
search engines refine their results.
days, Twitter users write tweets several times in a single day. Users can know how
other users are doing and often what they are thinking about now, users repeatedly
return to the site and check to see what other people are doing. The large number of
updates results in numerous reports related to events. They include social events such
as parties, baseball games, and presidential campaigns. They also include disastrous
events such as storm, fire, traffic jam, riots, heavy rainfall, and earthquakes.
CHAPTER-4
SYSTEM REQUIREMENTS
SOFTWARE REQUIREMENTS:-
⮚ Language : JAVA
CHAPTER-5
SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE:-
Figure.2: System Architecture
In the system architecture we study how the content analysis processes is done.
First the user login into the tweeter with the user id and users post real time
messages about their opinions on a variety of topics, discuss current issues,
complain, and express positive sentiment for products they use in daily life.
The tweeter checks the id of tweet. If the author user then the user logins.
And post the comment.
Tweet Downloader which is provided by the SemEval Task Organiser. It
contains a python script which downloads the tweets given the tweet id.
Tweet NLP a twitter specific tweet tokeniser and tagger. It provides a fast and
robust Java-based tokeniser and part-of-speech tagger for Twitter.
Tokenisation : After downloading the tweets using the tweet id's provided in
the dataset, we first tokenise the tweets. This tool tokenises the tweet and
returns the POS tags of the tweet along with the confidence score. It is
important to note that this is a twitter specific tagger in the sense it tags the
twitter specific entries like Emoticons, Hashtag and Mentions too.
After obtaining the tokenised and tagged tweet we move to the next step of
preprocessing.
Remove Non-English Tweets: Twitter allows more than 60 languages.
However, this work currently focuses on English tweets only. We remove the
tweets which are non-English in nature.
Replacing Emoticons: Emoticons play an important role in determining the
sentiment of the tweet. Hence we replace the emoticons by their sentiment
polarity by looking up in the Emoticon Dictionary.
Remove URL: The url's which are present in the tweet are shortened using
TinyUrl due to the limitation on the tweet text. These shortened url's did not
carry much information regarding the sentiment of the tweet. Thus these are
removed.
Remove Target : The target mentions in a tweet done using '@' are usually
the twitter handle of people or organization. This information is also not
needed to determine the sentiment of the tweet. Hence they are removed.
Replace Negative Mentions: Tweets consists of various notions of negation.
In general, words ending with 'nt' are appended with a not. Before we remove
the stop words 'not' is replaced by the word 'negation' Negation play a very
important role in determining the sentiment of the tweet. This is discussed
later in detail.
Hashtags: Hashtags are basically summarizer of the tweet and hence are very
critical. In order to capture the relevant information from hashtags, all special
characters and punctuations are removed before using it as a feature.
H. Sequence of Repeated Characters: Twitter provides a platform for users
to express their opinion in an informal way.Spell correction is an important
Numbers : Numbers are of no use when measuring sentiment. Thus, numbers
which are obtained as tokenized unit from the tokeniser are removed in order
to refine the tweet content.
Nouns and Prepositions: Given a tweet token, we identify the word as a
Noun word by looking at its part of speech tag given by the tokeniser. If the
majority sense (most commonly used sense) of that word is Noun, we discard
the word. Noun words dont carry sentiment and thus are of no use in our
experiments. The same reasoning go for prepositions too.
Stop-word Removal: Stop words play a negative role in the task of sentiment
classification. Stop words occur in both positive and negative training set,
thus adding more ambiguity in the model formation. And also, stop words
don't carry any sentiment information and thus are of no use to us. We create
a list of stop words like he, she, at, on, a, the, etc. and ignore them while
scoring the sentiment.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that
depicts information flow and the transformations that are applied as data
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of method
or process may also be added to; or associated with, UML.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and
the software development process. The UML uses mostly graphical notations to
express the design of software projects.
Dept. of CSE, P V P Siddhartha institute of technology
19
GOALS:
CHAPTER-6
Software Environment
6.1. JAVA TECHNOLOGY:-
Java technology is both a programming language and a platform.
You can think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool
or a Web browser that can run applets, is an implementation of the Java VM. Java
byte codes help make “write once, run anywhere” possible. You can compile your
program into byte codes on any platform that has a Java compiler. The byte codes
can then be run on any implementation of the Java VM. That means that as long as a
computer has a Java VM, the same program written in the Java programming
language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java platform is a suite of programs that facilitate developing and running
programs written in the Java programming language. The platform is not specific to
any one processor or operating system, rather an execution engine (called a virtual
machine) and a compiler with a set of libraries are implemented for various hardware
and operating systems so that Java programs can run identically on all of them.
There are multiple platforms, each targeting a different class of devices: Java Card:
Dept. of CSE, P V P Siddhartha institute of technology
23
A technology that allows small Java based applications (applets) to be run securely
on smart cards and similar small memory devices.
Java SE (Standard Edition): For general purpose use on desktop PCs, servers
and similar devices. Java EE (Enterprise Edition): Java SE plus various APIs useful
for multitier client–server enterprise applications.
The Java platform consists of several programs, each of which provides a portion of
its overall capabilities. For example, the Java compiler, which converts Java source
code into Java byte code (an intermediate language for the JVM), is provided as part
of the Java Development Kit (JDK). The Java Runtime Environment (JRE),
complementing the JVM with a just in time (JIT) compiler, converts intermediate
byte code into native machine code on the fly. An extensive set of libraries are also
part of the Java platform. The essential components in the platform are the Java
language compiler, the libraries, and the runtime environment in which Java
intermediate byte code executes according to the rules laid out in the virtual machine
specification.
The heart of the Java platform is the concept of a "virtual machine" that
executes Java byte code programs. This byte code is the same no matter what
hardware or operating system the program is running under. There is a JIT (Just In
Time) compiler within the Java Virtual Machine, or JVM. The JIT compiler
translates the Java byte code into native processor instructions at runtime and caches
the native code in memory during execution. The use of byte code as an intermediate
language permits Java programs to run on any platform that has a virtual machine
available. The use of a JIT compiler means that Java applications, after a short delay
during loading and once they have "warmed up" by being all or mostly JIT
compiled, tend to run about as fast as native programs. Since JRE version 1.2, Sun's
JVM implementation has included a just in time compiler instead of an interpreter.
Dept. of CSE, P V P Siddhartha institute of technology
24
Although Java programs are cross platform or platform independent, the code of the
Java Virtual Machines (JVM) that execute these programs is not. Every supported
operating platform has its own JVM.
The NetBeans IDE is primarily intended for development in Java, but also
supports other languages, in particular PHP, C/C++and HTML5. NetBeans is
cross-platform and runs on Microsoft Windows, Mac OS X, Linux, Solaris and other
platforms supporting a compatible JVM.
Applications can install modules dynamically. Any application can include the
Update Center module to allow users of the application to download digitally signed
upgrades and new features directly into the running application. Reinstalling an
upgrade or a new release does not force users to download the entire application
again.
Dept. of CSE, P V P Siddhartha institute of technology
25
Window management
The NetBeans Profiler is a tool for the monitoring of Java applications: It helps
developers find memory leaks and optimize speed. Formerly downloaded separately,
it is integrated into the core IDE since version 6.0.
The Profiler is based on a Sun Laboratories research project that was named
JFluid. That research uncovered specific techniques that can be used to lower the
overhead of profiling a Java application. One of those techniques is dynamic
bytecode instrumentation, which is particularly useful for profiling large Java
applications. Using dynamic bytecode instrumentation and additional algorithms, the
NetBeans Profiler is able to obtain runtime information on applications that are too
large or complex for other profilers. NetBeans also support Profiling Points that let
you profile precise points of execution and measure execution time.
CSS editor features comprise code completion for styles names, quick
navigation through the navigator panel, displaying the CSS rule declaration in a List
View and file structure in a Tree View, sorting the outline view by name, type or
declaration order (List & Tree).
The NetBeans IDE Bundle for Web & Java EE provides complete tools for all
the latest Java EE 6 standards, including the new Java EE 6 Web Profile, Enterprise
Java Beans (EJBs), servlets, Java Persistence API, web services, and annotations.
NetBeans also supports the JSF 2.0 (Facelets), JavaServer Pages (JSP), Hibernate,
Spring, and Struts frameworks, and the Java EE 5 and J2EE 1.4 platforms. It includes
GlassFish and Apache Tomcat. Some of its features with javaEE includes.
The NetBeans IDE Bundle for Java ME is a tool for developing applications
that run on mobile devices; generally mobile phones, but this also includes
entry-level PDAs, andJava Card, among others.
The NetBeans IDE comes bundled with the latest Java ME SDK 3.0 which
supports both CLDC and CDC development. One can easily integrate third-party
emulators for a robust testing environment. You can download other Java platforms,
including the Java Card Platform 3.0, and register them in the IDE.
Dept. of CSE, P V P Siddhartha institute of technology
28
CHAPTER-7
SYSTEM IMPLIMENTATION
For implementing our project we use following things:
First Install Netbeans.
Now open Netbeans. Select File Menu and select "Open Project" and select
the Source code folder.Now right click it and run the project.
t
(a) draw φ ~ Dir(β)
u
(a) draw θ ~ Dir(α)
β
B. draw w u,s,n ~ Multi(ϕ )
(2) I like apples, oranges, and avocados. I don’t like the flu or colds.
We’ll let k denote the number of topics that we think these tweets are
generated from. Let’s say there are k=2 topics. Note that there are v=8 words in
our corpus. LDA would tell us that:
And that:
We can conclude that there’s a food topic and a health topic, see words that
define those topics, and view the topic composition of each tweet. Each topic in
LDA is a probability distribution over the words. In our case, LDA would give k=2
distributions of size v=8. Each item of the distribution corresponds to a word in the
vocabulary. For instance, let’s call one of these distributions β1.β2 lets us answer
questions such as: given that our topic is Topic
#1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?
New York Times For the NYT dataset, because the articles already have
* *
topic t to category q where q = arg max q p(q|t) = arg max q
p(t|q)p(q)/p(t) = arg max q p(t|q), assuming that all categories are
equally important.
The larger CE(t) is, the more likely t is a noisy or background topic. We
from Tnyt as the final set of NYT topics we use for our empirical
comparison later.
Twitter and traditional news media and thus help make better use of Twitter as an
information source. In this section we use the discovered topics from the two datasets
together with their category and type information to perform an empirical
comparison between Twitter and NYT.
For Twitter, similarly, we can use the percentage of tweets belonging to each
category as a measure of the strength of that category. With the help of the
Twitter-LDA model, each tweet has been associated with a Twitter topic, and each
Twitter topic is also assigned to a particular category. We also consider an alternative
measure using the number of users interested in a topic category to gauge the strength
of a category. Only users who have written at least five tweets belonging to that topic
category are counted. We plot the distributions of topic categories in the two But the
relative degrees of presence of different topic categories are quite different between
Twitter and NYT.
For example, in Twitter, Family&Life dominates while this category does not
appear in NYT (because it is a new category we added for Twitter topics and
therefore no NYT article is originally labeled with this category). Arts is commonly
strong in both Twitter and NYT. However, Style is a strong category in Twitter but
not so strong in NYT.
By Topic Types Similarly, we can compare the distributions of different topic types in Twitter and in NYT.
An interesting finding is that Twitter clearly has relatively more tweets and users talking about entity-oriented
topics than NYT. In
contrast, event-oriented topics are not so popular in Twitter although it has a much
stronger presence than entity oriented topics in NYT. We suspect that many
entity-oriented topics are about celebrities and brands, and these tend to attract Web
users’ attention. To verify this, we inspected the entity-oriented topics in Twitter and
found that indeed out of the 19 entity-oriented topics in Twitter 10 of them are on
celebrities and the other 9 of them are on brands and big companies. Note that
long-standing topics are always dominating. It may be surprising to see this for NYT,
but it is partly because with LDA model each news article is assumed to have a
mixture of topics. So even if a news article is mainly about an event, it may still have
some fractions contributing to long-standing topics.
TwitterLDAmain.java:-
In the TwitterLDAmain .java it take the input text from the data folder. Then it
tokenizes the tweets. Then we it identifies the unique words, Background words,
word map, words in topics. And the final analysed topics in the text label folder.
package TwitterLDA;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import Common.Stopwords;
import Common.FileUtil;
import Common.JC;
FileUtil.mkdir(new File(outputDir));
FileUtil.readLines(filelist, files);
modelSettings.clear();
new Stopwords();
Stopwords.addStopfile(stopfile);
int outputTopicwordCnt = 30;
+ "BackgroundWordsDistribution.txt";
if (!new File(outputTextWithLabel).exists())
FileUtil.mkdir(new File(outputTextWithLabel));
Integer>();
users.add(tweetuser);
// ComUtil.printHash(wordMap);
if (uniWordMap.size() != wordMap.size()) {
System.out.println(wordMap.size());
System.out.println(uniWordMap.size());
System.err
hashmap size!");
System.exit(0);
}
// output wordMap and itemMap
wordMap.clear();
uniWordMap.clear();
// uniItemMap.clear();
nIter,
// model.fake_intialize(us
ers); model.estimate(users,
nIter);
try {
model.outputTextWithLabel(outputTextWithLabel, users,
uniWordMap);
} catch (Exception e) {
e.printStackTrace();
try {
model.outputWordsInTopics(outputWordsInTopics,
uniWordMap,
outputTopicwordCnt);
} catch (Exception e1) {
e1.printStackTrace();
try {
model.outputBackgroundWordsDistribution(
outputBackgroundWordsDistribution,
uniWordMap,
outputBackgroundwordCnt);
} catch (Exception e1) {
e1.printStackTrace();
System.out.println("Final Done");
modelSettings.clear();
modelSettings.add("40");
modelSettings.add("1.25");
modelSettings.add("0.01");
modelSettings.add("0.01");
modelSettings.add("20");
modelSettings.add("20");
FileUtil.readLines(modelParas, inputlines);
.toLowerCase();
String value = inputlines.get(i)
.substring(index + 1,
inputlines.get(i).length()).trim()
.toLowerCase();
switch (ModelParas.valueOf(para)) {
case topics:
modelSettings.set(0, value);
break;
case alpha_g:
modelSettings.set(1, value);
break;
case beta_word:
modelSettings.set(2, value);
break;
case beta_b:
modelSettings.set(3, value);
break;
case gamma:
modelSettings.set(4, value);
break;
case iteration:
modelSettings.set(5, value);
break;
default:
break; }
}
Dept. of CSE, P V P Siddhartha institute of technology
39
CHAPTER-8
SYSTEM TESTING
Testing is a process, which reveals errors in the program. It is the major quality
measure employed during software development, during software development,
during testing, the program is executed with a set of test cases and the output of the
program for the test cases is evaluated to determine if the program is performing as it
is expected to perform.
TEST RESULTS:-
TEST CASE: 1
Result: Fails.
Output:
TEST CASE: 2
Result: Pass
Output:
TEST CASE: 3
Result: fails
Output:
TEST CASE: 4
Result: pass
Output:
CHAPTER-9
SCREEN SHOTS
Screen shot: 1
We open the net beans and upload our project into the net beans .then after
adding library files we run TweeterLDAmain.java file. Then it repeats 100 iterations
after completion of iteration it will identity’s the background words, uni ue words,
the analysis of the particular word is done.
Screen shot: 2
Dept. of CSE, P V P Siddhartha institute of technology
45
Screen shot: 3
In the data model folder consist of texts files. Each text file contains the tweets
and retweets of particular topic.
Dept. of CSE, P V P Siddhartha institute of technology
46
Screen shot: 4
We tokenize the each topic. And the tokenized words are placed in the text
Tokenization folder.
Dept. of CSE, P V P Siddhartha institute of technology
47
Screen shot: 5
In the above screen shot shows the result of identify all the background words
in each topic .Those recorded background words are placed in the background word
distribution text file.
Dept. of CSE, P V P Siddhartha institute of technology
48
Screen shot: 6
In the above screen shot shows the topic distribution of words in each text.
Dept. of CSE, P V P Siddhartha institute of technology
49
Screen shot: 7
In the above screen shot shows the topic counts values of words in each text.
Dept. of CSE, P V P Siddhartha institute of technology
50
Screen shot: 8
In the above screen shots shows the result of unique words from the entire
topics.
Dept. of CSE, P V P Siddhartha institute of technology
51
Screen shot: 9
This screenshot shows the results of the text in the word map which we taken
as tweets. In those tweets the particular word appears as the value provided as above.
Dept. of CSE, P V P Siddhartha institute of technology
52
Screen shot: 10
CHAPTER-10
CONCLUSION & FUTURE SCOPE
In this project we empirically compared the content of Twitter with a typical
traditional news medium, New York Times, focusing on the differences between
these two. We developed a new Twitter-LDA model that is designed for short tweets
and showed its effectiveness compared with existing models. We introduced the
concepts of topic categories and topic types to facilitate our analysis of the topical
differences between Twitter and traditional news media. Our empirical comparison
confirmed some previous observations and also revealed some new findings.
In particular, we find that Twitter can be a good source of entity oriented topics
that have low coverage in traditional news media. In the future, we will study how to
summarize and visualize Twitter content in a systematic way. Our method of
associating tweets with different categories and types may also help visualization of
Twitter content.
In Future we can extend this project with respect of LSA and LDA comparison.
Moreover LDA is more powerful than existing LSA technique for Segmentation of
Twitter Timeline.
Dept. of CSE, P V P Siddhartha institute of technology