Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2006 IEEE International Conference on

Systems, Man, and Cybernetics


October 8-11, 2006, Taipei, Taiwan

A Role-based Customer review Mining System


Wenqian Shang, Youli Qu, Houkuan Huang, Yongmin Lin, and Hongbin Dong

Abstract-With the development of WWW (World Wide open answers in questionnaire data, they mainly adopt
Web), more and more people surf on the web, more and rule-based text classification. But they only adapt to various
more web sites provide forum for people to publish their types of survey data. In paper [4] [5][6], they mainly research
reviews, so there are many reviews on some special topic on the mining of customer reviews from the web. They adopt
in many forums. Our system mainly aims at the reviews natural language processing technology, data mining
of customer for some product. The data set comes from technology, expert knowledge database etc. They only adapt
forum data or emails or any other form of reviews. The to review data of forum. In paper [7], his research adopts
mining result can guide the company's CEO to make Bayes algorithm based on N-gram and imports semantic idea.
science decision for company's products research and But our system is very different from theirs. Our system can
market development and the mining result can guide the not only deal with questionnaire data, but can deal with web
customers to purchase more satisfying products. Our data from forum. It can deal with any form of text data. We
system mainly adopts data mining technology, natural adopt fuzzy kNN classifier, Apriori algorithm, natural
language processing technology, web text mining language processing, information retrieval technology and
technology and so on. In the realization of our system, we role-based idea and so on. We integrate theses technology and
adopt role based concepts and theory, this makes the idea to make the system more perfect.
realization more reasonable and more efficiency and this This paper is structured as follows: In Section 2, we briefly
makes the system function more perfect and more
science. introduce the architecture and function of system; Section 3
discusses the main algorithm used in system; Section 4
I. INTRODUCTION describes some ideas about role and the realization of mining
system based on role; Section 5 we give the conclusion.
WtITH the rapid development of the world-wide web,
e-commerce comes to flourish. More and more II. THE ARCHITECTURE AND FUNCTION OF THE SYSTEM
products are sold on the web; more and more people are also
shopping online; more and more web sites offer forum for A. The Process Flow of the System
customers to review their purchased products. Another aspect, The process flow of system is shown in Figure 1, coded in
customers perhaps sends emails to the company to review Visual C++ under the Window's environment.
their purchased products. In addition, customers may review
their purchased products in the questionnaire. These reviews
C ~~~~TheWeb
may contain valuable information and provide an important
basis for making business decision. Knowing the reputations
of your own and/or your competitors' products is very aetsrcenIn
aget seach enine
The ral-tie
Th reltm <
The data of questionnaire,
Email, reviews of forum etc
important for market development and customer relationship
management. These reviews can also guide the potential Preprocessing of review data (transfonn any form data to the data that can be
customers to purchase more satisfying products. processed by system)
Now there are some researches in this area. Such as: in
Fuzzy kNN classifier
paper [1][2][3], their researches are based on mining from Classified review data that according to different sort of product name

Manuscript submitted March 14, 2006. This work was supported in part
by the Beijing Jiaotong University Science Foundation under the grant Feature words Opinion words Predicting the orientations of
2004RC008 extraction extraction opinion words and sentence
Wenqian Shang is with School of Computer and Information Technology,
Beijing Jiaotong University, Beijing, China (phone: 010-51683602; e-mail:
shangwenqian(a hotmail.com).
Youli Qu is with School of Computer and Information Technology, Building of trend analysis report about review orientation
Beijing Jiaotong University, Beijing, China (e-mail:
qylgcomputer.njtu.edu.cn) Fig. 1. The flow of system.
Houkuan Huang is with School of Computer and Information
Technology, Beijing Jiaotong University, Beijing, China (e-mail:
hkhuanggcenter.njtu.edu.cn). The system has four main parts as follows: One is to get the
Yongmin Lin is with School of Computer and Information Technology,
Beijing Jiaotong University, Beijing, China (e-mail: review data from the web; second, according to the product
linyongminl 208gtom.com). sort to classify the review data into different classes; third,
Hongbin Dong is with School of Computer and Information Technology, adopting different data mining technologies to mine the
Beijing Jiaotong University, Beijing, China (e-mail: donghongbing263.net).

1-4244-0100-3/06/$20.00 C2006 IEEE 4855


product reputation of customer review; at last, building the What is the semantic orientation of opinion words? In the
trend analysis report to help the decision-maker make above example, "clear" is positive orientation; this word
decisions or help the potential customer purchase more expresses the favor of customer towards the product; "awful"
satisfying products. is negative orientation; this word expresses the dislike of
customer towards the product. Here we adopt the PMI-IR
B. The Function ofEach Component
algorithm to identify the semantic orientation of opinion
1) Real-time agent search engine words. If there are no negative words in a sentence, we think
This module realizes to gather the relevant reviews or that the semantic orientation of opinion words is the semantic
questionnaire data from the web automatically, saving these orientation of sentence. If there are negative words in a
data into the local server to deal with further. Moreover this sentence, we think that the semantic orientation of sentence is
module can complete update automatically, obtaining the up contrary with the semantic orientation of words.
to date information. 5) Building of the trend analysis report
2) Preprocessing review data Through the above four steps, we can get the statistic of
This module realizes to transfer any form review data (such positive and negative review, then build the trend analysis
as questionnaire data, review data in the forum, email etc.) to report to be available for decision-maker and potential
text data that system can deal with. customers to analyze.
3) Classifying review data
According to the product type, the system can classify III. MAIN ALGORITHM USED IN SYSTEM
review data into different types, such as: Mp3 player, DVD
player, Digital camera, Cellular phone and so on. If the In this section, we mainly present the algorithms used in
review data has been classified, it does not need to classify system. They can be described as follows:
again, it can go into next step----mining the review data. Here A. Fuzzy kNN Classifier
we adopt fuzzy kNN classifier to classify the review data. In recent years, the kNN algorithm is regarded by many
4) Mining the review data researchers. The kNN algorithm shows better categorization
After classifying the review data, we can mine the performance in text categorization. At present, there are two
reputation of customer towards products for every type of main decision rules in kNN algorithm, that is, the discrete
products. The process of mining can be divided into several value rule DVF (Discrete-Valued Function) and the weighted
steps as follows: similarity rule SWF (Similarity-Weighted Function). The
Step 1: part-of-speech tagging most widely used is the SWF rule. This paper mainly focuses
This step finishes the tagging of part of speech for on this rule. The kNN algorithm based on SWF rule can be
identifying part of speech of words, such as noun, verb, described as follows:
adjective etc, such as: The system searches k documents (called neighbors) that
<S><NG><W C-'PRP' L='SS' T='W' have the maximal similarity (cosine similarity) in training
S='Y'>I</W></NG><VG><W C 'VBP'>am</W><W sets. According to what classes these neighbors are affiliated
C 'RB'>deeply</W></VG><W C='IN'>in</W><NG><W with, it grades the test document's candidate classes. The
C-'NN'>Iove</W></NG><W C 'IN'>of</W><NG><W similarity between the neighbor document and the test
C='DT'>this</W><W C-'NN'>MP3</W></NG><W document is taken as the class weight of neighbor documents.
C='.'>. The decision function can be defined as follows:
Here we adopt the NL Processor linguistic parser [8] to k
analysis each review to split text into sentences and tag the
part of speech of each word. pi (X) = Zi=1 pi (Xi)sim(X, Xi) (1)
Step 2: Feature words extraction
What are feature words? We can use an example to explain,
such as:
Where uj(Xi)(e {O,1} shows whether Xi belongs to
"The sound of this mp3 is very clear."
"The appearance of this digital camera is very awful."
C0j ( lj (Xi) =1 is True) or not (,Uj (Xi ) = 0 is False);
Here, "sound" and "appearance" are feature words. In the where Oj is the sort of document class;
system, we adopt Apriori algorithm to extract feature words. sim(X, Xi) denotes the similarity between training
Step 3: Opinion words extraction
What are opinion words? Here opinion words are words document and test document. Then the decision rule is:
that customer evaluates products. In the example above, If,uj (X) = max 1i (X), then Xe (Os.
"clear" and "awful" are opinion words. Opinion words are In the classical kNN algorithm, there is an obvious
often adjective words. So we base on the feature words, to problem: when the density of training data is uneven it may
think that the adjective words near to feature words are decrease the precision of classification if we only consider the
opinion words. sequence of first k nearest neighbors but do not consider the
Step 4: Semantic orientation identification of opinion differences of distances. To solve this problem, we adopt the
words

4856
theory of fuzzy sets by constructing a new membership
function based on document similarities as follows: Return Ck
k XIiX procedure has_infrequent_subset(c:candidate k-itemset;
fli( l)i) x XI)
,Uj (X)
=l
k
)(-siw4x
(I
1
xi))"1(,-') Lk- 1: frequent(k- 1)-itemset)
//use prior knowledge
For each (k-l)-subset s of c
i=l (1-sim(X,Xi))2/(b-1) If sO LkI then
return TRUE;
Where j=1, 2, c, ,Uj(Xi)sim(X,Xi) is the
...,
return FALSE;
membership of known sample X to class j. If sample X Where D is review data set, min sup is the minimum
belongs to class j then the value is 1, otherwise 0. From this support threshold, L is feature word set.
formula, we can see that in reality the membership is using C. PMI-IR Algorithm
the different distance of every neighbor to the candidate
classifying sample to weigh its effect. Parameter b is used to In this paper, we use the PMI-IR Algorithm to identify the
adjust the degree of a distance weight. In this paper b take the semantic orientation of opinion words. The algorithm can be
value 2. Then fuzzy k-nearest neighbors' decision rule is: described as follows [ 1][ 12] [13]:
PMI-IR uses Pointwise Mutual Information (PMI) and
If uj (X) = maxp(X), then X 1J But why we
e .

Information Retrieval (IR) to measure the similarity of pairs


improve the classical kNN to this form? The particular reason of words or phrases. More specifically, the semantic
can consult [9]. Here we don't present in detail. orientation of a given words or phrase is calculated by
probability value. The probability value is calculated by
B. Apriori Algorithm taking the mutual information between the given words or
Here, we adopt the Apriori algorithm to extract feature phrase and the word "excellent" and subtracting the mutual
words. The description of the algorithm can be described as information between the given words or phrase and the word
follows [I0]: "poor". The strength of the semantic orientation is based on
L,= find frequent 1 -itemsets(D); the magnitude of the probability value. The PMI between two
for (k = 2; Lk ; k++) { words, word, and word2 is defined as follows [14]:
Ck= aproiri_gen(Lk I, min sup);
for each transaction tED {/scan D for counts PMI (word l word2 ) = log 2p(wordl & word2)] (3)
-p(word l) p(word 2 )
,

Ct= subset (Ck, t); //get the subsets of t that


are candidates
for each candidate cG9t Where, p(word, & word2) is the probability that
c.count++; word, and word2 co-occur. If the words are statistically
independent, then the probability that they co-occur is given
Lk {cECk c. count >min sup) by the product p(word1)p(word2). The Semantic Orientation
of a word/phrase, word/phrase, is calculated as follows:
Return L = U kLk
So (word / phrase ) = PM! (word / phrase ,"excellent ")

procedure apriori_gen(Lk- 1; frequent(k-1)-itemsets;


- PMI (word / phrase, " poor )
min-sup: minimum support threshold) (4)
for each itemset lIELkI
for each itemset 12 Lk-
IV. REALIZATION OF THE MINING SYSTEM BASED ON ROLE
if (11[1]= 12[1]) A (112] 12][2) A..A In the process of system realization, we adopt the method
(l[k-2] l[k-2]) A (iLk -1] 12 [k-1])
of role-based collaborative software development. The
detailed presentation can be described as follows [15]:
then {
c 1 l 12; //join step:generate A. Basic Role Concept
candidates Role idea has been widely used in behavior science, social
if has_infrequent_subset(c, Lk -) then science and psychology for many years and widely used in
delete c;//prune step:remove different area as an important idea [16][17]. Here we use this
idea in data mining area. Role is defined as a window that
unfruitful candidate
team members see collaborative development platform.
else add c to Ck; When he plays some role, he possesses a special view that

4857
understands surroundings. Roles specify not only what the
system may request the user, but also what the user may ask
for the system. It includes two aspects of content:
responsibility and right. Role is a key factor that development
platform communicates with team member and team member
communicates with team member. Role can express some
features of team member, on the one hand, it offers messages
to team member for serving other people; on the other hand, it
defines team members demand messages. Based on these
concepts, we can define the basic role ideas as follows:
RI: A role does not depend on people;
R2: A role can be set up, changed, and deleted;
R3: A role should consider both responsibility (service
interface) and right (requirement interface), defining a role
means defining its responsibility and right;
R4: A role does not concretely finish tasks that are
prescribed in its responsibility. The team members who are
playing this role finish the tasks;
R5: As a service interface, role is a filter that sends
messages to the agents who are playing it; Fig. 2. Assistant support tool
R6: As a requirement interface, role restricts the rights of
the users to access the system; The function of every component in Fig. 2 can be described
R7: Role is a medium that development platform as follows:
communicates with team members and team members 1) Class, object management tool
communicate with team members. Class management tool permits members to set up, delete,
Based on these concepts, a collaborative development modify class, or distribute message to class. The up to date
platform X can be described as a 9 meta change ofclass should be reloaded into development platform
dynamically.
group ::= C,O A, M, R, E, G, So, , H>. Object management tool permits members to set up, delete,
Where, C is a set of classes; 0 is a set of objects; A is a set modify object. The function is the same as class management
of agents; M is a set of messages; R is a set of roles; E is a set tool.
of surroundings; G is a set of workgroup; S( is the initial state 2) Task management tool
of collaborative system; H is a set of team members. We know that task decomposition, role distribution and
So is used to express the initial value of all components C, activity performance are three basic steps in implementing
0, A, M, R and E, such as primitive classes, initial objects, collaborative software development activities. But task
initial agents, primitive roles, primitive messages, and decomposition is the precondition of the whole collaborative
primitive environments etc. With the participation of member activity performance. In collaborative activity, a task is
H, after team members login collaborative platform £, always decomposed to some sub task and the sub task can be
access the object 0 of the platform, then send messages decomposed smaller cell task further. Every cell task is
through roles, thereby form a roup G in a special environment charged by corresponding role [18]. At the same time in
E. object-orientation idea, a cell task can be mapped into a
message. Task performance can be seen as the processing of
B. Assistant Support Tool
role towards message. So the whole task can be seen as
In the collaborative software development campaign, team message set that has definite performance list. They connect
members use initial roles to communicate with other team through role as ball and chain. From the analysis of task
members who play some roles. In order to help team members above, we can express a task as t ::= <r, Mi, Mo>.
communicate in the development activities, we need to set up Where, r expresses the number of roles executing this task,
a set of convenient and flexible assistant tools. It can be <r<n (n is the maximum number of task decomposition); Mi
described as Fig. 2: expresses input message of task, when t - tl, Mi is empty
(where, t1 is initial task, initial task has no input message); Mo
expresses output message of task, when t = t, Mo is empty
(where, tn is end task, end task has no output message).
We use t to express a cell task, use T to express a collective
task. In order to make the collaborative activity perform with
high efficiency, trace the task and improve the management
of task, it is needed to offer a convenient and flexible task
management tool. Team members can only use graphical user

4858
interfaces to realize decomposition and management of tasks. analyzer offers, such as class definition, class inheritance and
3) Agent management tool function definition. He/She does not care for the code
Agent management tool mainly realizes the function of realization of class and function;
addition, deletion, modification, activation and prohibition (6) The code programmer finishes the code of class and
about agents. Here, agent activation and prohibition are to function that module designer designs;
control agents' behavior and activities. (7) The tester mainly finishes the test of cell module and
4) Role definition tool the test of whole system. After finishing the test, submit the
Role definition tool mainly realizes the function of role test report;
definition, activation and prohibition. Role is defined as the (8) After the code programmer finishes the code, submit
set of message. So when defining a role, members only need the code to project group;
to release requirement message from class, object, agent, role, (9) The designer connects the submitted code into cell
group and surrounding lists. module and submits;
5) Role negotiation tool (10) At last, system analyzer integrates all modules into a
This tool mainly supports communication between system, checks the test result, decides whether to modify or
members and role manager. If member wants to play a role, not, then submits the system to customer;
he must offer his agent class and realize all input message of (11) After the user accepts the submitted system, runs the
this role. At the same time, he can require the addition of system, finds the error of system and find the new
some message. requirement.
6) Surroundings management tool Thus, the whole development of system is finished. The
Surroundings management tool mainly realize the function project manager finishes the control of cost and time during
of foundation and deletion, and manage the role number in the whole process of project development. At the same time,
surroundings. he cooperates with system analyzer and module designer to
7) Group management tool submit the project on schedule during the scope of cost
This tool mainly finishes the function of foundation, budget.
deletion and modification, and offers the distribution function In this paper, we use this concepts and idea to guide our
of group news. system realization. At last it shows that the whole process of
Notice: In idiographic collaborative activity, this group of project is shortened greatly than anticipation. This shows that
assistant tool does not exist isolated, but contact with each our methods are feasible and effective.
other, closely collaborate to finish the same task.
C. The Realization of Customer Review Mining System V. CONCLUSION
Based on Role In this paper, we design a novel role-based customer
Through analysis and discuss above, using class, object, review mining system. We synthetically apply natural
role, agent, group and surroundings etc and a group of language processing technology, information retrieval
assistant tool set, we can conveniently set up role-based technology, data mining technology and text mining
collaborative software development platform. Through this technology etc. At the same time, we introduce the role idea
platform, we can realize the role-based customer review into the system. This improves the efficiency of system
mining system; the process of realization can be described as development greatly. In the future, we need to improve the
follows: system precision and efficiency further, expand the functions
(1) Firstly, the task management tool decomposes the of the system, and improve the algorithms in the system
whole task into single cell task; further.
(2) The role definition tool defines project manager, system
analyzer, module designer, code programmer and tester roles REFERENCES
according to the project tasks to form the development team; [1] H. Li, and K. Yamanishi, "Mining from open answers in questionnaire
(3) The project manager uses the defined manager role to data," in 7" ACMSIGKDD International Conference on Knowledge
Discovery and Data Mining, San Francisco, California, Aug. 2001, pp.
login the development platform. The platform offers him 443-449.
corresponding User Interfaces and information according to [2] K. Yamanishi, and H. Li, "Mining open answers in questionnaire data,"
the roles he/she plays. With the help of them, the project IEEE Intelligent Systems, vol. 17, no. 5, pp. 58-63, Sep. 2002.
[3] S. Morinaga, K. Yamanishi, K. Tateishi, and T. Fukushima, "Mining
manager finishes his/her management work about the project product reputations on the web," in Proceedings of the 8"7 ACM
cost and time. He/She does not need to consider concreate SIGKDD International Conference on Knowledge Discovery and Data
realization details; Mining, Edmonton, Alberta, Canada, Jul. 2002, pp. 341-349.
[4] B. Liu, M. Hu, and J. Cheng, "Opinion observer: analyzing and
(4) The system analyzer uses analyzer role to login the comparing opinions on the web," in the 14'h International World Wide
development platform. He/She finishes the work of project Web Conference, Chiba, Japan, May, 2005, pp. 342-351.
requirement document and requirement report according to [5] M. Hu, and B. Liu, "Mining and summarizing customer reviews," in the
conference of KDD 2004, Seattle, Washington, USA, pp.168-177.
specific interface that platform offers him; [6] M. Hu, and B. Liu, "Mining Opinion Features in Customer Reviews,"
(5) The module designer designs project function modules in Proceedings of1]9g National Conference on Artificial Intelligence,
according to the requirement documents that the system San Jose, USA, July, 2004, pp. 755-760.

4859
[7] K. Dave, S. Lawrence, and D. M. Pennock, "Mining the peanut gallery:
opinion extraction and semantic classification of product reviews," in
the conference of WWW03, Budapest, Hungary, May, 2003.
[8] NL Processor - Text analysis toolkit. 2000.
http://www.infogistics.com/textanalysis.html.
[9] W. Q. Shang, H. K. Huang, and H. B. Zhu, et al, "An adaptive fuzzy
kNN text classifier," in proceedings of 2006 international conference
on computational science, UK, May 2006.
[10] J. W. Han, and M. Kamber, Data mining concepts and techniques.
Beijing, China, China Machine Press, 2001.
[ I1] P. D. Turney, "Thumbs up or thumbs down? Semantic orientation
applied to unsupervised classification of reviews," in proceedings of
the 40th annual meeting ofthe association for computational linguistics,
Philadelphia, PA, USA, July, 2002, pp. 417-424.
[12] P. D. Turney, "Mining the web for synonyms: PMI-IR versus LSA on
TOEFL," in proceedings ofthe 12ih European conference on machine
learning, Freiburg, Germany, Step. 2001, pp. 491-502.
[13] V. Hatzivassiloglou, and K. R. McKeown, "Predicting the semantic
orientation of adjectives," in Proc.35th Annual Meeting of the ACL and
the 8t" Conference of the European Chapter of the ACL, Madrid, July,
1997, pp.1748-181.
[14] K. W. Church, and P. Hanks, "Word association norms, mutual
information and lexicography," in Proceedings of the 27h Annual
Conference of the ACL, Canada, June, 1989, pp. 76-83.
[15] L. Y. Liu, H. B. Zhu, and C. X. Jiang, "A role-based collaborative
software development framework," Science Technology and
Engineering, vol. 5,no. 17, pp. 1300-1304, Sep. 2005.
[16] H. B. Zhu, "Role Mechanisms in Collaborative Systems," International
Journal ofProduction Research, voL 44, no. 1, Jan. 2006, pp. 181-193.
[17] H. B. Zhu et al, "An object-oriented multimedia management model
WNCH," Journal ofsoftware, supplement, voL 7, 1996.
[18] A. Caetano, M. Zacarias, A. R. Silva, and J. Tribolet, "A role-based
framework for business process modeling," in proceedings of the 38'h
Hawaii international conference on svstem sciences, 2005.

4860

You might also like