Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Framework to Predict Software Quality in Use

from Software Reviews


Issa Atoum1, Bong Chih How1
1

Faculty of Computer Science and Information Technology


Universiti Malaysia Sarawak
94300 Kota Samarahan, Sarawak, Malaysia
{ Issa.Atoum@gmail.com , chbong@fit.unimas.my }

Abstract Software reviews are verified to be a good source


of users experience. The software quality in use concerns
meeting users needs. Current software quality models such
McCall and Boehm, are built to support software development
process, rather than users perspectives. In this paper, opinion
mining is used to extract and summarize software quality in
use from software reviews. A framework to detect software
quality in use as defined by the ISO/IEC 25010 standard is
presented here. The framework employs opinion-feature double
propagation to expand predefined lists of software quality in
use features to domain specific features. Clustering is used to
learn software feature quality in use characteristics group. A
preliminary result of extracted software features shows
promising results in this direction.
Keywords qualtiy in use; software reviews; opinion mining;
ISO 25010; quality model;product reviews

I.

INTRODUCTION

Online product reviews are a major information source of


users experience. They have direct impacts on peers
willingness to buy same products. For example if someone is
willing to buy a mobile phone s/he will look for comments
about the phone on online web sites such as amazon.com or
epinions.com before deciding to buy it. Software as products,
are not an exception of hardware products. Many online web
sites give users the opportunity to share their experience and
give ideas about software enhancements. For example,
cnet.com is one popular web site that has thousands software
products and it has the ability to track users statistics of
downloads and satisfaction over time.
Reviews on popular software have a increased
dramatically; hence processing them is a laborious yet costly.
Moreover, most of the time, product reviewing can be
confusing. For example, comments like I just don't like this
product and The product took forever to be here is lack of
constructive expressions as the comments were not targeted on
the product because some reviews consist of emotional
expression and/or biases.
Software products are evaluated differently by different
stakeholders interests. For example, the publisher of the
product may be interested in developing quality software while
users care about the whole product while it is operational.
Quality according to Gravin in the point view of a user :
meeting customer needs[1]. If the software meets the needs
then it is said to have good quality.

Many software quality models such as McCall, Boehm,


Dromey and FURUPS [2], [3] are built for quality from
development perspective and does not fit to measure software
quality from user point of view [4], [5]. For users, the purpose
of using software is to help them achieve particular goals, as
the effectiveness, efficiency and satisfaction with which users
can achieve specified goals in specified environments.
ISO/IEC 25010:2011(hereafter ISO 25010) covers the software
quality by a model known as Systems and software Quality
Requirements and Evaluation (SQuaRE). The ISO 25010 has
the quality in use model, the focus of this work.
Opinion Mining (OM) can be used to identify important
reviews and opinions to answer users queries about quality.
Current opinion mining is restricted to reveal bi-polarity of the
reviews: positive and negative. Bipolar under a voting
mechanism, the quality of evaluations are affected by
imbalanced vote bias, which make an OM system
impracticable. Thus, there is an urgent need for effective
evaluation on product quality. The flaw with the polarity
system is that, the overall review can be easily tampered
because of the existing opinion spam. Opinion spam refers to
fake or untruthful opinions specifically not on the product, for
instance, negative comment given because of poor delivery
services or packaging.
Furthermore, a truthful reviewing process should center on
the efficiency and effective of the product to users. The more
fine-grain works are on feature or aspect-based sentiment
analysis where it determines the opinions on the features of the
reviewed entity such as cell phone, tablet etc. However, this
suffers the same two problems mentioned previously as it is
difficult to identify relevant entities and polarity, in addition,
the review may not focus on the entity itself. More importantly,
to our knowledge little research has been published in software
reviews opinion mining. Mining software reviews can save
users time and can help them in software selection process.
This paper proposes a framework to process software user
reviews in order to extract one of the software quality
indicators, quality in use as defined by the ISO 25010 (ISO,
2011) . Table I shown quality in use characteristics
definitions. One major step of this research is to build a data set
of software keywords or features. In the context of our
research, software features are software properties that describe
software quality in use such

TABLE I.
DEFINITIONS OF QUALITY IN USE
CHARACTERISTICS FROM ISO 25010.

Characteristic
Effectiveness

Efficiency

Freedom
From Risk
Satisfaction

Context
Coverage

Definition
Accuracy and completeness with
which users achieve specified goals
(ISO 9241-11).
Resources expended in relation to
the accuracy and completeness with
which users achieve goals (ISO
9241-11).
Degree to which a product or
system mitigates the potential risk
to economic status, human life,
health, or the environment.
Degree to which user needs are
satisfied when a product or system
is used in a specified context of use.
Degree to which a product or
system can be used with
effectiveness, efficiency, freedom
from risk and satisfaction in both
specified contexts of use and in
contexts beyond those initially
explicitly identified.

as the keyword conform and resource to describe


satisfaction and efficiency characteristics respectively. Once
data are in place, then a model is built utilizing topic modeling
[6][7] and opinion mining [8]. Finally the model is evaluated
against predefined criteria with users.
First, the problem is defined. Then related works are
presented. After that, the proposed approach is explained.
Finally, preliminary results are presented and the paper is
concluded.
II.

THE PROBLEM

The problem of this work can be summarized in two sub


problems:
A. Problems related to quality in use models
Although there are many software quality models such as
McCall, Boehm, Dromey and FURUPS [2], [3], most of them
target the software product or process characteristics and does
not fit to measure software quality from user point of view [4],
[5] (quality in use). Tweaking these models to allow
measuring quality in use can be cumbersome process and is
outside the scope of this work.
B. Problems related to user reviews processing

of Leoppairote et al. [9] provides an approach to mine


software reviews based on ontology mining and rule-based
classification for ISO 9126 model.
Processing massive number of software product reviews is
challenging due to the anticipated effort that is needed for a
single review: filter unneeded spam sentences, removal of stop
words and non-dictionary words, processing reviews that has
certain level of confidence between reviewers, etc. Also,
processing unconstructive sentences has to be taken into
special consideration given that current review processing is
bi-polar either positive or negative. It is very beneficial to
learn to recognize the subjectivity review especially user
experiences, to discover product quality, instead of judging
solely on the aggregation positive or negative score
Although extracting features (software aspects in our case) has
been studied extensively in various domains, many grouping
and summarization approaches are dictionary- or corpusbased. The dictionary based approaches cannot detect domain
dependent context features but usually are huge, and the
corpus-based approaches are hard to be constructed yet
domain dependent.
III.

RELATED WORKS

Here we present the related works that are used to build the
proposed framework grouped into WordNet, EM algorithm,
topic modeling, and feature extraction and summarization.
A. WordNet
WordNet is a lexical knowledge of English that contains more
than 155,000 words organized into a taxonomic ontology of
terms1. It contains nouns, verbs, adjectives and adverbs that are
clustered into synonym sets (called synsets). Each synset can
be linked to different synsets via a specific relationship entailed
between concepts. Hyponym/hypernym or is-a relationship,
and the meronym/holonym or part-of relationship are
common relationships in WordNet.
B. EM Algorithm
EM is class of iterative algorithms that uses maximum a
posteriori to estimate a new classifier in problems with
incomplete data[6]. In other words, EM uses a semi-supervised
learning seeded with labeled data to estimate the incomplete
data (unlabeled documents). EM is based on two steps known
as the E-Step (Expectation) and M-Step (Maximization). In the
E-step the model parameters (class distributions) are estimated
while in the M-Step the class that gets the maximum value over
the estimated parameters is returned. The algorithm iterates
between E-Step and M-Step till the model parameters
converge. As a result, each unlabeled document will get
labeled with the most appropriate class. In our work, the
initialization of the EM algorithm will be features and opinion
words extracted using Qiu algorithm.

To our knowledge little research has been published in


software reviews opinion mining. Mining software reviews
can save users time and expedite software selection. The work
1

latest version 3.0 as of Feb 2013

C. Topic Modeling
Topic modeling methods can be instinctively viewed as
clustering algorithms that cluster terms into meaningful
clusters or subtopics. In probabilistic topic models [7][10],
[11], documents are a mixture of topics where a topic is a
probability distribution over words. To generate a document
one choose distribution over topics, for each word in that
document one choose a topic at random according to its
distribution and draws a word from that topic. A famous topic
modeling model is called LSI or LSA[12], [13] . LSA
transforms text to low dimensional matrix and it finds the most
common topics that can appear together in the processed text.
Latent Dirichlet Allocation (LDA) is very famous topic
modeling[7]. The model extends Probabilistic Latent Semantic
Analysis (PLSA) model[6] to cover two problems: over fitting
and the limitation of assigning probability to a document
outside the training set.
D. feature extraction , classification, and summarization
Feature/topic extraction has been discussed in literature in
many works such as [1418]. Most of these works use the
language semantics to extract features such as nouns and noun
phrases along with their frequencies subject to predefined
thresholds. Another type of works deal with user reviews as
documents, and apply text classification approaches such as
[6], [1921] to classify reviews.
Leopairote, Surarerks, & Prompoon [9] proposed a model that
can extract and summarize software reviews in order to predict
software quality in use. The model depends on a manually
built ontology of ISO 9126 quality in use keywords and
WordNet 3.0 synonyms expansion.
Qiu et al. [18], [22] suggested to extract both features and
opinion by propagating information between them using
grammatical syntactic relations. The algorithm outperform
state-of-art algorithms of Kanayama & Nasukawa, [23],
Hofmann [11] , and Lafferty, Mccallum, & Pereira [24] .
Mukherjee & Bhattacharyya [14] extract domain-independent
features from reviews and classify them by graphing dependent
features/opinions and finding the shortest path to predefined
cluster heads.

Krestel & Dokoohaki [29] propose a model to summarize


reviews according to specific ranking strategy (summarizing all
reviews, specific latent topic, focuses on positive or negative or
neural aspects). It utilizes LDA and minimizes topic
divergence.
Bhattarai, et al. [30] present a framework to automatically
detect, extract and aggregate semantically related features.
They apply sentence level syntactic information to detect
features and a corpus level information to group similar
features.
We consider the work of Leopairote, Surarerks, & Prompoon
(2012) the most nearby to our work. The difference from our
work is that we use word similarity and relatedness rather than
rule based classification and ontologies.
IV.

PROPOSED FRAMEWORK

This research proposes a framework that is composed of data


preparation, feature extraction, sentiment orientation and
quality in use overall scoring. First quality in use
repository is built, and then software features are expanded.
After that sentences polarity are calculated and sentence are
grouped into its own quality in use characteristic. Finally the
quality in use is scored to represent the overall quality in
use value for a particular software.
Figure 1 illustrates the structure of the proposed framework.
The methodology goes in these steps:
A. Data Preparation
To allow the maximum coverage of quality in use, different
viewpoints are taken into consideration. These viewpoints are
the ISO standard definition of quality in use, and how often
different keywords are used together. To achieve this goal we
suggest the combination of: the ISO 25010 standard document,
Google search results, the WordNet taxonomy and a sample of
software product reviews.

Review
Corpus
Quality in
use Data
set

Seeds

Seeds

Opinion
lexicon

Jin, Huang, & Zhu [26] exploit different classification


algorithms to classify the product related microblogs. Both
explicit and implicit aspect is considered and multiple kinds of
feature weighing are compared. Implicit aspects are extracted
using mutual information scores between adjectives and the
most relevant aspect.
Zhai et al. [25] used a semi supervised soft-constrained
Algorithm (SC-EM) that depends on seeds(labeled group
examples) lexical similarity method to quantify strength of
features similarity to predefined groups.
Shang et al. [28] develop a new algorithm to automatically
learn and extract opinion targets in short comments. They
propose new words representation (word frequency &
sentiment proportion) based on Ku method [15] .A Neural
Network is trained on 37 manually selected and annotated
targets.

Fig.1 Proposed methodology for predicting quality in use of


softwaer user reviews

The reason behind this combination is that, the keywords to


cover domain specific keywords using a sample software
reviews and making sure all the extracted keywords are fully
aligned with the ISO 25010 standard. Google searchcan
enhance the results using related words. The WordNet can be
utilized to expand related words and synonyms.
B. Feature Extraction
A feature is a target that the user is of interest (software entity
name, or part of its attributes). In the context of our research,
software features are software properties that describe software
quality in use such as the keyword conform and resource
to describe satisfaction and efficiency characteristics
respectively. Features are extracted using double prorogation of
opinion and features [18], [22], so it will allow us to extract
extra quality in use features that were not initially identified.
C. Sentiment Orientation
Knowing the polarity of each compared sentence is very
important so that positive and negative opinions can be
aggregated The proposed approach suggests calculating the
sentiment orientation of each sentence using a semi-supervised
technique seeding it with list of combined opinion words list.
The polarity that is calculated by Qiu method[22] is based on a
combined list opinion lexicon prepared by Hu and Li [31]
which consist of 654 positive opinions and 10,98 negative
opinion words, however another alternative might be
SentiWordnet2.
D. Feature Summarization
Given a set of reviews with many sentences, each sentence is
grouped into a single quality in use characteristic such as
efficiency, and risk mitigation. To resolve this issue we
borrow using the Zhai et al. [25] Expectation Maximization
algorithm. Assuming features and opinion words pairs are
available from Qiu model(previous step), using a modified
version of Expectation Maximization(EM) framework [32] ,
related software features are grouped and mapped to software
quality in use characteristics. In this work, the initialization of
the EM algorithm will be features and opinion words extracted
using Qiu algorithm. Finally we score quality at the software
level by aggregating polarity of related grouped software
characteristics.
E. Overall Quality in Use scoring
In this step the overall quality in use for the software is
calculated. The final value of processed reviews can be shown
to user in a percentage layout. For example user may get that
certain software is 90% covering efficiency, 70% covering
effectiveness and an overall of 80% quality in use. One way
to get the overall quality in use for a software is to average
the quality in use for each of five characteristics using the
formula 1 . We defer other possible ways such as those that
depend on user preferences or characteristic weighting for
future research.
Initial investigation at this stage is that we might present
satisfaction as a function of efficiency, effectiveness and risk
mitigation and context coverage as a function of satisfaction.

http://sentiwordnet.isti.cnr.it/

Where
is the quality in use characteristics sentences
classified as {efficiency, effectiveness, risk mitigation,
( ) { } is a
satisfaction, context coverage},
positive or negative orientation of each sentence.
V.

PREIMILIARY RESULTS

Experiments have been carried out to prepare the data


dictionary for quality in use. Figure 2 illustrates the proposed
approach.
A. Data Preparation
Ten different domain software reviews have been crawled from
cnet.com. The fields that have been extracted are: software
name, pros, cons, summary and review rating. From these
domains one d were chosen at random for topic modeling (Step
B), cross checking the validity of the framework can be applied
using other software domains. The training of 200 sentences
was selected at random from the list of software lists from pros,
cons or summary fields. Each sentence is annotated with which
characteristic it belongs to by human expert. For sentences that
does not have any clear characteristics we assign it to the
negative example list. The whole ISO document was also taken
for topic modeling. Both documents were filtered from stop
words, special chars and non-English words. Other types of
documents such as those from Google search are not
considered at this stage. No synonym or word similarity using
WordNet was applied.
B. Topic Modeling
The datasets prepared in section A were used to extract
initial topics using a topic model algorithm known as Latent

ISO 25010
Documents: ISO25010,
Sample software Domain

Quality in
Use

Topic Modeling

Mapping

Building
keywords

Quality in use
topic
Keywords
(quality in use
dictionary)

Fig.2 Building quality in use dictionary


2

Dirichlet Allocation(LDA)[7]. Each data set were used alone to


extract a list of possible topics. The perception of LDA is to
discover topics of a given document. LDA is chosen because
exact topic keywords are not known and LDA is unsupervised
algorithm . LDA outperforms other topic modeling algorithms
such as LSA and PLSA [10], [12]. LDA was used with default
parameters topics=10, =5 (50/Topics), =0.01 and 20 words
per topic.
C. Finding Relevant Keywords
Given these initial topics from ISO document and the
review sentences, we construct the quality in use
characteristic by taking one word at a time and verify words
relevancy to the ISO 25010s quality in use characteristic
definitions. Extracted topics were then categorized into three
subtopics: effectiveness, efficiency and risks. Duplicate
keywords were removed if found.
We have also tried to automatically build the word list.
However, our preliminary result shows that it is rather
unsatisfied.
TABLE II.

SAMPLE EXTRACTED TOPIC KEYWORDS

VII. REFERENCES
[1]
[2]

[3]

[4]

[5]

[6]
[7]
[8]

[9]

Quality in Use characteristic


Efficiency
Risk Mitigation

Effectiveness
Achieve

accessibility

alert

Appropriateness

availability

Bug

Able

background

communication

Automatic

capacity

confidentiality

Control

compatible

Cost

Behavior

CPU

crash

Change

efficiency

Disabilities

Achieve

accessibility

alert

[10]

[11]

[12]

[13]

[14]

Table II shows a fragment of keywords categorized under


the three subcategories that are manually extracted. First glance
of the outcomes shows rather promising result. Collocating the
words with ISO document reveals that most words are indeed
closely related to each subcategories. We believe by detecting
the semantics embedded in the user reviews through semantic
analysis, we are able to identify different user experiences on a
product, which in turn allows us to better predict the software
quality.
VI.

[15]

[16]

[17]

CONCLUSION

In this paper, we propose a framework to detect software


quality in use from online software reviews as defined by the
ISO 25010. The framework employs LDA topic modeling to
build a data set of quality in use, a semi-supervised learning
to expand software quality in use features and predefined sets
of opinion lexical to calculate polarity of several sentences. We
are working on developing this framework and looking for an
application of the framework on other quality dimensions.

[18]

[19]

[20]

D. A. Garvin, What does product quality really mean, Sloan


management review, vol. 26, no. 1, pp. 2543, 1984.
J. A. McCall, P. K. Richards, and G. F. Walters, Factors in software
quality. General Electric, National Technical Information Service.,
1977.
R. G. Dromey, A model for software product quality, Software
Engineering, IEEE Transactions on, vol. 21, no. 2, pp. 146162,
Feb. 1995.
D. Samadhiya, S.-H. Wang, and D. Chen, Quality models: Role
and value in software engineering, in Software Technology and
Engineering (ICSTE), 2010 2nd International Conference on, 2010,
vol. 1, pp. V1320 V1324.
R. E. Al-Qutaish, Quality models in software engineering
literature: an analytical and comparative study, Journal of
American Science, vol. 6, no. 3, pp. 166175, 2010.
D. M. Blei, Probabilistic topic models, Commun. ACM, vol. 55,
no. 4, pp. 7784, Apr. 2012.
D. M. D. Blei, A. Y. A. Ng, and M. I. Jordan, Latent dirichlet
allocation, J. Mach. Learn. Res., vol. 3, pp. 9931022, Mar. 2003.
L. Zhang and B. Liu, Identifying noun product features that imply
opinions, in Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language
Technologies: short papers, 2011, vol. 2, pp. 575580.
W. Leopairote, A. Surarerks, and N. Prompoon, Software quality
in use characteristic mining from customer reviews, in Digital
Information and Communication Technology and its Applications
(DICTAP), 2012 Second International Conference on, 2012, pp.
434439.
T. Hofmann, Probabilistic latent semantic indexing, in
Proceedings of the 22nd annual international ACM SIGIR
conference on Research and development in information retrieval,
1999, pp. 5057.
T. Hofmann, Unsupervised Learning by Probabilistic Latent
Semantic Analysis, Machine Learning, vol. 42, no. 12, pp. 177
196, 2001.
S. Deerwester and S. Dumais, Indexing by latent semantic
analysis, Journal of the American society for information science,
vol. 41, no. 6, pp. 391407, Sep. 1990.
T. K. Landauer, P. W. Foltz, and D. Laham, An introduction to
latent semantic analysis, Discourse Processes, vol. 25, no. 23, pp.
259284, 1998.
A. Mukherjee and B. Liu, aspect extraction through SemiSupervised modeling, Proceedings of 50th anunal meeting of
association for computational Linguistics (acL-2012), no. July, pp.
339348, 2012.
L. Ku, Y. Liang, and H. Chen, Opinion extraction, summarization
and tracking in news and blog corpora, in Proceedings of AAAI2006 Spring Symposium on Computational Approaches to
Analyzing Weblogs, 2006, no. 2001.
L. Zhang, B. Liu, S. S. H. Lim, and E. OBrien-Strain, Extracting
and ranking product features in opinion documents, in Proceedings
of the 23rd International Conference on Computational Linguistics:
Posters, 2010, no. August, pp. 14621470.
T.-L. Wong, W. Lam, and T.-S. Wong, An unsupervised
framework for extracting and normalizing product attributes from
multiple web sites, in Proceedings of the 31st annual international
ACM SIGIR conference on Research and development in
information retrieval, 2008, pp. 3542.
G. Qiu, B. Liu, J. Bu, and C. Chen, Opinion word expansion and
target extraction through double propagation, Computational
linguistics, vol. 37, no. 1, pp. 927, 2011.
D. Andrzejewski and X. Zhu, Latent Dirichlet Allocation with
topic-in-set knowledge, in Proceedings of the NAACL HLT 2009
Workshop on Semi-Supervised Learning for Natural Language
Processing, 2009, no. June, pp. 4348.
D. M. Blei and J. D. McAuliffe, Supervised topic models, arXiv
preprint arXiv:1003.0783, 2010.

[21]

[22]

[23]

[24]

[25]

[26]

M. Steyvers and T. Griffiths, Probabilistic topic models, in


Handbook of latent semantic analysis, and W. K. In T. Landauer, D
McNamara, S. Dennis, Ed. Laurence Erlbaum, 2007, pp. 424440.
G. Qiu, B. Liu, J. Bu, and C. Chen, Expanding domain sentiment
lexicon through double propagation, in Proceedings of the 21st
international jont conference on Artifical intelligence, 2009, pp.
11991204.
H. Kanayama and T. Nasukawa, Fully automatic lexicon expansion
for domain-oriented sentiment analysis, in Proceedings of the 2006
Conference on Empirical Methods in Natural Language Processing,
2006, pp. 355363.
J. Lafferty, A. Mccallum, and F. C. N. Pereira, Conditional
Random Fields: Probabilistic Models for Segmenting and Labeling
Sequence Data, in Proceedings of ICML01, 2001, vol. 2001, no.
Icml, pp. 282289.
Z. Zhai, B. Liu, J. Wang, H. Xu, and P. Jia, Product Feature
Grouping for Opinion Mining, Intelligent Systems, IEEE, vol. 27,
no. 4, pp. 3744, 2012.
Q. Su, X. Xu, H. Guo, Z. Guo, X. Wu, X. Zhang, B. Swen, and Z.
Su, Hidden sentiment association in chinese web opinion mining,
in Proceedings of the 17th international conference on World Wide
Web, 2008, pp. 959968.

[27]

[28]

[29]

[30]

[31]

[32]

R. Zhang and T. Tran, An information gain-based approach for


recommending useful product reviews, Knowledge and
Information Systems, vol. 26, no. 3, pp. 419434, Mar. 2010.
L. Shang, H. Wang, X. Dai, and M. Zhang, Opinion Target
Extraction for Short Comments, PRICAI 2012: Trends in Artificial
Intelligence, vol. 7458, pp. 434439, 2012.
R. Krestel and N. Dokoohaki, Diversifying product review
rankings: Getting the full picture, Web Intelligence and Intelligent
Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International
Conference on, vol. 1, pp. 138145, Aug. 2011.
A. Bhattarai, N. Niraula, V. Rus, and K. Lin, A Domain
Independent Framework to Extract and Aggregate Analogous
Features in Online Reviews, in Computational Linguistics and
Intelligent Text Processing, vol. 7181, A. Gelbukh, Ed. Springer
Berlin / Heidelberg, 2012, pp. 568579.
M. Hu and B. Liu, Mining and summarizing customer reviews, in
Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, 2004, pp. 168177.
A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from
incomplete data via the EM algorithm, Journal of the Royal
Statistical Society. Series B (Methodological), pp. 138, 1977.

You might also like