Professional Documents
Culture Documents
Data Science 1St Edition Robert Stahlbock Online Ebook Texxtbook Full Chapter PDF
Data Science 1St Edition Robert Stahlbock Online Ebook Texxtbook Full Chapter PDF
Stahlbock
Visit to download the full and correct content document:
https://ebookmeta.com/product/data-science-1st-edition-robert-stahlbock/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://ebookmeta.com/product/the-beginners-guide-to-data-
science-robert-ball/
https://ebookmeta.com/product/managing-your-data-science-
projects-learn-salesmanship-presentation-and-maintenance-of-
completed-models-1st-edition-robert-de-graaf/
https://ebookmeta.com/product/the-educational-leader-s-guide-to-
improvement-science-data-design-and-cases-for-reflection-1st-
edition-robert-crow/
Data Science: The Hard Parts: Techniques for Excelling
at Data Science 1st Edition Daniel Vaughan
https://ebookmeta.com/product/data-science-the-hard-parts-
techniques-for-excelling-at-data-science-1st-edition-daniel-
vaughan/
https://ebookmeta.com/product/numerical-python-scientific-
computing-and-data-science-applications-with-numpy-scipy-and-
matplotlib-2nd-edition-robert-johansson/
https://ebookmeta.com/product/numerical-python-scientific-
computing-and-data-science-applications-with-numpy-scipy-and-
matplotlib-2nd-edition-robert-johansson-2/
https://ebookmeta.com/product/julia-data-science-1st-edition-
jose-storopoli/
https://ebookmeta.com/product/python-data-science-chaolemen-
borjigin/
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON
DATA SCIENCE
ICDATA’18
Editors
Robert Stahlbock
Gary M. Weiss, Mahmoud Abou-Nasr
Associate Editor
Hamid R. Arabnia
Copying without a fee is permitted provided that the copies are not made or distributed for direct
commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source.
Please contact the publisher for other copying, reprint, or republication permission.
Data mining or machine learning is critically important if we want to effectively learn from
the tremendous amounts of data that are routinely being generated in science, engineering,
medicine, business, and other areas in order to gain insight into processes, transactions, extract
knowledge, make better decisions, and deliver value to users or organizations. This is even more
important and challenging in an era in which scientists and practitioners are faced with numerous
challenges caused by exponential expansion of digital data, its diversity and complexity. The
scale and growth of data considerably outpace technological capacities of organizations to
process and manage it. During the last decade, we all observe new, more glorious and promising
concepts or labels emerging and slowly but steadily displacing ‘data mining’ from the agenda of
CTO’s. It was and still is the time of data science, big data, advanced-/business-/customer-/data-
/predictive-/prescriptive-/…/risk-analytics, to name only a few terms that dominate websites,
trade journals, and the general press – although there is even a rebirth of terms such as artificial
intelligence and (machine) learning (e.g., deep learning) in academia, companies, and even on the
agenda of political decision makers. All the concepts aim at leveraging data for a better
understanding of, and insight into, complex real-world phenomena. They all pursue this objective
using some formal, often algorithmic, procedures, at least to some extent. This is what data
miners have been doing for decades. The very idea of all those similar or identical concepts with
different labels, the idea to think of massive, omnipresent amounts of data as strategic assets, and
the aim to capitalize on these assets by means of analytic procedures is, indeed, more relevant and
topical than ever before. Although there are very helpful advances in hardware and software,
there are still many challenges to be tackled in order to leverage the promises of data analytics.
Obviously, technological change is never ending and appears to be accelerating. Right now the
world seems especially focused on machine learning and data mining (not contradictory but
similar or even equivalent to data science), as these disciplines are making an ever increasing
impact on our society. Large multinational corporations are expanding their efforts in these areas
and students are flocking to computer science and related disciplines in order to learn about these
disciplines and take advantage of the many lucrative job opportunities.
The growth in all these areas has been dramatic enough to require changes in nomenclature.
Most of these ‘hot’ technologies and methods are increasingly considered part of the broad field
of data science, and there are benefits to viewing this field as a unified whole, rather than a
collection of disparate sub-disciplines. For this reason, the International Conference on Data
Mining (DMIN), which has been held for 13 consecutive years, has been renamed as the
International Conference on Data Science (ICDATA). In an effort to unify the field, the former
International Conference on Advances in Big Data Analytics (ABDA) is being merged into
ICDATA. But ICDATA is still much broader than just data mining and ‘big data.’ It includes all
of the following main topics: All aspects of data mining and machine learning (tasks, algorithms,
tools, applications, etc.), all aspects of big data (algorithms, tools, infrastructure, and
applications), data privacy issues, and data management. The conference is designed to be of
equal interest to: researchers and practitioners; academics and members of industry; and computer
scientists, physical and social scientists, and business analysts.
Data science attracts innovative and influential contributions to both research and practice,
across a wide range of academic disciplines and application domains. Our conference seeks to
acknowledge and facilitate excellence in research and applications in the area of data science. Our
conference is held annually within CSCE. CSCE'18 assembles a spectrum of 20 affiliated
research conferences, workshops, and symposiums into a coordinated research meeting. Each
conference has its own program committee as well as referees and own proceedings. Attendees
have full access to all 20 conferences' sessions, tracks, and tutorials. ICDATA seeks to reflect the
multi- and interdisciplinary nature of data mining and to facilitate the exchange and development
of novel ideas, open communication and networking amongst researchers and practitioners in
different research domains. As in previous years, we hope that the 2018 International Conference
on Data Science will provide a forum to present your research in a professional environment,
exchange ideas, and network and interact across research areas. ICDATA’18 provides an
international and multicultural experience with contributions from 22 different countries. We
consider the resulting diversity in attendees and the mixture of established and starting
researchers as a particular advantage of an engaging conference format.
We are very grateful to the many colleagues who helped in organizing the conference. In
particular, we would like to thank the members of the program committee of ICDATA’18 and the
members of the congress steering committee. The continuing support of the ICDATA program
committee has been essential to further improve the quality of accepted submissions and the
resulting success of the conference. The ICDATA’18 program committee members are (in
alphabetical order): Mahmoud Abou-Nasr, Kai Brüssau, Paulo Cortez, Richard de Groof, Diego
Galar, Peter Geczy, Zahid Halim, Tzung-Pei Hong, Wei-Chiang Hong, Ulf Johansson, Madjid
Khalilian, Terje Kristensen, Zhang Sen, Robert Stahlbock, Chamont Wang, Simon Wang, Gary
M. Weiss, Zijiang Yang, and Wen Zhang. Together with additional reviewers, they all did a great
job in reviewing a lot of submissions in short time.
We would also like to thank our publicity co-chairs Ashu M. G. Solo (Fellow of British
Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc.) for
circulating information on the conference, as well as www.KDnuggets.com, a platform for
analytics, data mining and data science resources, for listing ICDATA’18.
Considering the increasing efforts of all towards the quality of the review process, the
conference sessions and the social program of ICDATA’18, we are confident that you will find
the conference stimulating and rewarding. It is a particular pleasure to provide data mining
oriented invited talks and tutorials presented by the following esteemed members of the data
mining community: Richard Dunks (Datapolitan, USA), Diego Galar (Luleå University of
Technology, Sweden), Peter Geczy (AIST, Japan), Dawud Gordon (TwoSense, USA), Andrew H.
Johnston (Mandiant, USA), and Ulf Johansson (Jönköping University, Sweden).
We express our gratitude to keynote, invited, and individual conference/tracks and tutorial
speakers - the list of speakers appears on the conference web site. We would also like to thank the
followings: UCMSS (Universal Conference Management Systems & Support, California, USA)
for managing all aspects of the conference; Dr. Tim Field of APC for coordinating and managing
the printing of the proceedings; and the staff of Luxor Hotel (Convention department) at Las
Vegas for the professional service they provided.
Last but not least, we wish to express again our sincere gratitude and respect towards our
colleague and friend Prof. Hamid R. Arabnia (Professor, Department of Computer Science,
University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing/Springer), General
Chair and Coordinator of the federated congress, and also Associate Editor of ICDATA’18 for his
excellent and tireless support, organization and coordination of all affiliated events. His
exemplary and professional effort in 2018 and all the many years before in the steering committee
of the congress makes these events possible. We are grateful to continue our data science
conference as ICDATA’18 under the umbrella of the CSCE congress.
Thank you all for your contribution to ICDATA’18! We hope that you will experience a
stimulating conference with many opportunities for future contacts, research and applications.
Robert Stahlbock
ICDATA’18 General Conference Chair
Mining Significant Terminologies in Online Social Media Using Parallelized LDA for the 3
Promotion of Cultural Products
Richard de Groof, Haiping Xu, Jurui Zhang, Raymond Liu
Multi-label Classification of Single and Clustered Cervical Cells Using Deep Convolutional 10
Networks
Melanie Kwon, Mohammed Kuko, Vanessa Martin, Tae Hun Kim, Sue Ellen Martin, Mohammad
Pourhomayoun
Credit Default Mining using Combined Machine Learning and Heuristic Approach 16
Sheikh Rabiul Islam, William Eberle, Sheikh Khaled Ghafoor
Customer Level Predictive Modeling for Accounts Receivable to Reduce Intervention Actions 23
Michelle LF Cheong, Wen Shi
Message Classification for Generalized Disease Incidence Detection with Topologically Derived 42
Concept Embeddings
Mark Abraham Magumba, Peter Nabende, Ernest Mwebaze
74
Effective Machine Learning Approach to Detect Groups of Fake Reviewers
Jayesh Soni, Nagarajan Prabakar
Sentiment Probing of Social Media Data using Various Supervised Learners 79
Sourangshu Das , S. Kami Makki
RNN as a Multivariate Arrival Process Model: Modeling and Predicting Taxi Trips 105
Xian Lai, Gary Weiss
Open-Source Neural Network and Wavelet Transform Tools for Server Log Analysis 130
Chunyu Liu, Tong Yu
Recursion Identify Algorithm for Gender Prediction with Chinese Names 137
Hua Zhao, Fairouz Kamareddine
Maximizing the Processing Rate for Streaming Applications in Apache Storm 143
Ali Al-Sinayyid, Michelle Zhu
Machine Learning Models for Predicting Fracture Strength of Porous Ceramics and Glasses 147
Saishruthi Swaminathan, Tejash Shah, Birsen Sirkeci-Mergen, Ozgur Keles
Finding a Balance Between Interestingness and Diversity in Sequential Pattern Mining 170
Rina Singh, Jeffrey A. Graves, Lemuel R. Waitman, Douglas A. Talbert
Effective Grouping of Unlabelled Texts using A New Similarity Measure for Spectral 181
Clustering
Arnab Roy, Tanmay Basu
Dyn2Vec: Exploiting Dynamic Behaviour using Difference Networks-based Node Embeddings 194
for Classification
Sandra Mitrovic, Jochen De Weerdt
A Generalized Method for Fault Detection and Diagnosis in SCADA Sensor Data via 201
Classification with Uncertain Labels
Md Ridwan Al Iqbal, Rui Zhao, Qiang Ji, Kristin P. Bennett
FAWCA: A Flexible-greedy Approach to find Well-tuned CNN Architecture for Image 214
Recognition Problem
Md Mosharaf Hossain, Douglas A. Talbert, Sheikh Khaled Ghafoor, Ramakrishnan Kannan
A Multivariate Linear Regression Analysis of In Vitro Testing Conditions and Brain 227
Biomechanical Response under Shear Loads
Folly Crawford, Jennifer Fisher, Osama Abuomar, Raj Prabhu
An Effective Nearest Neighbor Classification Technique Using Medoid Based Weighting 231
Scheme
Avideep Mukherjee , Tanmay Basu
SESSION: POSTER PAPERS AND EXTENDED ABSTRACTS
Yet Another Weighting Scheme for Collaborative Filtering Towards Effective Movie 237
Recommendation
Anurag Banerjee, Tanmay Basu
Predictive Hybrid Machine Learning Model for Network Intrusion Detection 258
Ebrahim Alareqi, Khalid Abed
Acceleration of Python Artificial Neural Network in a High Performance Computing Cluster 268
Environment
Christopher Rosser, Ebrahim Alareqi, Anthony Wright, Khalid Abed
SESSION
REAL-WORLD DATA MINING APPLICATIONS,
CHALLENGES, AND PERSPECTIVES + MACHINE
LEARNING
Chair(s)
Dr. Mahmoud Abou-Nasr
Dr. Robert Stahlbock
Dr. Gary M. Weiss
0LQLQJ6LJQLILFDQW7HUPLQRORJLHVLQ2QOLQH6RFLDO0HGLD8VLQJ
3DUDOOHOL]HG/'$IRUWKH3URPRWLRQRI&XOWXUDO3URGXFWV
5LFKDUGGH*URRI+DLSLQJ;X-XUXL=KDQJDQG5D\PRQG/LX
&RPSXWHUDQG,QIRUPDWLRQ6FLHQFH'HSDUWPHQW
8QLYHUVLW\RI0DVVDFKXVHWWV'DUWPRXWK'DUWPRXWK0$86$
'HSDUWPHQWRI0DUNHWLQJ8QLYHUVLW\RI0DVVDFKXVHWWV%RVWRQ%RVWRQ0$86$
$EVWUDFW - Despite the growing popularity of online social WH[WDQGSURYLGHPHDQLQJIXOLQWHUSUHWDWLRQ7RDFKLHYHWKLV
media, there are very few research efforts to use online LWLVQHFHVVDU\WRLGHQWLI\WKHVXEMHFWPDWWHULQYROYHG2QH
social media to study market strategies for the promotion of RI WKH PRVW SRSXODU WRSLF PRGHOLQJ WHFKQLTXHV LV FDOOHG
cultural products. With online content being largely /DWHQW 'LULFKOHW $OORFDWLRQ /'$ ZKLFK LV D SRZHUIXO
unregulated, Latent Dirichlet Allocation (LDA) provides a PHWKRG IRU OHDUQLQJ WRSLF GLVWULEXWLRQV LQ WH[W >@ /'$ LV
useful mechanism for organizing textual data and deriving DQ XQVXSHUYLVHG WRSLF PRGHOLQJ WHFKQLTXH IRU PLQLQJ WH[W
conclusions about the subject matter. In this paper, we GDWD DQG GHULYLQJ ODWHQW WRSLF GLVWULEXWLRQV /LNH RWKHU
introduce a parallelized LDA, called pLDA, to analyze XQVXSHUYLVHG WHFKQLTXHV LW GRHV QRW UHTXLUH D ODEHOHG
clustered textual data in online social media. We use pLDA WUDLQLQJ VHW IRU LWV RSHUDWLRQ 7KLV PDNHV LW YHU\ XVHIXO LQ
to infer the posterior of latent topics over documents and WKHFRQWH[WRIDODUJHDPRXQWRIXQFDWHJRUL]HGGDWDVXFKDV
words, and identify significant terminologies that describe PXVLFLDQV¶)DFHERRNSDJHV+RZHYHUWRPDNHPHDQLQJIXO
the vast number of posts. Making use of sentiment analysis, FODVVLILFDWLRQVDXVHUPXVWLQVSHFWWKHUHVXOWVWRGHWHUPLQH
we are able to further make suggestions about the relevant WKH DVVRFLDWLRQV EHWZHHQ KLGGHQ WRSLFV DQG GRFXPHQWV ,Q
topics for promoting cultural products. Finally, we use a DGGLWLRQWKHFRPSXWDWLRQDOFRPSOH[LW\RIWKLVPHWKRGRORJ\
case study of the music industry to demonstrate how the PD\UHQGHULWVXVHOLPLWHGIRUPDVVLYHGRFXPHQWV
most relevant aspects to artist popularity can be derived. ,QWKLVSDSHUZHLQWURGXFHDQ/'$EDVHGDSSURDFKWR
PLQLQJVLJQLILFDQWWHUPLQRORJLHVLQRQOLQHVRFLDOPHGLDIRU
.H\ZRUGV &XOWXUDO SURGXFWV RQOLQH VRFLDO PHGLD WH[W WKHSURPRWLRQRIFXOWXUDOSURGXFWV2XUXQLTXHSURFHVVXVHV
PLQLQJODWHQW'LULFKOHWDOORFDWLRQWRSLFPRGHOLQJ DSRVWSURFHVVLQJVWHSWR/'$ZKLFKDOORZVXVWREURDGO\
FDWHJRUL]HWKHSRVWVZLWKOHVVPDQXDOLQWHUYHQWLRQ2QFHZH
NQRZWKHVXEMHFWPDWWHURIWKHSRVWVZHPD\XVH6WDQIRUG
,QWURGXFWLRQ &RUH 1/3 WR SURYLGH VHQWLPHQW DQDO\VLV GHULYLQJ WKH
7KH SUROLIHUDWLRQ RI RQOLQH VRFLDO PHGLD WHFKQRORJLHV QHJDWLYH RU SRVLWLYH RULHQWDWLRQ RI HDFK 8VLQJ PHDQLQJIXO
KDV UHVXOWHG LQ D WUHPHQGRXV DPRXQW RI LQIRUPDWLRQ LQGLFDWRUVVXFKDVWKH)DFHERRNQXPEHUVRIIROORZHUVDQG
EHFRPLQJ UHDGLO\ DYDLODEOH 6RFLDO PHGLD ZHEVLWHV VXFK DV VROYLQJDV\VWHPRIHTXDWLRQVFKDUDFWHUL]LQJHDFKDUWLVWZH
7ZLWWHU DQG )DFHERRN SURYLGH WKH XVHUV DQ RSSRUWXQLW\ WR PD\GHULYHWKHUHODWLRQVKLSEHWZHHQWKHVHWRSLFVDQGDUWLVW
VSHDN WKHLU PLQG RQ SDJHV KRVWHG E\ HYHU\RQH IURP SRSXODULW\ *HQHUDOO\ D OLQHDU V\VWHP RI HTXDWLRQV PD\
FHOHEULWLHV WR DUWLVWV IURP \RXQJ NLGV WR SRS VKRS RZQHUV LQGLFDWH UHOHYDQFH RI RQH RU PRUH FRPPRQ DWWULEXWHV
'XH WR LWV SRSXODULW\ WKH SRWHQWLDO XWLOLW\ RI RQOLQH VRFLDO FDOFXODWHGVHQWLPHQWVFRUHVRIFRPPRQO\RFFXUULQJVXEMHFW
PHGLD LQ SURPRWLQJ FXOWXUDOSURGXFWV LV EHLQJ LQFUHDVLQJO\ PDWWHUDWWULEXWHGWRHDFKDUWLVW%DVHGRQYDULRXVIDFWRUVWKDW
UHFRJQL]HG 7KH VRFLDO PHGLD VLWHV W\SLFDOO\ UHTXLUH DUWLVWV DUH LPSRUWDQW WR FXOWXUH SURGXFWV DUWLVW SRSXODULW\ FDQ EH
VXFKDVPXVLFLDQVWRHVWDEOLVKDQRQOLQHSUHVHQFHE\ZKLFK FRQVLGHUHG DV WKH GHSHQGHQW IDFWRU DQG EUDQG QLFKH DQG
WKH\ PD\ GLVVHPLQDWH WKHLU EUDQG QDPHV 0XFK RI WKH DXGLHQFHFRQVHQVXVDPRQJVWRWKHUVDVLQGHSHQGHQWIDFWRUV
IHHGEDFN SRVWHG RQOLQH FRXOG EH XVHIXO WR LGHQWLI\ WUHQGV XVLQJ VXFK HTXDWLRQV >@ :H IRFXV RQ WKH WH[W FRQWHQW RI
DQGXQGHUVWDQGZKDWLVLPSRUWDQWLQFXOWXUDOSURGXFWVVXFK WKH RQOLQH VRFLDO PHGLD DQG XVH VHQWLPHQW DQDO\VLV WR
DV WKRVH SURGXFHG E\ PXVLFLDQV +RZHYHU PLQLQJ VRFLDO GHWHUPLQH WKH RULHQWDWLRQ DQG WKHUHE\ UHOHYDQFH RI HDFK
PHGLD KDV EHHQ D GLIILFXOW WDVN EHFDXVH VR PXFK RI WKH LQGHSHQGHQW IDFWRU WR DUWLVW SRSXODULW\ 7R LPSURYH WKH
LQIRUPDWLRQ LV LQ WKH IRUP RI XQVWUXFWXUHG IUHH WH[W HIILFLHQF\ RI RXU DSSURDFK ZH LQWURGXFH D SDUDOOHOL]HG
3UHYLRXVZRUNKDVIRFXVHGRQWKHDSSOLFDWLRQRIWH[WPLQLQJ /'$ SURFHGXUH FDOOHG S/'$ WR PLQH WH[WEDVHG RQOLQH
WHFKQLTXHVWKDWUHTXLUHPDQXDOLQWHUSUHWDWLRQ6LQFHWKHUHLV VRFLDOPHGLD,QDFDVHVWXG\RIPLQLQJRQOLQHVRFLDOPHGLD
WRR PXFK LQIRUPDWLRQ WR SURYLGH PDQXDO ODEHOV WKHUH LV D IRUWKHPXVLFLQGXVWU\ZHVKRZWKDWRXUDSSURDFKFDQQRW
SUHVVLQJ QHHG WR GHYHORS WRROV WR DXWRPDWLFDOO\ FDWHJRUL]H RQO\ HIIHFWLYHO\ LGHQWLI\ WKH PRVW UHOHYDQW DVSHFWV WR DUWLVW
SRSXODULW\ DFFRUGLQJ WR WKH VHQWLPHQWV H[SUHVVHG E\ WKH
7KLV PDWHULDO LV EDVHG XSRQ ZRUN VXSSRUWHG E\ WKH 3UHVLGHQW¶V OLVWHQLQJ DXGLHQFH EXW DOVR UXQ IDVWHU WKDQ LWV VHTXHQWLDO
&UHDWLYH(FRQRP\ &( )XQG8QLYHUVLW\RI0DVVDFKXVHWWV YHUVLRQRIWKH/'$PHFKDQLVP
5HODWHG:RUN ZHUHDJJUHJDWHGLQDVHSDUDWHVWHSEHIRUHFRQWLQXLQJ6XFK
UHVXOWV DUH FRQVLGHUHG DQ DSSUR[LPDWLRQ WR RXWSXWV RI WKH
5HVHDUFKHUV KDYH LQFUHDVLQJO\ UHFRJQL]HG WKDW RQOLQH RULJLQDO /'$ DOJRULWKP 6LQFH WKH VDPH ZRUG PD\ RFFXU
VRFLDO PHGLD RIIHUV D EUHDGWK RI LQIRUPDWLRQ UHODWHG WR WKH DFURVV PXOWLSOH GRFXPHQWV WKH WUXH UHVXOWV PXVW FRQVLGHU
SURPRWLRQRIFXOWXUDOSURGXFWV*RKet al.VWXGLHGWKHHIIHFW WKH WRSLF DVVLJQPHQW WR HDFK LQ RUGHU WR DWWDLQ WKH WUXH
RI XVHUJHQHUDWHG DQG PDUNHWHUJHQHUDWHG FRQWHQW LQ VRFLDO GLVWULEXWLRQ 'LIIHUHQW IURP WKH DERYH DSSURDFKHV RXU
PHGLDRQFRQVXPHUV¶UHSHDWHGSXUFKDVHEHKDYLRUV>@7KH\ PHWKRG XVHV D UHDOWLPH GDWD V\QFKURQL]DWLRQ PHFKDQLVP
XVHG FRPPHUFLDO WH[W PLQLQJ DSSOLFDWLRQV WR DQDO\]H WH[W IRU EHWWHU DFFXUDF\ QRW RQO\ PDNLQJ WKH UHVXOWV PRUH
JDWKHUHG IURP )DFHERRN DQG IRXQG WKDW XVHUJHQHUDWHG FRPSDUDEOH WR D VHTXHQWLDO LPSOHPHQWDWLRQ EXW DOVR
FRQWHQWKDGDPRUHVLJQLILFDQWLPSDFW6LPLODUO\.LPet al. SURYLGLQJDVSHHGXSGXHWRSDUDOOHOL]DWLRQ
DQDO\]HGWH[WUHYLHZVRIKRWHOVRQ7ULS$GYLVRUWRGLVFRYHU 2WKHU UHODWHG ZRUN KDV IRFXVHG RQ WKH DSSOLFDWLRQ RI
VDWLVILHUV DQG GLVVDWLVILHUV DV ZHOO DV WKH UHDVRQV ZK\ LQIRUPDWLRQ GHULYHG IURP VRFLDO PHGLD WRZDUGV WKH PXVLF
SHRSOH OHDYH SRVLWLYH RU QHJDWLYH UHYLHZV >@ 7KH\ DOVR LQGXVWU\)RUH[DPSOH0X6H1HWDQHWZRUNRIPXVLFDUWLVWV
OHYHUDJHGDIHDWXUHRQ7ULS$GYLVRUZKLFKDOORZVWKHXVHU DURXQG WKH ZRUOG OLQNHG E\ SURIHVVLRQDO UHODWLRQVKLSV ZDV
WR OHDYH D FDWHJRULFDO UDWLQJ +H et al. VWXGLHG WKH SL]]D GHYHORSHGDVDQH[DPLQDWLRQRIWKHVRFLDOQHWZRUNDVSHFWRI
LQGXVWU\ XVLQJ WH[W SRVWV IURP )DFHERRN DQG 7ZLWWHU LQ DQ VLWHV OLNH )DFHERRN XVHG LQ WKH PXVLF DUWLVW LQGXVWU\ >@
HIIRUW WR JDLQ PDUNHWLQJ LQVLJKW >@ 8VLQJ H[LVWLQJ WH[W 5HODWLRQVKLSV DPRQJVW VXEVFULEHUV PD\ EH DVVHPEOHG LQ D
PLQLQJ WRROV WKH\ LGHQWLILHG WKHPHV LQ WKH GDWD WKDW WKH\ JUDSK WR XQGHUVWDQG XQGHUO\LQJ SKHQRPHQD $V VXFK
XVHG WR FDWHJRULFDOO\ FRPSDUH WKUHH PDMRU SL]]D FKDLQV DSSURDFKHV SURYLGH XVHIXO LQVLJKWV DERXW RQOLQH VRFLDO
8QOLNHWKHDERYHZRUNZHLQWURGXFHRXUQRYHOSDUDOOHOL]HG PHGLDGDWDWKH\DUHFRPSOHPHQWDU\WRRXUUHVHDUFKHIIRUWV
WH[WPLQLQJDSSURDFKDQGRXUXQLTXHSURFHGXUHDOORZVWKH RQDQDO\]LQJWKHWH[WEDVHGVRFLDOPHGLDIRUWKHSURPRWLRQ
LGHQWLILFDWLRQRIWRSLFVWREHODUJHO\DXWRPDWLF RIFXOWXUDOSURGXFWV
7KHUHDUHDOVRDIHZSUHYLRXVUHVHDUFKHIIRUWVIRFXVHG
RQWKHXVHRI/'$WRDQDO\]HRQOLQHVRFLDOPHGLD4LDQJet 6LJQLILFDQW7HUPLQRORJ\,GHQWLILFDWLRQ
al.LQFRUSRUDWHGIHDWXUHVVXFKDV³JHRWUDFNLQJ´WRDLGLQWKH
LGHQWLILFDWLRQRIJHRJUDSKLFDOWRSLFVIURPVRFLDOPHGLD>@ )LJXUHVKRZVWKHSURFHGXUHE\ZKLFKWH[WGDWDIURP
7KHLUPHWKRGLVEDVHGRQWKH/'$PRGHOXVLQJJHQHUDWLRQ DQ RQOLQH VRFLDO PHGLD VXFK DV )DFHERRN SDJHV DUH
SUREDELOLWLHV ZKLFK JHQHUDWHV HDFK NH\ZRUG IURP HLWKHU D SURFHVVHG 3RSXODU DQG OHVVNQRZQ ORFDO DUWLVWV ZHUH
ORFDO RU D JOREDO WRSLF GLVWULEXWLRQ /'$ KDV DOVR EHHQ LGHQWLILHG DQG UHYLHZ WH[W LV H[WUDFWHG XVLQJ EURZVHU
DSSOLHG WR PXVLFDO UHFRPPHQGDWLRQ V\VWHPV .LQRVKLWD et VFULSWLQJ PHFKDQLVPV 8VLQJ DQ RQOLQH WDJJLQJ VHUYLFH
al. SURSRVHG D V\VWHP WR GHVFULEH PXVLFDO SUHIHUHQFHV E\ VXFK DV /DVWIP SRSXODU WDJV DUH H[WUDFWHG IRU HDFK DUWLVW
FRQVLGHULQJ GLIIHUHQW WDJV DVVRFLDWHG ZLWK DUWLVW JHQUH DQG $VVHPEOLQJ WKH WDJV LQWR YHFWRUV IRU HDFK DUWLVW kPHDQV
XVHU SUHIHUHQFHV >@ 7KH\ XVHG &ROODERUDWLYH )LOWHULQJ FOXVWHULQJ FDQ EH XVHG WR VHSDUDWH WKH DUWLVWV LQWR VLPLODU
&) EDVHG VLPLODU XVHU VHOHFWLRQ WR UHFRPPHQG PXVLF FDWHJRULHV LH DUWLVWV ZLWK VLPLODU SURSRUWLRQV RI LGHQWLFDO
SURGXFWVWRXVHUVZLWKVLPLODUWDVWHVWRDWDUJHWDUWLVW6XFK WDJV:HGRWKLVIRUWZRUHDVRQVWRUHGXFHWKHGDWDVL]HIRU
WH[W PLQLQJ PHWKRGRORJLHV KDYH EHHQ XVHG LQ D YDULHW\ RI UXQQLQJ S/'$ DQG DOVR WR VXSSRUW RXU FRQFOXVLRQV WKDW
FRQWH[WVEXWRIWHQLQWKHLURULJLQDOIRUPXODWLRQDSSO\LQJWKH GLIIHUHQW W\SHV RI DUWLVWV UHFHLYH FRPPHQWV UHIOHFWLQJ
UHWXUQHG WRSLF GLVWULEXWLRQV GLUHFWO\ >@ 7KH UHWXUQHG GLIIHUHQWVXEMHFWPDWWHUDQGXVHIXOLQIRUPDWLRQ7KHRXWSXWV
PDWULFHV DUH W\SLFDOO\ LQWHUSUHWHG DV FOXVWHUV ZKLFK RI WKH S/'$ SURFHVV DUH WZR PDWULFHV QDPHO\ ș
UHSUHVHQW YDULRXV FRPELQDWLRQV RI XQGHUO\LQJ WRSLFV ,Q UHSUHVHQWLQJWKHWRSLFGLVWULEXWLRQIRUHDFKGRFXPHQWLQSXW
FRQWUDVW RXU DSSURDFK GHULYHV VLJQLILFDQW WHUPLQRORJLHV LQWR S/'$ DQG ij UHSUHVHQWLQJ WKH GLVWULEXWLRQ RI ZRUGV
FRQGLWLRQHG RQ GRFXPHQWV ZKLFK DUH KLJKO\ UHIOHFWLYH RI RYHUWRSLFV)RUS/'$GDWDSURFHVVLQJHDFKSRVWUHSUHVHQWV
WKH XQGHUO\LQJ WRSLFV ,Q RXU DSSURDFK ZH DSSO\ WKH D GRFXPHQW ZKLFK XVXDOO\ GLVFXVVHV D VLQJOH VXEMHFW )RU
SUREDELOLW\RIZRUGJLYHQGRFXPHQWWRWKHLGHQWLILFDWLRQRI WKRVHLQVWDQFHVLQZKLFKRQHVXEMHFWLVGLVFXVVHGFOXVWHULQJ
ODWHQW WRSLFV DQG FRQVLGHU IUHHO\IRUPHG WH[W UDWKHU WKDQ E\ WRSLF GLVWULEXWLRQV DVVRFLDWHV FRPPHQWV GLVFXVVLQJ WKH
LQIRUPDWLRQGHULYHGIURPWDJV VDPH WRSLF WRJHWKHU $OVR FRPPHQWV DGGUHVVLQJ WKH VDPH
5HFHQW ZRUN KDV GLVFXVVHG WKH SDUDOOHOL]DWLRQ RI WKH PXOWLSOHWRSLFVDUHDOVRFOXVWHUHGWRJHWKHUDFFRUGLQJWRWKHLU
/'$SURFHVV1HZPDQet al.LQWURGXFHGDSDUDOOHOSURFHVV XQLTXHWRSLFGLVWULEXWLRQV
IRU /'$ ZKLFK HVVHQWLDOO\ GLYLGHV WKH GRFXPHQW VHW LQWR 2QFHWKHSRVWVDQGFRPPHQWVKDYHEHHQH[WUDFWHGDQG
VHFWLRQV DQG GLVWULEXWHV LW DFURVV FRPSXWDWLRQ XQLWV >@ FOXVWHUHG DFFRUGLQJ WR DUWLVW WDJV WKH\ DUH SURFHVVHG XVLQJ
6LPLODUO\ :DQJ et al. LQWURGXFHG 3OGD ZKLFK RSHUDWHV LQ S/'$ $GDSWHG IURP >@ WKH PDWULFHV ș DQG ij FDQ EH
WKH VDPH IDVKLRQ EXW LQFRUSRUDWHV +DGRRS IXQFWLRQV >@ FDOFXODWHGXVLQJ&*6DVLQ(T DQG
/LX et al. LQWURGXFHG 3OGD ZKLFK ZRUNV ZLWK D SLSHOLQH θ m k = n mk − mn + α k
V\VWHP WR SHUIRUP WKH VDPH WDVN >@ $ FULWLFDO DVSHFW RI
WKHVH SDUDOOHO DOJRULWKPV LV V\QFKURQL]DWLRQ ,Q SUHYLRXV n k −n m n + β n
ZRUN LQIHUHQFH WHFKQLTXHV OLNH &ROODSVHG*LEEV 6DPSOLQJ φ =
k n N
&*6 ZHUH XVHG RQ SDUWLWLRQV RI WKH GDWD VHWV >@ $IWHU
HDFKSURFHVVRUKDGSHUIRUPHGDVLQJOHLWHUDWLRQWKHUHVXOWV ¦
r =
n k −r m n + β r
'HULYDWLRQRI6LJQLILFDQW7HUPLQRORJLHV
3UHSURFHVVLQJRI7H[W'DWD
%HIRUH WKH RQOLQH PHGLD SRVWV DUH SURFHVVHG ZLWK
S/'$WKH\DUHFOXVWHUHGDFFRUGLQJWRDUWLVWJHQUHWDJV$V
DQ H[DPSOH VKRZQ LQ )LJ SRSXODU WDJV IRU HDFK DUWLVW
ZHUH FROOHFWHG IURP /DVWIP ZLWK WKH DUWLVWV DQG WKH DUWLVW
FOXVWHUV GLVSOD\HG LQVLGH WKH ER[HV (DFK WDJ KDV D FRXQW
DWWULEXWHG WR LW ZKLFK UHSUHVHQWV WKH QXPEHU RI /DVWIP
XVHUV ZKR KDYH DSSOLHG WKDW WDJ &RXQWV RI FRPPRQO\
RFFXUULQJ WDJV DUH DVVHPEOHG LQWR D YHFWRU IRU HDFK DUWLVW
ZKHUHHDFKYHFWRULVQRUPDOL]HGZLWKWKHVXPPDWLRQRILWV
HOHPHQWVWR7KHVHYHFWRUVDUHWKHQFOXVWHUHGZLWKkPHDQV
WR SURGXFH DUWLVW FOXVWHUV FRPSRVHG RI VLPLODUO\ FODVVLILHG
PXVLFLDQVDFFRUGLQJWRDUWLVWJHQUHWDJV
)LJXUH7UHQGVLGHQWLILFDWLRQLQFXOWXUDOSURGXFWSRSXODULW\
Ϭ͘ϯϯ Ϭ Ϭ͘ϲϱ Ϭ Ϭ Ϭ Ϭ Ϭ Ϭ͘ϳϱ Ϭ͘ϯϰ
i = j = t =
FDQEHUHSUHVHQWHGHTXLYDOHQWO\DVLQ(T
P z_d =
P d_z P z RFFXUULQJ ZRUGV DV PHDVXUHG E\ ZRUG VLJQLILFDQFH DUH WKH
Pd VLJQLILFDQWWHUPLQRORJLHVIRUWKHFOXVWHU
%\SHUIRUPLQJVWDQGDUGPDWUL[PXOWLSOLFDWLRQRIșDQG )LJXUH VKRZV DQ H[DPSOH RI PHWKRGV IRU DVVLJQLQJ
ij HDFK HQWU\ LV PXOWLSOLHG DQG VXPPHG DFURVV WRSLFV LH WRSLFV WR FRPPHQWV E\ VLJQLILFDQW WHUPLQRORJLHV LQ WKH
WKHPDWFKLQJLQQHUGLPHQVLRQEHWZHHQWKHPDWULFHV6RWKH GRPDLQ RI WKH PXVLF LQGXVWU\ $V VKRZQ LQ WKH ILJXUH D
RSHUDWLRQSURFHHGVDVLQ(T VLJQLILFDQW WHUPLQRORJ\ IRU WKH OLYH VKRZV WKDW DSSHDUV
FRQVLVWHQWO\ LV ³VKRZ´ 7KH ZRUG ³DOEXP´ DSSHDUV LQ WKDW
K K
P d _ zk P zk P d w _ zk P zk
¦ P w _ zk = ¦ FRQWH[W RI DOEXP 7KH ZRUGV ³VSRWLI\´ ³YLGHR´ DQG
k = P d k = Pd ³EDQGFDPS´DSSHDULQWKHVWUHDPLQJFDWHJRU\
ZKHUH k LV WKH WRSLF LQGH[ (T KROGV EHFDXVH RI FRQWDLQV³UHFRUG´__
&RPPHQW
FRQGLWLRQDO LQGHSHQGHQFH RI ZRUGV DQG GRFXPHQWV RYHU FRQWDLQV³HS´__
FRQWDLQV³DOEXP´__
WRSLFV %\ WKH GHILQLWLRQ RI FRQGLWLRQDO LQGHSHQGHQFH DQG FRQWDLQV³OS´ <HV 1R
WKHLWHUDWLRQVRYHUkZHFDQGHULYHWKHUHVXOWDVLQ(T FRQWDLQV
³VKRZ´ &RPPHQW
K K
P d w zk P zk P d w zk P dw $/%
¦ =¦ = = P w _ d <HV 1R
k = P zk P d k = Pd Pd
6+2 FRQWDLQV³VSRWLI\´__ &RPPHQW
FRQWDLQV ³YLGHR´ __
:HLQWHUSUHWWKHFRQGLWLRQDOSUREDELOLW\P w _ d DVWKH
675 6WUHDPLQJ FRQWDLQV³EDQGFDPS´__
SUREDELOLW\ RI ZRUG VLJQLILFDQFH JLYHQ GRFXPHQWV ,W LV WKH $/% $OEXP FRQWDLQV³L7XQHV´ <HV 1R
SUREDELOLW\ RI ZRUGV DFFRUGLQJ WR KLGGHQ WRSLFV ZKLFK DUH 6+2 6KRZ
18/ QXOO 675 18/
SURPLQHQWLQWKHLUUHVSHFWLYHGRFXPHQWV
)LJXUH&ODVVLILFDWLRQE\RFFXUUHQFHRIVLJQLILFDQWWHUPLQRORJLHV
6\VWHPRI3RSXODULW\(TXDWLRQV
7KHVLJQLILFDQWWHUPLQRORJ\FOXVWHUVDUHGHILQHGDVLQ 8VLQJ WKLV DSSURDFK WKH FOXVWHUV FDQ EH LGHQWLILHG
(T $FFRUGLQJWR'HILQLWLRQWKHPXOWLSOLFDWLRQRIș UHDGLO\ E\ WKH RFFXUUHQFH RU QRQRFFXUUHQFH RI WKH
DQGij\LHOGVWKHVLJQLILFDQWWHUPLQRORJLHV2QFHWKHUHVXOWV VLJQLILFDQW WHUPLQRORJLHV $V VXFK LW JUHDWO\ VLPSOLILHV WKH
RI S/'$ KDYH EHHQ FOXVWHUHG XVLQJ kPHDQV ZH KDYH FODVVLILFDWLRQ SURFHVV DQG KXQGUHGV RI FRPPHQWV PD\ IDOO
JURXSVRIGRFXPHQWVUHSUHVHQWLQJVLPLODUXQGHUO\LQJWRSLF LQWR RQH FDWHJRU\ RU DQRWKHU TXLFNO\ DQG DFFXUDWHO\ E\ WKH
GLVWULEXWLRQV (DFK WKHWD FOXVWHU LV PXOWLSOLHG ZLWK ij IURP LQFOXVLRQRUH[FOXVLRQRIWKHVLJQLILFDQWWHUPLQRORJLHV
WKDW EDWFK RI S/'$ XVLQJ VWDQGDUG PDWUL[ PXOWLSOLFDWLRQ )XUWKHUPRUH6WDQIRUG&RUH1/3FDQEHXVHGWRFODVVLI\
7KLV RSHUDWLRQ LWHUDWHV RYHU WRSLFV FRQVLVWHQW ZLWK WKH VHQWLPHQWRQDSRVLWLYHQHJDWLYHVSHFWUXPUDQJLQJEHWZHHQ
SURFHGXUHRXWOLQHGLQ(T DQG ZLWK EHLQJ WKH PRVW QHJDWLYH DQG EHLQJ WKH
PRVW SRVLWLYH 7KH LGHQWLILHG FRPPHQWV IRU DUWLVWV DUH
ª θ θ K º FODVVLILHG E\ VHQWLPHQW DQG WKH UHVXOWV VXPPHG DFURVV
°« » PXVLFLDQVFOXVWHUDQGWKHLGHQWLILHGWRSLFDVLQ(T
°« »
° «θ M θ M K »¼ streaming i = ¦ SentScore Comment istreaming
°¬
j
ª ϕ ϕ V º j
° Theta B Cluster B « » album i = ¦ SentScore Comment album
°
θ ×ϕ = ® × « » j
i j
«¬ϕ K ϕ K V »¼
° ª θ θ K º shows i = ¦ SentScore Comment ishows
j
°« » j
°« » Topics Words
ZKHUHWKHVXEVFULSWiLQGLFDWHVWKDWWKLVLVWKHiWKDUWLVW7KH
° «¬θ M θ M K »¼
° VXSHUVFULSW RYHU FRPPHQW LQGLFDWHV WKDW WKH FRPPHQW KDV
°¯ Theta B Cluster B N EHHQ ODEHOHG DV WKDW WRSLF 7KH VXEVFULSWV i DQG j XQGHU
FRPPHQWLQGLFDWHWKDWWKLVLVWKHjWKFRPPHQWEHORQJLQJWR
2QFH WKH VLJQLILFDQW WHUPLQRORJLHV KDYH EHHQ WKH iWK DUWLVW ODEHOHG E\ WKH VXSHUVFULSW 7KH PHWKRG
FDOFXODWHG DQG WKH GRFXPHQWV KDYH EHHQ FOXVWHUHG ZH FDQ SentScore LV D 6WDQIRUG &RUH 1/3 SURFHVV WKDW UHWXUQV WKH
GHULYH SDWWHUQV LQ WKH WH[W )RU H[DPSOH E\ UHYLHZLQJ WKH FRUUHVSRQGLQJ VHQWLPHQW VFRUH YDOXH 2QFH WKH VHQWLPHQW
SRVWV IRU PXVLFLDQV ZH PD\ ILQG WKDW WKH\ WHQG WR IDOO VFRUHV DUH VXPPHG WKH\ DUH DYHUDJHG E\ WKH QXPEHU RI
EURDGO\ LQWR WKUHH FDWHJRULHV GHVFULSWLRQV RI OLYH VKRZV FRPPHQWV DWWULEXWHG WR WKDW WRSLF 7KLV LV WR SURYLGH D
GHVFULSWLRQV DQG UHFRPPHQGDWLRQV RI QHZ DOEXPV DQG UHODWLYH YDOXH DV D ODUJH QXPEHU RI FRPPHQWV PD\
GLVFXVVLRQRIVWUHDPLQJVHUYLFHVRQZKLFKWKHDUWLVWVDSSHDU DUWLILFLDOO\LQIODWHWKHYDOXHZKLOHDQDYHUDJHZRXOGEHWWHU
7KH FOXVWHUV DUH UHDGLO\ LGHQWLILHG E\ NH\ZRUGV RFFXUULQJ UHSUHVHQWWKHRYHUDOOVFRUH
WKURXJKRXW WKH FOXVWHUHG GRFXPHQWV DFFRUGLQJ WR ZRUG $V SDUW RI WKH GDWD FROOHFWLRQ SURFHVV WKH QXPEHU RI
VLJQLILFDQFHV 7KH SUREDELOLWLHV RI ZRUG VLJQLILFDQFHV IRU IROORZHUV RI HDFK DUWLVW SRVWHG RQ WKHLU VRFLDO PHGLD SDJHV
WKHYDULRXVZRUGVLQDFOXVWHUDUHVXPPHGDFURVVWKHVHWRI DUHUHWULHYHGDQGZLWKWKHVHQWLPHQWVFRUHVDVVHPEOHGLQD
GRFXPHQWV WR GHWHUPLQH WKH JUHDWHVW SUREDELOLWLHV 7KH WRS V\VWHPRIOLQHDUHTXDWLRQVDVVKRZQLQ(TV
&DVH6WXG\
,QRUGHUWRGUDZFRQFOXVLRQVDERXWWKHPXVLFLQGXVWU\
ZHFROOHFWHGFRPPHQWVIURPDUWLVWVIHDWXUHGRQ
:80% D UDGLR VWDWLRQ RI 8QLYHUVLW\ RI 0DVVDFKXVHWWV
%RVWRQ WKDW EURDGFDVWV DQ $PHULFDQD%OXHV5RRWV)RON
PL[ DV ZHOO DV IURP D OLVW RI %RVWRQ 0DVVDFKXVHWWV ORFDO
DUWLVWV IHDWXUHG RQ WKHGHOLPDJD]LQHFRP D GDLO\ XSGDWHG )LJXUH$UWLVWFOXVWHUOHYHOWRSLFVLJQLILFDQFHE\SRSXODULW\
ZHEVLWHFRYHULQJ1RUWK$PHULFDQPXVLFVFHQHVWKURXJK
GHGLFDWHG VHSDUDWH EORJV )ROORZLQJ WKH SURFHGXUH
RXWOLQHGLQ6HFWLRQZHFODVVLILHGWKHFRPPHQWVSURGXFHG
WKH VLJQLILFDQW WHUPLQRORJLHV DQG GHULYHG WKH WRSLF
VLJQLILFDQFH ZLWK UHJDUGV WR DUWLVW SRSXODULW\ $PRQJVW WKH
GLIIHUHQW W\SHV RI PXVLF GLIIHUHQW IDFWRUV DSSHDU WR
FRQWULEXWHWRDUWLVWVXFFHVV6RPHEDQGVDUHNQRZQPRUHIRU
OLYHSHUIRUPDQFHVDQGWHQGWRSURPRWHDQGGLVFXVVWKHVHRQ
WKHLU)DFHERRNSDJHV2WKHUEDQGVDUHXVXDOO\SURPRWLQJD
)LJXUH$UWLVWFOXVWHUVZLWKWRSLFVLJQLILFDQFHIRUIRONPXVLF
QHZDOEXPZKHQWKH\SRVWWRRQOLQHVRFLDOPHGLD
$UWLVW/HYHO$QDO\VLV
$UWLVW&OXVWHU/HYHO$QDO\VLV
7DEOHVKRZVWKUHHDUWLVWVGUDZQIURPWKUHHGLIIHUHQW
)LJXUHVKRZVDUWLVWFOXVWHUVIURPERWKWKH:80% DUWLVW FOXVWHUV $UWLVW ; EHORQJV WR WKH :80% IRON FOXVWHU
DQG WKHGHOLPDJD]LQHFRP OLVWV 'LIIHUHQW W\SHV RI DUWLVWV DVVKRZQLQ)LJDQGDUWLVW<DQG=EHORQJWRWKH:80%
KDYH KLJKHU FRUUHODWLRQV ZLWK RQH WRSLF RU DQRWKHU 7KH DUWLVW FOXVWHU ³EOXHV IRON´ DQG WKH WKHGHOLPDJD]LQHFRP
:80% DUWLVW FOXVWHU UHSUHVHQWLQJ EOXHV DQG IRON DUWLVWV DUWLVWFOXVWHU³KDUGURFNLQGLH´UHVSHFWLYHO\ERWKVKRZQLQ
KDYH D KLJKHU FRUUHODWLRQ ZLWK WKH VWUHDPLQJ VHUYLFHV DQG )LJ:LWKDFRPSDUDEOHQXPEHURIFRPPHQWVKRZHYHU
OHVVFRUUHODWLRQZLWKDOEXP7KLVLQGLFDWHVWKDWWKHVHDUWLVWV DUWLVW ; LV OHVV WKDQ KDOI DV SRSXODU DV $UWLVW < 6LPLODUO\
DUHQRWXVLQJRQOLQHPHGLDWRSURPRWHWKHLUDOEXPVVROGLQ DUWLVW = LV PRUH WKDQ WZLFH DV SRSXODU DV DUWLVW < 7KLV LV
UHFRUG VWRUHV WR WKH VDPH GHJUHH WKDW WKH\ DUH SURPRWLQJ PDMRUO\ GXH WR WKHLU VWUHDPLQJ VFRUHV ZKLFK DUH WKH
VWUHDPLQJ VHUYLFHV ZKLFK SURYLGHV RQOLQH DFFHVV WR WKHLU FXPXODWLYH VHQWLPHQW VFRUHV IRU DUWLVWV LQ WKH VWUHDPLQJ
PXVLF 6LPLODUO\ WKH WKHGHOLPDJD]LQHFRP DUWLVW JURXS FDWHJRU\ )RU H[DPSOH LQ WKH FDVH RI DUWLVW < DQG = WKH
FODVVLILHG DV KDUG URFN LQGLH IROORZV WKH VDPH SDWWHUQ VWUHDPLQJ VFRUHV LQFUHDVH DW D UDWH JUHDWHU WKDQ GRXEOH IRU
:80%DUWLVWFOXVWHUIRONVLQJHUVRQJZULWHUKRZHYHUGRHV DUWLVW = GHVSLWH WKHUH EHLQJ IHZHU WKDQ GRXEOH WKH
QRW KDYH DQ\ FRUUHODWLRQ ZLWK VWUHDPLQJ 7KHVH DUWLVWV FRPPHQWVEHWZHHQDUWLVW<DQG=
PRVWO\SURPRWHQHZDOEXPVLQWKHFRQYHQWLRQDOIDVKLRQ,Q
DGGLWLRQ WKH WKHGHOLPDJD]LQHFRP DUWLVW FOXVWHU URFN LQGLH 7DEOH&RPSDULVRQRIDUWLVWVIURPGLIIHUHQWFOXVWHUV
PRVWO\SURPRWHVVKRZVEXWKDVQRPHQWLRQRIVWUHDPLQJ Artist ID #Followers Streaming Score #Comments
0RYLQJIRUZDUGWRDQDO\]HDSDUWLFXODUW\SHRIPXVLF $UWLVW;
LQ )LJ ZH VKRZ WZR DUWLVW FOXVWHUV ZLWK WRSLF $UWLVW<
VLJQLILFDQFH ZKLFK DUH ERWK GHVFULEHG DV IRON E\ /DVWIP $UWLVW=
Abstract— Cytology-based screening through the thousands of cells by a trained cytotechnologist and pathologist.
Papanicolaou test has dramatically decreased the incidence of Also, the FDA enforces workload limits to restrict the number
cervical cancer. Convolutional neural networks (CNN) have been of slides that can be screened (100 slides per individual per 24
utilized to classify cancerous cervical cytology cells but primarily hours). Despite the Bethesda System, an established criteria and
focused on pre-processed nuclear details for a single cell and guidelines on detecting cell abnormalities, considerable
binary classification of normal versus abnormal cells. In this subjectivity and inter-observer variability still exist [2,3].
study, we developed a novel system for multiple label classification
with a focus on both nucleus and cytoplasm of single cells and cell To address these difficulties, research has been directed
clusters. In this retrospective study, we digitalized cervical towards developing computer-aided reading systems i.e.
cytology slides from 104 patients. Based upon the Bethesda system, convolutional neural networks (CNN) [4-6]. By designing an
the established criteria for diagnosing cervical cytology, cells of automated method for detecting cancerous cells, medical
interest were categorized. With 10-fold cross validation, our CNN professionals would be able to increase workload throughput
algorithm demonstrated 84.5% overall accuracy, 79.1% while also reducing screening subjectivity. With recent FDA
sensitivity, 89.5% specificity for normal versus abnormal. For 3- approval of digital pathology for primary surgical pathology
level classification of normal, low-grade, and high-grade, CNN diagnostics [7], widespread use of automated systems into
demonstrated 76.1% overall accuracy. Results show promise on cytopathology workflow is a possibility in the future.
the utility of CNNs to learn cervical cytology.
carcinoma [11], detecting cancer metastases [12], mitosis III. DATA COLLECTION
detection in breast cancer [13], etc. Instead of using handcrafted Our study data involves the retrospective review of cytology
features to represent the complexity of learning histology or specimens from 104 LAC-USC Medical Center patients that
cytology, feature engineering is done automatically by the CNN underwent the Pap test. This data was collected in collaboration
to learn multiple hierarchical representations of the data. with the Department of Pathology at LAC-USC Medical Center.
Motivated by the successes of applying deep learning methods In this USC IRB-approved study, the cells of interest were
on other pathology medical images, several researchers captured via whole-slide scanning equipment without patient
conducted studies on using a deep learning approach to classify information or identifiers.
abnormal/dysplastic cells labeled according to the Bethesda
system. The specimens were prepared using ThinPrep® liquid-based
cytology preparation method and digitally captured at 40x
Bora et al. trained a CNN from scratch on AlexNet followed magnification objective, using both the Leica SCN400F and the
by a unsupervised technique for feature selection/reduction and Olympus VS120 whole slide scanners. The Leica scanner was
developed both a binary normal/abnormal classifier and 3-class set to output .scn format and the Olympus scanner was set to
classifier consisting of negative for intraepithelial lesion or output .tif format. Configuring the image output setting was
malignancy (NILM), low-grade intraepithelial lesion (LSIL), important to ensure read compatibility with OpenSlide (an open-
and high-grade intraepithelial lesion (HSIL) [4]. Hyeon et al. source library for reading proprietary digital pathology image
used a pre-trained CNN (VGGNet-16 trained on ImageNet) as a formats and annotation software) [15]. At the time of this study,
feature extractor and developed a binary normal/abnormal the native Olympus .vsi output format was not supported by
classifier [5]. Zhang, et al. similarly developed a binary OpenSlide.
normal/abnormal classifier by transferring features from a
model pretrained on ImageNet and fine-tuning it on a pap smear
dataset [6]. In all these prior studies, the datasets focused
primarily on discrete, individual cells and utilized the Herlev Fig. 2. Whole slide scan
dataset, a publicly available cytology image collection [14]. thumbnail image of Leica .scn,
read with OpenSlide.
However, the Herlev dataset consists of many images that lack
cytoplasmic borders (Figure 1). Thus, assessment of nuclear size
in relationship to the amount of cytoplasm (nuclear to
cytoplasmic ratio) cannot be performed. The nuclear to
cytoplasmic ratio is important to distinguish low-grade versus After capturing the whole-slide images, areas of interest
high-grade intraepithelial lesions. were digitally annotated by two pathologists (THK, VM)
according to the Bethesda system [16]. We used the annotation
interface of open-source digital pathology software, QuPath, to
navigate the whole slide images and draw labeled bounding
boxes around cells of interest [17].
A. The Bethesda System
Fig. 1. HSIL examples from publically available Herlev dataset. Images are
focused on nuclear detail without clear external cytoplasm boundary.
equivalences in medical terminology are shown in Table I. were downsampled by random selection to equalize distribution
Additional categories of atypical squamous cells of counts. Due to the relatively rare numbers of endometrial cells,
undetermined significance (ASCUS) and atypical squamous these were excluded from the final dataset. Low-grade consisted
cells cannot exclude a HSIL (ASC-H) were also included in the of ASCUS and LSIL cases, and high-grade consisted of ASC-H
dataset. and HSIL cases. The final distributions with the respective 2 and
3 classifications are shown in Table III.
TABLE I. CERVICAL CYTOLOGY TERMINOLOGY
TABLE III. MERGED DATASET DISTRIBUTION
Original Modified
Bethesda System Dysplasia CIN CIN
Count 2 Classifications
system system
2298 Normal
NILM
2298 Abnormal
(Negative for Normal Normal Normal
intraepithelial lesion or
malignancy)
Count 3 Classifications
ASCUS Atypia suggestive
(Atypical squamous cells but not sufficient 1332 NILM
of undetermined for LSIL
1279 Low Grade – ASCUS, LSIL
significance)
1066 High Grade – ASC-H, HSIL
LSIL Low-grade
Mild dysplasia CIN I
(Low–grade squamous CIN
intraepithelial lesion)
In regards to computer vision tasks for new categories, it is
Atypia suggestive advised to utilize either feature extraction or fine-tuning from a
ASC-H
but not sufficient pre-trained network (usually trained on natural images such as
(Atypical squamous cells
for HSIL
cannot exclude a HSIL) ImageNet). Because the new dataset is relatively small and the
Moderate dysplasia CIN II nature of cytology image data is very different from ImageNet,
HSIL High- it is advised to train a SVM classifier from activations earlier in
(High–grade squamous grade CIN
intraepithelial lesion)
Severe dysplasia CIN III the network where lower-level learned features exist (detecting
edges, blobs of colors, etc.) [18]
However, even for cases of small datasets it is possible to
train a smaller model from scratch by leveraging data
B. Dataset augmentation to generate additional training data to reduce
Table II shows the distribution of the study’s collected overfitting. [19]
annotated cytology regions according to Bethesda Category.
IV. METHODS
TABLE II. ORIGINAL DATASET DISTRIBUTION A. Data Preprocessing and Augmentation
To prepare image files to an appropriate format for the
Merged 2 Merged 3
Count Bethesda System
Category Category
network, we decode the RGB pixel values into floating point
tensors and rescale these values (originally between the range 0
11010 NILM - Squamous to 255) to be within the range [0, 1]. This is done by simply
Normal NILM dividing each channel by 255. For image preprocessing, we
314 NILM - Endocervical Cell utilized the ImageDataGenerator class of the Keras deep
learning library to read files from directory to create batches of
5 NILM - Endometrial excluded excluded pre-processed tensors. Additionally for data augmentation, we
559 ASCUS
applied random rotation, width/height shifting, and
Low Grade horizontal/vertical flipping. Because nucleus size and shape
719 LSIL relative to cytoplasm is important in distinguishing low versus
Abnormal high grade, we avoided any transformation for shearing or
282 ASC-H zooming to prevent distortion.
High Grade
788 HSIL
Layer Filter
Stride Padding Output Shape
Type size
Dropout - - - 6272
FC5 - - - 512
We trained this architecture on Keras library using (TP + False positives [FP] + False negative [FN] + TN)
Tensorflow backend with Nvidia GeForce GTX 1060 6GB [20]. Sensitivity or true positive rate (TPR) = TP / (TP + FN)
Specificity or true negative rate (TNR) = TN / (FP + TN)
C. Training and Testing
After employing pre-processing/data augmentation on each
V. RESULTS
training image using ImageGenerator and resizing to (150, 150,
3) tensor shape input, we trained the model using binary cross The binary classification of normal and abnormal
entropy as the loss function for binary classification. For the 3- demonstrated 84.5% overall accuracy, 79.1% sensitivity, and
level classification we trained the model using categorical cross 89.5% specificity averaged over 10-fold cross validation. 3 level
entropy as the loss function. For both binary and 3-level classification of NILM, LSIL, and HSIL demonstrated 76.1%
classification we used RMSProp optimization algorithm with a overall accuracy (Table V).
learning rate set at 0.0001. We trained up to 130 epochs, saving
weight checkpoints at every 5 epochs. At the end of training, the TABLE V. RESULTS OF BINARY AND 3-LEVEL CLASSIFICATION
model with the lowest validation loss was chosen to evaluate test # of Cross
Accuracy Sensitivity Specificity
images. These hyperparameters were chosen as an initial Classes Validation
configurations for training the CNN, but were not necessarily 2 10-fold 84.5% 79.1% 89.5%
tested as the optimal choices for learning.
3 10-fold 76.1% - -
The normalized confusion matrix in Figure 5 demonstrates The confusion matrix of Fig. 5 provides insight on what the
the averaged category accuracies and the relationships between CNN misclassifies. False positives mostly consisted of normal
network misclassification. predicted to be either low-grade or high-grade intraepithelial
lesions. A component of NILM consists of endocervical cells
which demonstrate high nuclear to cytoplasmic ratio and is a
known mimicker of high-grade intraepithelial lesion (Figure 6).
Ideally, a screening method should have high sensitivity to
detect all or most abnormalities. If implemented as a screening
method, the CNN’s characteristic of demonstrating a bias
towards false positives over false negatives is more preferable
as a cytotechnologist and/or pathologist can verify the
classification.
A. Limitations
As with any application of deep learning, the requirement for
vast quantities of quality labeled training data hinders the
feasibility of developing a robust system. Unlike natural images
scraped from the internet, video, or other easily mined sources,
the data collection and annotation process for pathology images
cannot easily be collected without concerted coordination and
Fig. 5. Normalized confusion matrix for 3 level classification, representing
the rounded average of each fold’s confusion matrix.
guidance from medical professionals. Pathology images also
differ in that under the microscope, a
cytotechnologist/pathologist can interactively adjust the focus to
VI. DISCUSSION view multiple focus planes or z-stack levels when viewing cell
Our methods demonstrated an overall accuracy of 84.5% for clusters. As part of future applications, incorporating 2-3
binary abnormal/normal classification and accuracy of 76.1% different z-stack images of the same cells would provide another
for 3-label classification. It is important to note that prior work avenue of data augmentation.
have focused primarily on nuclear characteristics and single For the three-level classification, ASCUS was grouped with
cells while we have included additional parameters like nuclear LSIL for low-grade, and ASC-H was grouped with HSIL for
to cytoplasmic ratio, cell clusters, and including the more high-grade because they share similar morphologic features.
subjective Bethesda categories of ASCUS and ASC-H. These Although cases of ASCUS and ASC-H may represent true
additional parameters provide a more accurate representation of dysplasia, a subset of these categories also include non-
the cells visualized by the cytotechnologists and/or pathologist. dysplastic conditions like atrophic changes or
Although the nuclear size and characteristics are important for reactive/inflammatory changes. To accurately portray the
distinguishing normal versus abnormal cells, the nuclear to Bethesda System with CNN, additional data and network
cytoplasmic ratio is critical to distinguish low-grade versus optimization is required to classify based upon 5 labels (NILM,
high-grade intraepithelial lesions. Low-grade have low nuclear ASCUS, LSIL, ASC-H, and HSIL).
to cytoplasmic ratio (i.e. abundant cytoplasm to nuclear size)
while high-grade dysplastic cells have high nuclear to Additionally, the final diagnosis requires assimilating
cytoplasmic ratio (i.e. moderate to minimal amounts of multiple data points that not only include the degree of cellular
cytoplasm compared to nuclear size). morphologic abnormalities (dysplasia/atypia) but also the
quantity of those changes. This component is not accounted by
The characteristics of a cell can be best visualized singly. our CNN trained on isolated cells and cell clusters. To address
However, often times, cells of interest are clustered or this issue, future efforts may include developing a method that
overlapping as they are normally cohesive in the native state. assesses isolated cells and cell clusters in context of the entire
Thus, consideration of both single and clustered cells are slide.
required.
VII. CONCLUSION
In this study, we developed a system based on
Convolutional Neural Networks (CNN) for multiple label
classification with a focus on both nucleus and cytoplasm of
single cells and cell clusters. Considering the effectiveness of
deep learning methods being applied to visual recognition tasks
and recent studies’ successes on its application in the field of
pathology, our study examines the application of CNN in the
field of cervical cytology. Future directions may include further
data collection, incorporation of z-stack layering for data
augmentation, and comparing performance between custom
Fig. 6. Endocervical cells cluster (left) compared to HSIL (right). models trained from scratch and transfer learning from pre-
trained networks.
Abstract—Predicting potential credit default accounts in to occur and calculate another score as soon as the transaction
advance is challenging. Traditional statistical techniques typically occurs. Finally, all scores are combined to make a decision. We
cannot handle large amounts of data and the dynamic nature of used the term OLAP data for offline data and OLTP data for
fraud and humans. To tackle this problem, recent research has online data in our previous work [3].The main limitations of the
focused on artificial and computational intelligence based previous work was the use of a synthetic dataset and a lack of
approaches. In this work, we present and validate a heuristic validation of the proposed model using a publicly available, real-
approach to mine potential default accounts in advance where a world dataset. Online Analytical Processing (OLAP) systems
risk probability is precomputed from all previous data and the risk typically use archived historical data over several years from a
probability for recent transactions are computed as soon they
data warehouse to gather business intelligence for decision-
happen. Beside our heuristic approach, we also apply a recently
making and forecasting. On the other hand, Online Transaction
proposed machine learning approach that has not been applied
previously on our targeted dataset [15]. As a result, we find that
Processing (OLTP) systems, only analyze records within a short
these applied approaches outperform existing state-of-the-art window of recent activities - enough to successfully meet the
approaches. requirement of current transactions within a reasonable response
time [19][3].
Keywords—offline, online, default, bankruptcy Currently, a variety of Machine Learning approaches are
used to detect fraud and predict payment defaults. Some of the
I. INTRODUCTION more common techniques include K Nearest Neighbor, Support
In general, we can refer to a customer’s inability to pay, or Vector Machine, Random Forest, Artificial Immune System,
their default on a payment, or personal bankruptcy, all as Meta-Learning, Ontology Graph, Genetic Algorithms, and
potential issues of non-payment. However, each of these Ensemble approaches. However, a potential approach that has
scenarios is a result of different circumstances. Sometimes it is not been used frequently in this area is Extremely Random Trees,
due to a sudden change in a person’s income source due to job or Extremely Randomized Trees (ET) [20]. This approach came
loss, health issues, or an inability to work. Sometimes it is a about in 2006, and is a tree-based ensemble method for
deliberate, for instance, when the customer knows that he/she is supervised classification and regression problems. In Extremely
not solvent enough to use a credit card anymore, but still uses it Random Trees (ET) randomness goes further than the
until the card is stopped by the bank. In the latter case, it is a type randomness in Random Forest. In Random Forest, the splitting
of fraud, which is very difficult to predict, and a big issue to attribute is determined by some criteria where the attribute is the
creditors. best to split on that level, whereas in ET the splitting attribute is
also chosen in an extremely random manner in terms of both
To address this issue, credit card companies try to predict variable index and splitting value. In the extreme case, this
potential default, or assess the risk probability, on a payment in algorithm randomly picks a single attribute and cut-point at each
advance. From the creditor's side, the earlier the potential default node, which leads to a totally randomized trees whose structures
accounts are detected the lower the losses [5]. For this reason, an are independent of the target variables values in the learning
effective approach for predicting a potential default account in sample [20]. Moreover, in ET, the whole training set is applied
advance is crucial for the creditors if they want to take preventive to train the tree instead of using bagging to produce the training
actions. In addition, they could also investigate and help the set as in Random Forest. As a result, ET gives a better result than
customer by providing necessary suggestions to avoid Random Forest for a particular set of problems. Besides
bankruptcy and minimize the loss. accuracy, the main strength of the ET algorithm is its
Analyzing millions of transactions and making a prediction computational efficiency and robustness [20]. While ET does
based on that is time consuming, resource intensive, and some reduce the variance at the expense of an increase in bias, we will
time error prone due to the dynamic variables (e.g., balance limit, use this algorithm as the foundation for our proposed approach.
income, credit score, economic conditions, etc.). Thus, there is a The following sections discuss related research, followed by
need for optimal approaches that can deal with the above our proposed approach. We then present the data that will be
constraints. In our previous work [3], we proposed an approach used, and our experimental results. We then conclude with some
that precomputes all previous data (offline data) and calculates a observations and future work.
score. Subsequently, it waits for a new transaction (online data)
II. LITERATURE REVIEW computations. They mentioned that most of the available
The research work of [1][2][3][4][5] are all about personal techniques in this area are based on offline machine learning
bankruptcy or credit card default on payment prediction and techniques. Their work is the first work in this area that is
detection. In the work of [4], the authors worked on finding capable of updating a model based on new data in real time. On
financial distress from four different summarized credit datasets. the other hand, traditional algorithms require retraining the
Bankruptcy prediction and credit scoring were the primary model even if there is some new data, and the size of the data
indicators of financial distress prediction. According to the affects the computation time, storage and processing. For the
authors, a single classifier is not good enough for a classification purpose of real-time model updating, they use Online Sequential
problem of this type. So they present an ensemble approach Extreme Learning Machine (OS-ELM) and Online Adaptive
where multiple classifiers are used on the same problem and then Boosting (Online AdaBoost) methods in their experiment. They
the result from all classifiers are combined to get the final result, compared the results from above mentioned two algorithms with
and reduce Type I/II errors – crucial in the financial sector. For basic ELM and AdaBoost in terms of training efficiency and
their classification ensemble approach, they use four testing accuracy. In online AdaBoost, the weight for each weak
approaches: a) majority voting b) bagging c) boosting and 3) leaner and the weight for the new data is updated based on the
stacking. The also introduced a new approach called Unanimous error rate found in each of the iterations. The OS-ELM is based
Voting (UV) where if any of the classifiers says “yes” then it is on basic ELM which is formed from a single layer feedforward
assumed as “yes” whereas in Majority Voting (MV) at least network. Along with these algorithms, they also applied some
(n+1)/2 classifiers need to say “yes” to make the final prediction other classic algorithms such as KNN, SVM, RF, and NB.
yes. In the end, they are able to reduce the Type II error but Although KNN, SVM, and RF have shown higher accuracy, the
decrease the overall accuracy. training time was more than 100 times compared to other
algorithms. They found RF exhibits great performance in terms
In the work of [5], the authors present a system to predict of efficiency and accuracy (81.96%). In the end, both online
personal bankruptcy by mining credit card data. In their ELM and AdaBoost maintain the accuracy level of other offline
application, each original attribute is transformed either as: i) a algorithms, while significantly reducing the training time with
binary [good behavior and bad behavior] categorical attribute, or an improvement of 99% percent. They conclude that the online
ii) a multivalued ordinal [good behavior and graded bad AdaBoost has the best computational efficiency, and the offline
behavior] attribute. Consequently, they obtain two types of or classic RF has best predictive accuracy. In other words,
sequences, i.e., binary sequences and ordinal sequences. Later Online AdaBoost balances relatively better than offline or
they use a clustering technique for discovering useful patterns classic RF between computational accuracy and computational
that can help them to identify bad accounts from good accounts. speed. They mentioned two future directions of this research as
Their system performs well, however, they only use a single follows: a) incorporating concept drift to deal with the change of
source of data, whereas the bankruptcy prediction systems of new data distributions over time, which may affect the
credit bureaus use multiple data sources related to effectiveness of the online learning model, and b) sustaining the
creditworthiness. robustness of online learning for a dataset with missing records
or noise. They also mention that some other online learning
In the work of [1], they compared the accuracy of different
techniques like Adaptive Bagging could be applied and
data mining techniques for predicting the credit card defaulters.
compared in terms of speed, accuracy, stability, and robustness.
The dataset used in this research is from the UCI machine
learning repository which is based on Taiwan’s credit card Besides credit card default prediction and detection, there are
clients default cases [15]. This dataset has 30,000 instances, and lots of work on different types of credit card fraud detection.
6626 (22.1%) of these records are default cases. There are 23 Some of those are [6], [7], [8], [9], [10], [11], [12], [13], and
features in this dataset. Some of the features include credit limit, [14], where credit card transaction fraud detection are
gender, marital status, last 6 months bills, last 6 months emphasized and surveyed. Most of the transaction fraud is the
payments, etc. These are labeled data and labeled with 0 (refer direct result of stolen credit card information. Some of the
to non-default) or 1 (refers to default). From the experiment, techniques they used for credit card transaction fraud detection
based on the area ratio in the lift chart on the validation data, are as follows: Artificial Immune System, Meta-Learning,
they ranked the algorithms as follows: artificial neural network, Ontology Graph, Genetic Algorithms, etc.
classification trees, naïve Bayesian classifiers, K-nearest
neighbor classifiers, logistic regression, and discriminant So, despite the plethora of research being done in the area of
analysis. In terms of accuracy, K-nearest neighbor demonstrated credit default/fraud detection, little has been reported that
the best performance with an accuracy of 82% on the training resolves the issue of detecting default/fraud early in the process.
data and 84% on the validation or test data. To get an actual In this work, we will focus on detecting default accounts in the
probability of “default” (rather than just a discrete binary result) very early stage of credit analysis towards the discovery of a
they proposed a novel approach called Sorting Smoothing potential default on payment or even bankruptcy.
Method (SSM).
III. METHODOLOGY
In the work of [2], the authors use the same Taiwan dataset
[15] as of [1]. However, they applied a different set of algorithms We applied two different approaches to the dataset: one for
and approaches. In this research, they proposed an application of comparing results with previous standard machine learning
online learning for a credit card default detection system that approaches, and the other for validating our proposed approach.
achieves real-time model tuning with minimal efforts for The first approach is the application of different machine
learning algorithms on the dataset. We call this standard
approach the Machine Learning Approach. The second approach standard rule given X is equal to null and the R Online becomes
is based on our previous work [3], where two tests are maximum, 2) there are n related causes and all causes are valid
performed, what we call a standard test and a customer specific given both X and Y are equal and the R Online becomes zero,
test, to mine potential defaulting accounts. We will call this and 3) there are m valid causes among n related causes given R
second approach our Heuristic Approach. Online is a value in between maximum and minimum. Thus, our
proposed RISK algorithm returns the online risk probability
The Heuristic Approach predicts the credit default risk in two Ronline for a transaction.
steps. In the first step, we compute a credit default probability
score from archived transaction history using appropriate ___________________________________________________
machine learning algorithms. This score is stored in the database
and continuously updated as new transactions occur. In the RISK (SR, CSR, FS, T):
second step, as real-time transactions occur, we apply a heuristic 1. for each online transaction t of T
(applying a standard test and a customer specific test, explained 2. ViolatedRules Å StandardTest(t)
in detail in section V) to compute a risk score. This score is 3. if count of ViolatedRules is greater than 0
combined with the archived score using the equations (1) and (2) 4. Ronline Å CustomerSpecificTest (ViolatedRules)
to compute overall risk probability. 5. else
For the Machine Learning Approach, we experimented with 6. Ronline Å 0
various supervised machine learning algorithms to determine the 7. return Ronline
best algorithm. Then, for the Heuristic approach we will take ___________________________________________________
the best algorithm found from the Machine Learning Approach
and apply it only to the offline data to calculate the offline risk
probability Roffline. And whenever a new transaction occurs, we Finally, the risk probability from both online and offline data
run the two tests (Standard Test, Customer Specific Test) on the are combined using a weighted method to see whether the
online data to calculate the online risk probability Ronline. account is going to default in the near future.
Our proposed algorithm RISK shows the steps involved in
calculating Ronline..The parameters for the algorithm are standard IV. DATA
rules (SR), customer specific rules (CSR), Feature Scores (FS), In this work, we have used the “Taiwan” dataset [15] of
and a batch of online transactions (T). Details of the rules, rule Taiwan’s credit card clients’ default cases which has 23 features
mappings, risk calculation steps, and flowcharts are described in and 30,000 instances, out of which 6,626 (22.1%) are default
detail in our previous work [3]. Each and every transaction is cases. The same dataset has also been used in other research
passed through the StandardTest function, which returns the work [1][2]. Some of the features of this dataset are credit limit,
violated standard rules (if any), which is then passed to the gender, marital status, last 6 months bills, last 6 months
CustomerSpecificTest function to find out the valid causes for payments, and last 6 months re-payment status. Records are
the breaking standard rules. There is a mapping between causes labeled as either 0 (non-default) or 1 (default). Fig. 1 shows a
and the standard rules. Also different causes carry different snapshot of 5 random records from the dataset.
weights based on the following criteria: 1) the mapping in
between causes and standard rules, 2) the mapping between As indicated earlier, the Heuristic Approach processes two
standard rules and features, and 3) the ranking of associated different datasets related to credit card transactions: the offline
features. We use the term impact coefficient and weight data and the online data. However, one of the issues with
interchangeably. To calculate risk probability from online data research in this area is that both offline and online transactional
we use the following formula: data are not publicly available. Specifically, there are some
public datasets that contain customer summarized profile
information and credit information, but not individual credit
ஊ୍୫୮ୟୡ୲େ୭ୣ୧ୡ୧ୣ୬୲ሺሻ transactions. (i.e., no single publicly available dataset that
R Online = [ 1 – ] × 100 contains both for the same set of customers). In order to tackle
ஊ୍୫୮ୟୡ୲େ୭ୣ୧ୡ୧ୣ୬୲ሺሻ
this issue, and provide a relevant data source for future work in
this area (something that we will make publicly available after
Here, X is the set of valid causes for breaking the standard publication), we will decompose the Taiwan dataset into both
rules, and Y is the set of relevant causes (valid or invalid) for offline and online datasets as shown with the examples in Table
breaking the standard rules. Some of the use cases of the 1 and Table 2.
formula are as follows: 1) there is no valid cause for breaking a
TABLE 1. OFFLINE DATASET CREATED FROM TAIWAN DATASET where, V2 = converted value, Max2 = the ceiling of the new
account balance sex education marriag age total_bi total_p repay default range, Min2 = the floor of the new range, Max1 = the ceiling of
_limit e ll ayment ment the current range, Min1 = the floor of the current range, V1 = the
4663 50000 2 3 2 23 28718 1028 0 0 value needs to be converted.
13181 100000 2 3 2 49 17211 2000 0 0
21600 50000 2 2 2 22 28739 800 0 0 To ensure that a corresponding transaction distribution can
1589 450000 2 2 2 36 201 3 -1 0 be followed in the “Spain” dataset for a BILL_AMT in the
28731 70000 2 3 1 39 133413 4859 2 0 “Taiwan” dataset, we used equal frequency binning to determine
the ranges under which a monthly bill amount (BILL_AMT)
must fall into. Equal frequency binning uses an inverse
TABLE 2. ONLINE DATASET CREATED FROM TAIWAN DATASET cumulative distribution function (ICDF) to calculate the upper
tid account amount date type and lower ranges. As a result, we came up with on average
53665 23665 660 2015-05-29 pay 359,583 online transactions per month for the same 30,000
9328 9328 46963 2015-05-14 exp accounts or records in the original dataset.
37597 7597 3000 2015-05-29 pay
9495 9495 75007 2015-05-14 exp It should also be noted that another significant result of this
34113 4113 5216 2015-05-29 pay work is the creation of a dataset for other researchers. As
mentioned earlier, public access to credit card summary data and
credit card transactional data for the same set of customers is
Initially, from each record (customer) in the Taiwan dataset, rare. While it was necessary to create this dataset for our specific
we created 5 online transactions of type “pay” (payment) from research purposes, we realize the benefit of making this dataset
PAY_AMT1 to PAY_AMT5 and 5 online transactions of type public to the general research community.
“exp” (expenditure) from BILL_AMT1 to BILL_AMT5. Since
BILL_AMT is the sum of all individual bills or transactions, we
divided this BILL_AMT into individual transactions by V. EXPERIMENT
following the data distribution of a real credit card transactions For our experiments, we will use the Python scikit-learn
dataset. As shown in Table 3, BILL_AMT1 is the total bill and library. The following sections describe the experimental setup
PAY_AMT1 is the payment amount for the month of September for each of the two approaches that we discussed earlier.
2005, BILL_AMT2 is the total bill and PAY_AMT2 is the
payment amount for the month of August 2005, and so on, up to A. Machine Learning Approach
the oldest month, which in this case is April 2005 (BILL_AMT6 We will run different machine learning algorithms on the
and PAY_AMT6). So, initially PAY_AMT6 and BILL_AMT6 “Taiwan” dataset. The purpose of this test is to evaluate an
go into the total_payment and total_bill for the offline data improved approach in terms of the following performance
(Table 1). At the end of the month, the total_payment and
evaluation metrics: accuracy, recall, F-score, and precision. We
total_bill is updated with that month's total bill (BILL_AMT)
chose these metrics for two reasons: 1) these are the metrics
and total payments (PAY_AMT).
frequently used in related research, and 2) to compare the results
TABLE 3. MONTH VS FEATURE MAPPING IN TAIWAN DATASET with previous research using this Taiwan dataset.
Month Feature Feature Some of the algorithms we tried include K-nearest
( BILL AMOUNT ) ( PAYMENT AMOUNT ) Neighbor, Random Forest, Naïve Bayes, Gradient Boosting,
April BILL_AMT6 PAY_AMT6 Extremely Random Trees (Extra Trees), etc. We also used the k-
fold (k =10) cross-validation technique for the testing/training
May BILL_AMT5 PAY_AMT5 set split and to calculate performance metrics. Default
June BILL_AMT4 PAY_AMT4 parameters for all algorithms (in scikit-learn) were used unless
otherwise mentioned.
July BILL_AMT3 PAY_AMT3
August BILL_AMT2 PAY_AMT2 B. Heuristic Approach
September BILL_AMT1 PAY_AMT1 This approach originated from our previous preliminary
work using a synthetic dataset [3]. However, in this work, we
will validate our approach by using the publicly available
As stated previously, BILL_AMT is the summarized “Taiwan” dataset, and dividing the dataset into offline and online
information of an entire month’s transaction. We then break datasets. Beside solving the limitations (e.g., lack of validating
down this BILL_AMT into the individual transactions by the proposed model using a known and real dataset), we also
following the real credit card transaction data distribution of the found a better base algorithm (Extremely Random Trees) than
“Spain” dataset [16] used in the work [18]. We then scaled those before that will contribute to the calculation of the offline risk
datasets up/down as needed to convert them into the same probability Roffline in our Heuristic Approach. We briefly
currency scale using the formula below: reiterate the two tests as discussed in detail in [3]:
1) Standard Test: The purpose of this test is to identify
transactions that deviate from the normal behavior and pass
ሺ ʹݔܽܯെ ʹ݊݅ܯሻ ൈ ሺܸͳ െ ͳ݊݅ܯሻ
ܸʹ ൌ ʹ݊݅ܯ them to the next test named Customer Specific Test. Here the
ሺ ͳݔܽܯെ ͳ݊݅ܯሻ normal behavior refers to the common set of standard rules that
In fraud or risk, detection recall is very important because we From Table 4 and Fig. 4 we can see that recall increases as
don’t want to miss fraud or risks. However, maximizing recall the number of batches increase. This implies that the percentage
introduces an increase of False Positives, which is expected in of targeted (i.e., defaulting) accounts increases (linearly) with
risk analytics. the number of batches (i.e., as more information about the
customer is known).
The computation time for both the Machine Learning
Approach and calculating Roffline using Extremely Random Trees
for 30,000 accounts was on average 11.24 seconds using a
commodity laptop with an Intel core i7 processor and 12 GB
RAM. Though Naïve Bayes is a bit faster than Extremely
Random Trees, its performance in terms of accuracy, precision,
recall, and F-score are not. For the online data computation, it
took on average of 114.42 seconds for a batch size of on average
of 359,583 transactions. For our interpretation of results, we
created only one batch per month. However, there is nothing in
our proposed approach that requires batches of this size, and any
number of transactions per month for online data could be used,
which could lead to batches with a much smaller number of
Fig. 3 Performance comparison of different approaches transactions with less computation time. To verify this, we tried
with batches of different sizes (reducing the batch size by half
Recall that we divided the dataset into offline and online each time) and we found that the computation time for the online
datasets, as mentioned in the Data section (Section IV), data reduces almost linearly with the reduction of the number of
consisting of bill and payment data for 6 months. Data from the transactions per batch. From Fig. 5, we can see that the trendline
first month was included in the summarized fields (i.e., total_bill (dotted line) is almost in line with the actual line. This
and total_payment), and for the remaining 5 months, we made 5 demonstrates how fast this approach can process the online
batches of offline data and 5 batches of online data. We then ran transaction and give a decision in near real-time.
offline batch 1 and online batch 1 serially, followed by offline
batch 2 and online batch 2, and so on, up to batch 5, which leads
to comparing the results from the newly created offline (Table
1) and online (Table 2) dataset from the “Taiwan” dataset with
the results of the Machine Learning Approach, as shown in Fig
3.
TABLE 4. BATCH WISE PERFORMANCE METRICS
VII. CONCLUSION
In this research, we have used two approaches, Machine
Learning and Heuristic, for mining default accounts from a well-
known dataset. The Heuristic Approach came from our previous
work [3], that we validated with actual data in this work. The
main idea of the Heuristic Approach is to calculate the risk factor
from the recent transactional data (online) and combine the
results with pre-computed risk factors from historical (offline)
Fig. 4. Batch wise performance metrics data in an efficient way. To make the process efficient, we only
have to process a transaction when it initially occurs, and then [7] J. W., Yoon, and C. C. Lee, "A data mining approach using transaction
the combined risk factor is carried forward for future patterns for card fraud detection," Jun. 2013. [Online]. Available:
arxiv.org/abs/1306.5547.
transactions. We showed this approach can predict a default
[8] RamaKalyani, K., and D. UmaDevi. "Fraud detection of credit card
account significantly in advance, which is very cost efficient for payment system by genetic algorithm." International Journal of Scientific
the funding organization. In addition, we demonstrated that the & Engineering Research 3.7 (2012): 1-6.
performance of both approaches outperforms reported [9] Delamaire, Linda, Hussein Abdou, and John Pointon. "Credit card fraud
approaches using the same data set [1][2]. Our future plan is to and detection techniques: a review." Banks and Bank systems 4.2 (2009):
improve the Heuristic Approach so that it outperforms the 57-68.
Machine Learning Approach in terms of all performance [10] Adnan M. Al-Khatib, “Electronic Payment Fraud Detection Techniques”,
metrics, and validate that with multiple datasets. Other plans World of Computer Science and Information Technology Journal
(WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141,2012.
include: testing and validating the model with multiple real
datasets, standardizing the online vs offline risk weight ratio (the [11] Pun, Joseph King-Fung. Improving Credit Card Fraud Detection using a
Meta-Learning Strategy. Diss. 2011.
value of λ) with multiple datasets of credit defaults, as well as
[12] West, Jarrod, and Maumita Bhattacharya. "Some Experimental Issues in
handling concept drift to deal with change in the distribution of Financial Fraud Mining." Procedia Computer Science 80 (2016): 1734-
the online data over time which may affect the effectiveness of 1744.
the approach. [13] Bhattacharyya, Siddhartha, et al. "Data mining for credit card fraud: A
comparative study." Decision Support Systems 50.3 (2011): 602-613.
[14] Gadi, Manoel Fernando Alonso, Xidi Wang, and Alair Pereira do Lago.
"Credit card fraud detection with artificial immune system." International
REFERENCES Conference on Artificial Immune Systems. Springer, Berlin, Heidelberg,
2008.
[15] "UCI Machine Learning Repository: Default of credit card clients Data
[1] Yeh, I-Cheng, and Che-hui Lien. "The comparisons of data mining Set". [Online]. Available:
techniques for the predictive accuracy of probability of default of credit https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#.
card clients." Expert Systems with Applications 36.2 (2009): 2473-2480.J. Accessed: December. 23, 2017.
Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2.
Oxford: Clarendon, 1892, pp.68–73. [16] "Synthetic data from a financial payment system | Kaggle". [Online].
Available:https://www.kaggle.com/ntnu-testimon/banksim1/data.
[2] Lu, Hongya, Haifeng Wang, and Sang Won Yoon. "Real Time Credit Accessed: February. 9, 2018..
Card Default Classification Using Adaptive Boosting-Based Online
Learning Algorithm." IIE Annual Conference. Proceedings. Institute of [17] "UCI machine learning repository: Data set,". [Online]. Available:
Industrial and Systems Engineers (IISE), 2017. https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data.
[3] Islam, Sheikh Rabiul, William Eberle, and Sheikh Khaled Ghafoor. Accessed: Feb. 20, 2016..
"Mining Bad Credit Card Accounts from OLAP and OLTP." Proceedings [18] Edgar Alonso Lopez-Rojas and Stefan Axelsson. BANKSIM: A bank
of the International Conference on Compute and Data Analysis. ACM, payments simulator for fraud detection research. 26th European Modeling
2017. and Simulation Symposium, EMSS 201420141441521placeholder.
[4] Liang, Deron, et al. "A novel classifier ensemble approach for financial [19] "Introduction to data warehousing concepts," 2014. [Online]. Available:
distress prediction." Knowledge and Information Systems 54.2 (2018): https://docs.oracle.com/database/121/DWHSG/concept.htm#DWHSG92
437-462. 89. Accessed: Mar. 28, 2016.
[5] Xiong, Tengke, et al. "Personal bankruptcy prediction by mining credit [20] Geurts, Pierre, Damien Ernst, and Louis Wehenkel. "Extremely
card data." Expert systems with applications 40.2 (2013): 665-676. randomized trees." Machine learning 63.1 (2006): 3-42.
[6] Ramaki, Ali Ahmadian, Reza Asgari, and Reza Ebrahimi Atani. "Credit
card fraud detection based on ontology graph." International Journal of
Security Privacy and Trust Management (IJSPTM) 1.5 (2012): 1-12.
firms, focusing on returning customers (customers with more on customers who are likely to pay on time. Finally, using our
than one invoice) and concluded that by using historical late pureness measure, we can determine the relationship between
payment behaviors of customers in addition to invoice computed pureness and predicted probability to pay on time.
features, there is significant improvement in the prediction
accuracy. They extended their analysis to include customer This paper is organized as follows. Section 3 defines our
features such as organization profile, and was able to further proposed pureness measure and how the measure is used to
improve the prediction accuracy. By predicting that an invoice group the customers into three different groups. Section 4
payment will be late, for example more than 90 days late, the describes the data preparation tasks where seven database
company can take early preemptive actions to prevent such tables were provided and the derived attributes were
invoices from becoming bad debt. Finally, they compared the computed, resulting in different number of customers in each
accuracy of a unified model (applicable for all four firms) group. In Section 5, we will explore the data to gain some
versus firm-specific model, and showed that the firm-specific intuitive as well as interesting insights. From the interesting
model yielded higher prediction accuracy. insights, we can support the conjecture that different billing
cycles and different number of business years could be
Following [2], [3] and [4] are two master thesis from correlated to the number of intervention actions due to late
MIT which focused on similar work, using similar invoice and invoice payment. In Section 6, we will discuss the modeling
payment behavior features, and had drawn the same process and the prediction results obtained. Section 7
conclusion that the Random Forest model had the highest concludes the paper.
prediction accuracy for predicting if an invoice payment will
be on time or delayed, and how long the delay will be. The 3 Pureness Measure Definition
author in [3] added some work in analyzing the characteristics
of delayed invoices and problematic customers, and In [2], they created 14 aggregated features related to late
concluded that there was no obvious correlation between payment behavior, including ratio of paid invoices that were
invoiced amount and invoice delay. The author in [4] also late, ratio of sum of paid base amount that were late, average
performed further analysis and concluded that customers with days late of paid invoices that were late, ratio of outstanding
fewer invoices are less likely to have late payment, and vice invoices that were late, ratio of sum of outstanding base
versa, and thus different models should be built for different amount that were late, and average days late of outstanding
customer groups. He also showed that prediction accuracy invoices that were late. Both [3] and [4] also included
increases as the number of invoices per customer increases. aggregated features related to late payment behavior such as
number of delayed invoices, total amount of delayed invoices,
Apart from managing overdue invoices from the average amount of delayed invoices, average delayed days,
creditor’s perspective, managing overdue invoices from the and their respective ratios. All these features were useful to
debtor’s perspective was considered [5]. The authors predict if an invoice will be paid on time or delayed.
proposed a generic methodological framework for invoice
payment processing which can take care of stochastic As the objective of our work is to predict if a customer
processing lead time, and built a cohort Markov chain model would likely to pay on time, we are more interested to
to simulate the process to identify bottlenecks for measure the percentage of invoices, both in terms of number
improvements, leading to reduction in the payment of overdue and value, which were paid on time, rather than late. Thus, we
invoices. have defined a new pureness measure as follows:
Our work is different from the past works in [2], [3] and Pureness = W1 * (number of invoices paid on time /
[4] in a few areas. Firstly, we do not predict if a particular total number of invoices) + W2 * (sum of value of invoices
invoice payment will be on time or late, instead we focus on paid on time / sum of value of all invoices)
the customer as a whole. A customer can have multiple
outstanding invoices, and we predict if he would likely pay on Where, W1 and W2 are weights which can be adjusted
time or not pay at all. Secondly, we define a new pureness according to how much emphasis the company wants to place
measure to determine if a customer is good (pureness = 1) or on each factor.
bad (pureness = 0) to train our predictive models, by using
features related to past on-time payment behavior and By using this pureness measure, we can compute it for
organization profile, rather than features related to past late all the customers in the data set we have and group the
payments. We then use our model to predict for those customers into three groups:
customers who have pureness between 0 and 1 (partially paid
on time), and identify those who are likely to pay on time with x Group 1 – those who paid all their invoices within 45
high probability, hoping to reduce the overall intervention days (Pureness = 1)
actions taken. Thirdly, the past works were focused on
x Group 2 – those who did not pay their invoices after
increasing intervention actions on invoices which are likely to
180 days (Pureness = 0)
be late, while we are focused on reducing intervention actions
x Group 3 – those who partially pay their invoices compute the total revenue of invoices paid on
within 45 days (0 < Pureness < 1) time, and the total revenue of all invoices in the
The intention is to use Group 1 and Group 2 to train the data, for each customer ID.
predictive model, and then predict for those in Group 3, who 4 - Payment x This table contains the payment methods for
are those that are most likely to pay their invoices on time like Type each invoice. One invoice record will be one
those in Group 1. For customers who are identified to have row in the table which contains the invoice
high probability to be like those in Group 1, the number of number, customer ID, payment type (paid by
intervention actions can be reduced. sender, paid by recipient, or paid by others),
etc.
x Using the payment type, we can compute the
4 Data Preparation percentage of invoices which are paid by
We were provided with a snapshot of the data in the sender, receiver or others.
month of January 2015 from the logistics company, which 5- x This table contains information related to the
includes seven different database tables as described in Table Customer customers. One customer record will be one
1. We joined all the seven tables together using the customer row in the table which contains customer ID,
ID, and removed records (rows) and attributes (columns) total number of employees, year which
which have too many missing data. We were left with 10,562 company was started, industry code, etc.
unique customer records. For all the customers, we computed 6 – Air bill x This table contains information for each air
their individual pureness measure, and used it as the target bill. One air bill will be one row in the table
variable for the predictive model. The number of customers in which contains the air bill ID, invoice number,
each Group is given in Table 2. weight, volume, amount, whether the shipment
is an envelope or a box, etc.
Table 1. Description of the Seven Database Tables (DBT) x An air bill (or air waybill) is a document that
Provided accompanies goods shipped to provide detailed
information about the shipment and allows it to
DBT Description be tracked. One invoice can consist of more
1 - Invoice x This table contains the invoice information than one air bill. Using the weight, volume and
for all the customers located in a specific amount of each air bill, we can compute the
country. One invoice record will be one row in average weight, average volume, average
the table which contains the invoice number, amount and percentage of envelope shipments,
customer ID, invoice date, invoice closed date, of all the air bills for all the invoices related to
etc. a customer ID.
x Using the invoice data and invoice closed 7 - Billing x This table contains the billing cycle and
date, we can compute the number of days taken Cycle payment mode for each customer. One
to make payment, and for those that are within customer record will be one row in the table
45 days, we marked them as on time, and those which contains customer ID, payment by cash
which are not closed after 180 days, we marked or not, billing cycle (daily, weekly, bi-weekly,
them as bad debt. We can then compute the or monthly), etc.
total number of invoices paid on time, and the
total number of invoices in the data, for each
customer ID. Based on the computed pureness measure for each
2- x This table contains all the past intervention customer, we have 5373 customers in Group 1 (pureness = 1),
Intervention actions taken for each customer. One 2747 in Group 2 (pureness = 0), and the remaining 2442 in
Actions intervention action record will be one row in Group 3 (0 < pureness < 1). Among the customers in Group 1,
the table which contains the intervention action 2004 of them had only one invoice with the company. For
ID, customer ID, invoice number, intervention such customers, we removed them as we were unable to
action description, etc. determine their good payment behavior confidently simply
x Note that since we only have a snapshot of based on one invoice. Finally, 3369 customers in Group 1,
the data, the intervention actions taken consists and 2747 customers in Group 2 were used for the training of
of all historical intervention actions, which can the predictive model.
be for invoices which are not found in DBT1.
3 - Revenue x This table contains the revenue associate with
each invoice. One invoice record will be one
row in the table which contains the invoice
number, customer ID, revenue, etc.
x Using the revenue for each invoice, we can
5 Data Exploration
With 17 attributes given in Table 3, including the unique
customer ID and the target variable pureness measure, we
explored the data to uncover some insights. The more
intuitive observations include:
Using the selected Neural Network model, we predicted Table 5. Predicted Pureness for Group 3 Customers
the outcome for Group 3 customers, and 41.89% of them were
predicted to have pureness = 1 with high predicted Pureness Prediction Percentage of Customers
probabilities as shown in Table 5. With this results, the 1 41.89%
logistics company can sort the customers according to 0 58.11%
predicted probabilities in descending order, and choose to
reduce the intervention actions for customers with high
probabilities. This will lead to reduced intervention actions
taken and the associated costs.
7 Conclusions
We have successfully used a predictive model to predict
if a customer will pay the outstanding invoices on time with
high probability, in an attempt to reduce intervention actions
taken on them, thus reducing cost and improving customer
relationship. We focused on customer level rather than
individual invoices, by including features such as customers’
billing cycle, employee level and business year. From our
results, it was found that a linear relationship exists, where a
0.1 unit increase in pureness measure, the customer is 1.132
times more likely to pay on time. This relationship is useful as
a general rule-of-thumb to determine if interventions should
be taken for a customer simply based on the computed
pureness measure, without using the predictive model.
SentiWordnet [7] suffers from not being domain specific embedding. He achieved a slightly better accuracy than the
and were not suitable to colloquial language. Extended top handcrafted methods in both Standard Arabic and
lexicon from seeds could be dictionary based or corpus Dialectal Arabic. However, his corpus was small and
based. Dictionary based are mostly domain independent. generic and he used only one embedding model with default
Relations such as synonym and antonym are used to expand parameters.
a small list of seeds words. In corpus-based lexicons,
polarity is decided by co-occurrence in a corpus. Type of 3. Methodology
co-occurrence could be a conjunction of adjectives by
"and/or" as in [8], appearing in the same window as in [9] Building our platform required three phases. The first
or generated from annotated corpus [10]. There are no well- phase is collecting a large corpus, and then using it to train
known available Arabic sentiment lexicons. Many the word embedding. The second phase is collecting the
researches tried to build their own lexicon. El-Beltagy [11] training dataset in order to train the classifier. The third
extended 370 seed words to construct an Egyptian dialect phase is tuning the embedding parameters to choose the best
sentiment lexicon of 4,392 terms. However, her lexicon was accuracy for the case study. In this section, we will describe
oriented towards politics. Abdulla [12] expanded 300 seed each phase in details.
words to 3,479 terms using synonyms of each words. His
lexicon was generic and in MSA form.
3.1 Building corpus and Training embeddings
The supervised technique is mainly based on machine 3.1.1 Crawling Social Media
learning. In polarity classification, a set of labeled data are Initially, a Facebook group called ‘Proudly made in
collected, then, a set of features are extracted to train the Egypt’ has been crawled. This group aims to discuss
classifier. Those extracted features could be a unigram, Egyptian products and compare them with imported ones.
either stemmed or lemmatized, part-of-speech-tagging, By the end of 2017, this group reached more than 610,000
emoticons or any other handcrafted features. Both the labels members that served 12,170 feeds with 663,981 comments.
and the extracted features are fed to the machine learning This data was crawled from Facebook by its publicly
algorithm to build the classifier model. In turn, the classifier available Graph API tool and then all Facebook pages Urls
model assigns label to the tested input text based on its were extracted to be used as new seeds to crawl. After
extracted features. The accuracy of a well-trained machine pruning them, more than 1,000 new Facebook pages were
learning approach can outperform the lexicon-based obtained. Then, Arabic Facebook pages related to products
approach by choosing good feature selection techniques. are crawled even if not made in Egypt. In addition, hashtags
Pang [13] was among the first who addressed this approach are extracted from the crawled content and used as seeds to
to classify reviews into positive and negative using different crawl from twitter after pruning them. Finally, around
classification algorithms. Then, a large number of research 717,530 tweets are crawled. The final Social Media (SM)
tackled the supervised classification problem by corpus contains around 52 million feeds, comments and
engineering a set of different features [4]. SAMAR [14] [15] tweets with 477 million tokens.
is an Arabic supervised sentiment analysis system to
investigate how to treat Arabic dialects. They suggested For the purpose of comparison using a larger corpus,
individual solutions for each domain and task but they found another three corpuses are generated combining the above
that lemmatization is the best feature in all approaches. SM corpus with three different MSA corpuses. Those MSA
However, in our colloquial language case, lemmatization corpuses are:
could not handle spelling mistakes and concatenated words. x Altowayan[20] corpus (ALTW) with 19 million
Recent research tackled the problem of sentiment tokens of Arabic news, reviews and the Quran text.
analysis by using word embedding. In unsupervised x Aracorpus1 (ARAC) with 67 million tokens of raw
classification, the generated word embedding can be used in text from Arabic daily newspapers collected over a
a label propagation framework to induce sentiment lexicons year between 2004 and 2005.
using small sets of seed words [16]. In supervised
x Arabic Wikipedia dump (WK) of 89 million tokens.
classification, the word embedding has been used in
representing features for the sentiment classifier [17]. Tang Those three versions of the corpus helps evaluating the
[18] proposed a Sentiment Specific Word Embedding effect of associating an MSA corpus with the colloquial one.
(SSWE). His method encodes sentiment information in the
word embedding by training the embedding from a large 3.1.2 Preprocessing
labeled tweets collected by positive and negative emotions. The preprocessing performed on the text has a strong
He reported a better performance while training the impact on the generated embedding. The social media users
embedding from a relatively small context window size. can mistakenly use similar letters interchangeably. Those
Labutov [19] suggested re-embedding an existing spelling mistakes will increase the number of unique tokens
embedding using some labeled data. His method showed an that in fact will increase the computational complexity of
improvement in the sentiment classification task but training the embedding. Hence, some filtrations,
observed that the approach is most useful when the training unifications and normalizations are applied as follows:
data is small.
x Filter the corpus from repeated characters,
In the Arabic language, Altowayan [20] addressed the punctuations, diactrice and elongation.
sentiment analysis problem using a generic word
1
http://aracorpus.e3rab.com/argistestsrv.nmsu.edu/AraCorpus/
x Standardize characters like ( أ, إ, )ا, (ى, )يand ( ة,)ه. listed in Table 3. The coverage and accuracy of the analogy
x Unify common terms such as: Question marks, test, and the purity results of the clustering tests are reported
hyperlinks, name tags (i.e. @Name), following in Table 4.
terms (i.e. ''م, 'f' and 'up') and numbers. Table 2: Products terms
x Unify emoticons and abbreviations based on its Clothes ﻓﺴﺘﺎن ﺷﻮز ﺑﻠﻮزه ﺑﻨﻄﻠﻮن
sentiment meaning. Fruits ﺑﻄﯿﺦ ﻓﺮاوﻟﮫ ﻣﻮز ﺗﻔﺎح
Home Appliance ﺗﻼﺟﮫ ﻣﻜﻨﺴﮫ ﺧﻼط ﺗﻠﻔﺰﯾﻮن
3.1.3 Training the Word Embedding Cosmetics ﻛﺤﻞ ﻣﺎﺳﻜﺮا ﺑﻠﺴﻢ ﺑﺮﻓﺎن
In this paper, three different word embedding models Furniture ﺳﺮﯾﺮ دوﻻب ﺗﺮاﺑﯿﺰه ﻛﺮﺳﻲ
were compared, CBOW [21], Skip-Gram [21] and GloVe
[22] . Those models gained a popularity in the last few years Table 3: Sentiment Terms
as they outperform traditional models. The first two are Positive ﯾﺠﻨﻦ ﺷﯿﻚ ﺗﺤﻔﮫ ﺟﻤﯿﻞ ﺷﺎﺑﻮ ﻣﻨﺎﺳﺐ راﻗﻲ
predictive models based on neural network. In CBOW, Negative ﻏﺎﻟﻲ ﻣﻌﻔﻦ ردئ وﺣﺶ ﺳﺊ ﻣﻘﺮف زﻓﺖ
Mikolov [21] uses n words before and after the target word
to predict it. While Skip-Gram uses the center word to It is noticed that, the accuracy and coverage of the
predict the surrounding words. GloVe tends to initialize analogy test of Zahran’s embeddings is more than other
vectors with co-occurrence statistics, by training on global social media embeddings. This is due to using MSA terms
word-word co-occurrence counts. in the analogy questions while the other social media
Initially, three embeddings of the above models are corpuses are mostly colloquial. However, the results of the
trained from the SM corpus. The same is performed with the social media corpuses are still better than the TW and
SM+ALTW, SM+WK and SM+ARAC corpuses. In order WWW embeddings. In addition, CBOW (CB) and Skip-
to compare those embeddings with others, we downloaded Gram (SG) models with the social media corpuses
other publicly available Arabic embeddings like Zahran outperforms the GloVE model in the analogy test.
[23] embeddings that are trained from a large amount of For product clustering, the four social media corpuses
collected MSA corpuses. In addition, Aravec [24] correctly clustered the products terms except the
embedding models built on top of completely different SM+ARAC embedding using Skip-Gram model. For
Arabic corpuses domains from Twitter (TW) and the World sentiment clustering, the Skip-Gram model of the social
Wide Web (WWW). Table 1 lists details of those media embeddings correctly clusters the sentiment words
embeddings. except the SM+WK embedding. For the GloVe model, it
shows better results when using the purely social media
Table 1: Evaluated embeddings corpus. The CBOW social media corpuses reported the
Corpus Number of Tokens Parameters
same high purity of 92.85%.
SM 477 Millions (Dim 300,Win10) This simple intrinsic evaluation led us to use one of the
SM+ALTW [20] 496 Millions (Dim 300,Win10) social media embeddings in the classifications. This make
SM+ARAC 544 Million (Dim 300, Win 10)
SM+WK 567 Millions (Dim 300, Win 10)
sense, since colloquial terms may have different meanings
TW [24] 1,090 Million (Dim 300, Win 3) in MSA. For example, the word ‘ ’ﻏﺎﻟﻲis a negative term if
WWW [24] 2,225.3 Millions (Dim 300, Win 5) used in the products domain as it means ‘expensive’, while
Skip-Gram(Dim 300, Win 10), in other domains it could be a positive term meaning
Zahran [23] 5.8 billion
CBOW(Dim 300, Win 5) ‘precious’. In addition, the spelling mistakes and word
variations are better captured when using a social media
To evaluate the embeddings, intrinsic and extrinsic corpus. In order to visualize the product terms and sentiment
evaluations are used [25]. For intrinsic evaluation, we used terms vectors from the SM embedding, the TensorBoard2
analogy test and a colloquial categorization test. To conduct tool is used. Such tool includes an Embedding Projector that
the analogy test, the manual translation of Mikolov’s[21] visualize embeddings by rendering it in two or three
test from English to Arabic provided by Zahran [23] has dimensions. Figure 1 and Figure 2 show the product terms
been used. Given the translation was in MSA, which is and sentiment terms vectors in 2D from the Skip-Gram SM
different from colloquial Arabic, the test was applied by embedding. As shown from the two figures the term vectors
considering a correct answer if one of the top 10 predicted are correctly clustered.
words correctly matched the answer as opposed to top 5
used by [23]. The coverage (cov) and accuracy (acc) for
each embedding are computed.
In addition, we generated a categorization test inspired
by the Battig test set introduced by Baroni et al [26]. Twenty
colloquial terms related to products from five different
categories are selected as listed in Table 2. The terms
vectors generated from the embedding are clustered using
the k-means algorithm and the purity of the returned results Figure 1: Product terms Figure 2: Sentiment terms
is computed. The same is applied to cluster another 14 vectors clustering vectors clustering
colloquial sentiment terms into positive and negative as
2
https://github.com/tensorflow/tensorboard
The decision of which corpus will be used to train the categories (i.e. ‘’ﺣﺎﺟﺘﮫ ﻏﺎﻟﯿﮫ ﺑﺲ ﻛﻮﯾﺴﮫ, which means ‘high
embedding (SM, SM+ALT, SM+ARAC or SM+WK ) and quality but expensive’). Using separate training data for
which model to use (CBOW, Skip-Gram or GloVe) with each category classifications allows easily classifying those
which parameters to generate the best classification mixed sentiments. Finally, comments with at least two
accuracy are yet to be taken. As per Schnabel, [25], different judges agreed upon are selected. The final judged training
tasks favor different embeddings. As one of his extrinsic dataset is around 13,884 comments; as reported in Table 5.
evaluations, he suggested to measure the contribution of the
word embedding to the sentiment classification. This Table 5: Judges Results
approach is the one adopted to decide on the corpus and the
All Positive Negative Neutral
model to be used, and will be presented in Section 3.3.
Overall training dataset: 13,884 5,127 3,453 5,304
Price 1,749 469 744 536
3.2 Collecting Training dataset Quality 6,407 4,238 2,018 151
Availability 1,741 1,082 476 183
3.2.1 Selection of data
To train the classifier, a dataset selected from the 3.3.Classifying Data
crawled Facebook content is selected to be judged by
humans. More than 400,000 comments are filtered, which Machine learning technique has been used as a
are not follows, nor tags, and contains Arabic text. To limit supervised binary classification task. The classification
the influence of spams, comments are divided into groups algorithm used is the Support Vector Machine (SVM). The
based on the last two digits of the comment id based on [4] SVM algorithm is chosen since it shows a very good
recommendation. performance and a higher accuracy in many studies directed
towards sentiment analysis in many languages [13]. The
In order to select a balanced set of positive, negative and feature representation used in the algorithm is the fixed size
neutral sentences, unsupervised classification method is embedding vector. The feature extracted for each sample, is
used to preliminary classify those comments. The well- the average of the embedding vectors of its words [21].
known Sentistrength application [27] was deployed for this Based on the quality of the embedding, similar features will
purpose. It uses a sentiment lexicons consisting of: positive, generate similar vectors.
negative, negation, and intensifier terms to generate a
sentiment score. Sentistrength was originally designed for To choose the best embedding for the different
English and not adapted to Arabic [28]. In order to use it, classifications, different embeddings are trained from the
the Sentistrength lexicons have been replaced with social media corpus using three models (CBOW, Skip-
colloquial Arabic terms. We manually extracted few Gram and GloVe) with different parameters. The main
positive and negative seeds from the top frequent terms training model parameters are the dimension (dim) size of
occurred in the corpus. Then, we extended them by the learned word vector, the window (win) size to be
extracting similar words from the SM embedding. The same included as the context of a target word, and the number of
process is applied for negation and intensifier terms. Then, training iterations (itr) over the corpus. Usually, the most
Sentistrength is fed by those extended lexicons to calculate used vector dimension is 300 and window size 5 for CBOW
the sentiment score of the selected comments. The training and 10 for Skip-Gram. In our experiments, we varied the
dataset of 15,000 comments was selected with a balanced vector dimension from 100 to 600 and the window sizes at
positive, negative and neutral comments. 5 and 10 from either sides of the target word to capture the
sense of the common statements length of Facebook
3.2.2 Judging of data comments. All results shown are for 15 iterations, as it was
To train the classifier, three human judges have found that increasing or decreasing this number has no great
manually labeled the 15,000 selected training data either impact on the resulted accuracy. At each classification, the
positive, negative, neutral or containing mixed sentiments. classifier is trained by 90% of the judged results. The
In addition, judges labeled each comment based on three remaining 10% of the final judged data are used for testing
categories namely: price, quality and availability. For to measure the classification accuracy. The accuracy is
example, ‘high quality’ is labeled as positive quality, while measured by the F1-score, Precision and Recall ( ‘F1’, ‘Pr.’
‘expensive’ is labeled as negative price. Selected training and ‘Rec.’ respectively).
data has around 5% mixed sentiments with different
3.3.1. Sentiment Analysis before, the different embeddings are used in the feature
The sentiment classification has been applied in two extraction phase to choose the most appropriate for each
separate stages. The first stage is the subjectivity classification.
classification; aiming to separate the subjective statements Table 7 shows the F1-score, Precision and Recall. As
containing positive or negative sentiments from objective shown, increasing the dimension enhances the performance,
statements that are neutrals with no sentiments. The second but limited increase is observed beyond dimension 400. For
stage is the polarity classification; separating the resulted the window sizes, no big difference is noticed in all models.
subjective statements into positive and negative. Table 6 Overall, the GloVe model underperforms the two other
reports the accuracy of each classification using the models. For dimension 600, both the Skip-Gram and
different embeddings. Each classification result represents CBOW have almost the best F1-score with a superiority of
the average of 50 runs with the same random training and the Skip-Gram model with window 10 in the price
testing datasets. All those runs, corpus processing and classification.
embedding trainings are performed on the High
Performance Computer (HPC) 3 cluster of Bibliotheca Table 7: Accuracies of categorization tasks
Alexandrina.
CBOW Skip-Gram GloVe
Results shows that the F1-score of the GloVe model Dim Win F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec%
underperforms Skip-Gram and CBOW in both subjective 100 10 64.5 82.1 53.2 63.4 82.0 51.8 60.6 82.0 48.2
and polarity classifications. For dimensions less than 300, it 200 10 69.4 83.9 59.3 69.0 84.3 58.4 65.6 83.4 54.2
300 10 70.3 83.2 61.0 71.3 84.1 62.0 67.1 83.8 56.1
is noticed that the F1-score of the polarity classification is 400 10 72.0 84.0 63.1 72.6 84.0 64.1 68.5 84.2 57.8
slightly better with the Skip-Gram model and the F1-score 500 10 72.7 83.9 64.3 73.5 84.5 65.2 69.2 83.3 59.3
of the subjectivity classification is better with the CBOW
Price
600 10 73.7 84.1 65.8 74.2 84.1 66.4 70.0 83.2 60.4
model. For dimensions larger than 300, both models have 100 5 63.6 81.9 52.1 63.6 82.6 51.8 59.7 81.0 47.4
200 5 68.3 82.8 58.2 69.4 84.3 59.1 64.2 82.5 52.7
almost same accuracy with best results achieved with 300 5 70.4 83.8 60.8 71.8 84.3 62.7 66.6 83.3 55.7
dimension 600. 400 5 72.1 83.9 63.3 73.3 84.3 65.0 67.5 83.0 57.1
500 5 72.3 83.1 64.1 73.3 84.1 65.1 69.5 83.4 59.8
600 5 73.0 83.3 65.0 74.0 85.0 65.6 69.7 83.2 60.1
Table 6: Accuracies of polarity and subjectivity tasks
100 10 83.6 86.1 81.3 83.0 85.8 80.3 81.3 84.4 78.5
CBOW Skip-Gram GloVe 200 10 85.4 87.6 83.3 84.9 87.3 82.6 83.1 86.1 80.2
Dim Win F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec% 300 10 86.1 88.2 84.1 85.6 87.9 83.4 83.8 86.6 81.2
100 10 91.1 91.0 91.3 91.6 91.7 91.6 91.3 91.4 91.1 400 10 85.9 87.9 84.1 85.7 87.8 83.8 84.2 86.9 81.7
200 10 92.4 92.4 92.4 92.4 92.3 92.5 92.1 92.1 92.1 500 10 86.1 88.0 84.3 86.0 87.7 84.3 83.9 86.3 81.6
Quality
300 10 92.8 92.9 92.8 92.9 92.8 93.0 92.6 92.5 92.7 600 10 86.3 88.1 84.5 86.2 87.7 84.7 84.4 86.6 82.3
400 10 93.0 92.9 93.1 93.1 93.2 93.1 92.7 92.8 92.8 100 5 84.0 86.4 81.7 82.9 85.4 80.5 81.0 84.4 77.8
500 10 93.2 93.2 93.2 93.2 93.2 93.2 92.8 92.9 92.6 200 5 85.3 87.4 83.3 85.2 87.7 82.8 82.6 85.8 79.7
Polarity
600 10 93.1 93.2 93.0 93.3 93.2 93.4 92.6 92.7 92.6 300 5 85.6 87.8 83.6 85.9 87.9 84.1 82.7 85.7 80.0
100 5 91.2 91.2 91.2 91.9 91.7 92.1 90.6 90.5 90.6 400 5 86.2 88.4 84.1 86.4 88.2 84.7 83.8 86.5 81.2
200 5 92.4 92.5 92.4 92.7 92.6 92.7 92.2 92.5 91.9 500 5 86.3 88.2 84.5 86.3 87.9 84.8 83.7 86.4 81.2
300 5 92.9 93.0 92.9 93.1 93.1 93.2 92.3 92.3 92.3 600 5 86.3 88.0 84.6 86.3 87.7 84.9 84.2 86.6 82.0
400 5 93.0 93.1 92.9 93.3 93.0 93.5 92.5 92.4 92.5 100 10 75.5 84.3 68.4 75.0 84.1 67.9 74.0 85.5 65.4
500 5 93.0 93.1 93.0 93.2 93.1 93.3 92.6 92.6 92.7 200 10 77.0 84.9 70.6 76.4 84.4 69.9 75.3 84.9 67.8
600 5 93.3 93.4 93.1 93.2 93.1 93.4 92.6 92.6 92.7 300 10 77.3 84.4 71.4 76.9 83.9 71.2 76.2 84.5 69.4
100 10 86.7 85.9 87.5 86.2 85.2 87.2 85.6 84.4 86.9 400 10 77.4 83.7 72.1 77.3 83.3 72.2 75.6 83.3 69.3
Availability
200 10 87.4 86.9 87.9 86.9 86.1 87.7 86.2 85.2 87.2 500 10 78.1 84.0 73.1 77.8 83.3 73.0 76.2 83.1 70.4
300 10 87.5 87.1 87.9 87.3 86.7 87.9 86.4 85.6 87.3 600 10 77.9 83.6 73.0 77.8 82.3 73.8 77.1 83.8 71.5
400 10 87.7 87.1 88.2 87.7 87.2 88.4 86.5 85.6 87.5 100 5 75.6 85.1 68.2 74.6 83.6 67.5 73.8 84.5 65.6
Subjectivity
500 10 87.6 87.0 88.2 87.7 87.3 88.1 86.4 85.6 87.3 200 5 76.5 84.4 70.0 76.7 84.6 70.2 75.2 84.8 67.7
600 10 87.6 87.1 88.2 87.8 87.2 88.4 86.7 85.8 87.6 300 5 77.4 84.2 71.7 76.9 83.7 71.1 75.7 84.3 68.7
100 5 86.5 85.6 87.4 86.3 85.4 87.2 85.4 84.0 86.9 400 5 78.0 84.2 72.8 77.4 83.4 72.3 76.0 84.1 69.4
200 5 87.4 86.9 88.0 86.9 86.4 87.4 86.2 85.2 87.2 500 5 77.6 83.6 72.5 77.4 82.9 72.7 76.6 84.4 70.2
300 5 87.5 87.1 87.9 87.6 87.1 88.1 86.1 85.2 87.1 600 5 78.0 83.6 73.2 77.5 82.7 73.0 77.0 84.0 71.1
400 5 87.7 87.1 88.3 87.5 87.0 88.1 86.5 85.5 87.6
500 5 87.7 87.1 88.3 87.6 87.3 88.0 86.4 85.4 87.4 The whole set of previous experiments have been
600 5 87.7 87.3 88.2 87.7 87.4 88.1 86.6 85.7 87.4 repeated using different 50 runs for the SM, SM+ALTW,
SM+ARAC and SM+WK Skip-Gram embeddings with 600
3.3.2. Categorization dimensions. As shown in Table 8, enlarging the social
In order to extract opinions about the product price, media corpus with the other MSA corpuses did not enhance
quality and availability, another set of classifications are the F1-score of most of the classifications except for the
applied. The main purpose is to assign zero or more Quality classification. The SM+WK embedding shows
category to each statement. Thus, three supervised binary 0.2% better F1 accuracy over the SM embedding in the
classifications are applied. Each category classification task Quality classification. From the above results, it has been
has a different judged dataset. In each dataset, a sample is concluded to use the Skip-Gram model with dimension 600,
labeled ‘1’ or ‘0’ based on the judges agreement on window 10 trained from purely social media corpus.
assigning a sentiment for that category or not. Same as
3
https://hpc.bibalex.org/
Date Event
A 7/2016 Increase in fuel prices Pos
B 9/2016 TV show ‘Made in Egypt’ episode 1. 9%
C 11/2016 Floating the Egyptian pound Pos Pos
Neut. 80% 77% 52%
D 1/2017 TV show ‘Made in Egypt’ episode 2.
E 1-3/2017 Egyptian decision to stop importing
F 7/2017 Increase in fuel prices Figure 5: Price Figure 6: Quality Figure 7: Availa-
sentiments sentiments bility sentiments
Figure 3, shows the overall sentiments of the whole
dataset during the two years period. As shown, positive and 5. Conclusion
negative sentiments follow the same pattern across time,
with positivity almost double negativity (79% Neutral, 15% Sentiment analysis in Arabic is still a fertile field of
positive and 6% Negative). As depicted in Figure 4, people research, especially for social media. In this paper, a
are equally concerned with the three aspects. Quality was platform has been established for sentiment analysis in
hot topic after the first media show. Availability was of Arabic social media. The supervised binary classification
interest when the pound floated. Prices became main method is adopted while using the neural word embedding
interest during the importing ban. for feature extraction. A case study has been applied,
analyzing sentiments towards ‘Products Made in Egypt’ in
terms of their price, quality and availability. From the
intrinsic and extrinsic embedding evaluations, it was
revealed that enlarging the social media corpus that train the
embedding by another generic MSA corpus does not
guarantee enhancing the accuracy of the classifier. In
addition, different embedding models are trained with
different parameters to be used in the feature extraction
phase of the different supervised classifications. The results
showed that GloVe underperforms Skip-Gram and CBOW
in all classifications. Furthermore, it was revealed that using
a 600 dimension and window size 5 or 10 either CBOW or
Skip-Gram model leads to best accuracy with all
classifications. For the different classifications, the F1-score
is ranging from 74% to 93% with a precision ranging from
82% to 93%.In the future, we hope to enlarge the corpus to
Figure 3: Number of feeds/comments per sentiment include Twitter, Instagram and others social media. In
during 2016 and 2017 addition, we hope to apply other case studies of interest to
the platform.
It was the night before the funeral, and Sir Ronald sat in his study
alone. His servants spoke of him in lowered voices, for since the
terrible day of the murder the master of Aldenmere had hardly tasted
food. More than once he had rung the bell, and, when it was
answered, with white lips and stone-cold face, he had asked for a
tumbler of brandy.
It was past ten o’clock now, and the silent gloom seemed to
gather in intensity, when suddenly there came a fierce ring at the hall
door, so fierce, so imperative, so vehement that one and all the
frightened servants sprang up, and the old housekeeper, with folded
hands, prayed, “Lord have mercy on us!”
Two of the men went, wondering who it was, and what was
wanted.
“Not a very decent way to ring, with one lying dead in the house,”
said one to the other; but, even before they reached the hall door, it
was repeated more imperatively than before.
They opened it quickly. There stood a gentleman who had
evidently ridden hard, for his horse was covered with foam; he had
dismounted in order to ring.
“Is this horrible, accursed story true?” he asked, in a loud, ringing
voice. “Is Lady Alden dead?”
“It is quite true, sir,” replied one of the men, quick to recognize the
true aristocrat.
“Where is Sir Ronald?” he asked, quickly.
“He cannot see any one.”
“Nonsense!” interrupted the stranger, “he must see me; I insist
upon seeing him. Take my card and tell him I am waiting. You send a
groom to attend to my horse; I have ridden hard.”
Both obeyed him, and the gentleman sat down in the entrance
hall while the card was taken to Sir Ronald. The servant rapped
many times, but no answer came; at length he opened the door.
There sat Sir Ronald, just as he had done the night before—his head
bent, his eyes closed, his face bearing most terrible marks of
suffering.
The man went up to him gently.
“Sir Ronald,” he asked, “will you pardon me? The gentleman who
brought this card insists upon seeing you, and will not leave the
house until he has done so. I would not have intruded, Sir Ronald,
but we thought perhaps it might be important.”
Sir Ronald took the card and looked at the name. As he did so a
red flush covered his pale face, and his lips trembled.
“I will see him,” he said, in a faint, hoarse voice.
“May I bring you some wine or brandy, Sir Ronald?” asked the
man.
“No, nothing. Ask Mr. Eyrle to come here.”
He stood quite still until the stranger entered the room; then he
raised his haggard face, and the two men looked at each other.
“You have suffered,” said Kenelm Eyrle; “I can see that. I never
thought to meet you thus, Sir Ronald.”
“No,” said the faint voice.
“We both loved her. You won her, and she sent me away. But, by
heaven! if she had been mine, I would have taken better care of her
than you have done.”
“I did not fail in care or kindness,” was the meek reply.
“Perhaps I am harsh,” he said, more gently. “You look very ill, Sir
Ronald; forgive me if I am abrupt; my heart is broken with this terrible
story.”
“Do you think it is less terrible for me?” said Sir Ronald, with a
sick shudder. “Do you understand how awful even the word murder
is?”
“Yes; it is because I understand so well that I am here. Ronald,”
he added, “there has been ill feeling between us since you won the
prize I would have died for. We were like brothers when we were
boys; even now, if you were prosperous and happy, as I have seen
you in my dreams, I would shun, avoid and hate you, if I could.”
His voice grew sweet and musical with the deep feelings stirred
in his heart.
“Now that you are in trouble that few men know; now that the
bitterest blow the hand of fate can give has fallen on you, let me be
your true friend, comrade and brother again.”
He held out his hand and clasped the cold, unyielding one of his
friend.
“I will help you as far as one man can help another, Ronald. We
will bury the old feud and forget everything except that we have a
wrong to avenge, a crime to punish, a murderer to bring to justice!”
“You are very good to me, Kenelm,” said the broken voice; “you
see that I have hardly any strength or energy.”
“I have plenty,” said Kenelm Eyrle, “and it shall be used for one
purpose. Ronald, will you let me see her? She is to be buried to-
morrow—the fairest face the sun ever shone on will be taken away
forever. Let me see her; do not refuse me. For the memory of the
boy’s love so strong between us once—for the memory of the man’s
love and the man’s sorrow that has laid my life bare and waste, let
me see her, Ronald?”
“I will go with you,” said Sir Ronald Alden; and, for the first time
since the tragedy in its full horror had been known to him, Sir Ronald
left the library and went to the room where his dead wife lay.
CHAPTER V.
WHICH LOVED HER BEST?