Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Data Science 1st Edition Robert

Stahlbock
Visit to download the full and correct content document:
https://ebookmeta.com/product/data-science-1st-edition-robert-stahlbock/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Advances in Data Science and Information Engineering:


Proceedings from ICDATA 2020 and IKE 2020 (Transactions
on Computational Science and Computational
Intelligence) Robert Stahlbock (Editor)
https://ebookmeta.com/product/advances-in-data-science-and-
information-engineering-proceedings-from-icdata-2020-and-
ike-2020-transactions-on-computational-science-and-computational-
intelligence-robert-stahlbock-editor/

The Beginner's Guide to Data Science Robert Ball

https://ebookmeta.com/product/the-beginners-guide-to-data-
science-robert-ball/

Managing Your Data Science Projects Learn Salesmanship


Presentation and Maintenance of Completed Models 1st
Edition Robert De Graaf

https://ebookmeta.com/product/managing-your-data-science-
projects-learn-salesmanship-presentation-and-maintenance-of-
completed-models-1st-edition-robert-de-graaf/

The Educational Leader s Guide to Improvement Science


Data Design and Cases for Reflection 1st Edition Robert
Crow

https://ebookmeta.com/product/the-educational-leader-s-guide-to-
improvement-science-data-design-and-cases-for-reflection-1st-
edition-robert-crow/
Data Science: The Hard Parts: Techniques for Excelling
at Data Science 1st Edition Daniel Vaughan

https://ebookmeta.com/product/data-science-the-hard-parts-
techniques-for-excelling-at-data-science-1st-edition-daniel-
vaughan/

Numerical Python: Scientific Computing and Data Science


Applications with Numpy, Scipy and Matplotlib 2nd
Edition Robert Johansson

https://ebookmeta.com/product/numerical-python-scientific-
computing-and-data-science-applications-with-numpy-scipy-and-
matplotlib-2nd-edition-robert-johansson/

Numerical Python: Scientific Computing and Data Science


Applications with Numpy, SciPy and Matplotlib - 2nd
Edition Robert Johansson

https://ebookmeta.com/product/numerical-python-scientific-
computing-and-data-science-applications-with-numpy-scipy-and-
matplotlib-2nd-edition-robert-johansson-2/

Julia Data Science 1st Edition Jose Storopoli

https://ebookmeta.com/product/julia-data-science-1st-edition-
jose-storopoli/

Python Data Science Chaolemen Borjigin

https://ebookmeta.com/product/python-data-science-chaolemen-
borjigin/
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON
DATA SCIENCE

ICDATA’18
Editors
Robert Stahlbock
Gary M. Weiss, Mahmoud Abou-Nasr
Associate Editor
Hamid R. Arabnia

Publication of the 2018 World Congress in Computer Science,


Computer Engineering, & Applied Computing (CSCE’18)
July 30 - August 02, 2018 | Las Vegas, Nevada, USA
https://americancse.org/events/csce2018

Copyright © 2018 CSREA Press


This volume contains papers presented at the 2018 International Conference on Data Science. Their inclusion in
this publication does not necessarily constitute endorsements by editors or by the publisher.

Copyright and Reprint Permission

Copying without a fee is permitted provided that the copies are not made or distributed for direct
commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source.
Please contact the publisher for other copying, reprint, or republication permission.

American Council on Science and Education (ACSE)

Copyright © 2018 CSREA Press


ISBN: 1-60132-481-2
Printed in the United States of America
https://americancse.org/events/csce2018/proceedings
Foreword
It gives us great pleasure to introduce this collection of papers to be presented at the
14th International Conference on Data Science 2018, ICDATA’18 (http://icdata.org), July 30 –
August 2, 2018, at Luxor Hotel (a property of MGM Resorts International), Las Vegas, USA.

Data mining or machine learning is critically important if we want to effectively learn from
the tremendous amounts of data that are routinely being generated in science, engineering,
medicine, business, and other areas in order to gain insight into processes, transactions, extract
knowledge, make better decisions, and deliver value to users or organizations. This is even more
important and challenging in an era in which scientists and practitioners are faced with numerous
challenges caused by exponential expansion of digital data, its diversity and complexity. The
scale and growth of data considerably outpace technological capacities of organizations to
process and manage it. During the last decade, we all observe new, more glorious and promising
concepts or labels emerging and slowly but steadily displacing ‘data mining’ from the agenda of
CTO’s. It was and still is the time of data science, big data, advanced-/business-/customer-/data-
/predictive-/prescriptive-/…/risk-analytics, to name only a few terms that dominate websites,
trade journals, and the general press – although there is even a rebirth of terms such as artificial
intelligence and (machine) learning (e.g., deep learning) in academia, companies, and even on the
agenda of political decision makers. All the concepts aim at leveraging data for a better
understanding of, and insight into, complex real-world phenomena. They all pursue this objective
using some formal, often algorithmic, procedures, at least to some extent. This is what data
miners have been doing for decades. The very idea of all those similar or identical concepts with
different labels, the idea to think of massive, omnipresent amounts of data as strategic assets, and
the aim to capitalize on these assets by means of analytic procedures is, indeed, more relevant and
topical than ever before. Although there are very helpful advances in hardware and software,
there are still many challenges to be tackled in order to leverage the promises of data analytics.
Obviously, technological change is never ending and appears to be accelerating. Right now the
world seems especially focused on machine learning and data mining (not contradictory but
similar or even equivalent to data science), as these disciplines are making an ever increasing
impact on our society. Large multinational corporations are expanding their efforts in these areas
and students are flocking to computer science and related disciplines in order to learn about these
disciplines and take advantage of the many lucrative job opportunities.

The growth in all these areas has been dramatic enough to require changes in nomenclature.
Most of these ‘hot’ technologies and methods are increasingly considered part of the broad field
of data science, and there are benefits to viewing this field as a unified whole, rather than a
collection of disparate sub-disciplines. For this reason, the International Conference on Data
Mining (DMIN), which has been held for 13 consecutive years, has been renamed as the
International Conference on Data Science (ICDATA). In an effort to unify the field, the former
International Conference on Advances in Big Data Analytics (ABDA) is being merged into
ICDATA. But ICDATA is still much broader than just data mining and ‘big data.’ It includes all
of the following main topics: All aspects of data mining and machine learning (tasks, algorithms,
tools, applications, etc.), all aspects of big data (algorithms, tools, infrastructure, and
applications), data privacy issues, and data management. The conference is designed to be of
equal interest to: researchers and practitioners; academics and members of industry; and computer
scientists, physical and social scientists, and business analysts.

An important mission of the World Congress in Computer Science, Computer Engineering,


and Applied Computing, CSCE (a federated congress to which this conference is affiliated with)
includes "Providing a unique platform for a diverse community of constituents composed of
scholars, researchers, developers, educators, and practitioners. The Congress makes concerted
effort to reach out to participants affiliated with diverse entities (such as: universities,
institutions, corporations, government agencies, and research centers/labs) from all over the
world. The congress also attempts to connect participants from institutions that have teaching as
their main mission with those who are affiliated with institutions that have research as their main
mission. The congress uses a quota system to achieve its institution and geography diversity
objectives." By any definition of diversity, this congress is among the most diverse scientific
meeting in USA. We are proud to report that this federated congress has authors and participants
from 67 different nations representing variety of personal and scientific experiences that arise
from differences in culture and values. As can be seen (see below), the program committee of this
conference as well as the program committee of all other tracks of the federated congress are as
diverse as its authors and participants.

Data science attracts innovative and influential contributions to both research and practice,
across a wide range of academic disciplines and application domains. Our conference seeks to
acknowledge and facilitate excellence in research and applications in the area of data science. Our
conference is held annually within CSCE. CSCE'18 assembles a spectrum of 20 affiliated
research conferences, workshops, and symposiums into a coordinated research meeting. Each
conference has its own program committee as well as referees and own proceedings. Attendees
have full access to all 20 conferences' sessions, tracks, and tutorials. ICDATA seeks to reflect the
multi- and interdisciplinary nature of data mining and to facilitate the exchange and development
of novel ideas, open communication and networking amongst researchers and practitioners in
different research domains. As in previous years, we hope that the 2018 International Conference
on Data Science will provide a forum to present your research in a professional environment,
exchange ideas, and network and interact across research areas. ICDATA’18 provides an
international and multicultural experience with contributions from 22 different countries. We
consider the resulting diversity in attendees and the mixture of established and starting
researchers as a particular advantage of an engaging conference format.

ICDATA’18 attracted submissions of theoretical research papers as well as industrial reports,


application case studies, and in a second phase, late breaking papers, position papers, and
abstract/poster papers. The program committee would like to thank all those who submitted
papers for consideration. We strived to establish a review process of high quality. To ensure a
fair, objective and transparent review process all review criteria were published on the website.
Papers were evaluated regarding their relevance to ICDATA, originality, significance,
information content, clarity, and soundness on an international level. Each aspect was objectively
evaluated, with alternative aspects finding consideration for application papers. Each paper was
refereed by at least two researchers in the topical area, taking the reviewers’ expertise and
confidence into consideration, with most of the papers receiving three reviews. The review
process was competitive. The overall paper acceptance rate for papers was 54% (including,
regular research papers, short research papers, posters, position papers, late breaking papers.)

We are very grateful to the many colleagues who helped in organizing the conference. In
particular, we would like to thank the members of the program committee of ICDATA’18 and the
members of the congress steering committee. The continuing support of the ICDATA program
committee has been essential to further improve the quality of accepted submissions and the
resulting success of the conference. The ICDATA’18 program committee members are (in
alphabetical order): Mahmoud Abou-Nasr, Kai Brüssau, Paulo Cortez, Richard de Groof, Diego
Galar, Peter Geczy, Zahid Halim, Tzung-Pei Hong, Wei-Chiang Hong, Ulf Johansson, Madjid
Khalilian, Terje Kristensen, Zhang Sen, Robert Stahlbock, Chamont Wang, Simon Wang, Gary
M. Weiss, Zijiang Yang, and Wen Zhang. Together with additional reviewers, they all did a great
job in reviewing a lot of submissions in short time.
We would also like to thank our publicity co-chairs Ashu M. G. Solo (Fellow of British
Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc.) for
circulating information on the conference, as well as www.KDnuggets.com, a platform for
analytics, data mining and data science resources, for listing ICDATA’18.

Considering the increasing efforts of all towards the quality of the review process, the
conference sessions and the social program of ICDATA’18, we are confident that you will find
the conference stimulating and rewarding. It is a particular pleasure to provide data mining
oriented invited talks and tutorials presented by the following esteemed members of the data
mining community: Richard Dunks (Datapolitan, USA), Diego Galar (Luleå University of
Technology, Sweden), Peter Geczy (AIST, Japan), Dawud Gordon (TwoSense, USA), Andrew H.
Johnston (Mandiant, USA), and Ulf Johansson (Jönköping University, Sweden).

As Sponsors-at-large, partners, and/or organizers each of the followings (separated by


semicolons) provided help for at least one track of the Congress: Computer Science Research,
Education, and Applications Press (CSREA); US Chapter of World Academy of Science;
American Council on Science & Education & Federated Research Council
(http://www.americancse.org/). In addition, a number of university faculty members and their
staff (names appear on the cover of the set of proceedings), several publishers of computer
science and computer engineering books and journals, chapters and/or task forces of computer
science associations/organizations from 3 regions, and developers of high-performance machines
and systems provided significant help in organizing the conference as well as providing some
resources. We are grateful to them all. We are also grateful for support by the Institute of
Information Systems at Hamburg University, Germany.

We express our gratitude to keynote, invited, and individual conference/tracks and tutorial
speakers - the list of speakers appears on the conference web site. We would also like to thank the
followings: UCMSS (Universal Conference Management Systems & Support, California, USA)
for managing all aspects of the conference; Dr. Tim Field of APC for coordinating and managing
the printing of the proceedings; and the staff of Luxor Hotel (Convention department) at Las
Vegas for the professional service they provided.

Last but not least, we wish to express again our sincere gratitude and respect towards our
colleague and friend Prof. Hamid R. Arabnia (Professor, Department of Computer Science,
University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing/Springer), General
Chair and Coordinator of the federated congress, and also Associate Editor of ICDATA’18 for his
excellent and tireless support, organization and coordination of all affiliated events. His
exemplary and professional effort in 2018 and all the many years before in the steering committee
of the congress makes these events possible. We are grateful to continue our data science
conference as ICDATA’18 under the umbrella of the CSCE congress.

Thank you all for your contribution to ICDATA’18! We hope that you will experience a
stimulating conference with many opportunities for future contacts, research and applications.

We present the proceedings of ICDATA’18.

Robert Stahlbock
ICDATA’18 General Conference Chair

Steering Committee ICDATA’18


http://icdata.org
Contents
SESSION: REAL-WORLD DATA MINING APPLICATIONS, CHALLENGES, AND
PERSPECTIVES + MACHINE LEARNING

Mining Significant Terminologies in Online Social Media Using Parallelized LDA for the 3
Promotion of Cultural Products
Richard de Groof, Haiping Xu, Jurui Zhang, Raymond Liu

Multi-label Classification of Single and Clustered Cervical Cells Using Deep Convolutional 10
Networks
Melanie Kwon, Mohammed Kuko, Vanessa Martin, Tae Hun Kim, Sue Ellen Martin, Mohammad
Pourhomayoun

Credit Default Mining using Combined Machine Learning and Heuristic Approach 16
Sheikh Rabiul Islam, William Eberle, Sheikh Khaled Ghafoor

Customer Level Predictive Modeling for Accounts Receivable to Reduce Intervention Actions 23
Michelle LF Cheong, Wen Shi

A Platform For Sentiment Analysis on Arabic Social Media 30


Sara Elshobaky, Noha Adly

Traffic Jam Direction Field Clustering 37


Alejandro Molina-Villegas, Louis Breton-Tenorio, Adriana Perez-Espinosa, Jose Luis Quiroz-Fabian,
Araceli Liliana Reyes-Cabello, Emilio Bravo-Grajales

Message Classification for Generalized Disease Incidence Detection with Topologically Derived 42
Concept Embeddings
Mark Abraham Magumba, Peter Nabende, Ernest Mwebaze

Data-Driven Exploration of Factors Affecting Federal Student Loan Repayment 49


Bin Luo, Qi Zhang, Somya Mohanty

Analysis of Tornado Environments using Convolutional Neural Networks 56


Michael P. McGuire, Todd W. Moore

On Monitoring Heat-pumps with a Group-based Conformal Aomaly Detection Approach 63


Shiraz Farouq, Stefan Byttner, Mohamed-Rafik Bouguelia

ML2ESC: A Source Code Generator to Embed Machine Learning Models in Production 70


Environments
Oscar Castro-Lopez, Ines F. Vega-Lopez

74
Effective Machine Learning Approach to Detect Groups of Fake Reviewers
Jayesh Soni, Nagarajan Prabakar
Sentiment Probing of Social Media Data using Various Supervised Learners 79
Sourangshu Das , S. Kami Makki

Evaluating Self-Organizing Map Quality Measures as Convergence Criteria 86


Gregory Breard, Lutz Hamel

SwiftVis2: Plotting with Spark using Scala 93


Mark C. Lewis, Lisa L. Lacher

Organizations's Investment in Cloud Computing: Designing a Decision Support Platform 100


Jacques Bou Abdo, Najma Saidani, Ahmed Bounfour

RNN as a Multivariate Arrival Process Model: Modeling and Predicting Taxi Trips 105
Xian Lai, Gary Weiss

Implementing Data Mining Methods to Predict Diamond Prices 112


Jose M. Pena Marmolejos

Analyzing Inventory Data Using K-Means Clustering 117


Michael Pigman, Hung Le, Urvish Bhagat, Maxfield Thompson, Qingguo Wang

Assessment of Minorities Access to Finance 123


Emi N. Harry, Gary Weiss

Open-Source Neural Network and Wavelet Transform Tools for Server Log Analysis 130
Chunyu Liu, Tong Yu

Recursion Identify Algorithm for Gender Prediction with Chinese Names 137
Hua Zhao, Fairouz Kamareddine

Maximizing the Processing Rate for Streaming Applications in Apache Storm 143
Ali Al-Sinayyid, Michelle Zhu

Machine Learning Models for Predicting Fracture Strength of Porous Ceramics and Glasses 147
Saishruthi Swaminathan, Tejash Shah, Birsen Sirkeci-Mergen, Ozgur Keles

Big Data Security Analytics: Key Challenges 151


Ripon Patgiri, Umakanta Majhi

SESSION: CLUSTERING, ASSOCIATION + WEB / TEXT / MULTIMEDIA


MINING + SOFTWARE
Improvement of the Firefly-based K-means Clustering Algorithm 157
Lubo Zhou, Longzhuang Li
Anomaly Detection of Elderly Patient Activities in Smart Homes using a Graph-Based 163
Approach
Ramesh Paudel, William Eberle, Lawrence B. Holder

Finding a Balance Between Interestingness and Diversity in Sequential Pattern Mining 170
Rina Singh, Jeffrey A. Graves, Lemuel R. Waitman, Douglas A. Talbert

Modeling Topics on Open Source Apache Spark Repositories 177


Xiaoran Wang, Leonardo Jimenez Rodriguez, Jilong Kuang

Effective Grouping of Unlabelled Texts using A New Similarity Measure for Spectral 181
Clustering
Arnab Roy, Tanmay Basu

SESSION: REGRESSION AND CLASSIFICATION


High-Performance Support Vector Machines and Its Applications 187
Taiping He, Tao Wang, Ralph Abbey, Joshua Griffin

Dyn2Vec: Exploiting Dynamic Behaviour using Difference Networks-based Node Embeddings 194
for Classification
Sandra Mitrovic, Jochen De Weerdt

A Generalized Method for Fault Detection and Diagnosis in SCADA Sensor Data via 201
Classification with Uncertain Labels
Md Ridwan Al Iqbal, Rui Zhao, Qiang Ji, Kristin P. Bennett

Tuning the Layers of Neural Networks for Robust Generalization 208


Chun Pang Chiu, Kwok Yee Michael Wong

FAWCA: A Flexible-greedy Approach to find Well-tuned CNN Architecture for Image 214
Recognition Problem
Md Mosharaf Hossain, Douglas A. Talbert, Sheikh Khaled Ghafoor, Ramakrishnan Kannan

Venn Predictors Using Lazy Learners 220


Ulf Johansson, Tuwe Lofstrom, Hakan Sundell

A Multivariate Linear Regression Analysis of In Vitro Testing Conditions and Brain 227
Biomechanical Response under Shear Loads
Folly Crawford, Jennifer Fisher, Osama Abuomar, Raj Prabhu

An Effective Nearest Neighbor Classification Technique Using Medoid Based Weighting 231
Scheme
Avideep Mukherjee , Tanmay Basu
SESSION: POSTER PAPERS AND EXTENDED ABSTRACTS
Yet Another Weighting Scheme for Collaborative Filtering Towards Effective Movie 237
Recommendation
Anurag Banerjee, Tanmay Basu

Towards Developing Effective Machine Learning Frameworks to Identify Toxic 239


Conversations Over Social Media
Ritam Majumder, Tanmay Basu

An Interactive Data Analytics System for Employment Acceptance Recommendation 241


Opeyemi Fadairo, David Olatunbosun, Premsagar Ravula, Manpreet Kaur

SESSION: LATE BREAKING PAPERS


Analysis of EHR Free-text Data with Supervised Deep Neural Networks 245
Duncan Wallace, Tahar Kechadi

Predicting Epileptic Seizures using Dimensionality Reduction Techniques 252


Luis A. Martinez-Lopez, Daniel A. Martinez-Perez

Predictive Hybrid Machine Learning Model for Network Intrusion Detection 258
Ebrahim Alareqi, Khalid Abed

Modelling Radon Variability in Qatar: Challenges & Opportunities 263


Kassim Mwitondi, Ibrahim Al Sadig, Rifaat Hassona, Charles Taylor, Adil Yousif

Acceleration of Python Artificial Neural Network in a High Performance Computing Cluster 268
Environment
Christopher Rosser, Ebrahim Alareqi, Anthony Wright, Khalid Abed

Stay Point Analysis in Automatic Identification System Trajectory Data 273


Yihan Cai, Menghan Tian, Weidong Yang, Yi Zhang

To Read or to Do? That is the Task 279


Zaid Alibadi, Jose Vidal
Int'l Conf. Data Science | ICDATA'18 | 1

SESSION
REAL-WORLD DATA MINING APPLICATIONS,
CHALLENGES, AND PERSPECTIVES + MACHINE
LEARNING

Chair(s)
Dr. Mahmoud Abou-Nasr
Dr. Robert Stahlbock
Dr. Gary M. Weiss

ISBN: 1-60132-481-2, CSREA Press ©


2 Int'l Conf. Data Science | ICDATA'18 |

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 3

0LQLQJ6LJQLILFDQW7HUPLQRORJLHVLQ2QOLQH6RFLDO0HGLD8VLQJ
3DUDOOHOL]HG/'$IRUWKH3URPRWLRQRI&XOWXUDO3URGXFWV 

5LFKDUGGH*URRI+DLSLQJ;X-XUXL=KDQJDQG5D\PRQG/LX

&RPSXWHUDQG,QIRUPDWLRQ6FLHQFH'HSDUWPHQW
8QLYHUVLW\RI0DVVDFKXVHWWV'DUWPRXWK'DUWPRXWK0$86$

'HSDUWPHQWRI0DUNHWLQJ8QLYHUVLW\RI0DVVDFKXVHWWV%RVWRQ%RVWRQ0$86$


$EVWUDFW - Despite the growing popularity of online social WH[WDQGSURYLGHPHDQLQJIXOLQWHUSUHWDWLRQ7RDFKLHYHWKLV
media, there are very few research efforts to use online LWLVQHFHVVDU\WRLGHQWLI\WKHVXEMHFWPDWWHULQYROYHG2QH
social media to study market strategies for the promotion of RI WKH PRVW SRSXODU WRSLF PRGHOLQJ WHFKQLTXHV LV FDOOHG
cultural products. With online content being largely /DWHQW 'LULFKOHW $OORFDWLRQ /'$  ZKLFK LV D SRZHUIXO
unregulated, Latent Dirichlet Allocation (LDA) provides a PHWKRG IRU OHDUQLQJ WRSLF GLVWULEXWLRQV LQ WH[W >@ /'$ LV
useful mechanism for organizing textual data and deriving DQ XQVXSHUYLVHG WRSLF PRGHOLQJ WHFKQLTXH IRU PLQLQJ WH[W
conclusions about the subject matter. In this paper, we GDWD DQG GHULYLQJ ODWHQW WRSLF GLVWULEXWLRQV /LNH RWKHU
introduce a parallelized LDA, called pLDA, to analyze XQVXSHUYLVHG WHFKQLTXHV LW GRHV QRW UHTXLUH D ODEHOHG
clustered textual data in online social media. We use pLDA WUDLQLQJ VHW IRU LWV RSHUDWLRQ 7KLV PDNHV LW YHU\ XVHIXO LQ
to infer the posterior of latent topics over documents and WKHFRQWH[WRIDODUJHDPRXQWRIXQFDWHJRUL]HGGDWDVXFKDV
words, and identify significant terminologies that describe PXVLFLDQV¶)DFHERRNSDJHV+RZHYHUWRPDNHPHDQLQJIXO
the vast number of posts. Making use of sentiment analysis, FODVVLILFDWLRQVDXVHUPXVWLQVSHFWWKHUHVXOWVWRGHWHUPLQH
we are able to further make suggestions about the relevant WKH DVVRFLDWLRQV EHWZHHQ KLGGHQ WRSLFV DQG GRFXPHQWV ,Q
topics for promoting cultural products. Finally, we use a DGGLWLRQWKHFRPSXWDWLRQDOFRPSOH[LW\RIWKLVPHWKRGRORJ\
case study of the music industry to demonstrate how the PD\UHQGHULWVXVHOLPLWHGIRUPDVVLYHGRFXPHQWV
most relevant aspects to artist popularity can be derived. ,QWKLVSDSHUZHLQWURGXFHDQ/'$EDVHGDSSURDFKWR
PLQLQJVLJQLILFDQWWHUPLQRORJLHVLQRQOLQHVRFLDOPHGLDIRU
.H\ZRUGV &XOWXUDO SURGXFWV RQOLQH VRFLDO PHGLD WH[W WKHSURPRWLRQRIFXOWXUDOSURGXFWV2XUXQLTXHSURFHVVXVHV
PLQLQJODWHQW'LULFKOHWDOORFDWLRQWRSLFPRGHOLQJ DSRVWSURFHVVLQJVWHSWR/'$ZKLFKDOORZVXVWREURDGO\
FDWHJRUL]HWKHSRVWVZLWKOHVVPDQXDOLQWHUYHQWLRQ2QFHZH
NQRZWKHVXEMHFWPDWWHURIWKHSRVWVZHPD\XVH6WDQIRUG
 ,QWURGXFWLRQ &RUH 1/3 WR SURYLGH VHQWLPHQW DQDO\VLV GHULYLQJ WKH
 7KH SUROLIHUDWLRQ RI RQOLQH VRFLDO PHGLD WHFKQRORJLHV QHJDWLYH RU SRVLWLYH RULHQWDWLRQ RI HDFK 8VLQJ PHDQLQJIXO
KDV UHVXOWHG LQ D WUHPHQGRXV DPRXQW RI LQIRUPDWLRQ LQGLFDWRUVVXFKDVWKH)DFHERRNQXPEHUVRIIROORZHUVDQG
EHFRPLQJ UHDGLO\ DYDLODEOH 6RFLDO PHGLD ZHEVLWHV VXFK DV VROYLQJDV\VWHPRIHTXDWLRQVFKDUDFWHUL]LQJHDFKDUWLVWZH
7ZLWWHU DQG )DFHERRN SURYLGH WKH XVHUV DQ RSSRUWXQLW\ WR PD\GHULYHWKHUHODWLRQVKLSEHWZHHQWKHVHWRSLFVDQGDUWLVW
VSHDN WKHLU PLQG RQ SDJHV KRVWHG E\ HYHU\RQH IURP SRSXODULW\ *HQHUDOO\ D OLQHDU V\VWHP RI HTXDWLRQV PD\
FHOHEULWLHV WR DUWLVWV IURP \RXQJ NLGV WR SRS VKRS RZQHUV LQGLFDWH UHOHYDQFH RI RQH RU PRUH FRPPRQ DWWULEXWHV
'XH WR LWV SRSXODULW\ WKH SRWHQWLDO XWLOLW\ RI RQOLQH VRFLDO FDOFXODWHGVHQWLPHQWVFRUHVRIFRPPRQO\RFFXUULQJVXEMHFW
PHGLD LQ SURPRWLQJ FXOWXUDOSURGXFWV LV EHLQJ LQFUHDVLQJO\ PDWWHUDWWULEXWHGWRHDFKDUWLVW%DVHGRQYDULRXVIDFWRUVWKDW
UHFRJQL]HG 7KH VRFLDO PHGLD VLWHV W\SLFDOO\ UHTXLUH DUWLVWV DUH LPSRUWDQW WR FXOWXUH SURGXFWV DUWLVW SRSXODULW\ FDQ EH
VXFKDVPXVLFLDQVWRHVWDEOLVKDQRQOLQHSUHVHQFHE\ZKLFK FRQVLGHUHG DV WKH GHSHQGHQW IDFWRU DQG EUDQG QLFKH DQG
WKH\ PD\ GLVVHPLQDWH WKHLU EUDQG QDPHV 0XFK RI WKH DXGLHQFHFRQVHQVXVDPRQJVWRWKHUVDVLQGHSHQGHQWIDFWRUV
IHHGEDFN SRVWHG RQOLQH FRXOG EH XVHIXO WR LGHQWLI\ WUHQGV XVLQJ VXFK HTXDWLRQV >@ :H IRFXV RQ WKH WH[W FRQWHQW RI
DQGXQGHUVWDQGZKDWLVLPSRUWDQWLQFXOWXUDOSURGXFWVVXFK WKH RQOLQH VRFLDO PHGLD DQG XVH VHQWLPHQW DQDO\VLV WR
DV WKRVH SURGXFHG E\ PXVLFLDQV +RZHYHU PLQLQJ VRFLDO GHWHUPLQH WKH RULHQWDWLRQ DQG WKHUHE\ UHOHYDQFH RI HDFK
PHGLD KDV EHHQ D GLIILFXOW WDVN EHFDXVH VR PXFK RI WKH LQGHSHQGHQW IDFWRU WR DUWLVW SRSXODULW\ 7R LPSURYH WKH
LQIRUPDWLRQ LV LQ WKH IRUP RI XQVWUXFWXUHG IUHH WH[W HIILFLHQF\ RI RXU DSSURDFK ZH LQWURGXFH D SDUDOOHOL]HG
3UHYLRXVZRUNKDVIRFXVHGRQWKHDSSOLFDWLRQRIWH[WPLQLQJ /'$ SURFHGXUH FDOOHG S/'$ WR PLQH WH[WEDVHG RQOLQH
WHFKQLTXHVWKDWUHTXLUHPDQXDOLQWHUSUHWDWLRQ6LQFHWKHUHLV VRFLDOPHGLD,QDFDVHVWXG\RIPLQLQJRQOLQHVRFLDOPHGLD
WRR PXFK LQIRUPDWLRQ WR SURYLGH PDQXDO ODEHOV WKHUH LV D IRUWKHPXVLFLQGXVWU\ZHVKRZWKDWRXUDSSURDFKFDQQRW
SUHVVLQJ QHHG WR GHYHORS WRROV WR DXWRPDWLFDOO\ FDWHJRUL]H RQO\ HIIHFWLYHO\ LGHQWLI\ WKH PRVW UHOHYDQW DVSHFWV WR DUWLVW
SRSXODULW\ DFFRUGLQJ WR WKH VHQWLPHQWV H[SUHVVHG E\ WKH

 7KLV PDWHULDO LV EDVHG XSRQ ZRUN VXSSRUWHG E\ WKH 3UHVLGHQW¶V  OLVWHQLQJ DXGLHQFH EXW DOVR UXQ IDVWHU WKDQ LWV VHTXHQWLDO
&UHDWLYH(FRQRP\ &( )XQG8QLYHUVLW\RI0DVVDFKXVHWWV YHUVLRQRIWKH/'$PHFKDQLVP

ISBN: 1-60132-481-2, CSREA Press ©





4 Int'l Conf. Data Science | ICDATA'18 |



 5HODWHG:RUN ZHUHDJJUHJDWHGLQDVHSDUDWHVWHSEHIRUHFRQWLQXLQJ6XFK
UHVXOWV DUH FRQVLGHUHG DQ DSSUR[LPDWLRQ WR RXWSXWV RI WKH
 5HVHDUFKHUV KDYH LQFUHDVLQJO\ UHFRJQL]HG WKDW RQOLQH RULJLQDO /'$ DOJRULWKP 6LQFH WKH VDPH ZRUG PD\ RFFXU
VRFLDO PHGLD RIIHUV D EUHDGWK RI LQIRUPDWLRQ UHODWHG WR WKH DFURVV PXOWLSOH GRFXPHQWV WKH WUXH UHVXOWV PXVW FRQVLGHU
SURPRWLRQRIFXOWXUDOSURGXFWV*RKet al.VWXGLHGWKHHIIHFW WKH WRSLF DVVLJQPHQW WR HDFK LQ RUGHU WR DWWDLQ WKH WUXH
RI XVHUJHQHUDWHG DQG PDUNHWHUJHQHUDWHG FRQWHQW LQ VRFLDO GLVWULEXWLRQ 'LIIHUHQW IURP WKH DERYH DSSURDFKHV RXU
PHGLDRQFRQVXPHUV¶UHSHDWHGSXUFKDVHEHKDYLRUV>@7KH\ PHWKRG XVHV D UHDOWLPH GDWD V\QFKURQL]DWLRQ PHFKDQLVP
XVHG FRPPHUFLDO WH[W PLQLQJ DSSOLFDWLRQV WR DQDO\]H WH[W IRU EHWWHU DFFXUDF\ QRW RQO\ PDNLQJ WKH UHVXOWV PRUH
JDWKHUHG IURP )DFHERRN DQG IRXQG WKDW XVHUJHQHUDWHG FRPSDUDEOH WR D VHTXHQWLDO LPSOHPHQWDWLRQ EXW DOVR
FRQWHQWKDGDPRUHVLJQLILFDQWLPSDFW6LPLODUO\.LPet al. SURYLGLQJDVSHHGXSGXHWRSDUDOOHOL]DWLRQ
DQDO\]HGWH[WUHYLHZVRIKRWHOVRQ7ULS$GYLVRUWRGLVFRYHU  2WKHU UHODWHG ZRUN KDV IRFXVHG RQ WKH DSSOLFDWLRQ RI
VDWLVILHUV DQG GLVVDWLVILHUV DV ZHOO DV WKH UHDVRQV ZK\ LQIRUPDWLRQ GHULYHG IURP VRFLDO PHGLD WRZDUGV WKH PXVLF
SHRSOH OHDYH SRVLWLYH RU QHJDWLYH UHYLHZV >@ 7KH\ DOVR LQGXVWU\)RUH[DPSOH0X6H1HWDQHWZRUNRIPXVLFDUWLVWV
OHYHUDJHGDIHDWXUHRQ7ULS$GYLVRUZKLFKDOORZVWKHXVHU DURXQG WKH ZRUOG OLQNHG E\ SURIHVVLRQDO UHODWLRQVKLSV ZDV
WR OHDYH D FDWHJRULFDO UDWLQJ +H et al. VWXGLHG WKH SL]]D GHYHORSHGDVDQH[DPLQDWLRQRIWKHVRFLDOQHWZRUNDVSHFWRI
LQGXVWU\ XVLQJ WH[W SRVWV IURP )DFHERRN DQG 7ZLWWHU LQ DQ VLWHV OLNH )DFHERRN XVHG LQ WKH PXVLF DUWLVW LQGXVWU\ >@
HIIRUW WR JDLQ PDUNHWLQJ LQVLJKW >@ 8VLQJ H[LVWLQJ WH[W 5HODWLRQVKLSV DPRQJVW VXEVFULEHUV PD\ EH DVVHPEOHG LQ D
PLQLQJ WRROV WKH\ LGHQWLILHG WKHPHV LQ WKH GDWD WKDW WKH\ JUDSK WR XQGHUVWDQG XQGHUO\LQJ SKHQRPHQD $V VXFK
XVHG WR FDWHJRULFDOO\ FRPSDUH WKUHH PDMRU SL]]D FKDLQV DSSURDFKHV SURYLGH XVHIXO LQVLJKWV DERXW RQOLQH VRFLDO
8QOLNHWKHDERYHZRUNZHLQWURGXFHRXUQRYHOSDUDOOHOL]HG PHGLDGDWDWKH\DUHFRPSOHPHQWDU\WRRXUUHVHDUFKHIIRUWV
WH[WPLQLQJDSSURDFKDQGRXUXQLTXHSURFHGXUHDOORZVWKH RQDQDO\]LQJWKHWH[WEDVHGVRFLDOPHGLDIRUWKHSURPRWLRQ
LGHQWLILFDWLRQRIWRSLFVWREHODUJHO\DXWRPDWLF RIFXOWXUDOSURGXFWV
 7KHUHDUHDOVRDIHZSUHYLRXVUHVHDUFKHIIRUWVIRFXVHG
RQWKHXVHRI/'$WRDQDO\]HRQOLQHVRFLDOPHGLD4LDQJet  6LJQLILFDQW7HUPLQRORJ\,GHQWLILFDWLRQ
al.LQFRUSRUDWHGIHDWXUHVVXFKDV³JHRWUDFNLQJ´WRDLGLQWKH
LGHQWLILFDWLRQRIJHRJUDSKLFDOWRSLFVIURPVRFLDOPHGLD>@  )LJXUHVKRZVWKHSURFHGXUHE\ZKLFKWH[WGDWDIURP
7KHLUPHWKRGLVEDVHGRQWKH/'$PRGHOXVLQJJHQHUDWLRQ DQ RQOLQH VRFLDO PHGLD VXFK DV )DFHERRN SDJHV DUH
SUREDELOLWLHV ZKLFK JHQHUDWHV HDFK NH\ZRUG IURP HLWKHU D SURFHVVHG 3RSXODU DQG OHVVNQRZQ ORFDO DUWLVWV ZHUH
ORFDO RU D JOREDO WRSLF GLVWULEXWLRQ /'$ KDV DOVR EHHQ LGHQWLILHG DQG UHYLHZ WH[W LV H[WUDFWHG XVLQJ EURZVHU
DSSOLHG WR PXVLFDO UHFRPPHQGDWLRQ V\VWHPV .LQRVKLWD et VFULSWLQJ PHFKDQLVPV 8VLQJ DQ RQOLQH WDJJLQJ VHUYLFH
al. SURSRVHG D V\VWHP WR GHVFULEH PXVLFDO SUHIHUHQFHV E\ VXFK DV /DVWIP SRSXODU WDJV DUH H[WUDFWHG IRU HDFK DUWLVW
FRQVLGHULQJ GLIIHUHQW WDJV DVVRFLDWHG ZLWK DUWLVW JHQUH DQG $VVHPEOLQJ WKH WDJV LQWR YHFWRUV IRU HDFK DUWLVW kPHDQV
XVHU SUHIHUHQFHV >@ 7KH\ XVHG &ROODERUDWLYH )LOWHULQJ FOXVWHULQJ FDQ EH XVHG WR VHSDUDWH WKH DUWLVWV LQWR VLPLODU
&) EDVHG VLPLODU XVHU VHOHFWLRQ WR UHFRPPHQG PXVLF FDWHJRULHV LH DUWLVWV ZLWK VLPLODU SURSRUWLRQV RI LGHQWLFDO
SURGXFWVWRXVHUVZLWKVLPLODUWDVWHVWRDWDUJHWDUWLVW6XFK WDJV:HGRWKLVIRUWZRUHDVRQVWRUHGXFHWKHGDWDVL]HIRU
WH[W PLQLQJ PHWKRGRORJLHV KDYH EHHQ XVHG LQ D YDULHW\ RI UXQQLQJ S/'$ DQG DOVR WR VXSSRUW RXU FRQFOXVLRQV WKDW
FRQWH[WVEXWRIWHQLQWKHLURULJLQDOIRUPXODWLRQDSSO\LQJWKH GLIIHUHQW W\SHV RI DUWLVWV UHFHLYH FRPPHQWV UHIOHFWLQJ
UHWXUQHG WRSLF GLVWULEXWLRQV GLUHFWO\ >@ 7KH UHWXUQHG GLIIHUHQWVXEMHFWPDWWHUDQGXVHIXOLQIRUPDWLRQ7KHRXWSXWV
PDWULFHV DUH W\SLFDOO\ LQWHUSUHWHG DV FOXVWHUV ZKLFK RI WKH S/'$ SURFHVV DUH WZR PDWULFHV QDPHO\ ș
UHSUHVHQW YDULRXV FRPELQDWLRQV RI XQGHUO\LQJ WRSLFV ,Q UHSUHVHQWLQJWKHWRSLFGLVWULEXWLRQIRUHDFKGRFXPHQWLQSXW
FRQWUDVW RXU DSSURDFK GHULYHV VLJQLILFDQW WHUPLQRORJLHV LQWR S/'$ DQG ij UHSUHVHQWLQJ WKH GLVWULEXWLRQ RI ZRUGV
FRQGLWLRQHG RQ GRFXPHQWV ZKLFK DUH KLJKO\ UHIOHFWLYH RI RYHUWRSLFV)RUS/'$GDWDSURFHVVLQJHDFKSRVWUHSUHVHQWV
WKH XQGHUO\LQJ WRSLFV ,Q RXU DSSURDFK ZH DSSO\ WKH D GRFXPHQW ZKLFK XVXDOO\ GLVFXVVHV D VLQJOH VXEMHFW )RU
SUREDELOLW\RIZRUGJLYHQGRFXPHQWWRWKHLGHQWLILFDWLRQRI WKRVHLQVWDQFHVLQZKLFKRQHVXEMHFWLVGLVFXVVHGFOXVWHULQJ
ODWHQW WRSLFV DQG FRQVLGHU IUHHO\IRUPHG WH[W UDWKHU WKDQ E\ WRSLF GLVWULEXWLRQV DVVRFLDWHV FRPPHQWV GLVFXVVLQJ WKH
LQIRUPDWLRQGHULYHGIURPWDJV VDPH WRSLF WRJHWKHU $OVR FRPPHQWV DGGUHVVLQJ WKH VDPH
 5HFHQW ZRUN KDV GLVFXVVHG WKH SDUDOOHOL]DWLRQ RI WKH PXOWLSOHWRSLFVDUHDOVRFOXVWHUHGWRJHWKHUDFFRUGLQJWRWKHLU
/'$SURFHVV1HZPDQet al.LQWURGXFHGDSDUDOOHOSURFHVV XQLTXHWRSLFGLVWULEXWLRQV
IRU /'$ ZKLFK HVVHQWLDOO\ GLYLGHV WKH GRFXPHQW VHW LQWR  2QFHWKHSRVWVDQGFRPPHQWVKDYHEHHQH[WUDFWHGDQG
VHFWLRQV DQG GLVWULEXWHV LW DFURVV FRPSXWDWLRQ XQLWV >@ FOXVWHUHG DFFRUGLQJ WR DUWLVW WDJV WKH\ DUH SURFHVVHG XVLQJ
6LPLODUO\ :DQJ et al. LQWURGXFHG 3OGD ZKLFK RSHUDWHV LQ S/'$ $GDSWHG IURP >@ WKH PDWULFHV ș DQG ij FDQ EH
WKH VDPH IDVKLRQ EXW LQFRUSRUDWHV +DGRRS IXQFWLRQV >@ FDOFXODWHGXVLQJ&*6DVLQ(T  DQG  
/LX et al. LQWURGXFHG 3OGD ZKLFK ZRUNV ZLWK D SLSHOLQH θ m k = n mk − mn + α k   
V\VWHP WR SHUIRUP WKH VDPH WDVN >@ $ FULWLFDO DVSHFW RI
WKHVH SDUDOOHO DOJRULWKPV LV V\QFKURQL]DWLRQ ,Q SUHYLRXV n k −n m n + β n
ZRUN LQIHUHQFH WHFKQLTXHV OLNH &ROODSVHG*LEEV 6DPSOLQJ  φ =   
k n N
&*6  ZHUH XVHG RQ SDUWLWLRQV RI WKH GDWD VHWV >@ $IWHU
HDFKSURFHVVRUKDGSHUIRUPHGDVLQJOHLWHUDWLRQWKHUHVXOWV ¦
r =
n k −r m n + β r

ISBN: 1-60132-481-2, CSREA Press ©



Int'l Conf. Data Science | ICDATA'18 | 5

 'HULYDWLRQRI6LJQLILFDQW7HUPLQRORJLHV
 3UHSURFHVVLQJRI7H[W'DWD
 %HIRUH WKH RQOLQH PHGLD SRVWV DUH SURFHVVHG ZLWK
S/'$WKH\DUHFOXVWHUHGDFFRUGLQJWRDUWLVWJHQUHWDJV$V
DQ H[DPSOH VKRZQ LQ )LJ  SRSXODU WDJV IRU HDFK DUWLVW
ZHUH FROOHFWHG IURP /DVWIP ZLWK WKH DUWLVWV DQG WKH DUWLVW
FOXVWHUV GLVSOD\HG LQVLGH WKH ER[HV (DFK WDJ KDV D FRXQW
DWWULEXWHG WR LW ZKLFK UHSUHVHQWV WKH QXPEHU RI /DVWIP
XVHUV ZKR KDYH DSSOLHG WKDW WDJ &RXQWV RI FRPPRQO\
RFFXUULQJ WDJV DUH DVVHPEOHG LQWR D YHFWRU IRU HDFK DUWLVW
ZKHUHHDFKYHFWRULVQRUPDOL]HGZLWKWKHVXPPDWLRQRILWV
HOHPHQWVWR7KHVHYHFWRUVDUHWKHQFOXVWHUHGZLWKkPHDQV
WR SURGXFH DUWLVW FOXVWHUV FRPSRVHG RI VLPLODUO\ FODVVLILHG
PXVLFLDQVDFFRUGLQJWRDUWLVWJHQUHWDJV

Ϭ͘Ϯϱ Ϭ Ϭ͘ϳϱ Ϭ Ϭ Ϭ Ϭ Ϭ Ϭ͘ϴϳ Ϭ͘Ϯϱ


)LJXUH7UHQGVLGHQWLILFDWLRQLQFXOWXUDOSURGXFWSRSXODULW\
Ϭ͘ϯϯ Ϭ Ϭ͘ϲϱ Ϭ Ϭ Ϭ Ϭ Ϭ Ϭ͘ϳϱ Ϭ͘ϯϰ

 7KH EDVLF LGHD RI (TV   DQG   VKRZV WKDW /'$


GHILQHVDSUREDELOLVWLFPRGHOZKLFKFRXOGEHGHULYHGIURP 
ZRUGFRXQWVDQGWKHK\SHUSDUDPHWHUYHFWRUVĮDQGȕ,QWKH )LJXUH$QH[DPSOHRIFOXVWHULQJDUWLVWFRPPHQWV
HTXDWLRQV n UHSUHVHQWV WKH FRXQWV RI WRSLFV DVVLJQHG WR
ZRUGVDQGGRFXPHQWV7KHVXSHUVFULSWkUHSUHVHQWVWKHWRSLF  3DUDOODOL]HG/'$
LQGH[ DQG − m  n  UHSUHVHQWV WKH H[FOXVLRQ RI WRSLF IRU  $OJRULWKP  VKRZV WKH S/'$ URXWLQHV ZKHUH
GRFXPHQW m DQG ZRUG n DV UHTXLUHG E\ &*6 0DWULFHV ș SDUWLWLRQVDUHPDGHRIWKHGDWDVHWDFURVVGRFXPHQWV6LQFH
DQGij DUHFDOFXODWHGXVLQJGRFXPHQWWRSLFSDLUVDQGWRSLF WKH GRFXPHQWV WKHPVHOYHV DUH GLVMRLQWHG WKH RQO\ UHTXLUHG
ZRUG SDLUV UHVSHFWLYHO\ (DFK UHVXOWLQJ ș IURP WKH S/'$ V\QFKURQL]DWLRQ DPRQJ S/'$:RUNHU SURFHVVHV LV WKH
SURFHVV RQ DUWLVW FOXVWHUV LV WKHQ LQSXW WR WKH kPHDQV DVVLJQPHQWV RI WRSLFV WR WKH VKDUHG ZRUGV LQ WKH ZRUNHU
DOJRULWKP DJJUHJDWLQJ LQWR FOXVWHUV WKDW UHSUHVHQW VLPLODU SURFHVVHV ,Q HDFK LWHUDWLRQ WKH WRSLF DVVLJQPHQWV DUH
GLVWULEXWLRQV RI WKH XQGHUO\LQJ WRSLFV 7KLV VWHS VHSDUDWHV UHFRUGHGLQPDWUL[prevAssignZKRVHYDOXHVDUHXVHGLQWKH
WKH GRFXPHQWV LQWR JURXSV UHSUHVHQWLQJ VLPLODU WRSLFV $V QH[W LWHUDWLRQ 7KH VKDUHG PDWULFHV Ȗ DQG Ș UHSUHVHQW WKH
VXFK WKH FRPPHQWV DUH FODVVLILHG DFFRUGLQJ WR FRQWHQW FRXQWV RI WRSLF DVVLJQPHQWV WR ZRUGV DQG WKH FRXQWV RI
DLGLQJLQWKHLGHQWLILFDWLRQRIVXEMHFWPDWWHU7KHVHFOXVWHUV WRSLFDVVLJQPHQWVWRGRFXPHQWVUHVSHFWLYHO\ZKLOHsyncLV
DUH WKHQ PXOWLSOLHG ZLWK ij \LHOGLQJ WKH VLJQLILFDQW DVHPDSKRUHWKDWFRQWUROVDFFHVVWRȖ
WHUPLQRORJLHV ZKLFK DUH WKH EDVLV IRU FODVVLILFDWLRQ RI  $VVXPH WKHUH DUH L SURFHVVHV RI S/'$:RUNHU WKDW
FRPPHQWVE\WRSLFV WKHGHWDLOHGSURFHGXUHLVGHVFULEHGLQ SHUIRUP WKH &*6 FRPSXWDWLRQV LQ SDUDOOHO 7KH DOJRULWKP
6HFWLRQ   7KH FRUUHVSRQGLQJ VHQWHQFHV FRQVWLWXWLQJ WKH EHJLQV E\ LQLWLDOL]LQJ WKH VKDUHG YDULDEOHV ș DQG ij DQG
GRFXPHQWVODEHOHGE\WRSLFVZHUHWKHQFODVVLILHGDFFRUGLQJ FUHDWLQJ WKH ZRUNHU SURFHVVHV 1RWH WKDW WR PDNH WKH
WR VHQWLPHQW VFRUHV 7KH UHVXOWV DUH LQSXWV RI SRSXODULW\ DOJRULWKPHDV\WRUHDGZHDVVXPHWKHQXPEHURIGRFXPHQWV
HTXDWLRQV ZKLFK UHSUHVHQW WKH WRWDO VHQWLPHQW VFRUHV E\ MLVGLYLVLEOHE\L7KHPDWUL[prevAssign UHSUHVHQWLQJWKH
WRSLF DFFXPXODWLQJ WR DUWLVW SRSXODULW\ $VVXPLQJ WKDW WKH FXUUHQW DVVLJQPHQW RI WRSLF WR GRFXPHQW DQG ZRUG LV
GHSHQGHQW IDFWRU DUWLVW SRSXODULW\ UHOLHV RQ WKH VHQWLPHQW LQLWLDOL]HG ZLWK UDQGRP WRSLF DVVLJQPHQWV 7KH FRXQWV LQ Ȗ
H[SUHVVHG E\ DOO WRSLFV WKH SRSXODULW\ HTXDWLRQV FDQ EH DQG Ș DUH WKHQ XSGDWHG ZLWK WKH WRSLF DVVLJQPHQWV LQ
VROYHG WR GHULYH FRHIILFLHQWV UHSUHVHQWLQJ WKH WRSLF prevAssign%DUULHUV\QFKURQL]DWLRQLVXVHGWRDVVXUHWKDWDOO
VLJQLILFDQFH IRU DUWLVW SRSXODULW\ :H WKHQ SUHVHQW WKH SURFHVVHV VWDUW WKH VDPSOLQJ SURFHVVHV RQO\ DIWHU WKH
UHVXOWV RI WKHVH VLPXOWDQHRXV HTXDWLRQV LQ D FDVH VWXG\ YDULDEOHV KDYH EHHQ LQLWLDOL]HG OLQH   7KH PDWUL[
GHVFULEHG LQ 6HFWLRQ   WR GUDZ FRQFOXVLRQV DERXW ZKDW prevAssign LVUHTXLUHGE\&*6EHFDXVHWKHFXUUHQWWRSLFLV
WRSLFVDUHKLJKO\UHOHYDQWLQWKHPXVLFLQGXVWU\ VDPSOHGFRQVLGHULQJWKHFRQGLWLRQDOSUREDELOLW\RIDOORWKHU

ISBN: 1-60132-481-2, CSREA Press ©





6 Int'l Conf. Data Science | ICDATA'18 |



DVVLJQPHQWV H[FHSW IRU WKH FXUUHQW DVVLJQPHQW 7KH ZKHUHWLVDVHWRIZRUGDVVLJQPHQWVWRGRFXPHQWVZ LVD


DOJRULWKPSURFHHGVLQDORRSDFURVVGRFXPHQWVZRUGVRYHU VHWRIWRSLFDVVLJQPHQWVWRGRFXPHQWVDQGZRUGVM LVWKH
LWHUDWLRQV XQWLO FRQYHUJHQFH WKH SRLQW DW ZKLFK QXPEHU RI GRFXPHQWVN LV WKH QXPEHU RIZRUGV DQGK LV
PRGLILFDWLRQVWRWKHWRSLFFRXQWVVWDELOL]HDQGWKHUHVXOWVDUH WKH QXPEHU RI WRSLFV 7KH YHFWRUV Į DQG ȕ DUH 'LULFKOHW
DSSURDFKLQJWKHWUXHGLVWULEXWLRQ SDUDPHWHUV DQG WKH VHPLFRORQ LQGLFDWHV WKDW ij DQG ș DUH
GHSHQGHQWXSRQWKHPIRUWKHFDOFXODWLRQ
$OJRULWKPS/'$0DLQ3URFHVV 'XHWRWKHFRPSOH[LW\RIWKHPRGHOLWLVQRWIHDVLEOHWR
6KDUHG 9DULDEOHV Ȗ LV D K×N PDWUL[ UHSUHVHQWLQJ WKH FRXQWV RI GLUHFWO\ FDOFXODWH WKH H[DFW GLVWULEXWLRQV LQVWHDG ZH ILUVW
WRSLFk DVVLJQHGWRZRUGw ZKHUH K LVWKHQXPEHURIWRSLFVDQGN LQWHJUDWH RXW ș DQG ij DQGXVLQJ SURSHUWLHV RI WKH 'LULFKOHW
LV WKH QXPEHU RI ZRUGV Ș LV DQ M×K PDWUL[ UHSUHVHQWLQJ WKH DQGPXOWLQRPLDOGLVWULEXWLRQVWRUHGXFH(T  WR(T  
FRXQWVRIWRSLFk DVVLJQHGWRGRFXPHQWm ZKHUHMLVWKHQXPEHU P Z m n = k  Z − m  n  W  α  β ∝
RIGRFXPHQWVVyncLVDVHPDSKRUHWRV\QFKURQL]HZULWHVWRȖ
n k −n m n + β n   
,QSXWȌLVDQM×NPDWUL[UHSUHVHQWLQJWKHFRXQWVRIHDFKZRUG nmk − m n + α k N
LQHDFKGRFXPHQW
2XWSXW ș DQG ij DUH M×K DQG K×N PDWULFHV WKDW UHSUHVHQW WKH
¦n
r =
k  − m n
 r + βr

GLVWULEXWLRQRIWRSLFVRYHUGRFXPHQWVDQGWKHGLVWULEXWLRQRIZRUGV  (T  VHUYHVDVWKHVWDUWLQJSRLQWIRU*LEEVVDPSOLQJ


RYHUWRSLFVUHVSHFWLYHO\
ZKHUH n UHSUHVHQWV FRXQWV RI WRSLFV DVVLJQHG WR GRFXPHQWV
,QLWLDOL]Hș DQGijWR DQG ZRUGV *LEEV VDPSOLQJ FDQ EH XVHG WR HVWLPDWH D
SURFHVVS/'$:RUNHU>p = WRL@FUHDWHLZRUNHUSURFHVVHV GLVWULEXWLRQ XVLQJ VDPSOHV IURP WKDW GLVWULEXWLRQ >@
/HWprevAssign EHDQM×NPDWUL[UHSUHVHQWLQJWKHWRSLFIRU 7KH VDPSOLQJ SURFHVV LV UHSHDWHG XQWLO LW FRQYHUJHV ZKHQ
GRFXPHQWm DQGZRUGw
WKH VDPSOHV UHSUHVHQW WKH XQGHUO\LQJ GLVWULEXWLRQ ,Q &*6
,QLWLDOL]HprevAssignZLWKUDQGRPQXPEHUVLQ>K@
,QFUHPHQWWRSLFDVVLJQPHQWVIURPprevAssign LQȖ DQGȘ WKLV LV SHUIRUPHG E\ UHPRYLQJ WKH PHDVXUHPHQW RI WKH
barrier synchronization WKHORFNVWHS FXUUHQW YDOXH EHLQJ VDPSOHG DQG PDNLQJ D UDQGRP GUDZ
UHSHDWXQWLOconvergence EDVHG RQ DOO RWKHU VWDWLRQDU\ YDOXHV 7KH QRWDWLRQ
first  p ML last p ML  − m  n LQGLFDWHVWKDWWKHVDPSOHIRUGRFXPHQWmDQGZRUGn
IRUm = firstWRlast KDVEHHQVXEWUDFWHGIURPWKHFXUUHQWVDPSOHVHW
IRUw WRN  7KHMRLQWSUREDELOLW\UHSUHVHQWVWKDWRIWKHFXUUHQWWRSLF
preTopic prevAssign>m@>w@ DQGWKHVHWRIZRUGDVVLJQPHQWVWRGRFXPHQWV:LWK(T  
LIȌ>m@>w@!
Ș>m@>preTopic@
GHULYHGE\PDUJLQDOL]LQJRXWșDQGijIURP(T  WKHMRLQW
3 sync  Ȗ>preTopic@>w@9 sync  SUREDELOLW\FDQEHUHSUHVHQWHGDVLQ(T  >@
newTopic WRSLFDVVLJQPHQWWR mw XVLQJ&*6 P d _z P z
Ș>m@>newTopic@
P z_d P w_z = P w _ z   
Pd
3 sync  Ȗ>newTopic@>w@; 9 sync 
prevAssign>m@>w@ newTopic  $QDVVXPSWLRQXVHGLQGHULYLQJWKH/'$IRUPXODWLRQLV
barrier synchronization WKHORFNVWHS WKDW RI FRQGLWLRQDO LQGHSHQGHQFH EHWZHHQ GRFXPHQWV DQG
HQGSURFHVV ZRUGVRYHUWRSLFVWKXVZHUHZULWH(T  DVLQ  
&DOFXODWHșDQGij DVLQ(T  DQG  XVLQJȖDQGȘ
P w d _ z P z  
P z _d P w_z = 
Pd
1RWHWKDWWKHDOJRULWKPVHOHFWVWKHQH[WWRSLFDVVLJQPHQW
XVLQJ WKH &*6 VDPSOLQJ SURFHVV OLQH   DQG LQFUHPHQWV  %\ WKH GHILQLWLRQ RI FRQGLWLRQDO LQGHSHQGHQFH ZH
WKH FRXQWV RI WRSLF DVVLJQHG WR ERWK GRFXPHQWV DQG ZRUGV IXUWKHUUHZULWH(T  LQWR  
OLQH $WWKHHQGRIHDFKLWHUDWLRQWKH S/'$:RUNHU  P z _ d P w _ z = P w d  z P z = P w d  z = P w z _ d   
SURFHVVHV DUH V\QFKURQL]HG E\ WKH EDUULHU V\QFKURQL]DWLRQ P z Pd Pd
PHFKDQLVPDJDLQWRDOORZDOOSHQGLQJXSGDWHVWRRFFXU OLQH
7KXVWKHUHODWLRQVKLSGHVFULEHGLQ(T  PXVWKROG
  ,Q WKLV ZD\ XSGDWHV WR WKH FRPPRQ FRXQWV RI WRSLF
DVVLJQPHQWVWRZRUGVDUHFRQWLQXRXVDQGWKHUHVXOWVZLOOEH P Z m n = k  Z − mn W = P w z _ d = P z _ d P w _ z   
PRUH DFFXUDWH WKDQ WKRVH IURP SUHYLRXV LPSOHPHQWDWLRQV
/DVWO\ ș DQG ij DUH FDOFXODWHG DV LQ (TV   DQG   E\ 'HILQLWLRQ  $ significant terminology RI D FRUSXV LV D
VXPPLQJDFURVVWRSLFFRXQWVLQGRFXPHQWVDQGWRSLFFRXQWV NH\ZRUG WKDW UHSUHVHQWV D PDMRU WRSLF RI WKH GRFXPHQWV
LQZRUGVXVLQJȘDQGȖUHVSHFWLYHO\ FRQWDLQHG LQ WKH FRUSXV $ VLJQLILFDQW WHUPLQRORJ\ ZKLFK
UHSUHVHQWVDQHZPHWULFIRUWKHSUREDELOLW\RIPDMRUWRSLFV
 6LJQLILFDQW7HUPLQRORJLHV FDQEHGHULYHGIURPWKHPXOWLSOLFDWLRQRIPDWULFHVșDQG‫׋‬
Explanations:7KHRXWSXWVRIWKH/'$SURFHVVXVLQJ&*6
 7KHFRPSOHWHMRLQWSUREDELOLW\RIWKH/'$PRGHOFDQ LH WKH PDWULFHV ș DQG ij FDQ EH FDOFXODWHG LQ D VLQJOH
EHFDOFXODWHGDVLQ(T  >@ LWHUDWLRQXVLQJ(TV  DQG  DFURVVGRFXPHQWVDQGWRSLFV
P W  Z θ  φ  α  β = RUWRSLFVDQGZRUGV7KHUHIRUHWKHPXOWLSOLFDWLRQRIșDQGij
  
K M N
FDQEHUHSUHVHQWHGE\P z _ d P w _ z %\WKH%D\HV¶UXOH, ș
∏ P φi  β ∏ > P θ j α ∏> P z j t _ θ j P w j t _ φz j  t @@
 

i = j = t =
FDQEHUHSUHVHQWHGHTXLYDOHQWO\DVLQ(T  

ISBN: 1-60132-481-2, CSREA Press ©



Int'l Conf. Data Science | ICDATA'18 | 7

P z_d =
P d_z P z    RFFXUULQJ ZRUGV DV PHDVXUHG E\ ZRUG VLJQLILFDQFH DUH WKH
Pd VLJQLILFDQWWHUPLQRORJLHVIRUWKHFOXVWHU
%\SHUIRUPLQJVWDQGDUGPDWUL[PXOWLSOLFDWLRQRIșDQG  )LJXUH  VKRZV DQ H[DPSOH RI PHWKRGV IRU DVVLJQLQJ
ij HDFK HQWU\ LV PXOWLSOLHG DQG VXPPHG DFURVV WRSLFV LH WRSLFV WR FRPPHQWV E\ VLJQLILFDQW WHUPLQRORJLHV LQ WKH
WKHPDWFKLQJLQQHUGLPHQVLRQEHWZHHQWKHPDWULFHV6RWKH GRPDLQ RI WKH PXVLF LQGXVWU\ $V VKRZQ LQ WKH ILJXUH D
RSHUDWLRQSURFHHGVDVLQ(T   VLJQLILFDQW WHUPLQRORJ\ IRU WKH OLYH VKRZV WKDW DSSHDUV
FRQVLVWHQWO\ LV ³VKRZ´ 7KH ZRUG ³DOEXP´ DSSHDUV LQ WKDW
K K
P d _ zk P zk P d  w _ zk P zk 
¦ P w _ zk = ¦   FRQWH[W RI DOEXP 7KH ZRUGV ³VSRWLI\´ ³YLGHR´ DQG
k = P d k = Pd ³EDQGFDPS´DSSHDULQWKHVWUHDPLQJFDWHJRU\
ZKHUH k LV WKH WRSLF LQGH[ (T   KROGV EHFDXVH RI FRQWDLQV³UHFRUG´__
&RPPHQW
FRQGLWLRQDO LQGHSHQGHQFH RI ZRUGV DQG GRFXPHQWV RYHU FRQWDLQV³HS´__
FRQWDLQV³DOEXP´__
WRSLFV %\ WKH GHILQLWLRQ RI FRQGLWLRQDO LQGHSHQGHQFH DQG FRQWDLQV³OS´ <HV 1R
WKHLWHUDWLRQVRYHUkZHFDQGHULYHWKHUHVXOWDVLQ(T   FRQWDLQV
³VKRZ´ &RPPHQW
K K
P d  w zk P zk P d  w zk P dw $/%
¦ =¦ = = P w _ d    <HV 1R
k = P zk P d k = Pd Pd
6+2 FRQWDLQV³VSRWLI\´__ &RPPHQW
FRQWDLQV ³YLGHR´ __
 :HLQWHUSUHWWKHFRQGLWLRQDOSUREDELOLW\P w _ d DVWKH 
675 6WUHDPLQJ FRQWDLQV³EDQGFDPS´__
SUREDELOLW\ RI ZRUG VLJQLILFDQFH JLYHQ GRFXPHQWV ,W LV WKH $/% $OEXP FRQWDLQV³L7XQHV´ <HV 1R
SUREDELOLW\ RI ZRUGV DFFRUGLQJ WR KLGGHQ WRSLFV ZKLFK DUH 6+2 6KRZ
18/ QXOO 675 18/
SURPLQHQWLQWKHLUUHVSHFWLYHGRFXPHQWV  

)LJXUH&ODVVLILFDWLRQE\RFFXUUHQFHRIVLJQLILFDQWWHUPLQRORJLHV
 6\VWHPRI3RSXODULW\(TXDWLRQV
 7KHVLJQLILFDQWWHUPLQRORJ\FOXVWHUVDUHGHILQHGDVLQ  8VLQJ WKLV DSSURDFK WKH FOXVWHUV FDQ EH LGHQWLILHG
(T  $FFRUGLQJWR'HILQLWLRQWKHPXOWLSOLFDWLRQRIș UHDGLO\ E\ WKH RFFXUUHQFH RU QRQRFFXUUHQFH RI WKH
DQGij\LHOGVWKHVLJQLILFDQWWHUPLQRORJLHV2QFHWKHUHVXOWV VLJQLILFDQW WHUPLQRORJLHV $V VXFK LW JUHDWO\ VLPSOLILHV WKH
RI S/'$ KDYH EHHQ FOXVWHUHG XVLQJ kPHDQV ZH KDYH FODVVLILFDWLRQ SURFHVV DQG KXQGUHGV RI FRPPHQWV PD\ IDOO
JURXSVRIGRFXPHQWVUHSUHVHQWLQJVLPLODUXQGHUO\LQJWRSLF LQWR RQH FDWHJRU\ RU DQRWKHU TXLFNO\ DQG DFFXUDWHO\ E\ WKH
GLVWULEXWLRQV (DFK WKHWD FOXVWHU LV PXOWLSOLHG ZLWK ij IURP LQFOXVLRQRUH[FOXVLRQRIWKHVLJQLILFDQWWHUPLQRORJLHV
WKDW EDWFK RI S/'$ XVLQJ VWDQGDUG PDWUL[ PXOWLSOLFDWLRQ )XUWKHUPRUH6WDQIRUG&RUH1/3FDQEHXVHGWRFODVVLI\
7KLV RSHUDWLRQ LWHUDWHV RYHU WRSLFV FRQVLVWHQW ZLWK WKH VHQWLPHQWRQDSRVLWLYHQHJDWLYHVSHFWUXPUDQJLQJEHWZHHQ
SURFHGXUHRXWOLQHGLQ(T    DQG  ZLWK  EHLQJ WKH PRVW QHJDWLYH DQG  EHLQJ WKH
PRVW SRVLWLYH 7KH LGHQWLILHG FRPPHQWV IRU DUWLVWV DUH
­ ª θ   θ K º FODVVLILHG E\ VHQWLPHQW DQG WKH UHVXOWV VXPPHG DFURVV
°« » PXVLFLDQVFOXVWHUDQGWKHLGHQWLILHGWRSLFDVLQ(T  
°«    »
° «θ M  θ M K »¼ streaming i = ¦ SentScore Comment istreaming
°¬
j
ª ϕ   ϕ V º j  
° Theta B Cluster B  « »    album i = ¦ SentScore Comment album 
°
θ ×ϕ = ®  × «    » j
i j

«¬ϕ K  ϕ K V »¼
° ª θ   θ K º shows i = ¦ SentScore Comment ishows
j
°«    » j

°« » Topics  Words
ZKHUHWKHVXEVFULSWiLQGLFDWHVWKDWWKLVLVWKHiWKDUWLVW7KH
° «¬θ M  θ M K »¼
° VXSHUVFULSW RYHU FRPPHQW LQGLFDWHV WKDW WKH FRPPHQW KDV
°¯ Theta B Cluster B N EHHQ ODEHOHG DV WKDW WRSLF 7KH VXEVFULSWV i DQG j XQGHU
FRPPHQWLQGLFDWHWKDWWKLVLVWKHjWKFRPPHQWEHORQJLQJWR
 2QFH WKH VLJQLILFDQW WHUPLQRORJLHV KDYH EHHQ WKH iWK DUWLVW ODEHOHG E\ WKH VXSHUVFULSW 7KH PHWKRG
FDOFXODWHG DQG WKH GRFXPHQWV KDYH EHHQ FOXVWHUHG ZH FDQ SentScore LV D 6WDQIRUG &RUH 1/3 SURFHVV WKDW UHWXUQV WKH
GHULYH SDWWHUQV LQ WKH WH[W )RU H[DPSOH E\ UHYLHZLQJ WKH FRUUHVSRQGLQJ VHQWLPHQW VFRUH YDOXH 2QFH WKH VHQWLPHQW
SRVWV IRU PXVLFLDQV ZH PD\ ILQG WKDW WKH\ WHQG WR IDOO VFRUHV DUH VXPPHG WKH\ DUH DYHUDJHG E\ WKH QXPEHU RI
EURDGO\ LQWR WKUHH FDWHJRULHV GHVFULSWLRQV RI OLYH VKRZV FRPPHQWV DWWULEXWHG WR WKDW WRSLF 7KLV LV WR SURYLGH D
GHVFULSWLRQV DQG UHFRPPHQGDWLRQV RI QHZ DOEXPV DQG UHODWLYH YDOXH DV D ODUJH QXPEHU RI FRPPHQWV PD\
GLVFXVVLRQRIVWUHDPLQJVHUYLFHVRQZKLFKWKHDUWLVWVDSSHDU DUWLILFLDOO\LQIODWHWKHYDOXHZKLOHDQDYHUDJHZRXOGEHWWHU
7KH FOXVWHUV DUH UHDGLO\ LGHQWLILHG E\ NH\ZRUGV RFFXUULQJ UHSUHVHQWWKHRYHUDOOVFRUH
WKURXJKRXW WKH FOXVWHUHG GRFXPHQWV DFFRUGLQJ WR ZRUG  $V SDUW RI WKH GDWD FROOHFWLRQ SURFHVV WKH QXPEHU RI
VLJQLILFDQFHV 7KH SUREDELOLWLHV RI ZRUG VLJQLILFDQFHV IRU IROORZHUV RI HDFK DUWLVW SRVWHG RQ WKHLU VRFLDO PHGLD SDJHV
WKHYDULRXVZRUGVLQDFOXVWHUDUHVXPPHGDFURVVWKHVHWRI DUHUHWULHYHGDQGZLWKWKHVHQWLPHQWVFRUHVDVVHPEOHGLQD
GRFXPHQWV WR GHWHUPLQH WKH JUHDWHVW SUREDELOLWLHV 7KH WRS V\VWHPRIOLQHDUHTXDWLRQVDVVKRZQLQ(TV  

ISBN: 1-60132-481-2, CSREA Press ©





8 Int'l Conf. Data Science | ICDATA'18 |



 ArtistPopularityi = α k streamingi + α l albumi + α m showsi DQGIHDWXUHGRQWKH:80%DQGWKHGHOLPDJD]LQHFRPOLVWV


«   7KH ORFDO IRON VFHQH IRXQG RQ WKHGHOLPDJD]LQHFRP FLWHV
 ArtistPopularity j = α k streaming j + α l album j + α m shows j  RQOLQH VLWHV PRUH IUHTXHQWO\ DQG KDV D KLJKHU VWUHDPLQJ
FRHIILFLHQW 7KH\ DUH PRUH OLNHO\ WR PHQWLRQ VLQJOHV
7KLVLVDOLQHDUV\VWHPZLWKDQHTXDWLRQIRUHDFKDUWLVW SXEOLVKHG RQ L7XQHV WKHLU FKDQQHO RQ 6SRWLI\ RU D IHDWXUH
iWKURXJKj7KHDUWLVWSRSXODULW\LVGHILQHGDVWKHQXPEHURI RQEDQGFDPSDVLWHZKLFKSURPRWHVPXVLFDODUWLVWVDQGKDV
VRFLDO PHGLD IROORZHUV HJ )DFHERRN IROORZHUV  LQ RXU EHHQ JDLQLQJ SRSXODULW\ LQ UHFHQW \HDUV 7R SURPRWH
FDVH VWXG\ 6HFWLRQ   7KH YDULDEOHV LQ (TV   VXFK DV SRSXODULW\ LW LV ZRUWKZKLOH IRU DUWLVWV DW :80% WR IDOO LQ
streamingialbumiDQGshowiDUHLQGHSHQGHQWRQHV:HXVH WKLVFDWHJRU\WDNLQJDGYDQWDJHRIWKHVWUHDPLQJVHUYLFHVYLD
D SURFHVV OLNH WKH OHDVW VTXDUHV WR VROYH IRU FRHIILFLHQW RQOLQH VRFLDO PHGLD 7KLV LV HYLGHQFHG E\ WKH KLJK
YHFWRUĮRQHIRUHDFKLQGHSHQGHQWYDULDEOH:HLJKWHGOHDVW FRHIILFLHQWVIRUWKH:80%EOXHVDQGIRONFOXVWHUVKRZQLQ
VTXDUHVZLWKWKHZHLJKWVEHLQJWKHQXPEHURIFRPPHQWVIRU )LJEXWZLWKH[WUHPHO\ORZVWUHDPLQJFRHIILFLHQWIRUIRON
HDFKDUWLVWPD\DOVREHXVHGWRVROYHWKLVV\VWHPDQGFRXOG DUWLVWVIURP:80%DVVKRZQLQ)LJ
EHDEHWWHURSWLRQ>@1RWHWKDWUHJXODUOHDVWVTXDUHVLVQRWD
VXLWDEOH FKRLFH EHFDXVH WKH HUURU RFFXUULQJ EHWZHHQ
PXVLFLDQV LQ HDFK FDWHJRU\ PD\ YDU\ DFFRUGLQJ WR WKH
QXPEHU RI FRPPHQWV DV /'$ WHQGV WR EH PRUH DFFXUDWH
ZLWKDODUJHUDPRXQWRIGDWD

 &DVH6WXG\
 ,QRUGHUWRGUDZFRQFOXVLRQVDERXWWKHPXVLFLQGXVWU\
ZHFROOHFWHGFRPPHQWVIURPDUWLVWVIHDWXUHGRQ
:80% D UDGLR VWDWLRQ RI 8QLYHUVLW\ RI 0DVVDFKXVHWWV
%RVWRQ WKDW EURDGFDVWV DQ $PHULFDQD%OXHV5RRWV)RON
PL[ DV ZHOO DV IURP D OLVW RI %RVWRQ 0DVVDFKXVHWWV ORFDO 
DUWLVWV IHDWXUHG RQ WKHGHOLPDJD]LQHFRP D GDLO\ XSGDWHG )LJXUH$UWLVWFOXVWHUOHYHOWRSLFVLJQLILFDQFHE\SRSXODULW\
ZHEVLWHFRYHULQJ1RUWK$PHULFDQPXVLFVFHQHVWKURXJK
 GHGLFDWHG VHSDUDWH EORJV )ROORZLQJ WKH SURFHGXUH
RXWOLQHGLQ6HFWLRQZHFODVVLILHGWKHFRPPHQWVSURGXFHG
WKH VLJQLILFDQW WHUPLQRORJLHV DQG GHULYHG WKH WRSLF
VLJQLILFDQFH ZLWK UHJDUGV WR DUWLVW SRSXODULW\ $PRQJVW WKH
GLIIHUHQW W\SHV RI PXVLF GLIIHUHQW IDFWRUV DSSHDU WR
FRQWULEXWHWRDUWLVWVXFFHVV6RPHEDQGVDUHNQRZQPRUHIRU
OLYHSHUIRUPDQFHVDQGWHQGWRSURPRWHDQGGLVFXVVWKHVHRQ 
WKHLU)DFHERRNSDJHV2WKHUEDQGVDUHXVXDOO\SURPRWLQJD
)LJXUH$UWLVWFOXVWHUVZLWKWRSLFVLJQLILFDQFHIRUIRONPXVLF
QHZDOEXPZKHQWKH\SRVWWRRQOLQHVRFLDOPHGLD
 $UWLVW/HYHO$QDO\VLV
 $UWLVW&OXVWHU/HYHO$QDO\VLV
 7DEOHVKRZVWKUHHDUWLVWVGUDZQIURPWKUHHGLIIHUHQW
 )LJXUHVKRZVDUWLVWFOXVWHUVIURPERWKWKH:80% DUWLVW FOXVWHUV $UWLVW ; EHORQJV WR WKH :80% IRON FOXVWHU
DQG WKHGHOLPDJD]LQHFRP OLVWV 'LIIHUHQW W\SHV RI DUWLVWV DVVKRZQLQ)LJDQGDUWLVW<DQG=EHORQJWRWKH:80%
KDYH KLJKHU FRUUHODWLRQV ZLWK RQH WRSLF RU DQRWKHU 7KH DUWLVW FOXVWHU ³EOXHV IRON´ DQG WKH WKHGHOLPDJD]LQHFRP
:80% DUWLVW FOXVWHU UHSUHVHQWLQJ EOXHV DQG IRON DUWLVWV DUWLVWFOXVWHU³KDUGURFNLQGLH´UHVSHFWLYHO\ERWKVKRZQLQ
KDYH D KLJKHU FRUUHODWLRQ ZLWK WKH VWUHDPLQJ VHUYLFHV DQG )LJ:LWKDFRPSDUDEOHQXPEHURIFRPPHQWVKRZHYHU
OHVVFRUUHODWLRQZLWKDOEXP7KLVLQGLFDWHVWKDWWKHVHDUWLVWV DUWLVW ; LV OHVV WKDQ KDOI DV SRSXODU DV $UWLVW < 6LPLODUO\
DUHQRWXVLQJRQOLQHPHGLDWRSURPRWHWKHLUDOEXPVVROGLQ DUWLVW = LV PRUH WKDQ WZLFH DV SRSXODU DV DUWLVW < 7KLV LV
UHFRUG VWRUHV WR WKH VDPH GHJUHH WKDW WKH\ DUH SURPRWLQJ PDMRUO\ GXH WR WKHLU VWUHDPLQJ VFRUHV ZKLFK DUH WKH
VWUHDPLQJ VHUYLFHV ZKLFK SURYLGHV RQOLQH DFFHVV WR WKHLU FXPXODWLYH VHQWLPHQW VFRUHV IRU DUWLVWV LQ WKH VWUHDPLQJ
PXVLF 6LPLODUO\ WKH WKHGHOLPDJD]LQHFRP DUWLVW JURXS FDWHJRU\ )RU H[DPSOH LQ WKH FDVH RI DUWLVW < DQG = WKH
FODVVLILHG DV KDUG URFN LQGLH IROORZV WKH VDPH SDWWHUQ VWUHDPLQJ VFRUHV LQFUHDVH DW D UDWH JUHDWHU WKDQ GRXEOH IRU
:80%DUWLVWFOXVWHUIRONVLQJHUVRQJZULWHUKRZHYHUGRHV DUWLVW = GHVSLWH WKHUH EHLQJ IHZHU WKDQ GRXEOH WKH
QRW KDYH DQ\ FRUUHODWLRQ ZLWK VWUHDPLQJ 7KHVH DUWLVWV FRPPHQWVEHWZHHQDUWLVW<DQG=
PRVWO\SURPRWHQHZDOEXPVLQWKHFRQYHQWLRQDOIDVKLRQ,Q
DGGLWLRQ WKH WKHGHOLPDJD]LQHFRP DUWLVW FOXVWHU URFN LQGLH 7DEOH&RPSDULVRQRIDUWLVWVIURPGLIIHUHQWFOXVWHUV
PRVWO\SURPRWHVVKRZVEXWKDVQRPHQWLRQRIVWUHDPLQJ Artist ID #Followers Streaming Score #Comments
 0RYLQJIRUZDUGWRDQDO\]HDSDUWLFXODUW\SHRIPXVLF $UWLVW;   
LQ )LJ  ZH VKRZ WZR DUWLVW FOXVWHUV ZLWK WRSLF $UWLVW<   
VLJQLILFDQFH ZKLFK DUH ERWK GHVFULEHG DV IRON E\ /DVWIP $UWLVW=   

ISBN: 1-60132-481-2, CSREA Press ©



Int'l Conf. Data Science | ICDATA'18 | 9

 (IILFLHQF\$QDO\VLV  5HIHUHQFHV


)LJXUHVKRZVWKHUXQQLQJWLPHRIERWKWKHVHTXHQWLDO >@ ' 0 %OHL $ < 1J DQG 0 , -RUGDQ ³/DWHQW 'LULFKOHW
YHUVLRQRIWKH/'$LPSOHPHQWHGYLD&*6DQGWKHS/'$ $OORFDWLRQ´Journal of Machine Learning Research 9RO
-DQXDU\SS
LPSOHPHQWHG YLD &*6 XVLQJ  SDUDOOHO WKUHDGV ZLWK
>@ = -XUXL DQG 5 /LX ³3RSXODULW\ RI 'LJLWDO 3URGXFWV LQ 2QOLQH
LQFUHDVLQJ QXPEHUV RI LWHUDWLRQV &RPSDUHG ZLWK WKH 6RFLDO 7DJJLQJ 6\VWHPV´ Journal of Brand Management, 9RO
VHTXHQWLDO&*6QRWRQO\GRHVWKHS/'$FRQVLVWHQWO\UXQ 1R-DQXDU\SS
IDVWHU LWV UXQQLQJ WLPH GRHV QRW LQFUHDVH OLQHDUO\ EXW >@ . *RK & +HQJ DQG = /LQ ³6RFLDO 0HGLD %UDQG &RPPXQLW\
LQFUHDVHV ZLWK D GHFUHDVLQJ UDWH 7KLV LV EHFDXVH WKH DQG &RQVXPHU %HKDYLRU 4XDQWLI\LQJ WKH 5HODWLYH ,PSDFW RI
RYHUKHDG WLPH IRU LQLWLDOL]DWLRQ SDUWLWLRQLQJ WKH GDWDVHW 8VHUDQG0DUNHWHU*HQHUDWHG &RQWHQW´ Information Systems
DQG V\QFKURQL]DWLRQ LV FRQVWDQW 7KXV ZLWK PRUH DQG Research9RO1R0DUFKSS
PRUH LWHUDWLRQV JRLQJ RQ WKH LPSDFW RI RYHUKHDG WLPH >@ % .LP 6 .LP DQG & < +HR ³$QDO\VLV RI 6DWLVILHUV DQG
EHFRPHVOHVVDQGOHVVVLJQLILFDQW 'LVVDWLVILHUV LQ 2QOLQH +RWHO 5HYHLZV RQ 6RFLDO 0HGLD´
International Journal of Contemporary Hospitality Management
9RO1R6HSWHPEHUSS
>@ : +H 6 =KD DQG / /L ³6RFLDO 0HGLD &RPSHWLWLYH $QDO\VLV
DQG 7H[W 0LQLQJ D &DVH 6WXG\ LQ WKH 3L]]D ,QGXVWU\´
International Journal of Information Management 9RO  1R
-XQHSS
>@ 64LDQJ<:DQJDQG<-LQ³$/RFDO*OREDO/'$0RGHOIRU
'LVFRYHULQJ *HRJUDSKLFDO 7RSLFV IURP 6RFLDO 0HGLD´
Proceedings of the Asia-Pacific Web (APWeb) and Web-age
Information Management (WAIM) Joint Conference on Web and
Big Data, %HLMLQJ&KLQD-XO\SS
>@ 6 .LQRVKLWD 7 2JDZD DQG 0 +DVH\DPD ³/'$%DVHG 0XVLF
5HFRPPHQGDWLRQ ZLWK &)EDVHG 6LPLODU 8VHU 6HOHFWLRQ´
Proceedings of the IEEE 4th Global Conference on Consumer
Electronics,2VDND&LW\-DSDQ2FWREHUSS
>@ <;XDQG=;LDQOL³$/'$0RGHO%DVHG7H[W0LQLQJ0HWKRG
 WR 5HFRPPHQG 5HYLHZHU IRU 3URSRVDO RI 5HVHDUFK 3URMHFW
6HOHFWLRQ´ Proceedings of the IEEE 13th International
)LJXUH7KHUXQQLQJWLPHRISDUDOOHODQGVHTXHQWLDO/'$ Conference on 6ervice Systems and Service Management
 (ICSSSM)3LVFDWDZD\1-86$-XQH, SS
 &RQFOXVLRQVDQGIXWXUHZRUN >@ 0 :X = & 'RQJ / :HL\DR DQG : 4 4LDQJ ³7H[W 7RSLF
0LQLQJ%DVHGRQ/'$DQG&R2FFXUUHQFH7KHRU\´Proceeding
 2XUPHWKRGRORJ\RIGHULYLQJVLJQLILFDQWWHUPLQRORJLHV of the 2012 7th International Conference on Computer Science &
DOORZV/'$WREHDSSOLHGWRWKHWDVNRIWRSLFFDWHJRUL]DWLRQ Education0HOERXUQH$XVWUDOLD-XO\SS
TXLFNO\ DQG DFFXUDWHO\ 2XU DSSURDFK LV PRUH DXWRPDWHG >@ ' 1HZPDQ $ $VXQFWLRQ 3 6P\WK DQG 0 :HOOLQJ
WKDQ WUDGLWLRQDO XVDJHV RI /'$ DSSURDFK ZKLFK UHTXLUH ³'LVWULEXWHG,QIHUHQFHIRU/DWHQW'LULFKOHW$OORFDWLRQ´Advances
in Neural Information Processing Systems, SS
PDQXDOLQWHUSUHWDWLRQRIWKHGRFXPHQWFOXVWHUVDVRSSRVHG
WRWKHXVHRIVLJQLILFDQWWHUPLQRORJLHVWKDWWHQGWREHPRUH >@ <:DQJ+%DL06WDQWRQ:&KHQDQG(<&KDQJ³3OGD
3DUDOOHO /DWHQW 'LULFKOHW $OORFDWLRQ IRU /DUJH6FDOH
UHIOHFWLYH RI XQGHUO\LQJ WRSLFV $V D SRWHQWLDO DGGLWLRQDO $SSOLFDWLRQV´ Proceedings of the International Conference on
VWHS WR RXU DSSURDFK WKH FOXVWHUV RI VLJQLILFDQW Algorithmic Applications in Management (ICAAM), %HUOLQ
WHUPLQRORJLHV FDQ EH ZHOO LGHQWLILHG XVLQJ D VXSHUYLVHG +HLGHOEHUJ-XQHSS
DSSURDFK OLNH GHFLVLRQ WUHHV 7KH LQFOXVLRQ RI VLJQLILFDQW >@ = /LX < =KDQJ DQG ( < &KDQJ ³3OGD 3DUDOOHO /DWHQW
WHUPLQRORJLHV OHQGV LWVHOI ZHOO WR D WUHH VWUXFWXUH DQG PD\ 'LULFKOHW $OORFDWLRQ ZLWK 'DWD 3ODFHPHQW DQG 3LSHOLQH
PDNHWKHSURFHVVRILGHQWLI\LQJFOXVWHUVPRUHDFFXUDWH 3URFHVVLQJ´ ACM Transactions on Intelligent Systems and
 8VLQJ RXU DSSURDFK ZH KDYH IRXQG WKDW LQ WKH IRON Technology (TIST), 9RO1R$SULOSS
PXVLF VFHQH DQG HOVHZKHUH RQOLQH VWUHDPLQJ VHUYLFHV DUH >@ %*DEULHO$ 7RSULFHDQX DQG08GUHVFX ³0X6H1HW1DWXUDO
3DWWHUQVLQWKH0XVLF$UWLVWV,QGXVWU\´Proceedings of the IEEE
KLJKO\UHOHYDQWWRDUWLVWSRSXODULW\7KHVHDUWLVWVDUHWXUQLQJ 9th International Symposium on Applied Computational
WRVXFKQHZYHQXHVLQVWHDGRIWKHFRQYHQWLRQDODSSURDFKRI Intelligence and Informatics (SACI) 7LPLVRDUD 5RPDQLD 0D\
SURPRWLQJ DOEXPV WKURXJK UHFRUG ODEHOV 7KH UHVXOWV DUH SS
SDUWLFXODUO\ UHOHYDQW WR WKH :80% UDGLR VWDWLRQ DV WKHLU >@ ,3RUWHRXV'1HZPDQ$,KOHU$$VXQFLRQ36P\WKDQG0
DUWLVWV DUH PRVWO\ FDWHJRUL]HG DV IRON PXVLF ZKHUH ZH :HOOLQJ ³)DVW &ROODSVHG *LEEV 6DPSOLQJ IRU /DWHQW 'LULFKOHW
REVHUYHGKRZWKH\XVHRQOLQHVRXUFHVIRUSURPRWLRQ $OORFDWLRQ´Proceedings of the 14th ACM SIGKDD International
 )RUIXWXUHZRUNRQWKHSURPRWLRQRIFXOWXUDOSURGXFWV Conference on Knowledge Discovery and Data Mining $&0
/DV9HJDV1986$$XJXVWSS
ZHZRXOGOLNHWRGHYHORSDPRUHGHWHUPLQLVWLFDSSURDFKLQ
/'$ ZKLFK ERDVWV WKH VDPH HIILFLHQF\ DV VWRFKDVWLF >@ 5 GH *URRI DQG + ;X ³$XWRPDWLF 7RSLF 'LVFRYHU\ RI 2QOLQH
+RVSLWDO 5HYLHZV 8VLQJ DQ ,PSURYHG /'$ ZLWK 9DULDWLRQDO
DSSURDFKHV OLNH &*6 :H DOVR SODQ WR H[SORUH VRFLDO *LEEV 6DPSOLQJ´ Proceedings of the 2017 IEEE International
QHWZRUNV XVLQJ JUDSK WKHRU\ WR EHWWHU XQGHUVWDQG KRZ WKH Conference on Big Data (IEEE BigData 2017) %RVWRQ 0$
GHPRJUDSKLFVPD\LPSDFWRQDUWLVWSRSXODULW\ 86$'HFHPEHUSS

ISBN: 1-60132-481-2, CSREA Press ©



10 Int'l Conf. Data Science | ICDATA'18 |

Multi-label Classification of Single and Clustered


Cervical Cells Using Deep Convolutional Networks
Melanie Kwon Mohammed Kuko Vanessa Martin, D.O. Tae Hun Kim, M.D.
Computer Science Department Computer Science Department Department of Pathology Department of Pathology
California State University, Los California State University, Los Los Angeles County + University Los Angeles County + University
Angeles Angeles of Southern California Medical of Southern California Medical
Los Angeles, California, USA Los Angeles, California, USA Center Center
mkwon6@calstatela.edu mkuko@calstatela.edu Los Angeles, California Los Angeles, California
vmartin2@dhs.lacounty.gov tkim4@dhs.lacounty.gov

Sue Ellen Martin, M.D. Ph.D. Mohammad Pourhomayoun,


Department of Pathology Ph.D. Member, IEEE
Los Angeles County + University Computer Science Department
of Southern California Medical California State University, Los
Center Angeles
Los Angeles, California Los Angeles, California, USA
sue.martin@med.usc.edu mpourho@calstatela.edu

Abstract— Cytology-based screening through the thousands of cells by a trained cytotechnologist and pathologist.
Papanicolaou test has dramatically decreased the incidence of Also, the FDA enforces workload limits to restrict the number
cervical cancer. Convolutional neural networks (CNN) have been of slides that can be screened (100 slides per individual per 24
utilized to classify cancerous cervical cytology cells but primarily hours). Despite the Bethesda System, an established criteria and
focused on pre-processed nuclear details for a single cell and guidelines on detecting cell abnormalities, considerable
binary classification of normal versus abnormal cells. In this subjectivity and inter-observer variability still exist [2,3].
study, we developed a novel system for multiple label classification
with a focus on both nucleus and cytoplasm of single cells and cell To address these difficulties, research has been directed
clusters. In this retrospective study, we digitalized cervical towards developing computer-aided reading systems i.e.
cytology slides from 104 patients. Based upon the Bethesda system, convolutional neural networks (CNN) [4-6]. By designing an
the established criteria for diagnosing cervical cytology, cells of automated method for detecting cancerous cells, medical
interest were categorized. With 10-fold cross validation, our CNN professionals would be able to increase workload throughput
algorithm demonstrated 84.5% overall accuracy, 79.1% while also reducing screening subjectivity. With recent FDA
sensitivity, 89.5% specificity for normal versus abnormal. For 3- approval of digital pathology for primary surgical pathology
level classification of normal, low-grade, and high-grade, CNN diagnostics [7], widespread use of automated systems into
demonstrated 76.1% overall accuracy. Results show promise on cytopathology workflow is a possibility in the future.
the utility of CNNs to learn cervical cytology.

Keywords—cervical cytology; cervical cancer; pap smear;


Bethesda system; deep learning; convolutional neural network; II. RELATED WORKS
digital pathology; whole-slide imaging
A. Non-Deep Learning Methods
I. INTRODUCTION Prior to the popularization of deep learning, studies for
Cervical cancer was one of the most common causes of automated analysis of pap smears were focused around cell
cancer-related deaths for women in the United States [1]. With segmentation – splitting the cytoplasm and nucleus – and
the introduction of the Papanicolau test (Pap test or smear) as a extraction of morphological and texture features from cells
screening method for pre-cancer and cervical cancer, the [8,9]. Although progress has been made on cell segmentation of
incidence of cervical cancer has dramatically decreased [1]. The individual cells, such methods are limited by the wide variability
Pap test involves taking cells from the cervix and examining in cell shape and appearance [9]. The existence of cell clusters
them under the microscope. or overlapping cells hinder accurate segmentation.
The Pap test or liquid-based cytology preparations remain B. Deep Learning Methods
effective, widely used methods for early detection of pre-cancer In recent years, various studies have demonstrated promising
and cervical cancer. However, the screening process can be results on visual recognition tasks relating to pathology –
time-consuming due to visual examination of hundreds to automated grading of gliomas [10], detection of invasive ductal

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 11

carcinoma [11], detecting cancer metastases [12], mitosis III. DATA COLLECTION
detection in breast cancer [13], etc. Instead of using handcrafted Our study data involves the retrospective review of cytology
features to represent the complexity of learning histology or specimens from 104 LAC-USC Medical Center patients that
cytology, feature engineering is done automatically by the CNN underwent the Pap test. This data was collected in collaboration
to learn multiple hierarchical representations of the data. with the Department of Pathology at LAC-USC Medical Center.
Motivated by the successes of applying deep learning methods In this USC IRB-approved study, the cells of interest were
on other pathology medical images, several researchers captured via whole-slide scanning equipment without patient
conducted studies on using a deep learning approach to classify information or identifiers.
abnormal/dysplastic cells labeled according to the Bethesda
system. The specimens were prepared using ThinPrep® liquid-based
cytology preparation method and digitally captured at 40x
Bora et al. trained a CNN from scratch on AlexNet followed magnification objective, using both the Leica SCN400F and the
by a unsupervised technique for feature selection/reduction and Olympus VS120 whole slide scanners. The Leica scanner was
developed both a binary normal/abnormal classifier and 3-class set to output .scn format and the Olympus scanner was set to
classifier consisting of negative for intraepithelial lesion or output .tif format. Configuring the image output setting was
malignancy (NILM), low-grade intraepithelial lesion (LSIL), important to ensure read compatibility with OpenSlide (an open-
and high-grade intraepithelial lesion (HSIL) [4]. Hyeon et al. source library for reading proprietary digital pathology image
used a pre-trained CNN (VGGNet-16 trained on ImageNet) as a formats and annotation software) [15]. At the time of this study,
feature extractor and developed a binary normal/abnormal the native Olympus .vsi output format was not supported by
classifier [5]. Zhang, et al. similarly developed a binary OpenSlide.
normal/abnormal classifier by transferring features from a
model pretrained on ImageNet and fine-tuning it on a pap smear
dataset [6]. In all these prior studies, the datasets focused
primarily on discrete, individual cells and utilized the Herlev Fig. 2. Whole slide scan
dataset, a publicly available cytology image collection [14]. thumbnail image of Leica .scn,
read with OpenSlide.
However, the Herlev dataset consists of many images that lack
cytoplasmic borders (Figure 1). Thus, assessment of nuclear size
in relationship to the amount of cytoplasm (nuclear to
cytoplasmic ratio) cannot be performed. The nuclear to
cytoplasmic ratio is important to distinguish low-grade versus After capturing the whole-slide images, areas of interest
high-grade intraepithelial lesions. were digitally annotated by two pathologists (THK, VM)
according to the Bethesda system [16]. We used the annotation
interface of open-source digital pathology software, QuPath, to
navigate the whole slide images and draw labeled bounding
boxes around cells of interest [17].
A. The Bethesda System

Fig. 1. HSIL examples from publically available Herlev dataset. Images are
focused on nuclear detail without clear external cytoplasm boundary.

Acquiring quality labeled training data from medical images


is difficult due to many reasons. Besides adhering to patient
information privacy laws, data collection is a manually intensive
process. It requires capturing slide specimens with specialized Fig. 3. Acquired images of cervical cells by Bethesda system, not to scale,
camera or whole-slide scanning equipment and dedicated from left to right. (First row: NILM-Squamous, NILM-Endocervical Cell,
attention of highly trained medical professionals to identify and ASCUS. Second row: LSIL, ASC-H, HSIL)
annotate regions of interest. Compared to the field of radiology
where majority of the image data is already digitized, most Cells of interest were labeled according to the Bethesda
pathology laboratories do not digitize specimens as part of system, a consensus guideline for reporting cervical diagnoses.
routine workflow. The Bethesda system utilizes a two tier-system of low-grade and
high-grade with additional categories for cases that are atypical
As part of our data collection process, we captured a broader but not sufficient for low-grade or high-grade intraepithelial
and more representative image datasets with a focus on lesion. The prior classification scheme of cervical intraepithelial
acquiring nuclear to cytoplasmic ratio for all images, clear neoplasia (CIN) initially utilized a three tier-system based upon
chromatin texture, cell clusters, and single cells. the nuclear to cytoplasmic ratio but was subsequently changed
into a two tier-system like the Bethesda system. The

ISBN: 1-60132-481-2, CSREA Press ©


12 Int'l Conf. Data Science | ICDATA'18 |

equivalences in medical terminology are shown in Table I. were downsampled by random selection to equalize distribution
Additional categories of atypical squamous cells of counts. Due to the relatively rare numbers of endometrial cells,
undetermined significance (ASCUS) and atypical squamous these were excluded from the final dataset. Low-grade consisted
cells cannot exclude a HSIL (ASC-H) were also included in the of ASCUS and LSIL cases, and high-grade consisted of ASC-H
dataset. and HSIL cases. The final distributions with the respective 2 and
3 classifications are shown in Table III.
TABLE I. CERVICAL CYTOLOGY TERMINOLOGY
TABLE III. MERGED DATASET DISTRIBUTION
Original Modified
Bethesda System Dysplasia CIN CIN
Count 2 Classifications
system system
2298 Normal
NILM
2298 Abnormal
(Negative for Normal Normal Normal
intraepithelial lesion or
malignancy)
Count 3 Classifications
ASCUS Atypia suggestive
(Atypical squamous cells but not sufficient 1332 NILM
of undetermined for LSIL
1279 Low Grade – ASCUS, LSIL
significance)
1066 High Grade – ASC-H, HSIL
LSIL Low-grade
Mild dysplasia CIN I
(Low–grade squamous CIN
intraepithelial lesion)
In regards to computer vision tasks for new categories, it is
Atypia suggestive advised to utilize either feature extraction or fine-tuning from a
ASC-H
but not sufficient pre-trained network (usually trained on natural images such as
(Atypical squamous cells
for HSIL
cannot exclude a HSIL) ImageNet). Because the new dataset is relatively small and the
Moderate dysplasia CIN II nature of cytology image data is very different from ImageNet,
HSIL High- it is advised to train a SVM classifier from activations earlier in
(High–grade squamous grade CIN
intraepithelial lesion)
Severe dysplasia CIN III the network where lower-level learned features exist (detecting
edges, blobs of colors, etc.) [18]
However, even for cases of small datasets it is possible to
train a smaller model from scratch by leveraging data
B. Dataset augmentation to generate additional training data to reduce
Table II shows the distribution of the study’s collected overfitting. [19]
annotated cytology regions according to Bethesda Category.

IV. METHODS
TABLE II. ORIGINAL DATASET DISTRIBUTION A. Data Preprocessing and Augmentation
To prepare image files to an appropriate format for the
Merged 2 Merged 3
Count Bethesda System
Category Category
network, we decode the RGB pixel values into floating point
tensors and rescale these values (originally between the range 0
11010 NILM - Squamous to 255) to be within the range [0, 1]. This is done by simply
Normal NILM dividing each channel by 255. For image preprocessing, we
314 NILM - Endocervical Cell utilized the ImageDataGenerator class of the Keras deep
learning library to read files from directory to create batches of
5 NILM - Endometrial excluded excluded pre-processed tensors. Additionally for data augmentation, we
559 ASCUS
applied random rotation, width/height shifting, and
Low Grade horizontal/vertical flipping. Because nucleus size and shape
719 LSIL relative to cytoplasm is important in distinguishing low versus
Abnormal high grade, we avoided any transformation for shearing or
282 ASC-H zooming to prevent distortion.
High Grade
788 HSIL

B. Model Architectures and Hardware


Normal or non-dysplastic categories were grouped together Table IV provides an overview of our model architecture,
(squamous and endocervical). It is important to note that even which consists of alternating series of convolutional and pooling
though these cells exist under the category of NILM, they are layers.
visually distinct from each other (Fig. 3). NILM-squamous cases

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 13

TABLE IV. CONVNET ARCHITECTURE FOR CLASSIFICATION

Layer Filter
Stride Padding Output Shape
Type size

input - - - (150, 150, 3)

Conv1 3x3 1 valid (148, 148, 32)


Pool1 2x2 1 valid (74, 74, 32)

Conv2 3x3 1 valid (72, 72, 64)


Pool2 2x2 1 valid (36, 36, 64)

Conv3 3x3 1 valid (34, 34, 128)


Pool3 2x2 1 valid (17, 17, 128)

Conv4 3x3 1 valid (15, 15, 128)


Pool4 2x2 1 valid (7, 7, 128)

Dropout - - - 6272

FC5 - - - 512

FC6 - - - 1 or 3* Fig. 4. CNN-A training/validation accuracy and loss for Fold 1.


* 1 for binary classification (Softmax), 3 for three-class classification (Sigmoid)

Weights parameters from Conv1 to Pool4 were initialized


with uniform random initialization. Hidden Layers were
configured to use ReLU activation function. After Pool4, D. Evaluation
Dropout at a rate of 0.5 is applied as a regularization technique. We employed 10-fold cross-validation to train the model and
The last fully connected layer FC6, is configured to use Sigmoid evaluated the average accuracy. Table VI demonstrates results
activation function for binary normal/abnormal classification measured by the averaged accuracy, sensitivity, and specificity.
and Softmax activation function for the classification task to
determine the 3 categories – NILM, Low Grade, High Grade. Accuracy = (True positive [TP] + True negative [TN]) /

We trained this architecture on Keras library using (TP + False positives [FP] + False negative [FN] + TN)
Tensorflow backend with Nvidia GeForce GTX 1060 6GB [20]. Sensitivity or true positive rate (TPR) = TP / (TP + FN)
Specificity or true negative rate (TNR) = TN / (FP + TN)
C. Training and Testing
After employing pre-processing/data augmentation on each
V. RESULTS
training image using ImageGenerator and resizing to (150, 150,
3) tensor shape input, we trained the model using binary cross The binary classification of normal and abnormal
entropy as the loss function for binary classification. For the 3- demonstrated 84.5% overall accuracy, 79.1% sensitivity, and
level classification we trained the model using categorical cross 89.5% specificity averaged over 10-fold cross validation. 3 level
entropy as the loss function. For both binary and 3-level classification of NILM, LSIL, and HSIL demonstrated 76.1%
classification we used RMSProp optimization algorithm with a overall accuracy (Table V).
learning rate set at 0.0001. We trained up to 130 epochs, saving
weight checkpoints at every 5 epochs. At the end of training, the TABLE V. RESULTS OF BINARY AND 3-LEVEL CLASSIFICATION
model with the lowest validation loss was chosen to evaluate test # of Cross
Accuracy Sensitivity Specificity
images. These hyperparameters were chosen as an initial Classes Validation
configurations for training the CNN, but were not necessarily 2 10-fold 84.5% 79.1% 89.5%
tested as the optimal choices for learning.
3 10-fold 76.1% - -

ISBN: 1-60132-481-2, CSREA Press ©


14 Int'l Conf. Data Science | ICDATA'18 |

The normalized confusion matrix in Figure 5 demonstrates The confusion matrix of Fig. 5 provides insight on what the
the averaged category accuracies and the relationships between CNN misclassifies. False positives mostly consisted of normal
network misclassification. predicted to be either low-grade or high-grade intraepithelial
lesions. A component of NILM consists of endocervical cells
which demonstrate high nuclear to cytoplasmic ratio and is a
known mimicker of high-grade intraepithelial lesion (Figure 6).
Ideally, a screening method should have high sensitivity to
detect all or most abnormalities. If implemented as a screening
method, the CNN’s characteristic of demonstrating a bias
towards false positives over false negatives is more preferable
as a cytotechnologist and/or pathologist can verify the
classification.
A. Limitations
As with any application of deep learning, the requirement for
vast quantities of quality labeled training data hinders the
feasibility of developing a robust system. Unlike natural images
scraped from the internet, video, or other easily mined sources,
the data collection and annotation process for pathology images
cannot easily be collected without concerted coordination and
Fig. 5. Normalized confusion matrix for 3 level classification, representing
the rounded average of each fold’s confusion matrix.
guidance from medical professionals. Pathology images also
differ in that under the microscope, a
cytotechnologist/pathologist can interactively adjust the focus to
VI. DISCUSSION view multiple focus planes or z-stack levels when viewing cell
Our methods demonstrated an overall accuracy of 84.5% for clusters. As part of future applications, incorporating 2-3
binary abnormal/normal classification and accuracy of 76.1% different z-stack images of the same cells would provide another
for 3-label classification. It is important to note that prior work avenue of data augmentation.
have focused primarily on nuclear characteristics and single For the three-level classification, ASCUS was grouped with
cells while we have included additional parameters like nuclear LSIL for low-grade, and ASC-H was grouped with HSIL for
to cytoplasmic ratio, cell clusters, and including the more high-grade because they share similar morphologic features.
subjective Bethesda categories of ASCUS and ASC-H. These Although cases of ASCUS and ASC-H may represent true
additional parameters provide a more accurate representation of dysplasia, a subset of these categories also include non-
the cells visualized by the cytotechnologists and/or pathologist. dysplastic conditions like atrophic changes or
Although the nuclear size and characteristics are important for reactive/inflammatory changes. To accurately portray the
distinguishing normal versus abnormal cells, the nuclear to Bethesda System with CNN, additional data and network
cytoplasmic ratio is critical to distinguish low-grade versus optimization is required to classify based upon 5 labels (NILM,
high-grade intraepithelial lesions. Low-grade have low nuclear ASCUS, LSIL, ASC-H, and HSIL).
to cytoplasmic ratio (i.e. abundant cytoplasm to nuclear size)
while high-grade dysplastic cells have high nuclear to Additionally, the final diagnosis requires assimilating
cytoplasmic ratio (i.e. moderate to minimal amounts of multiple data points that not only include the degree of cellular
cytoplasm compared to nuclear size). morphologic abnormalities (dysplasia/atypia) but also the
quantity of those changes. This component is not accounted by
The characteristics of a cell can be best visualized singly. our CNN trained on isolated cells and cell clusters. To address
However, often times, cells of interest are clustered or this issue, future efforts may include developing a method that
overlapping as they are normally cohesive in the native state. assesses isolated cells and cell clusters in context of the entire
Thus, consideration of both single and clustered cells are slide.
required.
VII. CONCLUSION
In this study, we developed a system based on
Convolutional Neural Networks (CNN) for multiple label
classification with a focus on both nucleus and cytoplasm of
single cells and cell clusters. Considering the effectiveness of
deep learning methods being applied to visual recognition tasks
and recent studies’ successes on its application in the field of
pathology, our study examines the application of CNN in the
field of cervical cytology. Future directions may include further
data collection, incorporation of z-stack layering for data
augmentation, and comparing performance between custom
Fig. 6. Endocervical cells cluster (left) compared to HSIL (right). models trained from scratch and transfer learning from pre-
trained networks.

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 15

REFERENCES of convolutional neural networks. AMIA Annual Symposium. 2015;


1899–1908.
[1] National Institutes of Health. Cervical Cancer. NIH Consensus Statement.
[11] A. Cruz-Roa, A. Basavanhally, F. Gonzalez, H. Gilmore, M. Feldman, S.
1996;14(1):1–38.
Ganesan, N. Shih, J. Tomaszewski, A. Madabhushi. Automatic detection
[2] A. E. Smith, M. E. Sherman, D. R. Scott, S. O. Tabbara, L. Dworkin, J. of invasive ductal carcinoma in whole slide images with convolutional
Olson, J. Thompson, C. Faser, J. Snell, M. Schiffman. Review of the neural networks. SPIE Medical Imaging. 2014; (9041):904103.
Bethesda System atlas does not improve reproducibility or accuracy in the
classification of atypical squamous cells of undetermined significance [12] Y. Liu et. Al. Detecting cancer metastases on gigapixel pathology images.
smears. Cancer. 2000 Aug 25; 90(4):201-6. 2017; arXiv:1703.02442.
[3] N. A. Young, S. Naryshkin, B. F. Atkinson, H. Ehya, P. K. Gupta, T. S. [13] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis
Kline, R. D. Luff. Interobserver variability of cervical smears with detection in breast cancer histology images with deep neural networks.
squamous-cell abnormalities: a Philadelphia study. Diagn Cytopathol. Med Image Comput Comput Assist Interv. 2013; (8150) of Lect Notes
1994 Dec;11(4):352-7. Comput Sci. pp. 411–418.
[4] K. Bora, M. Chowdhury, L. B. Mahanta, M. K. Kundu, and A. K. Das. [14] J. Jantzen, J. Norup, G. Dounias, B. Bjerregaard. Pap-smear bench- mark
Pap smear image classification using convolutional neural network. In data for pattern classification. Nature inspired Smart Information Systems
2005; pp. 1–9, 2005.
Proceedings of the Tenth Indian Conference on Computer Vision,
Graphics and Image Processing, ICVGIP 2016, pages 55:1–55:8, New [15] M. Satyanarayanan, A. Goode, B. Gilbert, J. Harkes, D. Jukic. OpenSlide:
York, NY, USA, 2016. ACM. A vendor-neutral software foundation for digital pathology. J. Pathol.
[5] J. Hyeon, H. J. Choi, B. D. Lee, and K. N. Lee. Diagnosing cervical cell Inform. 2013;4:27.
images using pre-trained convolutional neural network as feature [16] D. Solomon and R. Nayar, The Bethesda System for reporting cervical
extractor. In 2017 IEEE International Conference on Big Data and Smart cytology: definitions, criteria, and explanatory notes. Springer Science &
Computing (BigComp), pages 390–393, Feb 2017. Business Media, 2004.
[6] L. Zhang, L. Lu, I. Nogues, R. Summers, S. Liu, and J. Yao. Deeppap: [17] P. Bankhead et al. QuPath: Open source software for digital pathology
Deep convolutional networks for cervical cell classification. IEEE Journal image analysis. Scientific Repeports. 2017;7(1):16878.
of Biomedical and Health Informatics, pages (99):1–1, 2017. [18] Aurélien Géron. 2017. Hands-On Machine Learning with Scikit-Learn
[7] S. Mukhopadhyay et al. Whole Whole Slide Imaging Versus Microscopy and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
for Primary Diagnosis in Surgical Pathology: A Multicenter Blinded Systems. "O’Reilly Media,Inc.". Google-Books-ID:bRpYDgAAQBAJ.
Randomized Noninferiority Study of 1992 Cases (Pivotal Study). Am J [19] Francois Chollet. 2017. Deep Learning with Python (1st ed.). Manning
Surg Pathol. 2018 Jan;42(1):39-52. Publications Co., Greenwich, CT, USA.
[8] M. E. Plissiti and C. Nikou. Combining shape, texture and intensity [20] Chollet, Francois. Keras: Deep learning library for theano and tensorflow.
features for cell nuclei extraction in pap smear images. Pattern https://github.com/fchollet/keras, 2015.
Recognition Letters, 32(6):838–853, 2011.
[9] A. Genctav, S. Aksoy, S. Onder. Unsupervised segmentation and
classification of cervical cell images. Pattern Recognit., 2012; 45(12):
4151–4168.
[10] M. G. Ertosun, D. L. Rubin. Automated grading of gliomas using deep
learning in digital pathology images: a modular approach with ensemble

ISBN: 1-60132-481-2, CSREA Press ©


16 Int'l Conf. Data Science | ICDATA'18 |

Credit Default Mining Using Combined Machine


Learning and Heuristic Approach
Sheikh Rabiul Islam William Eberle Sheikh Khaled Ghafoor
Computer Science Computer Science Computer Science
Tennessee Technological University Tennessee Technological University Tennessee Technological University
Cookeville, U.S.A Cookeville, U.S.A Cookeville, U.S.A
sislam42@students.tntech.edu weberle@tntech.edu sghafoor@tntech.edu

Abstract—Predicting potential credit default accounts in to occur and calculate another score as soon as the transaction
advance is challenging. Traditional statistical techniques typically occurs. Finally, all scores are combined to make a decision. We
cannot handle large amounts of data and the dynamic nature of used the term OLAP data for offline data and OLTP data for
fraud and humans. To tackle this problem, recent research has online data in our previous work [3].The main limitations of the
focused on artificial and computational intelligence based previous work was the use of a synthetic dataset and a lack of
approaches. In this work, we present and validate a heuristic validation of the proposed model using a publicly available, real-
approach to mine potential default accounts in advance where a world dataset. Online Analytical Processing (OLAP) systems
risk probability is precomputed from all previous data and the risk typically use archived historical data over several years from a
probability for recent transactions are computed as soon they
data warehouse to gather business intelligence for decision-
happen. Beside our heuristic approach, we also apply a recently
making and forecasting. On the other hand, Online Transaction
proposed machine learning approach that has not been applied
previously on our targeted dataset [15]. As a result, we find that
Processing (OLTP) systems, only analyze records within a short
these applied approaches outperform existing state-of-the-art window of recent activities - enough to successfully meet the
approaches. requirement of current transactions within a reasonable response
time [19][3].
Keywords—offline, online, default, bankruptcy Currently, a variety of Machine Learning approaches are
used to detect fraud and predict payment defaults. Some of the
I. INTRODUCTION more common techniques include K Nearest Neighbor, Support
In general, we can refer to a customer’s inability to pay, or Vector Machine, Random Forest, Artificial Immune System,
their default on a payment, or personal bankruptcy, all as Meta-Learning, Ontology Graph, Genetic Algorithms, and
potential issues of non-payment. However, each of these Ensemble approaches. However, a potential approach that has
scenarios is a result of different circumstances. Sometimes it is not been used frequently in this area is Extremely Random Trees,
due to a sudden change in a person’s income source due to job or Extremely Randomized Trees (ET) [20]. This approach came
loss, health issues, or an inability to work. Sometimes it is a about in 2006, and is a tree-based ensemble method for
deliberate, for instance, when the customer knows that he/she is supervised classification and regression problems. In Extremely
not solvent enough to use a credit card anymore, but still uses it Random Trees (ET) randomness goes further than the
until the card is stopped by the bank. In the latter case, it is a type randomness in Random Forest. In Random Forest, the splitting
of fraud, which is very difficult to predict, and a big issue to attribute is determined by some criteria where the attribute is the
creditors. best to split on that level, whereas in ET the splitting attribute is
also chosen in an extremely random manner in terms of both
To address this issue, credit card companies try to predict variable index and splitting value. In the extreme case, this
potential default, or assess the risk probability, on a payment in algorithm randomly picks a single attribute and cut-point at each
advance. From the creditor's side, the earlier the potential default node, which leads to a totally randomized trees whose structures
accounts are detected the lower the losses [5]. For this reason, an are independent of the target variables values in the learning
effective approach for predicting a potential default account in sample [20]. Moreover, in ET, the whole training set is applied
advance is crucial for the creditors if they want to take preventive to train the tree instead of using bagging to produce the training
actions. In addition, they could also investigate and help the set as in Random Forest. As a result, ET gives a better result than
customer by providing necessary suggestions to avoid Random Forest for a particular set of problems. Besides
bankruptcy and minimize the loss. accuracy, the main strength of the ET algorithm is its
Analyzing millions of transactions and making a prediction computational efficiency and robustness [20]. While ET does
based on that is time consuming, resource intensive, and some reduce the variance at the expense of an increase in bias, we will
time error prone due to the dynamic variables (e.g., balance limit, use this algorithm as the foundation for our proposed approach.
income, credit score, economic conditions, etc.). Thus, there is a The following sections discuss related research, followed by
need for optimal approaches that can deal with the above our proposed approach. We then present the data that will be
constraints. In our previous work [3], we proposed an approach used, and our experimental results. We then conclude with some
that precomputes all previous data (offline data) and calculates a observations and future work.
score. Subsequently, it waits for a new transaction (online data)

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 17

II. LITERATURE REVIEW computations. They mentioned that most of the available
The research work of [1][2][3][4][5] are all about personal techniques in this area are based on offline machine learning
bankruptcy or credit card default on payment prediction and techniques. Their work is the first work in this area that is
detection. In the work of [4], the authors worked on finding capable of updating a model based on new data in real time. On
financial distress from four different summarized credit datasets. the other hand, traditional algorithms require retraining the
Bankruptcy prediction and credit scoring were the primary model even if there is some new data, and the size of the data
indicators of financial distress prediction. According to the affects the computation time, storage and processing. For the
authors, a single classifier is not good enough for a classification purpose of real-time model updating, they use Online Sequential
problem of this type. So they present an ensemble approach Extreme Learning Machine (OS-ELM) and Online Adaptive
where multiple classifiers are used on the same problem and then Boosting (Online AdaBoost) methods in their experiment. They
the result from all classifiers are combined to get the final result, compared the results from above mentioned two algorithms with
and reduce Type I/II errors – crucial in the financial sector. For basic ELM and AdaBoost in terms of training efficiency and
their classification ensemble approach, they use four testing accuracy. In online AdaBoost, the weight for each weak
approaches: a) majority voting b) bagging c) boosting and 3) leaner and the weight for the new data is updated based on the
stacking. The also introduced a new approach called Unanimous error rate found in each of the iterations. The OS-ELM is based
Voting (UV) where if any of the classifiers says “yes” then it is on basic ELM which is formed from a single layer feedforward
assumed as “yes” whereas in Majority Voting (MV) at least network. Along with these algorithms, they also applied some
(n+1)/2 classifiers need to say “yes” to make the final prediction other classic algorithms such as KNN, SVM, RF, and NB.
yes. In the end, they are able to reduce the Type II error but Although KNN, SVM, and RF have shown higher accuracy, the
decrease the overall accuracy. training time was more than 100 times compared to other
algorithms. They found RF exhibits great performance in terms
In the work of [5], the authors present a system to predict of efficiency and accuracy (81.96%). In the end, both online
personal bankruptcy by mining credit card data. In their ELM and AdaBoost maintain the accuracy level of other offline
application, each original attribute is transformed either as: i) a algorithms, while significantly reducing the training time with
binary [good behavior and bad behavior] categorical attribute, or an improvement of 99% percent. They conclude that the online
ii) a multivalued ordinal [good behavior and graded bad AdaBoost has the best computational efficiency, and the offline
behavior] attribute. Consequently, they obtain two types of or classic RF has best predictive accuracy. In other words,
sequences, i.e., binary sequences and ordinal sequences. Later Online AdaBoost balances relatively better than offline or
they use a clustering technique for discovering useful patterns classic RF between computational accuracy and computational
that can help them to identify bad accounts from good accounts. speed. They mentioned two future directions of this research as
Their system performs well, however, they only use a single follows: a) incorporating concept drift to deal with the change of
source of data, whereas the bankruptcy prediction systems of new data distributions over time, which may affect the
credit bureaus use multiple data sources related to effectiveness of the online learning model, and b) sustaining the
creditworthiness. robustness of online learning for a dataset with missing records
or noise. They also mention that some other online learning
In the work of [1], they compared the accuracy of different
techniques like Adaptive Bagging could be applied and
data mining techniques for predicting the credit card defaulters.
compared in terms of speed, accuracy, stability, and robustness.
The dataset used in this research is from the UCI machine
learning repository which is based on Taiwan’s credit card Besides credit card default prediction and detection, there are
clients default cases [15]. This dataset has 30,000 instances, and lots of work on different types of credit card fraud detection.
6626 (22.1%) of these records are default cases. There are 23 Some of those are [6], [7], [8], [9], [10], [11], [12], [13], and
features in this dataset. Some of the features include credit limit, [14], where credit card transaction fraud detection are
gender, marital status, last 6 months bills, last 6 months emphasized and surveyed. Most of the transaction fraud is the
payments, etc. These are labeled data and labeled with 0 (refer direct result of stolen credit card information. Some of the
to non-default) or 1 (refers to default). From the experiment, techniques they used for credit card transaction fraud detection
based on the area ratio in the lift chart on the validation data, are as follows: Artificial Immune System, Meta-Learning,
they ranked the algorithms as follows: artificial neural network, Ontology Graph, Genetic Algorithms, etc.
classification trees, naïve Bayesian classifiers, K-nearest
neighbor classifiers, logistic regression, and discriminant So, despite the plethora of research being done in the area of
analysis. In terms of accuracy, K-nearest neighbor demonstrated credit default/fraud detection, little has been reported that
the best performance with an accuracy of 82% on the training resolves the issue of detecting default/fraud early in the process.
data and 84% on the validation or test data. To get an actual In this work, we will focus on detecting default accounts in the
probability of “default” (rather than just a discrete binary result) very early stage of credit analysis towards the discovery of a
they proposed a novel approach called Sorting Smoothing potential default on payment or even bankruptcy.
Method (SSM).
III. METHODOLOGY
In the work of [2], the authors use the same Taiwan dataset
[15] as of [1]. However, they applied a different set of algorithms We applied two different approaches to the dataset: one for
and approaches. In this research, they proposed an application of comparing results with previous standard machine learning
online learning for a credit card default detection system that approaches, and the other for validating our proposed approach.
achieves real-time model tuning with minimal efforts for The first approach is the application of different machine
learning algorithms on the dataset. We call this standard

ISBN: 1-60132-481-2, CSREA Press ©


18 Int'l Conf. Data Science | ICDATA'18 |

approach the Machine Learning Approach. The second approach standard rule given X is equal to null and the R Online becomes
is based on our previous work [3], where two tests are maximum, 2) there are n related causes and all causes are valid
performed, what we call a standard test and a customer specific given both X and Y are equal and the R Online becomes zero,
test, to mine potential defaulting accounts. We will call this and 3) there are m valid causes among n related causes given R
second approach our Heuristic Approach. Online is a value in between maximum and minimum. Thus, our
proposed RISK algorithm returns the online risk probability
The Heuristic Approach predicts the credit default risk in two Ronline for a transaction.
steps. In the first step, we compute a credit default probability
score from archived transaction history using appropriate ___________________________________________________
machine learning algorithms. This score is stored in the database
and continuously updated as new transactions occur. In the RISK (SR, CSR, FS, T):
second step, as real-time transactions occur, we apply a heuristic 1. for each online transaction t of T
(applying a standard test and a customer specific test, explained 2. ViolatedRules Å StandardTest(t)
in detail in section V) to compute a risk score. This score is 3. if count of ViolatedRules is greater than 0
combined with the archived score using the equations (1) and (2) 4. Ronline Å CustomerSpecificTest (ViolatedRules)
to compute overall risk probability. 5. else
For the Machine Learning Approach, we experimented with 6. Ronline Å 0
various supervised machine learning algorithms to determine the 7. return Ronline
best algorithm. Then, for the Heuristic approach we will take ___________________________________________________
the best algorithm found from the Machine Learning Approach
and apply it only to the offline data to calculate the offline risk
probability Roffline. And whenever a new transaction occurs, we Finally, the risk probability from both online and offline data
run the two tests (Standard Test, Customer Specific Test) on the are combined using a weighted method to see whether the
online data to calculate the online risk probability Ronline. account is going to default in the near future.
Our proposed algorithm RISK shows the steps involved in
calculating Ronline..The parameters for the algorithm are standard IV. DATA
rules (SR), customer specific rules (CSR), Feature Scores (FS), In this work, we have used the “Taiwan” dataset [15] of
and a batch of online transactions (T). Details of the rules, rule Taiwan’s credit card clients’ default cases which has 23 features
mappings, risk calculation steps, and flowcharts are described in and 30,000 instances, out of which 6,626 (22.1%) are default
detail in our previous work [3]. Each and every transaction is cases. The same dataset has also been used in other research
passed through the StandardTest function, which returns the work [1][2]. Some of the features of this dataset are credit limit,
violated standard rules (if any), which is then passed to the gender, marital status, last 6 months bills, last 6 months
CustomerSpecificTest function to find out the valid causes for payments, and last 6 months re-payment status. Records are
the breaking standard rules. There is a mapping between causes labeled as either 0 (non-default) or 1 (default). Fig. 1 shows a
and the standard rules. Also different causes carry different snapshot of 5 random records from the dataset.
weights based on the following criteria: 1) the mapping in
between causes and standard rules, 2) the mapping between As indicated earlier, the Heuristic Approach processes two
standard rules and features, and 3) the ranking of associated different datasets related to credit card transactions: the offline
features. We use the term impact coefficient and weight data and the online data. However, one of the issues with
interchangeably. To calculate risk probability from online data research in this area is that both offline and online transactional
we use the following formula: data are not publicly available. Specifically, there are some
public datasets that contain customer summarized profile
information and credit information, but not individual credit
ஊ୍୫୮ୟୡ୲େ୭ୣ୤୤୧ୡ୧ୣ୬୲ሺ௑ሻ transactions. (i.e., no single publicly available dataset that
R Online = [ 1 – ] × 100 contains both for the same set of customers). In order to tackle
ஊ୍୫୮ୟୡ୲େ୭ୣ୤୤୧ୡ୧ୣ୬୲ሺ௒ሻ
this issue, and provide a relevant data source for future work in
this area (something that we will make publicly available after
Here, X is the set of valid causes for breaking the standard publication), we will decompose the Taiwan dataset into both
rules, and Y is the set of relevant causes (valid or invalid) for offline and online datasets as shown with the examples in Table
breaking the standard rules. Some of the use cases of the 1 and Table 2.
formula are as follows: 1) there is no valid cause for breaking a

Fig. 1. Taiwan dataset

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 19

TABLE 1. OFFLINE DATASET CREATED FROM TAIWAN DATASET where, V2 = converted value, Max2 = the ceiling of the new
account balance sex education marriag age total_bi total_p repay default range, Min2 = the floor of the new range, Max1 = the ceiling of
_limit e ll ayment ment the current range, Min1 = the floor of the current range, V1 = the
4663 50000 2 3 2 23 28718 1028 0 0 value needs to be converted.
13181 100000 2 3 2 49 17211 2000 0 0
21600 50000 2 2 2 22 28739 800 0 0 To ensure that a corresponding transaction distribution can
1589 450000 2 2 2 36 201 3 -1 0 be followed in the “Spain” dataset for a BILL_AMT in the
28731 70000 2 3 1 39 133413 4859 2 0 “Taiwan” dataset, we used equal frequency binning to determine
the ranges under which a monthly bill amount (BILL_AMT)
must fall into. Equal frequency binning uses an inverse
TABLE 2. ONLINE DATASET CREATED FROM TAIWAN DATASET cumulative distribution function (ICDF) to calculate the upper
tid account amount date type and lower ranges. As a result, we came up with on average
53665 23665 660 2015-05-29 pay 359,583 online transactions per month for the same 30,000
9328 9328 46963 2015-05-14 exp accounts or records in the original dataset.
37597 7597 3000 2015-05-29 pay
9495 9495 75007 2015-05-14 exp It should also be noted that another significant result of this
34113 4113 5216 2015-05-29 pay work is the creation of a dataset for other researchers. As
mentioned earlier, public access to credit card summary data and
credit card transactional data for the same set of customers is
Initially, from each record (customer) in the Taiwan dataset, rare. While it was necessary to create this dataset for our specific
we created 5 online transactions of type “pay” (payment) from research purposes, we realize the benefit of making this dataset
PAY_AMT1 to PAY_AMT5 and 5 online transactions of type public to the general research community.
“exp” (expenditure) from BILL_AMT1 to BILL_AMT5. Since
BILL_AMT is the sum of all individual bills or transactions, we
divided this BILL_AMT into individual transactions by V. EXPERIMENT
following the data distribution of a real credit card transactions For our experiments, we will use the Python scikit-learn
dataset. As shown in Table 3, BILL_AMT1 is the total bill and library. The following sections describe the experimental setup
PAY_AMT1 is the payment amount for the month of September for each of the two approaches that we discussed earlier.
2005, BILL_AMT2 is the total bill and PAY_AMT2 is the
payment amount for the month of August 2005, and so on, up to A. Machine Learning Approach
the oldest month, which in this case is April 2005 (BILL_AMT6 We will run different machine learning algorithms on the
and PAY_AMT6). So, initially PAY_AMT6 and BILL_AMT6 “Taiwan” dataset. The purpose of this test is to evaluate an
go into the total_payment and total_bill for the offline data improved approach in terms of the following performance
(Table 1). At the end of the month, the total_payment and
evaluation metrics: accuracy, recall, F-score, and precision. We
total_bill is updated with that month's total bill (BILL_AMT)
chose these metrics for two reasons: 1) these are the metrics
and total payments (PAY_AMT).
frequently used in related research, and 2) to compare the results
TABLE 3. MONTH VS FEATURE MAPPING IN TAIWAN DATASET with previous research using this Taiwan dataset.
Month Feature Feature Some of the algorithms we tried include K-nearest
( BILL AMOUNT ) ( PAYMENT AMOUNT ) Neighbor, Random Forest, Naïve Bayes, Gradient Boosting,
April BILL_AMT6 PAY_AMT6 Extremely Random Trees (Extra Trees), etc. We also used the k-
fold (k =10) cross-validation technique for the testing/training
May BILL_AMT5 PAY_AMT5 set split and to calculate performance metrics. Default
June BILL_AMT4 PAY_AMT4 parameters for all algorithms (in scikit-learn) were used unless
otherwise mentioned.
July BILL_AMT3 PAY_AMT3
August BILL_AMT2 PAY_AMT2 B. Heuristic Approach
September BILL_AMT1 PAY_AMT1 This approach originated from our previous preliminary
work using a synthetic dataset [3]. However, in this work, we
will validate our approach by using the publicly available
As stated previously, BILL_AMT is the summarized “Taiwan” dataset, and dividing the dataset into offline and online
information of an entire month’s transaction. We then break datasets. Beside solving the limitations (e.g., lack of validating
down this BILL_AMT into the individual transactions by the proposed model using a known and real dataset), we also
following the real credit card transaction data distribution of the found a better base algorithm (Extremely Random Trees) than
“Spain” dataset [16] used in the work [18]. We then scaled those before that will contribute to the calculation of the offline risk
datasets up/down as needed to convert them into the same probability Roffline in our Heuristic Approach. We briefly
currency scale using the formula below: reiterate the two tests as discussed in detail in [3]:
1) Standard Test: The purpose of this test is to identify
transactions that deviate from the normal behavior and pass
ሺ‫ ʹݔܽܯ‬െ ‫ʹ݊݅ܯ‬ሻ  ൈ  ሺܸͳ െ ‫ͳ݊݅ܯ‬ሻ
ܸʹ ൌ   ൅ ‫ʹ݊݅ܯ‬ them to the next test named Customer Specific Test. Here the
ሺ‫ ͳݔܽܯ‬െ ‫ͳ݊݅ܯ‬ሻ normal behavior refers to the common set of standard rules that

ISBN: 1-60132-481-2, CSREA Press ©


20 Int'l Conf. Data Science | ICDATA'18 |

every good transaction bound to follow. Some of the standard


rules that we applied to the “Taiwan” dataset include whether
the minimum due was paid, whether the paid amount was less R Total = λ R Online + (1- λ) R Offline (2)
than the bill amount, whether the monthly total bill was greater
less than balance limit etc.
where λ is the risk factor.
2) Customer Specific Test: This test is more customer-
centric rather than the standard rules of Standard Test that are For our experiments, we have found that between 45% and
applicable to every account in the same way. It takes customer 50% for the online data weight, with the remaining % for the
specific measures like foreign national, job change, address offline data weight, provides the best results. In other words, if
change, promotion, salary increase, etc. into consideration. The λ = .45 or .5 then 1- λ = .55 or .5 accordingly. As a result, we
purpose of this test is to recognize possible causes for which a have used λ = .5 for our experiments.
transaction is unable to satisfy a standard rule in the Standard
Test. In the experiment with the “Taiwan” dataset, this test was VI. RESULTS
not completely in effect due to the lack of necessary information Running the different algorithms on the “Taiwan” dataset we
that can be extracted from the dataset. discover that the Extremely Random Trees outperforms all the
As a consequence of the above tests, an online risk standard machine learning algorithms and notable previous
probability Ronline is returned from the RISK algorithm explained works [1][2] on this dataset in terms of Accuracy, Precision,
earlier in section III. Details of this procedure are described in Recall and F-score. Detail scores are shown in Fig. 2. The
our previous work [3]. The total risk probability for a performance gain is mainly due to the fact that the tree-based
transaction comes from both online and offline data. So, the approach works very well for problems where the number of
features is moderate, data is properly labeled, and there are few
equation of total risk probability is as follows:
missing values. To the best of our knowledge, the Extremely
Random Trees algorithm has not been used on this dataset
R Total = R Online + R Offline (1) before.
Here,
R Total = Overall risk probability from both online and offline
data.
R Online = Risk probability from online data
R Offline = Risk probability from offline data
Initially, we get the risk probability from offline data (R
Offline) for corresponding accounts from the value of the risk
probability distribution value of the classification results on
offline data. So, for the first transaction of the account, the R
Offline = probability of it being defaulted comes from the
probability distribution of the classification outcome. Then, for
the subsequent transaction N, the Roffline is the value of total risk
probability of from the previous transaction.

R Offline of transaction N = R Total of transaction N-1

Thus, Roffline is updated in two situations: a) At the end of a


transaction that has a positive Ronline; and b) At the end of the
month to synchronize with possible profile changes (i.e., credit
limit increase). That is the reason why we created 5 batches from Fig. 2 Performance by algorithms using Machine Learning Approach
5 months of data and ran them chronologically to accommodate
the profile change at the end of each month, which also leads to Fig. 3 shows the comparison of performance using the
better results (Table 4 and Fig. 4). Machine Learning Approach (i.e., applying only the best
classifier, Extremely Random Trees on the dataset without any
Furthermore, the risk probability from the online data and other test like Standard Test or Customer Specific Test), the
offline data may carry different weights. For example, giving Heuristic Approach, and the State-of-the-art. So far, we have
half of the weight (i.e., 50%) to offline data and remaining half seen a maximum accuracy of 84% (82% on training data) and
of the weight (i.e., 50%) to online data might provide better maximum recall of 65.54% among all previous research work on
mining results for a particular company or dataset. On the other this “Taiwan” dataset, while the Heuristic Approach has an
hand, for another company or dataset, a different combination of accuracy of 93.14% and the Machine Learning Approach has an
offline vs online risk probability weights might be better. So, the accuracy of 95.84%. We also realize a better recall percentage.
modified version of (1) for a total risk probability calculation is:

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 21

In fraud or risk, detection recall is very important because we From Table 4 and Fig. 4 we can see that recall increases as
don’t want to miss fraud or risks. However, maximizing recall the number of batches increase. This implies that the percentage
introduces an increase of False Positives, which is expected in of targeted (i.e., defaulting) accounts increases (linearly) with
risk analytics. the number of batches (i.e., as more information about the
customer is known).
The computation time for both the Machine Learning
Approach and calculating Roffline using Extremely Random Trees
for 30,000 accounts was on average 11.24 seconds using a
commodity laptop with an Intel core i7 processor and 12 GB
RAM. Though Naïve Bayes is a bit faster than Extremely
Random Trees, its performance in terms of accuracy, precision,
recall, and F-score are not. For the online data computation, it
took on average of 114.42 seconds for a batch size of on average
of 359,583 transactions. For our interpretation of results, we
created only one batch per month. However, there is nothing in
our proposed approach that requires batches of this size, and any
number of transactions per month for online data could be used,
which could lead to batches with a much smaller number of
Fig. 3 Performance comparison of different approaches transactions with less computation time. To verify this, we tried
with batches of different sizes (reducing the batch size by half
Recall that we divided the dataset into offline and online each time) and we found that the computation time for the online
datasets, as mentioned in the Data section (Section IV), data reduces almost linearly with the reduction of the number of
consisting of bill and payment data for 6 months. Data from the transactions per batch. From Fig. 5, we can see that the trendline
first month was included in the summarized fields (i.e., total_bill (dotted line) is almost in line with the actual line. This
and total_payment), and for the remaining 5 months, we made 5 demonstrates how fast this approach can process the online
batches of offline data and 5 batches of online data. We then ran transaction and give a decision in near real-time.
offline batch 1 and online batch 1 serially, followed by offline
batch 2 and online batch 2, and so on, up to batch 5, which leads
to comparing the results from the newly created offline (Table
1) and online (Table 2) dataset from the “Taiwan” dataset with
the results of the Machine Learning Approach, as shown in Fig
3.
TABLE 4. BATCH WISE PERFORMANCE METRICS

Batch Accuracy Precision Recall F- Computation Computation


score Time (offline) Time (online)
1 0.94 0.92 0.82 0.87 12.23 152.54
2 0.94 0.88 0.86 0.87 9.14 109.35
3 0.94 0.84 0.89 0.87 13.04 86.90
4 0.94 0.82 0.91 0.86 11.59 89.74
5 0.93 0.80 0.92 0.86 10.21 133.58 Fig. 5 Batch size vs computation time

Another mentionable contribution of this approach is the


early detection. While there is ~10% improvement in recall from
the first month (batch 1) to the fifth month (batch 5) – from a
recall of 81.96% to 92.15% - it is clear that we can achieve a
good recall very early in the process, enabling a real-time system
to detect potential credit card default.

VII. CONCLUSION
In this research, we have used two approaches, Machine
Learning and Heuristic, for mining default accounts from a well-
known dataset. The Heuristic Approach came from our previous
work [3], that we validated with actual data in this work. The
main idea of the Heuristic Approach is to calculate the risk factor
from the recent transactional data (online) and combine the
results with pre-computed risk factors from historical (offline)
Fig. 4. Batch wise performance metrics data in an efficient way. To make the process efficient, we only

ISBN: 1-60132-481-2, CSREA Press ©


22 Int'l Conf. Data Science | ICDATA'18 |

have to process a transaction when it initially occurs, and then [7] J. W., Yoon, and C. C. Lee, "A data mining approach using transaction
the combined risk factor is carried forward for future patterns for card fraud detection," Jun. 2013. [Online]. Available:
arxiv.org/abs/1306.5547.
transactions. We showed this approach can predict a default
[8] RamaKalyani, K., and D. UmaDevi. "Fraud detection of credit card
account significantly in advance, which is very cost efficient for payment system by genetic algorithm." International Journal of Scientific
the funding organization. In addition, we demonstrated that the & Engineering Research 3.7 (2012): 1-6.
performance of both approaches outperforms reported [9] Delamaire, Linda, Hussein Abdou, and John Pointon. "Credit card fraud
approaches using the same data set [1][2]. Our future plan is to and detection techniques: a review." Banks and Bank systems 4.2 (2009):
improve the Heuristic Approach so that it outperforms the 57-68.
Machine Learning Approach in terms of all performance [10] Adnan M. Al-Khatib, “Electronic Payment Fraud Detection Techniques”,
metrics, and validate that with multiple datasets. Other plans World of Computer Science and Information Technology Journal
(WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141,2012.
include: testing and validating the model with multiple real
datasets, standardizing the online vs offline risk weight ratio (the [11] Pun, Joseph King-Fung. Improving Credit Card Fraud Detection using a
Meta-Learning Strategy. Diss. 2011.
value of λ) with multiple datasets of credit defaults, as well as
[12] West, Jarrod, and Maumita Bhattacharya. "Some Experimental Issues in
handling concept drift to deal with change in the distribution of Financial Fraud Mining." Procedia Computer Science 80 (2016): 1734-
the online data over time which may affect the effectiveness of 1744.
the approach. [13] Bhattacharyya, Siddhartha, et al. "Data mining for credit card fraud: A
comparative study." Decision Support Systems 50.3 (2011): 602-613.
[14] Gadi, Manoel Fernando Alonso, Xidi Wang, and Alair Pereira do Lago.
"Credit card fraud detection with artificial immune system." International
REFERENCES Conference on Artificial Immune Systems. Springer, Berlin, Heidelberg,
2008.
[15] "UCI Machine Learning Repository: Default of credit card clients Data
[1] Yeh, I-Cheng, and Che-hui Lien. "The comparisons of data mining Set". [Online]. Available:
techniques for the predictive accuracy of probability of default of credit https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#.
card clients." Expert Systems with Applications 36.2 (2009): 2473-2480.J. Accessed: December. 23, 2017.
Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2.
Oxford: Clarendon, 1892, pp.68–73. [16] "Synthetic data from a financial payment system | Kaggle". [Online].
Available:https://www.kaggle.com/ntnu-testimon/banksim1/data.
[2] Lu, Hongya, Haifeng Wang, and Sang Won Yoon. "Real Time Credit Accessed: February. 9, 2018..
Card Default Classification Using Adaptive Boosting-Based Online
Learning Algorithm." IIE Annual Conference. Proceedings. Institute of [17] "UCI machine learning repository: Data set,". [Online]. Available:
Industrial and Systems Engineers (IISE), 2017. https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data.
[3] Islam, Sheikh Rabiul, William Eberle, and Sheikh Khaled Ghafoor. Accessed: Feb. 20, 2016..
"Mining Bad Credit Card Accounts from OLAP and OLTP." Proceedings [18] Edgar Alonso Lopez-Rojas and Stefan Axelsson. BANKSIM: A bank
of the International Conference on Compute and Data Analysis. ACM, payments simulator for fraud detection research. 26th European Modeling
2017. and Simulation Symposium, EMSS 201420141441521placeholder.
[4] Liang, Deron, et al. "A novel classifier ensemble approach for financial [19] "Introduction to data warehousing concepts," 2014. [Online]. Available:
distress prediction." Knowledge and Information Systems 54.2 (2018): https://docs.oracle.com/database/121/DWHSG/concept.htm#DWHSG92
437-462. 89. Accessed: Mar. 28, 2016.
[5] Xiong, Tengke, et al. "Personal bankruptcy prediction by mining credit [20] Geurts, Pierre, Damien Ernst, and Louis Wehenkel. "Extremely
card data." Expert systems with applications 40.2 (2013): 665-676. randomized trees." Machine learning 63.1 (2006): 3-42.
[6] Ramaki, Ali Ahmadian, Reza Asgari, and Reza Ebrahimi Atani. "Credit
card fraud detection based on ontology graph." International Journal of
Security Privacy and Trust Management (IJSPTM) 1.5 (2012): 1-12.

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 23

Customer Level Predictive Modeling for Accounts


Receivable to Reduce Intervention Actions
Michelle LF CHEONG, and Wen SHI
School of Information Systems
Singapore Management University
80 Stamford Road, Singapore 178902

be made to the customers, and hard intervention actions like


Abstract - One of the main costs associated with Accounts
letter of demand will be sent from the 61st day onwards. After
receivable (AR) collection is related to the intervention
180 days, the payment will be deemed as bad debt and no
actions taken to remind customers to pay their outstanding
future orders will be accepted from such bad customers.
invoices. Apart from the cost, intervention actions may lead to
poor customer satisfaction, which is undesirable in a
Through interactions with the customers, the logistics
competitive industry. In this paper, we studied the payment
company realized that delay in payments could be due to non-
behavior of invoices for customers of a logistics company, and
apparent reasons, such as the billing cycle being misaligned
used predictive modeling to predict if a customer will pay the
with the due dates of the invoices. For some of the customers,
outstanding invoices with high probability, in an attempt to
they will eventually make payments even without intervention
reduce intervention actions taken, thus reducing cost and
actions being served upon them. Thus, in this project, we
improving customer relationship. We defined a pureness
studied the payment behaviors of invoices for the good
measure to classify customers into two groups, those who paid
customers (those who paid all their invoices within 45 days)
all their invoices on time (pureness = 1) versus those who did
and bad customers (those who did not pay their invoices after
not pay their invoices (pureness = 0), and then use their
180 days), and used predictive modeling to predict if a
attributes to train predictive models, to predict for customers
customer will pay the outstanding invoices with high
who partially paid their invoices on time (0 < pureness < 1),
probability, in an attempt to reduce intervention actions taken
to determine those who will pay with high probability. Our
on them, thus reducing cost and improving customer
results show that a Neural Network model was able to predict
relationship.
with high accuracy and further concluded that for a 0.1 unit
increase in pureness measure, the customer is 1.132 times
more likely to pay on time. 2 Literature Review
A patent was granted in 2007 for an invention to use an
Keywords: accounts receivable; predictive modeling; automated system and method to predict the likelihood of
intervention actions; customer level collecting a delinquent debt [1]. In this patent, the authors
proposed several embodiments including one where a
predictive model which uses historical data of delinquent debt
1 Introduction accounts, the collection methods used, and the success of the
A typical Order-to-Cash process for businesses involves collection. In another embodiment, a predictive model which
accounts receivable collection after an invoice is issued to the uses profiles of delinquent debt accounts to summarize the
customer. A usual payment term of 30, 45 or 60 days is often patterns of events and the success of collection. A third
provided for the customer to make full payment of the embodiment described a predictive model which includes a
invoiced amount. However, many businesses face a similar vector representation of the collector’s written notes to encode
problem of having customers who do not pay on time, and contextual similarity to map to a word space, to calculate the
will have to take intervention actions to remind their net present value of a delinquent debt, the preferred collection
customers to pay their outstanding invoices. Such intervention action or most appropriate collection agent. While this patent
actions cost money and time, and may even lead to poor did not present the actual models and results, it contained
customer satisfaction. possible statistical methods that could be used to enhance debt
collection.
An international logistics company serves many
customers in delivering their goods in boxes or envelopes, to The first piece of published work [2] used supervised
locations all over the world. A standard 30-day payment term learning to build predictive models to predict if an invoice
is provided for all invoices and up to 45 days of allowance is will be paid on time or delayed, which could be 1 to 30 days
permitted before intervention actions start to kick in. After 45 late, 31 to 60 days late, 61 to 90 days late, or more than 90
days, soft intervention actions like reminder phone calls will days late. The authors analyzed invoice records from four

ISBN: 1-60132-481-2, CSREA Press ©


24 Int'l Conf. Data Science | ICDATA'18 |

firms, focusing on returning customers (customers with more on customers who are likely to pay on time. Finally, using our
than one invoice) and concluded that by using historical late pureness measure, we can determine the relationship between
payment behaviors of customers in addition to invoice computed pureness and predicted probability to pay on time.
features, there is significant improvement in the prediction
accuracy. They extended their analysis to include customer This paper is organized as follows. Section 3 defines our
features such as organization profile, and was able to further proposed pureness measure and how the measure is used to
improve the prediction accuracy. By predicting that an invoice group the customers into three different groups. Section 4
payment will be late, for example more than 90 days late, the describes the data preparation tasks where seven database
company can take early preemptive actions to prevent such tables were provided and the derived attributes were
invoices from becoming bad debt. Finally, they compared the computed, resulting in different number of customers in each
accuracy of a unified model (applicable for all four firms) group. In Section 5, we will explore the data to gain some
versus firm-specific model, and showed that the firm-specific intuitive as well as interesting insights. From the interesting
model yielded higher prediction accuracy. insights, we can support the conjecture that different billing
cycles and different number of business years could be
Following [2], [3] and [4] are two master thesis from correlated to the number of intervention actions due to late
MIT which focused on similar work, using similar invoice and invoice payment. In Section 6, we will discuss the modeling
payment behavior features, and had drawn the same process and the prediction results obtained. Section 7
conclusion that the Random Forest model had the highest concludes the paper.
prediction accuracy for predicting if an invoice payment will
be on time or delayed, and how long the delay will be. The 3 Pureness Measure Definition
author in [3] added some work in analyzing the characteristics
of delayed invoices and problematic customers, and In [2], they created 14 aggregated features related to late
concluded that there was no obvious correlation between payment behavior, including ratio of paid invoices that were
invoiced amount and invoice delay. The author in [4] also late, ratio of sum of paid base amount that were late, average
performed further analysis and concluded that customers with days late of paid invoices that were late, ratio of outstanding
fewer invoices are less likely to have late payment, and vice invoices that were late, ratio of sum of outstanding base
versa, and thus different models should be built for different amount that were late, and average days late of outstanding
customer groups. He also showed that prediction accuracy invoices that were late. Both [3] and [4] also included
increases as the number of invoices per customer increases. aggregated features related to late payment behavior such as
number of delayed invoices, total amount of delayed invoices,
Apart from managing overdue invoices from the average amount of delayed invoices, average delayed days,
creditor’s perspective, managing overdue invoices from the and their respective ratios. All these features were useful to
debtor’s perspective was considered [5]. The authors predict if an invoice will be paid on time or delayed.
proposed a generic methodological framework for invoice
payment processing which can take care of stochastic As the objective of our work is to predict if a customer
processing lead time, and built a cohort Markov chain model would likely to pay on time, we are more interested to
to simulate the process to identify bottlenecks for measure the percentage of invoices, both in terms of number
improvements, leading to reduction in the payment of overdue and value, which were paid on time, rather than late. Thus, we
invoices. have defined a new pureness measure as follows:

Our work is different from the past works in [2], [3] and Pureness = W1 * (number of invoices paid on time /
[4] in a few areas. Firstly, we do not predict if a particular total number of invoices) + W2 * (sum of value of invoices
invoice payment will be on time or late, instead we focus on paid on time / sum of value of all invoices)
the customer as a whole. A customer can have multiple
outstanding invoices, and we predict if he would likely pay on Where, W1 and W2 are weights which can be adjusted
time or not pay at all. Secondly, we define a new pureness according to how much emphasis the company wants to place
measure to determine if a customer is good (pureness = 1) or on each factor.
bad (pureness = 0) to train our predictive models, by using
features related to past on-time payment behavior and By using this pureness measure, we can compute it for
organization profile, rather than features related to past late all the customers in the data set we have and group the
payments. We then use our model to predict for those customers into three groups:
customers who have pureness between 0 and 1 (partially paid
on time), and identify those who are likely to pay on time with x Group 1 – those who paid all their invoices within 45
high probability, hoping to reduce the overall intervention days (Pureness = 1)
actions taken. Thirdly, the past works were focused on
x Group 2 – those who did not pay their invoices after
increasing intervention actions on invoices which are likely to
180 days (Pureness = 0)
be late, while we are focused on reducing intervention actions

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 25

x Group 3 – those who partially pay their invoices compute the total revenue of invoices paid on
within 45 days (0 < Pureness < 1) time, and the total revenue of all invoices in the
The intention is to use Group 1 and Group 2 to train the data, for each customer ID.
predictive model, and then predict for those in Group 3, who 4 - Payment x This table contains the payment methods for
are those that are most likely to pay their invoices on time like Type each invoice. One invoice record will be one
those in Group 1. For customers who are identified to have row in the table which contains the invoice
high probability to be like those in Group 1, the number of number, customer ID, payment type (paid by
intervention actions can be reduced. sender, paid by recipient, or paid by others),
etc.
x Using the payment type, we can compute the
4 Data Preparation percentage of invoices which are paid by
We were provided with a snapshot of the data in the sender, receiver or others.
month of January 2015 from the logistics company, which 5- x This table contains information related to the
includes seven different database tables as described in Table Customer customers. One customer record will be one
1. We joined all the seven tables together using the customer row in the table which contains customer ID,
ID, and removed records (rows) and attributes (columns) total number of employees, year which
which have too many missing data. We were left with 10,562 company was started, industry code, etc.
unique customer records. For all the customers, we computed 6 – Air bill x This table contains information for each air
their individual pureness measure, and used it as the target bill. One air bill will be one row in the table
variable for the predictive model. The number of customers in which contains the air bill ID, invoice number,
each Group is given in Table 2. weight, volume, amount, whether the shipment
is an envelope or a box, etc.
Table 1. Description of the Seven Database Tables (DBT) x An air bill (or air waybill) is a document that
Provided accompanies goods shipped to provide detailed
information about the shipment and allows it to
DBT Description be tracked. One invoice can consist of more
1 - Invoice x This table contains the invoice information than one air bill. Using the weight, volume and
for all the customers located in a specific amount of each air bill, we can compute the
country. One invoice record will be one row in average weight, average volume, average
the table which contains the invoice number, amount and percentage of envelope shipments,
customer ID, invoice date, invoice closed date, of all the air bills for all the invoices related to
etc. a customer ID.
x Using the invoice data and invoice closed 7 - Billing x This table contains the billing cycle and
date, we can compute the number of days taken Cycle payment mode for each customer. One
to make payment, and for those that are within customer record will be one row in the table
45 days, we marked them as on time, and those which contains customer ID, payment by cash
which are not closed after 180 days, we marked or not, billing cycle (daily, weekly, bi-weekly,
them as bad debt. We can then compute the or monthly), etc.
total number of invoices paid on time, and the
total number of invoices in the data, for each
customer ID. Based on the computed pureness measure for each
2- x This table contains all the past intervention customer, we have 5373 customers in Group 1 (pureness = 1),
Intervention actions taken for each customer. One 2747 in Group 2 (pureness = 0), and the remaining 2442 in
Actions intervention action record will be one row in Group 3 (0 < pureness < 1). Among the customers in Group 1,
the table which contains the intervention action 2004 of them had only one invoice with the company. For
ID, customer ID, invoice number, intervention such customers, we removed them as we were unable to
action description, etc. determine their good payment behavior confidently simply
x Note that since we only have a snapshot of based on one invoice. Finally, 3369 customers in Group 1,
the data, the intervention actions taken consists and 2747 customers in Group 2 were used for the training of
of all historical intervention actions, which can the predictive model.
be for invoices which are not found in DBT1.
3 - Revenue x This table contains the revenue associate with
each invoice. One invoice record will be one
row in the table which contains the invoice
number, customer ID, revenue, etc.
x Using the revenue for each invoice, we can

ISBN: 1-60132-481-2, CSREA Press ©


26 Int'l Conf. Data Science | ICDATA'18 |

Table 2. Number of Customers in Each Group

Group Pureness Number of Number of customers


Measure customers used for predictive
model
1 1 5373 3369
2 0 2747 2747
3 Between 2442 2442
0 and 1
TOTAL 10562 8558

5 Data Exploration
With 17 attributes given in Table 3, including the unique
customer ID and the target variable pureness measure, we
explored the data to uncover some insights. The more
intuitive observations include:

x Customers who do not pay by cash tend to be late in


payment and thus received more intervention actions.
x Customers with more air bills tend to be late in
payment and thus received more intervention actions. Figure 1. Distribution of total number of intervention actions
by business year levels for pureness = 0 and pureness = 1
x Customers from smaller companies, which tend to
have immature business processes, tend to be late in
payment leading to more intervention actions.
The more interesting and insightful observations include:

x Figure 1 shows that customers which are 11 to 20


years in operation (level 4 for business year) tend to
have late payments and thus received more
intervention actions. This observation was consistent
for customers with pureness = 0 and pureness = 1.
Note that our data set consisted of all intervention
actions which could be for invoices not within our
data set. Thus, those with pureness = 1 could also
have intervention actions taken on them in the past.
x Figure 2 shows that customers with weekly (WL) Figure 2. Distribution of total number of intervention actions
billing cycle tend to have late payments and thus by billing cycle for pureness = 0 and pureness = 1
received more intervention actions. This is followed
by companies with bi-weekly billing cycle (BWL).
This seems to suggest that daily (DL) and monthly Table 3. Attributes of the Data Set Used
(ML) billing cycles are preferred.
Attribute Description
These insightful observations showed that different billing 1 Serial_Nos_ Unique customer ID
cycle and different number of business years are correlated to 2 Billing_cycle 4 types:
the number of intervention actions due to late invoice x DL (daily)
payment. Thus, there is merit in predicting payment behavior
x WL (weekly)
at the customer level rather than individual invoice level.
x BWL (biweekly)
x ML (monthly)
3 Cash_or_not 1: Customer pays by cash
0: Customer does not pay
by cash
4 Airbill_count_total Total number of air bill
5 Bill_count Number of air bill for this
month

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 27

6 C_percent Percentage of all We used the Receiver Operating Characteristics (ROC)


transactions paid by the chart and Fit Statistics to select the best predictive model.
recipient ROC chart is a two-dimensional plot of the Sensitivity (y)
7 O_percent Percentage of all against the 1-Specificity (x) at different threshold settings.
transactions paid by others Sensitivity, also known as true positive rate, measures the
8 S_percent Percentage of all proportion of positives that are correctly identified among all
transactions paid by the the positives. While 1-Specificity, also known as false
sender positive rate, measures the probability of false alarm. We
9 Per_of_envelope Percentage of all would want a model with the highest Sensitivity and lowest 1-
transactions delivered by Specificity. So this would correspond to the top-left-hand
envelope corner of the chart at (0, 1). Thus, a model that is closest to (0,
10 Avg_amount The average amount of all 1) will be the best model. From Figure 4, we can see that the
the air bills Neural Network model is the closest to the (0, 1) point, thus
11 Avg_volume The average volume of all the best model.
the air bills
12 Avg_weight The average weight of all For Fit Statistics, there are several statistical measures
the air bills which will help one determine if a model is well fitted.
13 Intervention_count_new The total number of all the Measures include sum of squared errors, average squared
intervention actions taken error, root average squared error, misclassification rate, and
14 Total_revenue_new Total revenue maximum absolute error, were provided by the SAS output
15 Employee_level 4 levels file. Misclassification rate is more suitable for categorical
prediction, while the others can be used for numerical
x 1: 1-50 employees
prediction. To obtain a tight measure, sum of squared errors,
x 2: 51-300
average squared error and root average squared error are
employees
possible measures as they report similar results. The Fit
x 3: 301-1000
Statistics given in Table 4 also shows that the Neural Network
employees
model had the lowest sum of squared errors, average squared
x 4: more than 1000 error and root average squared error, among all the models,
employees thus it is the best model.
16 Business_year 8 levels
x 1: 0-2 years
x 2: 3-5 years Table 4. Fit Statistics for All Models
x 3: 6-10 years
Model Sum of Average Root
x 4: 11-20 years Squared Squared Average
x 5: 21-30 years Errors Error Squared
x 6: 31-40 years Error
x 7: 41-50 years Neural Network 797.8585 0.097897 0.312885
x 8: more than 50 Ensemble Model 811.9171 0.099622 0.315629
years Neural Network (2) 816.4224 0.100175 0.316504
17 Pureness Calculated pureness Probability Tree 840.4616 0.103124 0.321129
measure Misclassification 898.7805 0.110280 0.332084
Tree
Polynomial 913.5018 0.112086 0.334793
6 Modeling and Prediction Regression
Using SAS Enterprise Guide, seven models were built Regression 1158.587 0.142158 0.377038
including Probability Tree, Misclassification Tree,
Regression, Polynomial Regression, Neural Network,
Regression with Neural Network(2), and Ensemble Model, as
shown in Figure 3. There were several data handling methods
included in the process to partition the data, transform the
variables and to impute missing data. Both training (66%) and
validation (33%) data sets followed the same process flow.
The final chosen model was used to perform the prediction for
the prediction data set (Group 3).

ISBN: 1-60132-481-2, CSREA Press ©


28 Int'l Conf. Data Science | ICDATA'18 |

Figure 3. Modeling Process Flow using SAS Enterprise Guide

Figure 4. ROC Chart for All Models

Using the selected Neural Network model, we predicted Table 5. Predicted Pureness for Group 3 Customers
the outcome for Group 3 customers, and 41.89% of them were
predicted to have pureness = 1 with high predicted Pureness Prediction Percentage of Customers
probabilities as shown in Table 5. With this results, the 1 41.89%
logistics company can sort the customers according to 0 58.11%
predicted probabilities in descending order, and choose to
reduce the intervention actions for customers with high
probabilities. This will lead to reduced intervention actions
taken and the associated costs.

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 29

With so many customers to handle for a large logistics 8 References


company, we attempt to obtain a general relationship to assist
the company to quickly evaluate whether a customer will
likely pay on time, simply based on the customer’s pureness [1] M. Shao, S. Zoldi, G. Cameron, R. Martin, R. Drossu,
measure, computed from past on time payment behavior. For J.G. Zhang, and D. Shoham, “Enhancing delinquent debt
the Group 3 customers with predicted pureness measure collection using statistical models of debt historical
(between 0 and 1) and their respective predicted probabilities, information and account events”. March 13 2007. US Patent
we plotted their probabilities versus their actual pureness 7, 2007, 191, 150.
measure. It was found that a linear relationship exists where,
for a 0.1 unit increase in pureness measure, the customer is [2] S. Zeng, P. Melville, C.A. Lang, I. Boier-Martin, and C.
1.132 times more likely to pay on time as shown in Figure 5. Murphy, “Using predictive analysis to improve invoice-to-
Such a relationship can be useful as a general rule-of-thumb cash collection”, in Proceedings of the 14th ACM SIGKDD
for the logistics company to quickly compute their customers’ International Conference on Knowledge Discovery and Data
pureness measure from their past on-time payment behavior, Mining, 2008, pp. 1043 – 1050.
and determine the corresponding probability of paying on
time, and decide if fewer intervention actions should be taken, [3] P. Hu, “Predicting and improving invoice-to-case
should they decide not to run the predictive model for collection through machine learning”. Master Thesis, Master
prediction. of Science in Transportation, Massachusetts Institute of
Technology, 2015.

[4] W. Hu, “Overdue invoice forecasting and data mining”.


Master Thesis, Master of Science in Electrical Engineering
and Computer Science, Massachusetts Institute of
Technology, 2016.

[5] B. Younes, A. Bouferguene, M. Al-Hussein, and H. Yu,


“Overdue invoice management: Markov chain approach”.
Journal of Construction Engineering and Management, 2015,
141(1): 04014062.

Figure 5. Linear Relationship Between Computed Pureness


and Predicted Probability

7 Conclusions
We have successfully used a predictive model to predict
if a customer will pay the outstanding invoices on time with
high probability, in an attempt to reduce intervention actions
taken on them, thus reducing cost and improving customer
relationship. We focused on customer level rather than
individual invoices, by including features such as customers’
billing cycle, employee level and business year. From our
results, it was found that a linear relationship exists, where a
0.1 unit increase in pureness measure, the customer is 1.132
times more likely to pay on time. This relationship is useful as
a general rule-of-thumb to determine if interventions should
be taken for a customer simply based on the computed
pureness measure, without using the predictive model.

ISBN: 1-60132-481-2, CSREA Press ©


30 Int'l Conf. Data Science | ICDATA'18 |

A Platform For Sentiment Analysis on Arabic Social


Media
Sara Elshobaky1 and Noha Adly1,2
1
Bibliotheca Alexandrina, Alexandria, Egypt
2
Faculty of Engineering, Alexandria University, Alexandria, Egypt
1
{sara.elshobaky, noha.adly}@bibalex.org

For example, the word ‘‫ ’ﺣﻠﻮ‬could be written as ‘‫’ﺣﻠﻮو‬,‘‫’وﺣﻠﻮ‬


Abstract—A platform is built for applying sentiment or ‘‫’ﺣﻠﻮاوى‬. Hence, traditional handcrafted features such as
analysis on social media in colloquial Arabic language. stemming and POS tagging are not suitable in that case.
The applied classifications are based on machine-learning
techniques using neural word embedding for feature Recently, deep learning and specially word embedding
extraction. The system was built and tuned for a case study approaches have been introduced to sentiment analysis and
addressing the extraction of opinions from different aspects have shown very good results [6]. The main idea is to train
toward ‘Products Made in Egypt’. Intrinsic and extrinsic a neural word embedding from a large corpus to map words
evaluations of the embeddings showed that a colloquial from the vocabulary to vectors of real numbers. The vectors
created by the word embedding preserve words similarities,
corpus of the products domain yields a good performance.
so words that regularly occur nearby in text will also be
Further, we studied different embedding models and tuned
close in vector space. In this work, this approach is used in
their parameters for the different classifications in order to representing features for the sentiment classifier.
reach the best accuracy. For the different classifications,
the F1-score is ranging from 74% to 93% with a precision In the last two years, an initiative has been raised to
ranging from 82% to 93%. support ‘Products Made in Egypt’ in order to assist the
Egyptian economy in a very tough period. The initiative
Keywords—sentiment analysis, word embedding, created lots of traffic on social media where users used it to
machine-learning, classification, opinion mining share their experience with those products. Our platform is
used to analyze those discussions in order to discover the
users opinions about the products in general as well as about
prices, quality and availability specifically. In addition,
1. Introduction changes in opinions are measured during major events in the
Sentiment analysis refers to discovering people opinions last two years.
and feelings about a topic being a product, a service, etc. In this work, the machine learning technique is applied
This is achieved by classifying the polarity of a given text with the word embedding to classify users’ opinions toward
whether positive, negative or neutral. Recently, people have ‘Products Made in Egypt’. The main contributions are:
been using social media extensively. For example, in 2017,
every minute 510,000 comments are posted on Facebook x Building a platform for applying sentiment analysis
and 350,000 tweets are sent on Twitter. Therefore, and categorization on social media.
analyzing social media content could reflect public opinion. x Applying this platform for a case study to analyze
Recent studies focused on applying sentiment analysis users opinions from different aspects towards
on social media for politics and for marketing products. ‘Products Made in Egypt’.
Many studies applied sentiment analysis on products x Tuning this platform for the colloquial Arabic
reviews such as comparing different Pizza industries [1], language and the case study domain.
phone brands [2]. Others used sentiment analysis to monitor In the next section, the sentiment analysis approaches
brand pages [3] [4] or to study the impact of social media used in other researches will be described briefly. Section 3,
on stock performance [5]. It has been noticed though that describes in details the methodology used to build our
most studies addressed English language and very little platform. Section 4, presents results for the analyzed case
work has addressed Arabic language and more specifically study. Finally, section 5 concludes the paper.
colloquial Arabic. The colloquial Arabic has many
variations that requires special handling as opposed to the 2. Related Work
Modern Standard Arabic (MSA). For example, the word
There are two main approaches for sentiment analysis;
‘look’ in MSA means ‘‫ ;’اﻧﻈﺮ‬while in colloquial it is ‘‫’ﺑﺺ‬
unsupervised and supervised classification; the accuracy of
or ‘‫’ﺷﻮف‬. Other words are transliterated from English like
which is measured by how well they agree with human
the word ‘bravo’ written as ‘‫’ﺑﺮاﻓﻮ‬. Some words have
judgments.
different sentiments based on the domain used. For example
the word ‘‫ ’ﺧﻄﯿﺮ‬which means ‘dangerous’ has a negative In the unsupervised technique, the sentiment analysis is
sentiments in MSA, but positive sentiment when used in calculated based on a lexicon of polarized terms. Such
colloquial Arabic. This is in addition to the different ways lexicon could be already made or has to be extended from
of writing and common spelling mistakes in social media. some initial seeds. Already made English lexicons such as

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 31

SentiWordnet [7] suffers from not being domain specific embedding. He achieved a slightly better accuracy than the
and were not suitable to colloquial language. Extended top handcrafted methods in both Standard Arabic and
lexicon from seeds could be dictionary based or corpus Dialectal Arabic. However, his corpus was small and
based. Dictionary based are mostly domain independent. generic and he used only one embedding model with default
Relations such as synonym and antonym are used to expand parameters.
a small list of seeds words. In corpus-based lexicons,
polarity is decided by co-occurrence in a corpus. Type of 3. Methodology
co-occurrence could be a conjunction of adjectives by
"and/or" as in [8], appearing in the same window as in [9] Building our platform required three phases. The first
or generated from annotated corpus [10]. There are no well- phase is collecting a large corpus, and then using it to train
known available Arabic sentiment lexicons. Many the word embedding. The second phase is collecting the
researches tried to build their own lexicon. El-Beltagy [11] training dataset in order to train the classifier. The third
extended 370 seed words to construct an Egyptian dialect phase is tuning the embedding parameters to choose the best
sentiment lexicon of 4,392 terms. However, her lexicon was accuracy for the case study. In this section, we will describe
oriented towards politics. Abdulla [12] expanded 300 seed each phase in details.
words to 3,479 terms using synonyms of each words. His
lexicon was generic and in MSA form.
3.1 Building corpus and Training embeddings
The supervised technique is mainly based on machine 3.1.1 Crawling Social Media
learning. In polarity classification, a set of labeled data are Initially, a Facebook group called ‘Proudly made in
collected, then, a set of features are extracted to train the Egypt’ has been crawled. This group aims to discuss
classifier. Those extracted features could be a unigram, Egyptian products and compare them with imported ones.
either stemmed or lemmatized, part-of-speech-tagging, By the end of 2017, this group reached more than 610,000
emoticons or any other handcrafted features. Both the labels members that served 12,170 feeds with 663,981 comments.
and the extracted features are fed to the machine learning This data was crawled from Facebook by its publicly
algorithm to build the classifier model. In turn, the classifier available Graph API tool and then all Facebook pages Urls
model assigns label to the tested input text based on its were extracted to be used as new seeds to crawl. After
extracted features. The accuracy of a well-trained machine pruning them, more than 1,000 new Facebook pages were
learning approach can outperform the lexicon-based obtained. Then, Arabic Facebook pages related to products
approach by choosing good feature selection techniques. are crawled even if not made in Egypt. In addition, hashtags
Pang [13] was among the first who addressed this approach are extracted from the crawled content and used as seeds to
to classify reviews into positive and negative using different crawl from twitter after pruning them. Finally, around
classification algorithms. Then, a large number of research 717,530 tweets are crawled. The final Social Media (SM)
tackled the supervised classification problem by corpus contains around 52 million feeds, comments and
engineering a set of different features [4]. SAMAR [14] [15] tweets with 477 million tokens.
is an Arabic supervised sentiment analysis system to
investigate how to treat Arabic dialects. They suggested For the purpose of comparison using a larger corpus,
individual solutions for each domain and task but they found another three corpuses are generated combining the above
that lemmatization is the best feature in all approaches. SM corpus with three different MSA corpuses. Those MSA
However, in our colloquial language case, lemmatization corpuses are:
could not handle spelling mistakes and concatenated words. x Altowayan[20] corpus (ALTW) with 19 million
Recent research tackled the problem of sentiment tokens of Arabic news, reviews and the Quran text.
analysis by using word embedding. In unsupervised x Aracorpus1 (ARAC) with 67 million tokens of raw
classification, the generated word embedding can be used in text from Arabic daily newspapers collected over a
a label propagation framework to induce sentiment lexicons year between 2004 and 2005.
using small sets of seed words [16]. In supervised
x Arabic Wikipedia dump (WK) of 89 million tokens.
classification, the word embedding has been used in
representing features for the sentiment classifier [17]. Tang Those three versions of the corpus helps evaluating the
[18] proposed a Sentiment Specific Word Embedding effect of associating an MSA corpus with the colloquial one.
(SSWE). His method encodes sentiment information in the
word embedding by training the embedding from a large 3.1.2 Preprocessing
labeled tweets collected by positive and negative emotions. The preprocessing performed on the text has a strong
He reported a better performance while training the impact on the generated embedding. The social media users
embedding from a relatively small context window size. can mistakenly use similar letters interchangeably. Those
Labutov [19] suggested re-embedding an existing spelling mistakes will increase the number of unique tokens
embedding using some labeled data. His method showed an that in fact will increase the computational complexity of
improvement in the sentiment classification task but training the embedding. Hence, some filtrations,
observed that the approach is most useful when the training unifications and normalizations are applied as follows:
data is small.
x Filter the corpus from repeated characters,
In the Arabic language, Altowayan [20] addressed the punctuations, diactrice and elongation.
sentiment analysis problem using a generic word

1
http://aracorpus.e3rab.com/argistestsrv.nmsu.edu/AraCorpus/

ISBN: 1-60132-481-2, CSREA Press ©


32 Int'l Conf. Data Science | ICDATA'18 |

x Standardize characters like ( ‫أ‬, ‫إ‬, ‫)ا‬, (‫ى‬,‫ )ي‬and (‫ ة‬,‫)ه‬. listed in Table 3. The coverage and accuracy of the analogy
x Unify common terms such as: Question marks, test, and the purity results of the clustering tests are reported
hyperlinks, name tags (i.e. @Name), following in Table 4.
terms (i.e. '‫'م‬, 'f' and 'up') and numbers. Table 2: Products terms
x Unify emoticons and abbreviations based on its Clothes ‫ﻓﺴﺘﺎن‬ ‫ﺷﻮز‬ ‫ﺑﻠﻮزه‬ ‫ﺑﻨﻄﻠﻮن‬
sentiment meaning. Fruits ‫ﺑﻄﯿﺦ‬ ‫ﻓﺮاوﻟﮫ‬ ‫ﻣﻮز‬ ‫ﺗﻔﺎح‬
Home Appliance ‫ﺗﻼﺟﮫ‬ ‫ﻣﻜﻨﺴﮫ‬ ‫ﺧﻼط‬ ‫ﺗﻠﻔﺰﯾﻮن‬
3.1.3 Training the Word Embedding Cosmetics ‫ﻛﺤﻞ‬ ‫ﻣﺎﺳﻜﺮا‬ ‫ﺑﻠﺴﻢ‬ ‫ﺑﺮﻓﺎن‬
In this paper, three different word embedding models Furniture ‫ﺳﺮﯾﺮ‬ ‫دوﻻب‬ ‫ﺗﺮاﺑﯿﺰه‬ ‫ﻛﺮﺳﻲ‬
were compared, CBOW [21], Skip-Gram [21] and GloVe
[22] . Those models gained a popularity in the last few years Table 3: Sentiment Terms
as they outperform traditional models. The first two are Positive ‫ﯾﺠﻨﻦ‬ ‫ﺷﯿﻚ‬ ‫ﺗﺤﻔﮫ‬ ‫ﺟﻤﯿﻞ‬ ‫ﺷﺎﺑﻮ‬ ‫ﻣﻨﺎﺳﺐ‬ ‫راﻗﻲ‬
predictive models based on neural network. In CBOW, Negative ‫ﻏﺎﻟﻲ‬ ‫ﻣﻌﻔﻦ‬ ‫ردئ‬ ‫وﺣﺶ‬ ‫ﺳﺊ‬ ‫ﻣﻘﺮف‬ ‫زﻓﺖ‬
Mikolov [21] uses n words before and after the target word
to predict it. While Skip-Gram uses the center word to It is noticed that, the accuracy and coverage of the
predict the surrounding words. GloVe tends to initialize analogy test of Zahran’s embeddings is more than other
vectors with co-occurrence statistics, by training on global social media embeddings. This is due to using MSA terms
word-word co-occurrence counts. in the analogy questions while the other social media
Initially, three embeddings of the above models are corpuses are mostly colloquial. However, the results of the
trained from the SM corpus. The same is performed with the social media corpuses are still better than the TW and
SM+ALTW, SM+WK and SM+ARAC corpuses. In order WWW embeddings. In addition, CBOW (CB) and Skip-
to compare those embeddings with others, we downloaded Gram (SG) models with the social media corpuses
other publicly available Arabic embeddings like Zahran outperforms the GloVE model in the analogy test.
[23] embeddings that are trained from a large amount of For product clustering, the four social media corpuses
collected MSA corpuses. In addition, Aravec [24] correctly clustered the products terms except the
embedding models built on top of completely different SM+ARAC embedding using Skip-Gram model. For
Arabic corpuses domains from Twitter (TW) and the World sentiment clustering, the Skip-Gram model of the social
Wide Web (WWW). Table 1 lists details of those media embeddings correctly clusters the sentiment words
embeddings. except the SM+WK embedding. For the GloVe model, it
shows better results when using the purely social media
Table 1: Evaluated embeddings corpus. The CBOW social media corpuses reported the
Corpus Number of Tokens Parameters
same high purity of 92.85%.
SM 477 Millions (Dim 300,Win10) This simple intrinsic evaluation led us to use one of the
SM+ALTW [20] 496 Millions (Dim 300,Win10) social media embeddings in the classifications. This make
SM+ARAC 544 Million (Dim 300, Win 10)
SM+WK 567 Millions (Dim 300, Win 10)
sense, since colloquial terms may have different meanings
TW [24] 1,090 Million (Dim 300, Win 3) in MSA. For example, the word ‘‫ ’ﻏﺎﻟﻲ‬is a negative term if
WWW [24] 2,225.3 Millions (Dim 300, Win 5) used in the products domain as it means ‘expensive’, while
Skip-Gram(Dim 300, Win 10), in other domains it could be a positive term meaning
Zahran [23] 5.8 billion
CBOW(Dim 300, Win 5) ‘precious’. In addition, the spelling mistakes and word
variations are better captured when using a social media
To evaluate the embeddings, intrinsic and extrinsic corpus. In order to visualize the product terms and sentiment
evaluations are used [25]. For intrinsic evaluation, we used terms vectors from the SM embedding, the TensorBoard2
analogy test and a colloquial categorization test. To conduct tool is used. Such tool includes an Embedding Projector that
the analogy test, the manual translation of Mikolov’s[21] visualize embeddings by rendering it in two or three
test from English to Arabic provided by Zahran [23] has dimensions. Figure 1 and Figure 2 show the product terms
been used. Given the translation was in MSA, which is and sentiment terms vectors in 2D from the Skip-Gram SM
different from colloquial Arabic, the test was applied by embedding. As shown from the two figures the term vectors
considering a correct answer if one of the top 10 predicted are correctly clustered.
words correctly matched the answer as opposed to top 5
used by [23]. The coverage (cov) and accuracy (acc) for
each embedding are computed.
In addition, we generated a categorization test inspired
by the Battig test set introduced by Baroni et al [26]. Twenty
colloquial terms related to products from five different
categories are selected as listed in Table 2. The terms
vectors generated from the embedding are clustered using
the k-means algorithm and the purity of the returned results Figure 1: Product terms Figure 2: Sentiment terms
is computed. The same is applied to cluster another 14 vectors clustering vectors clustering
colloquial sentiment terms into positive and negative as

2
https://github.com/tensorflow/tensorboard

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 33

Table 4: Intrinsic evaluation results of the embeddings


Analogy Test Product Clustering Sentiment Clustering
CB SG GloVe CB SG GloVe CB SG GloVe
Cov. Acc. Cov. Acc. Cov. Acc. Purity Purity Purity Purity Purity Purity
SM 19.2% 68.80% 19.2% 61.34% 19.2% 52.77% 100% 100% 100% 92.85% 100% 92.85%
SM+ALTW [20] 20.7% 69.33 20.7% 61.09 20.7% 55.33 100% 100% 100% 92.85% 100% 85.71%
SM+ARAC 22% 68.85% 22% 61.32% 22% 53.23% 100% 95% 100% 92.85% 100% 78.57%
SM+WK 23.5% 71.89% 23.5% 66.97% 23.5% 64.49% 100% 100% 100% 92.85% 92.85% 85.71%
TW [24] 21.8% 54.86% 23% 51.99% 60% 75% 71.42.3% 57.14%
WWW [24] 25.5% 57.51% 25.3% 40.13% 80% 70% 78.57% 78.57%
Zahran [23] 25.3% 75.68% 25.3% 72.65% 95% 90% 85.71% 92.85%

The decision of which corpus will be used to train the categories (i.e. ‘‫’ﺣﺎﺟﺘﮫ ﻏﺎﻟﯿﮫ ﺑﺲ ﻛﻮﯾﺴﮫ‬, which means ‘high
embedding (SM, SM+ALT, SM+ARAC or SM+WK ) and quality but expensive’). Using separate training data for
which model to use (CBOW, Skip-Gram or GloVe) with each category classifications allows easily classifying those
which parameters to generate the best classification mixed sentiments. Finally, comments with at least two
accuracy are yet to be taken. As per Schnabel, [25], different judges agreed upon are selected. The final judged training
tasks favor different embeddings. As one of his extrinsic dataset is around 13,884 comments; as reported in Table 5.
evaluations, he suggested to measure the contribution of the
word embedding to the sentiment classification. This Table 5: Judges Results
approach is the one adopted to decide on the corpus and the
All Positive Negative Neutral
model to be used, and will be presented in Section 3.3.
Overall training dataset: 13,884 5,127 3,453 5,304
Price 1,749 469 744 536
3.2 Collecting Training dataset Quality 6,407 4,238 2,018 151
Availability 1,741 1,082 476 183
3.2.1 Selection of data
To train the classifier, a dataset selected from the 3.3.Classifying Data
crawled Facebook content is selected to be judged by
humans. More than 400,000 comments are filtered, which Machine learning technique has been used as a
are not follows, nor tags, and contains Arabic text. To limit supervised binary classification task. The classification
the influence of spams, comments are divided into groups algorithm used is the Support Vector Machine (SVM). The
based on the last two digits of the comment id based on [4] SVM algorithm is chosen since it shows a very good
recommendation. performance and a higher accuracy in many studies directed
towards sentiment analysis in many languages [13]. The
In order to select a balanced set of positive, negative and feature representation used in the algorithm is the fixed size
neutral sentences, unsupervised classification method is embedding vector. The feature extracted for each sample, is
used to preliminary classify those comments. The well- the average of the embedding vectors of its words [21].
known Sentistrength application [27] was deployed for this Based on the quality of the embedding, similar features will
purpose. It uses a sentiment lexicons consisting of: positive, generate similar vectors.
negative, negation, and intensifier terms to generate a
sentiment score. Sentistrength was originally designed for To choose the best embedding for the different
English and not adapted to Arabic [28]. In order to use it, classifications, different embeddings are trained from the
the Sentistrength lexicons have been replaced with social media corpus using three models (CBOW, Skip-
colloquial Arabic terms. We manually extracted few Gram and GloVe) with different parameters. The main
positive and negative seeds from the top frequent terms training model parameters are the dimension (dim) size of
occurred in the corpus. Then, we extended them by the learned word vector, the window (win) size to be
extracting similar words from the SM embedding. The same included as the context of a target word, and the number of
process is applied for negation and intensifier terms. Then, training iterations (itr) over the corpus. Usually, the most
Sentistrength is fed by those extended lexicons to calculate used vector dimension is 300 and window size 5 for CBOW
the sentiment score of the selected comments. The training and 10 for Skip-Gram. In our experiments, we varied the
dataset of 15,000 comments was selected with a balanced vector dimension from 100 to 600 and the window sizes at
positive, negative and neutral comments. 5 and 10 from either sides of the target word to capture the
sense of the common statements length of Facebook
3.2.2 Judging of data comments. All results shown are for 15 iterations, as it was
To train the classifier, three human judges have found that increasing or decreasing this number has no great
manually labeled the 15,000 selected training data either impact on the resulted accuracy. At each classification, the
positive, negative, neutral or containing mixed sentiments. classifier is trained by 90% of the judged results. The
In addition, judges labeled each comment based on three remaining 10% of the final judged data are used for testing
categories namely: price, quality and availability. For to measure the classification accuracy. The accuracy is
example, ‘high quality’ is labeled as positive quality, while measured by the F1-score, Precision and Recall ( ‘F1’, ‘Pr.’
‘expensive’ is labeled as negative price. Selected training and ‘Rec.’ respectively).
data has around 5% mixed sentiments with different

ISBN: 1-60132-481-2, CSREA Press ©


34 Int'l Conf. Data Science | ICDATA'18 |

3.3.1. Sentiment Analysis before, the different embeddings are used in the feature
The sentiment classification has been applied in two extraction phase to choose the most appropriate for each
separate stages. The first stage is the subjectivity classification.
classification; aiming to separate the subjective statements Table 7 shows the F1-score, Precision and Recall. As
containing positive or negative sentiments from objective shown, increasing the dimension enhances the performance,
statements that are neutrals with no sentiments. The second but limited increase is observed beyond dimension 400. For
stage is the polarity classification; separating the resulted the window sizes, no big difference is noticed in all models.
subjective statements into positive and negative. Table 6 Overall, the GloVe model underperforms the two other
reports the accuracy of each classification using the models. For dimension 600, both the Skip-Gram and
different embeddings. Each classification result represents CBOW have almost the best F1-score with a superiority of
the average of 50 runs with the same random training and the Skip-Gram model with window 10 in the price
testing datasets. All those runs, corpus processing and classification.
embedding trainings are performed on the High
Performance Computer (HPC) 3 cluster of Bibliotheca Table 7: Accuracies of categorization tasks
Alexandrina.
CBOW Skip-Gram GloVe
Results shows that the F1-score of the GloVe model Dim Win F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec%
underperforms Skip-Gram and CBOW in both subjective 100 10 64.5 82.1 53.2 63.4 82.0 51.8 60.6 82.0 48.2
and polarity classifications. For dimensions less than 300, it 200 10 69.4 83.9 59.3 69.0 84.3 58.4 65.6 83.4 54.2
300 10 70.3 83.2 61.0 71.3 84.1 62.0 67.1 83.8 56.1
is noticed that the F1-score of the polarity classification is 400 10 72.0 84.0 63.1 72.6 84.0 64.1 68.5 84.2 57.8
slightly better with the Skip-Gram model and the F1-score 500 10 72.7 83.9 64.3 73.5 84.5 65.2 69.2 83.3 59.3
of the subjectivity classification is better with the CBOW

Price
600 10 73.7 84.1 65.8 74.2 84.1 66.4 70.0 83.2 60.4
model. For dimensions larger than 300, both models have 100 5 63.6 81.9 52.1 63.6 82.6 51.8 59.7 81.0 47.4
200 5 68.3 82.8 58.2 69.4 84.3 59.1 64.2 82.5 52.7
almost same accuracy with best results achieved with 300 5 70.4 83.8 60.8 71.8 84.3 62.7 66.6 83.3 55.7
dimension 600. 400 5 72.1 83.9 63.3 73.3 84.3 65.0 67.5 83.0 57.1
500 5 72.3 83.1 64.1 73.3 84.1 65.1 69.5 83.4 59.8
600 5 73.0 83.3 65.0 74.0 85.0 65.6 69.7 83.2 60.1
Table 6: Accuracies of polarity and subjectivity tasks
100 10 83.6 86.1 81.3 83.0 85.8 80.3 81.3 84.4 78.5
CBOW Skip-Gram GloVe 200 10 85.4 87.6 83.3 84.9 87.3 82.6 83.1 86.1 80.2
Dim Win F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec% 300 10 86.1 88.2 84.1 85.6 87.9 83.4 83.8 86.6 81.2
100 10 91.1 91.0 91.3 91.6 91.7 91.6 91.3 91.4 91.1 400 10 85.9 87.9 84.1 85.7 87.8 83.8 84.2 86.9 81.7
200 10 92.4 92.4 92.4 92.4 92.3 92.5 92.1 92.1 92.1 500 10 86.1 88.0 84.3 86.0 87.7 84.3 83.9 86.3 81.6
Quality

300 10 92.8 92.9 92.8 92.9 92.8 93.0 92.6 92.5 92.7 600 10 86.3 88.1 84.5 86.2 87.7 84.7 84.4 86.6 82.3
400 10 93.0 92.9 93.1 93.1 93.2 93.1 92.7 92.8 92.8 100 5 84.0 86.4 81.7 82.9 85.4 80.5 81.0 84.4 77.8
500 10 93.2 93.2 93.2 93.2 93.2 93.2 92.8 92.9 92.6 200 5 85.3 87.4 83.3 85.2 87.7 82.8 82.6 85.8 79.7
Polarity

600 10 93.1 93.2 93.0 93.3 93.2 93.4 92.6 92.7 92.6 300 5 85.6 87.8 83.6 85.9 87.9 84.1 82.7 85.7 80.0
100 5 91.2 91.2 91.2 91.9 91.7 92.1 90.6 90.5 90.6 400 5 86.2 88.4 84.1 86.4 88.2 84.7 83.8 86.5 81.2
200 5 92.4 92.5 92.4 92.7 92.6 92.7 92.2 92.5 91.9 500 5 86.3 88.2 84.5 86.3 87.9 84.8 83.7 86.4 81.2
300 5 92.9 93.0 92.9 93.1 93.1 93.2 92.3 92.3 92.3 600 5 86.3 88.0 84.6 86.3 87.7 84.9 84.2 86.6 82.0
400 5 93.0 93.1 92.9 93.3 93.0 93.5 92.5 92.4 92.5 100 10 75.5 84.3 68.4 75.0 84.1 67.9 74.0 85.5 65.4
500 5 93.0 93.1 93.0 93.2 93.1 93.3 92.6 92.6 92.7 200 10 77.0 84.9 70.6 76.4 84.4 69.9 75.3 84.9 67.8
600 5 93.3 93.4 93.1 93.2 93.1 93.4 92.6 92.6 92.7 300 10 77.3 84.4 71.4 76.9 83.9 71.2 76.2 84.5 69.4
100 10 86.7 85.9 87.5 86.2 85.2 87.2 85.6 84.4 86.9 400 10 77.4 83.7 72.1 77.3 83.3 72.2 75.6 83.3 69.3
Availability

200 10 87.4 86.9 87.9 86.9 86.1 87.7 86.2 85.2 87.2 500 10 78.1 84.0 73.1 77.8 83.3 73.0 76.2 83.1 70.4
300 10 87.5 87.1 87.9 87.3 86.7 87.9 86.4 85.6 87.3 600 10 77.9 83.6 73.0 77.8 82.3 73.8 77.1 83.8 71.5
400 10 87.7 87.1 88.2 87.7 87.2 88.4 86.5 85.6 87.5 100 5 75.6 85.1 68.2 74.6 83.6 67.5 73.8 84.5 65.6
Subjectivity

500 10 87.6 87.0 88.2 87.7 87.3 88.1 86.4 85.6 87.3 200 5 76.5 84.4 70.0 76.7 84.6 70.2 75.2 84.8 67.7
600 10 87.6 87.1 88.2 87.8 87.2 88.4 86.7 85.8 87.6 300 5 77.4 84.2 71.7 76.9 83.7 71.1 75.7 84.3 68.7
100 5 86.5 85.6 87.4 86.3 85.4 87.2 85.4 84.0 86.9 400 5 78.0 84.2 72.8 77.4 83.4 72.3 76.0 84.1 69.4
200 5 87.4 86.9 88.0 86.9 86.4 87.4 86.2 85.2 87.2 500 5 77.6 83.6 72.5 77.4 82.9 72.7 76.6 84.4 70.2
300 5 87.5 87.1 87.9 87.6 87.1 88.1 86.1 85.2 87.1 600 5 78.0 83.6 73.2 77.5 82.7 73.0 77.0 84.0 71.1
400 5 87.7 87.1 88.3 87.5 87.0 88.1 86.5 85.5 87.6
500 5 87.7 87.1 88.3 87.6 87.3 88.0 86.4 85.4 87.4 The whole set of previous experiments have been
600 5 87.7 87.3 88.2 87.7 87.4 88.1 86.6 85.7 87.4 repeated using different 50 runs for the SM, SM+ALTW,
SM+ARAC and SM+WK Skip-Gram embeddings with 600
3.3.2. Categorization dimensions. As shown in Table 8, enlarging the social
In order to extract opinions about the product price, media corpus with the other MSA corpuses did not enhance
quality and availability, another set of classifications are the F1-score of most of the classifications except for the
applied. The main purpose is to assign zero or more Quality classification. The SM+WK embedding shows
category to each statement. Thus, three supervised binary 0.2% better F1 accuracy over the SM embedding in the
classifications are applied. Each category classification task Quality classification. From the above results, it has been
has a different judged dataset. In each dataset, a sample is concluded to use the Skip-Gram model with dimension 600,
labeled ‘1’ or ‘0’ based on the judges agreement on window 10 trained from purely social media corpus.
assigning a sentiment for that category or not. Same as

3
https://hpc.bibalex.org/

ISBN: 1-60132-481-2, CSREA Press ©


Int'l Conf. Data Science | ICDATA'18 | 35

Table 8: Accuracies using Social media embeddings


SM SM+ALT SM+ARAC SM+WK
F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec% F1% Pr% Rec%
Pol. 93.3 93.6 93 93.3 93.6 93.0 93.0 93.4 92.7 93.2 93.6 92.9
Subj. 87.6 86.8 88.3 87.4 86.9 87.9 87.6 87.1 88.1 87.5 86.9 88.0
Price 74.0 84.1 66.1 74.0 84.5 65.9 73.7 84.3 65.6 73.7 84.3 65.5
Qual. 86.0 87.7 84.4 86.0 87.7 84.4 85.4 87.1 83.7 86.2 88.0 84.6
Avail. 77.8 82.6 73.6 77.7 82.7 73.3 77.0 82.9 71.9 77.5 82.6 73.1

4. Case Study Results


The dataset used to generate the results are 645
Facebook pages and groups of ‘Products Made in Egypt’ Figure 4: Number of feeds/comments per category during
during the years 2016 and 2017. Those pages have, 2016 and 2017
5,161,247 feeds and comments where only 3,409,461 As per Figure 5, 80 % of people are discussing prices,
containing Arabic content. Those two years spanned but not expressing an opinion, while people are more
important events in Egypt as listed in Table 9. The fuel negative about prices than positive. Figure 6, shows the
prices increase caused an indirect price increase in Egypt. A sentiments of quality. 77% of people tend to expose positive
famous Egyptian TV show called ‘Sahebat El-Saada’ sentiments towards the quality of Egyptian products. As for
specialized two episodes to promote the initiative of availability, as shown in Figure 7, more than 50% of the
‘Products Made in Egypt’ by highlighting specific Egyptian people feel positively about availability of Egyptian
products. The Egyptian central bank decision to float the products. Around 36% of the public discuss availability but
Egyptian Pound currency reduced its value by almost 50%. do not express their opinion with only 12% negativity.
Finally, the political decision to stop importing foreign
products for three months. Neg Neg Neut. Neut.
Neg
11% 16% 7% 36%
Table 9: Events list 12%

Date Event
A 7/2016 Increase in fuel prices Pos
B 9/2016 TV show ‘Made in Egypt’ episode 1. 9%
C 11/2016 Floating the Egyptian pound Pos Pos
Neut. 80% 77% 52%
D 1/2017 TV show ‘Made in Egypt’ episode 2.
E 1-3/2017 Egyptian decision to stop importing
F 7/2017 Increase in fuel prices Figure 5: Price Figure 6: Quality Figure 7: Availa-
sentiments sentiments bility sentiments
Figure 3, shows the overall sentiments of the whole
dataset during the two years period. As shown, positive and 5. Conclusion
negative sentiments follow the same pattern across time,
with positivity almost double negativity (79% Neutral, 15% Sentiment analysis in Arabic is still a fertile field of
positive and 6% Negative). As depicted in Figure 4, people research, especially for social media. In this paper, a
are equally concerned with the three aspects. Quality was platform has been established for sentiment analysis in
hot topic after the first media show. Availability was of Arabic social media. The supervised binary classification
interest when the pound floated. Prices became main method is adopted while using the neural word embedding
interest during the importing ban. for feature extraction. A case study has been applied,
analyzing sentiments towards ‘Products Made in Egypt’ in
terms of their price, quality and availability. From the
intrinsic and extrinsic embedding evaluations, it was
revealed that enlarging the social media corpus that train the
embedding by another generic MSA corpus does not
guarantee enhancing the accuracy of the classifier. In
addition, different embedding models are trained with
different parameters to be used in the feature extraction
phase of the different supervised classifications. The results
showed that GloVe underperforms Skip-Gram and CBOW
in all classifications. Furthermore, it was revealed that using
a 600 dimension and window size 5 or 10 either CBOW or
Skip-Gram model leads to best accuracy with all
classifications. For the different classifications, the F1-score
is ranging from 74% to 93% with a precision ranging from
82% to 93%.In the future, we hope to enlarge the corpus to
Figure 3: Number of feeds/comments per sentiment include Twitter, Instagram and others social media. In
during 2016 and 2017 addition, we hope to apply other case studies of interest to
the platform.

ISBN: 1-60132-481-2, CSREA Press ©


Another random document with
no related content on Scribd:
“Spare no money, no time, no labor,” he said, “but let the criminal
be found. Sir Ronald is too ill, too overwhelmed, to give any orders at
present; but you know what should be done. Do it promptly.”
And Captain Johnstone had at once taken every necessary step.
There was something ghastly in the pretty town of Leeholme, for
there on the walls was the placard, worded:
“MURDER!
“Two hundred pounds will be given to any one bringing
certain information as to a murder committed on Tuesday
morning, June 19th, in the Holme Woods. Apply to
Captain Johnstone, Police Station, Leeholme.”
Gaping rustics read it, and while they felt heartily sorry for the
unhappy lady they longed to know something about it for the sake of
the reward.
But no one called on Captain Johnstone—no one had a word
either of certainty or surmise. The police officers, headed by
intelligent men, made diligent search in the neighborhood of the
pool; but nothing was found. There was no mark of any struggle; the
soft, thick grass gave no sign of heavy footsteps. No weapon could
be found, no trace of blood-stained fingers. It was all a mystery dark
as night, without one gleam of light.
The pool had always been a favorite place with the hapless lady;
and, knowing that, Sir Ronald had ordered a pretty, quaint golden
chair to be placed there for her; and on the very morning when the
event happened Lady Clarice Alden had taken her book and had
gone to the fatal spot to enjoy the beauty of the morning, the
brightness of the sun and the odor of the flowers. The book she had
been reading lay on the ground, where it had evidently fallen from
her hands. But there was no sign of anything wrong; the bluebells
had not even been trampled under foot.
After twenty-four hours’ search the police relinquished the matter.
Captain Johnstone instituted vigorous inquiries as to all the beggars
and tramps who had been in the neighborhood—nothing suspicious
came to light. One man, a traveling hawker, a gaunt, fierce-looking
man, with a forbidding face, had been passing through Holme
Woods, and the police tracked him; but when he was examined he
was so evidently unconscious and ignorant of the whole matter it
would have been folly to detain him.
In the stately mansion of Aldenmere a coroner’s inquest had
been held. Mrs. Glynn declared that it was enough to make the
family portraits turn on the wall—enough to bring the dead to life.
Such a desecration as that had never occurred before. But the
coroner was very grave. Such a murder, he said, was a terrible thing;
the youth, beauty and position of the lady made it doubly horrible. He
showed the jury how intentional the murder must have been—it was
no deed done in hot haste. Whoever had crept with stealthy steps to
the lady’s side, whoever had placed his hand underneath the white
lace mantle which she wore, and with desperate, steady aim stabbed
her to the heart, had done it purposely and had meditated over it.
The jury saw that the white lace mantle must either have been raised
or a hand stealthily crept beneath it, for the cut that pierced the
bodice of the dress was not in the mantle.
He saw the red puncture on the white skin. One of the jury was a
man who had traveled far and wide.
“It was with no English weapon this was done,” he said. “I
remember a case very similar when I was staying in Sicily; a man
there was killed, and there was no other wound on his body save a
small red circle like this; afterward I saw the very weapon that he had
been slain with.”
“What was it like?” asked the coroner eagerly.
“A long, thin, very sharp instrument, a species of Sicilian dagger. I
heard that years ago ladies used to wear them suspended from the
waist as a kind of ornament. I should not like to be too certain, but it
seems to me this wound has been caused by the same kind of
weapon.”
By the coroner’s advice the suggestion was not made public.
The verdict returned was one the public had anticipated: “Willful
murder against some person or persons unknown.”
Then the inquest was over, and nothing remained but to bury
Lady Clarice Alden. Dr. Mayne, however, had not come to the end of
his resources yet.
“The local police have failed,” he said to Sir Ronald; “we will send
to Scotland Yard at once.”
And Sir Ronald bade him do whatever in the interests of justice
he considered best.
In answer to his application came Sergeant Hewson, who was
generally considered the shrewdest and cleverest man in England.
“If Sergeant Hewson gives a thing up, no one else can succeed,”
was a remark of general use in the profession. He seemed to have
an instinctive method of finding out that which completely baffled
others.
“The mystery will soon be solved now,” said Dr. Mayne;
“Sergeant Hewson will not be long in suspense.”
The sergeant made his home at Aldenmere; he wished to be
always on the spot.
“The murder must have been done either by some one in the
house or some one out of it,” he said; “let us try the inside first.”
So he watched and waited; he talked to the servants, who
considered him “a most affable gent;” he listened to them; he
examined everything belonging to them—in vain.
Lady Clarice Alden had been beloved and admired by her
servants.
“She was very high, poor thing!—high and proud, but as
generous and kind a lady as ever lived. So beautiful, too, with a
queer sort of way with her! She never spoke an unkind word to any
of us in her life.”
He heard nothing but praises of her. Decidedly, in all that large
household Lady Clarice had no enemy. He inquired all about her
friends, and he left no stone unturned; but, for once in his life,
Sergeant Hewson was baffled, and the fact did not please him.
CHAPTER IV.
KENELM EYRLE.

It was the night before the funeral, and Sir Ronald sat in his study
alone. His servants spoke of him in lowered voices, for since the
terrible day of the murder the master of Aldenmere had hardly tasted
food. More than once he had rung the bell, and, when it was
answered, with white lips and stone-cold face, he had asked for a
tumbler of brandy.
It was past ten o’clock now, and the silent gloom seemed to
gather in intensity, when suddenly there came a fierce ring at the hall
door, so fierce, so imperative, so vehement that one and all the
frightened servants sprang up, and the old housekeeper, with folded
hands, prayed, “Lord have mercy on us!”
Two of the men went, wondering who it was, and what was
wanted.
“Not a very decent way to ring, with one lying dead in the house,”
said one to the other; but, even before they reached the hall door, it
was repeated more imperatively than before.
They opened it quickly. There stood a gentleman who had
evidently ridden hard, for his horse was covered with foam; he had
dismounted in order to ring.
“Is this horrible, accursed story true?” he asked, in a loud, ringing
voice. “Is Lady Alden dead?”
“It is quite true, sir,” replied one of the men, quick to recognize the
true aristocrat.
“Where is Sir Ronald?” he asked, quickly.
“He cannot see any one.”
“Nonsense!” interrupted the stranger, “he must see me; I insist
upon seeing him. Take my card and tell him I am waiting. You send a
groom to attend to my horse; I have ridden hard.”
Both obeyed him, and the gentleman sat down in the entrance
hall while the card was taken to Sir Ronald. The servant rapped
many times, but no answer came; at length he opened the door.
There sat Sir Ronald, just as he had done the night before—his head
bent, his eyes closed, his face bearing most terrible marks of
suffering.
The man went up to him gently.
“Sir Ronald,” he asked, “will you pardon me? The gentleman who
brought this card insists upon seeing you, and will not leave the
house until he has done so. I would not have intruded, Sir Ronald,
but we thought perhaps it might be important.”
Sir Ronald took the card and looked at the name. As he did so a
red flush covered his pale face, and his lips trembled.
“I will see him,” he said, in a faint, hoarse voice.
“May I bring you some wine or brandy, Sir Ronald?” asked the
man.
“No, nothing. Ask Mr. Eyrle to come here.”
He stood quite still until the stranger entered the room; then he
raised his haggard face, and the two men looked at each other.
“You have suffered,” said Kenelm Eyrle; “I can see that. I never
thought to meet you thus, Sir Ronald.”
“No,” said the faint voice.
“We both loved her. You won her, and she sent me away. But, by
heaven! if she had been mine, I would have taken better care of her
than you have done.”
“I did not fail in care or kindness,” was the meek reply.
“Perhaps I am harsh,” he said, more gently. “You look very ill, Sir
Ronald; forgive me if I am abrupt; my heart is broken with this terrible
story.”
“Do you think it is less terrible for me?” said Sir Ronald, with a
sick shudder. “Do you understand how awful even the word murder
is?”
“Yes; it is because I understand so well that I am here. Ronald,”
he added, “there has been ill feeling between us since you won the
prize I would have died for. We were like brothers when we were
boys; even now, if you were prosperous and happy, as I have seen
you in my dreams, I would shun, avoid and hate you, if I could.”
His voice grew sweet and musical with the deep feelings stirred
in his heart.
“Now that you are in trouble that few men know; now that the
bitterest blow the hand of fate can give has fallen on you, let me be
your true friend, comrade and brother again.”
He held out his hand and clasped the cold, unyielding one of his
friend.
“I will help you as far as one man can help another, Ronald. We
will bury the old feud and forget everything except that we have a
wrong to avenge, a crime to punish, a murderer to bring to justice!”
“You are very good to me, Kenelm,” said the broken voice; “you
see that I have hardly any strength or energy.”
“I have plenty,” said Kenelm Eyrle, “and it shall be used for one
purpose. Ronald, will you let me see her? She is to be buried to-
morrow—the fairest face the sun ever shone on will be taken away
forever. Let me see her; do not refuse me. For the memory of the
boy’s love so strong between us once—for the memory of the man’s
love and the man’s sorrow that has laid my life bare and waste, let
me see her, Ronald?”
“I will go with you,” said Sir Ronald Alden; and, for the first time
since the tragedy in its full horror had been known to him, Sir Ronald
left the library and went to the room where his dead wife lay.
CHAPTER V.
WHICH LOVED HER BEST?

They went through the silent house without another word,


through the long corridors so lately gay with the sound of laughing
voices and the lustre of perfumed silken gowns. The gloom seemed
to deepen, the very lights that should have lessened it looked
ghastly.
They came to the door of my lady’s room, and there for one-half
minute Sir Ronald paused. It was as though he feared to open it.
Then he made an effort. Kenelm saw him straighten his tall figure
and raise his head as though to defy fear. With reverent touch he
turned the handle and they entered the room together. Loving hands
had been busy there; it was hung round with black velvet and lighted
with innumerable wax tapers. She had loved flowers so well in life
that in death they had gathered them round her. Vases of great,
luscious white roses; clusters of the sad passion flower; masses of
carnations—all mixed with green leaves and hawthorn branches.
In the midst of the room stood the stately bedstead, with its black
velvet hangings. Death lost its gloom there, for the quiet figure
stretched upon it was as beautiful as though sculptured from purest
marble; it was the very beauty and majesty of death without its
horror.
The white hands were folded and laid on the heart that was never
more to suffer either pleasure or pain. Fragrant roses were laid on
her breast, lilies and myrtle at her feet.
But Kenelm noted none of these details—he went up to her
hurriedly, as though she had been living, and knelt down by her side.
He was strong and proud, undemonstrative as are most English
gentlemen, but all this deserted him now. He laid his head down on
the folded hands and wept aloud.
“My darling! my lost, dear love, so young to die! If I could but
have given my life for you!” His hot tears fell on the marble breast.
Sir Ronald stood with folded arms, watching him, thinking to himself:
“He loved her best of all—he loved her best!”
For some minutes the deep silence was unbroken save by the
deep-drawn, bitter sobs of the unhappy man kneeling there. When
the violence of his weeping was exhausted he rose and bent over
her.
“She is beautiful in death as she was in life,” he said. “Oh,
Clarice, my darling! If I were but lying there in your place. Do you
know, Ronald, how and where I saw her last?”
The haggard, silent face was raised in its despairing quiet to him.
“It was three weeks before her wedding day, and I was mad with
wounded love and sorrow. I went over to Mount Severn—not to talk
to her, Ronald, not to try to induce her to break her faith—only to
look at her and bear away with me the memory of her sweet face
forever and forever. It is only two years last June. I walked through
the grounds, and she was sitting in the center of a group of young
girls, her bridesmaids who were to be, her fair hair catching the
sunbeams, her lovely face brighter than the morning, the love-light in
her eyes; and she was talking of you, Ronald, every word full of
music, yet every word pierced my heart with hot pain. I did not go to
speak to her, but I stood for an hour watching her face, impressing
its glorious young beauty on my mind. I said to myself that I bade her
farewell, and the thought came to my mind, ‘How will she look when I
see her again?’”
Then he seemed to forget Sir Ronald was present, and he bent
again over the beautiful face.
“If you could only look at me once, only unclose those white lips
and speak to me, who loves you as I do, my lost darling.”
He took one of the roses from the folded hands and kissed it
passionately as he had kissed her lips.
“You cannot hear me, Clarice,” at last he murmured, “at least with
mortal ears; you cannot see me; but listen, my darling, I loved you
better than I loved my life; I kiss your dead lips, sweet, and I swear
that I will never kiss another woman. You are gone now where all
secrets are known; you know now how I loved you; and when I go to
the eternal land you will meet me. No love shall replace you. I will be
true to you, dead, as I was while you were living. Do you hear me,
Clarice?”
All the time he poured out this passionate torrent of words Sir
Ronald stood with bowed head and folded arms.
“I kiss those white lips again, love, and on them I swear to know
no rest, no pleasure, no repose until I have brought the man who
murdered you to answer for his crime; I swear to devote all the talent
and wealth God has given me to that purpose; I will give my days
and nights—my thoughts, time, energies—all for it; and when I have
avenged you I will come and kneel down by your grave and tell you
so.”
Then he looked up at Sir Ronald.
“What are you going to do?” he asked. “What steps shall you
take?”
“Everything possible has been done. I know no more that I can
do.”
Kenelm Eyrle looked up at him.
“Do you mean to sleep, to eat, to rest, while the man who did that
dastardly deed lives?”
His eyes flashed fire.
“I shall do my best,” Sir Ronald said, with a heavy groan. “God
help us all. It has been a dreadful mistake, Kenelm. You loved her
best.”
“She did not think so then, but she knows now. I will live to
avenge her. I ask from Heaven no greater favor than that I may bring
the murderer to justice. I shall do it, Ronald; a certain instinct tells me
so. When I do, I shall show him no mercy; he showed none to her. If
the mother who bore him knelt at my feet and asked me to have pity
on him, I would not. If the child who calls him father clung round my
neck and prayed me with tears and asked for mercy, I would show
none.”
“Nor would I,” said Sir Ronald. Then Kenelm Eyrle bent down
over the dead body.
“Good-by, my love,” he said, “until eternity; good-by.”
With reverent hands he drew the white lace round her, and left
her to the deep, dreamless repose that was never more to be
broken.
He went downstairs with Sir Ronald, but he did not enter the
library again.
“I am going home,” he said. “I shall not intrude any longer,
Ronald.”
“You will come to-morrow?” said Sir Ronald, as Kenelm stood at
the hall door.
“Yes, I will pay her that mark of respect,” he said, “and I will live to
avenge her.”
So they parted, and Sir Ronald, going back to the old seat in the
library, remained there until morning dawned.
CHAPTER VI.
KENELM EYRLE’S VOW.

In the picturesque and beautiful country of Loamshire they still tell


of the funeral, the extraordinary crowd of people assembled to pay
the last mark of homage to Lady Clarice Alden.
Perhaps most pity of all was given to the hapless lady’s mother,
Mrs. Severn, a handsome, stately, white-haired old lady, little
accustomed to demonstration of any kind. She had apologized for
her excessive grief by saying to every one:
“She was my only child, you know, and I loved her so dearly—my
only one.”
The long ceremony was over at last and the mourners returned to
Aldenmere.
The morning afterward the blinds were drawn. Once more the
blessed sunlight filled the rooms with light and warmth; once more
the servants spoke in their natural voices and the younger ones
became more anxious as to whether their new mourning was
becoming or not; but the master of the house was not sensible to
anything—the terrible tragedy had done its worst; Sir Ronald Alden
of Aldenmere lay in the clutches of fierce fever, battling for life.
The sympathy of the whole neighborhood was aroused. The
murder had been bad enough; but that it should also cause Sir
Ronald’s death was too terrible to contemplate.
Mrs. Severn remained to nurse her son-in-law; but after a time
his illness became too dangerous, and the doctors sent for two
trained nurses who could give the needful care to the sick man.
It was a close and terrible fight. Sir Ronald had naturally a strong
and magnificent constitution; it seemed as though he fought inch by
inch for his life. He was delirious, but it hardly seemed like the
ordinary delirium of fever; it was one long, incessant muttering, no
one could tell what, and just when the doctors were beginning to
despair and the nurses to grow weary of what seemed an almost
helpless task, Kenelm Eyrle came to the rescue. He took up his
abode at Aldenmere and devoted himself to Sir Ronald. His strength
and patience were both great; he was possessed of such intense
vitality himself, and such power of will, that he soon established a
marvelous influence over the patient.
For some days the contest seemed even—life and death were
equally balanced—Sir Ronald was weak as a feeble infant, but the
terrible brain fever was conquered, and the doctors gave a slight
hope of his recovery. Then it was that Kenelm’s help was invaluable;
his strong arm guided the feeble steps, his cheerful words roused
him, his strong will influenced him, and that Sir Ronald did recover,
after God, was owing to his friend.
When he was well enough to think of moving about, the doctors
strongly advised him to go away from the scene of the fatal tragedy.
“Take your friend to some cheerful place, Mr. Eyrle,” they said,
“where he can forget that his beautiful young wife was cruelly
murdered; whether he mentions the matter or not, it is now always in
his thoughts, his mind dwells on it constantly; take him anywhere
where it will cease to haunt him.”
Kenelm was quite willing.
“I must defer the great business of my life,” he said, “until Ronald
is himself again; then if the murderer be still on earth I will find him.
Thou hearest me, oh, my God—justice shall be done!”
Though outwardly he was cheerful and bright, seemingly
devoting all his energies to his friend, yet the one idea was fixed in
his mind as are the stars in heaven.
He had already spoken many times to Sergeant Hewson on the
subject, he had told him that he never intended to rest from his
labors until he found out who had done the deed.
“You will never rest, then, sir, while you live,” said the sergeant,
bluntly; “for I do not believe that it will ever be found out. I have had
to do with many queer cases in my life, but this, I am willing to own,
beats them all. I can see no light in it.”
“It will come to light sometime,” said Kenelm.
“Then it will be the work of God, Mr. Eyrle, and not of man,” was
the quiet rejoinder.
“What makes you despair about it?” asked Kenelm.
“There are features in this case different to any other. In most
crimes, especially of murder, there is a motive; I can see none in
this. There is revenge, greed, gain, robbery, baffled love, there is
always a ground for the crime.”
“There is none here?” interrupted Kenelm.
“No, sir, none; the poor lady was not robbed, therefore the motive
of greed, gain or dishonesty is not present. No one living gains
anything by her death, therefore no one could have any interest in
bringing it about. She is the only daughter of a mother who will never
get over her loss; the wife of a husband who is even now at death’s
door for her sake. Who could possibly desire her death? She never
appears to have made an enemy; her servants and dependents all
say of her that she was proud, but generous and lavish as a queen.”
“It is true,” said Kenelm Eyrle.
“I have known strange cases in my life,” continued Sergeant
Hewson, warming with his subject. “Strange and terrible. I have
known murder committed by ladies whom the world considers good
as they are fair——”
“Ladies!” interrupted Kenelm. “Ah! do not tell me that. Surely the
gentle hand of woman was never red in a crime so deep as that.”
Sergeant Hewson smiled as one who knows the secret of many
hearts.
“A woman, sir, when she is bad, is far worse than a man; when
they are good they are something akin to the angels; but there is no
woman in this case. I have looked far ahead. I am sure of it; there
was no rival with hot hate in her heart, no woman deceived and
abandoned for this lady’s sake, to have foul vengeance. I confess
myself baffled, for I can find no motive.”
Kenelm Eyrle looked perplexed.
“Nor, to tell you the truth, can I.”
“Do you think it possible that any tramp or beggar going through
the wood did it, and was disturbed before he had time to rob her?”
“No, I do not. However her death came to her, it was suddenly,
for she died, you know, with a smile on her lips. I have examined the
locality well, and in my opinion Lady Alden sat reading, never
thinking of coming harm, and the murderer stole up behind her and
did his deadly work before she ever knew that any one was near.
There was no horror of fright for her.”
“You heard what was said at the time of the inquest about the
weapon?”
“Yes; that is the clue. If ever the secret comes to light we shall
hear of that weapon again.”
“Then do you intend to give up the search?” asked Kenelm.
“I think so—if there was the least chance of success I should go
on with it—as it is, it is hopeless. I am simply living here in idleness,
taking Sir Ronald’s money and doing nothing for it. I have other and
more important work in hand.”
“Well,” said Mr. Eyrle, “if all the world gives it up I never shall.
What have you done toward it?”
“I have mastered every detail of the lady’s life. I know all her
friends. I have visited wherever she visited. I have exerted all the
capability and energy that I am possessed of, yet I have not
discovered one single circumstance that throws the least light on her
death.”
So Mr. Eyrle was forced to see the cleverest detective in England
leave the place without having been able to give the least
assistance.
“I will unravel it,” he said; “even were the mystery twenty times as
great. I will fathom it. But first I will devote myself to Ronald.”
It was August when they left Aldenmere. Sir Ronald would not go
abroad.
“I could not bear the sound of voices or the sight of faces,” he
said, appealingly. “If I am to have change, let us go to some quiet
Scotch village, where no one has ever heard my ill-fated name. If
recovery be possible it must be away from all these inquiries and
constant annoyance of visitors.”
Mr. Eyrle understood the frame of mind that made his friend
shrink from all observation.
“I must manage by degrees,” he thought. “First of all, he shall
have solitude and isolation, then cheerful society until he is himself
again—all for your sake, my lost love, my dear, dead darling—all
because he is the man you loved, and to whom you gave your
loving, innocent heart.”
When Kenelm Eyrle left Aldenmere, at the bottom of his traveling
trunk there was a small box containing the white rose he had taken
from Lady Alden’s dead hand.
CHAPTER VII.
THE RIVAL BEAUTIES.

The neighborhood of Leeholme was essentially an aristocratic


one; in fact, Leeholme calls itself a patrician country, and prides itself
on its freedom of all manufacturing towns. It is essentially devoted to
agriculture, and has rich pasture lands, fertile meadows and luxuriant
gardens.
The Aldens of Aldenmere were, perhaps, the oldest family of any.
Aldenmere was a magnificent estate; the grounds were more
extensive and beautiful than any other in the country. Nature had
done her utmost for them; art had not been neglected. The name
was derived from a large sheet of water formed by the river Lee—a
clear, broad, deep mere, always cool, shaded by large trees, with
water lilies lying on its bosom. The great beauty of the place was the
mere.
Holme Woods belonged to the estate; they bordered on the
pretty, picturesque village of Holme—the whole of which belonged to
the lords of Alden—quaint homesteads, fertile farms and broad
meadows, well-watered, surrounded the village. Not more than five
miles away was the stately and picturesque mansion of Mount
Severn, built on the summit of a green, sloping hill. Its late owner,
Charles Severn, Esq., had been one of the most eminent statesmen
who of late years had left a mark upon the times. He had served his
country well and faithfully; he had left a name honored by all who
knew it; he had done good in his generation, and when he died all
Europe lamented a truly great and famous man.
He had left only one daughter, Clarice Severn, afterward Lady
Alden, whose tragical death filled the whole country with gloom. His
widow, Mrs. Severn, had been a lady of great energy and activity;
but her life had been a very arduous one. She had shared in all her
husband’s political enterprises. She had shared his pains and his
joys. She had labored with her whole soul; and now that he was
dead she suffered from the reaction. Her only wish and desire was
for quiet and repose; the whole life of her life was centered on her
beautiful daughter.
Clarice Severn was but sixteen when her father died. His estate
was entailed, and at his widow’s death was to pass into possession
of his heir-at-law. But the gifted statesman had not neglected his only
child. He had saved a large fortune for her, and Clarice Severn was
known as a wealthy heiress.
She was also the belle and beauty par excellence of the country.
At all balls and fêtes she was queen. Her brilliant face, lighted by
smiles, her winning, haughty grace drawing all eyes, attracting all
attention. Wherever she was she reigned paramount. Other women,
even if more beautiful, paled into insignificance by her side.
She was very generous, giving with open, lavish hands. Proud in
so far as she had a very just appreciation of her own beauty, wealth
and importance. She was at times haughty to her equals, but to her
inferiors she was ever gentle and considerate, a quality which
afterward, when she came to reign at Aldenmere, made her beloved
and worshiped by all her servants.
She had faults, but the nature of the woman was essentially
noble. What those faults were and what they did for her will be seen
during the course of our story.
Mount Severn, even after the death of its accomplished master,
was a favorite place of resort. Mrs. Severn did not enjoy much of the
quiet she longed for. She would look at her daughter sometimes with
a smile, and say:
“It will always be the same until you are married, Clarice; then
people will visit you instead of me.”
So, little when she dreamed of the brilliant future awaiting that
beautiful and beloved child, did she dream of the tragedy that was to
cut that young life so terribly short.
Leeholme Park was the family seat of the Earl of Lorriston, a
quiet, easy, happy, prosperous gentleman, who had never known a
trouble or shadow of care in his whole life.
“People talk of trouble,” he was accustomed to say; “but I really
think half of it is their own making; of course there must be sickness
and death, but the world is a bright place in spite of that.”
He was married to the woman he loved; he had a son to succeed
him; his estates were large; his fortune vast; he had a young
daughter, who made the sunshine and light of his home. What had
he to trouble him? He had never known any kind of want, privation,
care or trouble; he had never suffered pain or heartache. No wonder
he looked around on those nearest and dearest, on his elegant
home, his attached friends, and wondered with a smile how people
could think the world dull or life dreary. Yet on this kindly, simple,
happy man a terrible blow was to fall.
I do not know who could properly describe Lady Hermione
Lorriston, the real heroine of our story. It seems to me easier to paint
the golden dawn of a summer morning, the transparent beauty of a
dewdrop, to put to music the song of the wind or the carol of a bird,
or the deep, solemn anthem of the waves, as to describe a character
that was full of light and shade, tender as a loving woman, playful as
a child, spiritual, poetical, romantic, a perfect queen of the fairies,
whose soul was steeped in poetry as flowers are in dew.
By no means a perfect woman, though endowed with woman’s
sweetest virtues; she was inclined to be willful, with a delicious grace
that no one could resist. She liked to have her own way, and
generally managed it in the end. She delighted rather too much in
this will of her own. She owned to herself, with meek, pretty
contrition, that she was often inclined to be passionate, that she was
impatient of control, too much inclined to speak her mind with a
certain freedom that was not always prudent.
Yet the worst of Lady Hermione’s faults was that they compelled
you to love her, and even to love them, they were so full of charms.
When she was quite a little child Lord Lorriston was accustomed to
say that the prettiest sight in all the world was Hermione in a
passion.
She was completely spoiled by her father, but, fortunately, Lady
Lorriston was gifted with some degree of common sense, and
exerted a wholesome control over the pet of the household.
The earl’s son and heir, Clement Dane Lorriston, was at college,
and Lady Hermione, having no sister of her own, was warmly
attached to Clarice Severn.
There were several other families—the Thrings of Thurston, the
Gordons of Leyton, and, as may be imagined, with so many young
people, there was no inconsiderable amount of love-making and
marriage.
Sir Ronald Alden was, without exception, the most popular man
in the neighborhood. The late Lord of Aldenmere had never married;
to save himself all trouble he adopted his nephew, Ronald, and
brought him up as his heir; so that when his time came to reign he
was among those with whom he had lived all his life.
He was very handsome, this young lord of Alden. The Alden
faces were all very much the same; they had a certain weary, half-
contemptuous look; but when they softened with tenderness or
brightened with smiles, they were simply beautiful and irresistible.
They were of the high-bred, patrician type—the style of face that
has come down to us from the cavaliers and crusaders of old. The
only way in which Sir Ronald differed from his ancestors was that he
had a mouth like one of the old Greek gods—it would of itself have
made a woman almost divinely lovely—it made him irresistible. Very
seldom does one see anything like it in real life. A smile from it would
have melted the coldest heart—a harsh word have pierced the heart
of one who loved him.
He had something of the spirit that distinguished the crusaders;
he was brave even to recklessness—he never studied danger; he
was proud, stubborn, passionate. A family failing of the Aldens was a
sudden impulse of anger that often led them to words they repented
of.
So that he was by no means perfect, this young lord of Alden; but
it is to be imagined that many people liked him all the better for that.

You might also like