Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

COMMUNICATIONS

ACM
CACM.ACM.ORG OF THE 07/2015 VOL.58 NO.07

Exascale
Computing
and Big Data

The New Smart Cities


Respecting People and
Respecting Privacy
Unifying Logic
and Probability
Password-Based
Authentication

Association for
Computing Machinery
KDD2015
21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
10 - 13 August 2015, Hilton, Sydney
http://www.kdd.org/kdd2015/

KDD 2015 is a premier conference that brings together researchers


and practitioners from data mining, knowledge discovery, data
analytics, and big data. KDD 2015 will be the first Australian edition
of KDD, and is its second time in the Asia Pacific region.
4 Keynotes

8 Invited Industry Talks

14 Workshops

12 Tutorials

Industry and Government Track

Research Track

Big Data Summit (collocated)

Regional Forums

Mentoring Program

27 Exhibition Booths

For enquiries, contact Prof. Longbing Cao at general2015@kdd.org.


ACM’s Career
& Job Center
Are you looking for
your next IT job?
Do you need Career Advice?
The ACM Career & Job Center offers ACM members a host of
career-enhancing benefits:
U A highly targeted focus on job U Job Alert system that notifies you of
opportunities in the computing industry new opportunities matching your criteria
U Access to hundreds of industry job postings U Career coaching and guidance available
from trained experts dedicated to your
U Resume posting keeping you connected success
to the employment market while letting you
maintain full control over your confidential U Free access to a content library of the best
information career articles compiled from hundreds of
sources, and much more!

Visit ACM’s Career & Job Center at:


http://jobs.acm.org
The ACM Career & Job Center is the perfect place to
begin searching for your next employment opportunity!

Visit today at http://jobs.acm.org


COMMUNICATIONS OF THE ACM

Departments News Viewpoints

4 From the ACM President 24 Legally Speaking


A New Chief Executive Officer Anti-Circumvention Rules Limit
and Executive Director of ACM Reverse Engineering
By Alexander L. Wolf Considering some of the requested
exceptions to technical
7 Cerf’s Up protection mechanisms.
Milestones By Pamela Samuelson
By Vinton G. Cerf
27 Computing Ethics
8 Letters to the Editor Respecting People
Quality vs. Quantity and Respecting Privacy
in Faculty Publications Minimizing data collection to protect
user privacy and increase security.
12 BLOG@CACM By L. Jean Camp
14
The Dangers of Military Robots,
the Risks of Online Voting 29 Historical Reflections
John Arquilla considers 14 Growing Pains for Deep Learning Preserving the Digital Record
the evolution of defense drones, Neural networks, which support of Computing History
and why Duncan A. Buell thinks online image search and speech Reflecting on the complexities
we are not ready for e-voting. recognition, eventually will drive associated with maintaining rapidly
more advanced services. changing information technology.
37 Calendar By Chris Edwards By David Anderson

109 Careers 17 Bringing Big Data to the Big Tent 32 The Business of Software
Open source tools An Updated Software Almanac
assist data science. Research into what makes
Last Byte By Gregory Goth software projects succeed.
By Phillip G. Armour
112 Future Tense 20 The New Smart Cities
Toy Box Earth How urban information 35 Broadening Participation
What a young AI learned following systems are slowly revamping African Americans in the U.S.
Alice through the looking glass… the modern metropolis. Computing Sciences Workforce
By Brian Clegg By Gregory Mone An exploration of the
education-to-work pipeline.
22 ACM Announces By Juan E. Gilbert,
2014 Award Recipients Jerlando F.L. Jackson,
Recognizing excellence in technical Edward C. Dillon, Jr., and
and professional achievements LaVar J. Charleston
and contributions in computer
science and information technology. 39 Viewpoint
By Lawrence M. Fisher The Future of Computer Science
and Engineering Is in Your Hands
How government service
can profoundly influence
computer science research
and education.
IMAGE BY SERGEY NIVENS

By Vijay Kumar and Thomas A. Kalil

Association for Computing Machinery


Advancing Computing as a Science & Profession

2 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


07/2015 VOL. 58 NO. 07

Practice Contributed Articles Review Articles

42 78 88

42 Low-Latency Distributed 56 Exascale Computing and Big Data 88 Unifying Logic and Probability
Applications in Finance Scientific discovery and engineering Open-universe probability models
The finance industry has innovation requires unifying show merit in unifying efforts.
unique demands for low-latency traditionally separated By Stuart Russell
distributed systems. high-performance computing
By Andrew Brook and big data analytics.
By Daniel A. Reed and Jack Dongarra
51 Using Free and Open Source Tools Watch the author discuss
their work in this exclusive
to Manage Software Quality Communications video.
Watch the authors discuss http://cacm.acm.org/
An agile process implementation. videos/unifying-logic-
their work in this exclusive
By Phelim Dowling and Kevin McGrath Communications video. and-probability
http://cacm.acm.org/
videos/exascale-
Articles’ development led by computing-and-big-data
queue.acm.org
Research Highlights
69 Using Rhetorical Structure
in Sentiment Analysis 100 Technical Perspective
A deep, fine-grain analysis of The Simplicity of Cache Efficient
rhetorical structure highlights Functional Algorithms
IMAGES BY MA RIO WAGNER, IWONA USA KIEW ICZ/A NDRIJ BORYS ASSOCIAT ES, A LMAGA MI

crucial sentiment-carrying By William D. Clinger


text segments.
By Alexander Hogenboom, 101 Cache Efficient
Flavius Frasincar, Franciska de Jong, Functional Algorithms
and Uzay Kaymak By Guy E. Blelloch and Robert Harper

78 Passwords and the Evolution


of Imperfect Authentication
Theory on passwords has lagged
practice, where large providers
About the Cover:
Daniel Reed and use back-end smarts to survive with
Jack Dongarra explore imperfect technology.
the myriad of great
opportunities and By Joseph Bonneau,
greater challenges Cormac Herley, Paul C. van Oorschot,
that face the future
of exascale computing and Frank Stajano
and big data analysis
(p. 56). Cover illustration
by Peter Bollinger.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF THE ACM 3


from the acm president

DOI:10.1145/2788395 Alexander L. Wolf

Bobby
Schnabel’s A New Chief
contributions, to Executive Officer
the computing
community and Executive
in general
and ACM in
Director of ACM

I
particular, have SOMETIMES REFLECT on what when John White—holder of a Ph.D. in
been nothing makes ACM so special, so
unique as a professional soci-
computer science, a former university
professor, an industrial research lab
short of stellar. ety. I can easily come up with director, and a past president of ACM,
several good reasons, including transitioned from his role as a long-
the extent to which responsibility, au- standing volunteer to that of chief ex-
thority, and budget control are devolved ecutive officer (the first person to hold
to our 37 special interest groups or the that title) and executive director (the
strong support we provide to our fellow traditional title of someone who heads
volunteers in building a broad range the staff of a non-profit organization).
of community service activities (the For the first time, the most senior mem-
“good works” of ACM). But alongside ber of staff would be someone from our
or, perhaps, underlying these very vis- technical community. What John did
ible special and unique features is the with that opportunity is, simply put,
largely invisible special and unique re- legendary. While previous ACM EDs
lationship that exists between volunteer contributed fundamentally and sub-
leaders and ACM headquarters staff. stantially to our success, John brought
This is a relationship characterized by a different kind of experience and per-
mutual respect and trust, and a deep spective to the position, one that in
understanding of each other’s motiva- the context of computing’s explosive
tions, aspirations, and priorities. Not growth in importance and ubiquity
that there are no conflicts and some- (John took up his position in the same
times even some tensions, but as in any year that Microsoft Windows 98 was re-
healthy relationship, it is the manner leased and Google was founded) could
in which you approach and overcome not have been more timely for our pro-
problems and challenges that matters, fession. Together with ACM COO Pat
not the fact they may exist. Ryan and the rest of the headquarters
What is the secret to ACM’s incred- staff, John worked tirelessly with ACM
ible success in building its volunteer/ volunteers to guide our association on
staff relationship? From where did it a path toward its standing today as the
come and how do we ensure it contin- world’s leading international profes-
ues? This is something that has been sional society of computing engineers,
literally decades in the making (start- scientists, educators, and students.
ing well before my time as an ACM With John’s impending retirement
volunteer) and something that as Pres- after nearly 17 years, it fell to me (much
ident I feel a great responsibility to pre- to the relief of some recent past presi-
serve and nurture. dents!) to lead the process of finding a
From my point of view, a watershed CEO who could ably sustain and even
moment in the building of that rela- grow ACM’s success, but more funda-
tionship occurred in November 1998 mentally perhaps, continue the excel-

4 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


from the acm president

lent working relationship between vol- an ideal person to serve in this role go- ers of the importance of an early and
unteers and staff. From the outset we ing forward. He comes to us from Indi- substantial computing education.
aimed, once again, to attract someone ana University, where he is Professor Bobby was a co-founder of the National
from our community: someone with and Dean in the School of Informatics Center for Women & Information Tech-
the right combination of experience and Computing, leading the school nology (NCWIT) and was a board mem-
leading a major organization, a his- through a remarkable period of growth ber of the Computer Research Associa-
tory of deep knowledge of ACM, and an in size, stature, and funding. Prior to tion (CRA). Bobby is currently a board
acknowledged technical standing in this he spent 30 years at the University member of code.org, serves as chair of
computing. I called upon a diverse set of Colorado at Boulder after receiv- the advisory board of the Alliance for
of highly respected computing profes- ing a Ph.D. in computer science from Hispanic Serving Institutions, and is
sionals to serve on a search committee: Cornell University. Among many other a member of the U.S. National Science
Vicki Hanson (chair), Muffy Calder, senior duties and positions in Boul- Foundation Computing and Informa-
Vint Cerf, Stu Feldman, Mary Jane Ir- der, Bobby founded and directed the tion Science and Engineering advisory
win, Matthias Kaiserswerth, and Peter Alliance for Technology, Learning and committee. These are but some of the
Lee. I asked them to come back with Society (ATLAS) Institute and served many important activities that have
a recommendation to the ACM Execu- for nine years as campus CIO. Bobby’s seen him take a leadership role in our
tive Committee, the leadership group technical background is in numerical community, building a broad and se-
charged in our bylaws with making the computation, having published over cure foundation for his challenging
appointment decision. After receiv- 100 scholarly articles in the area, earn- new duties at ACM.
ing a remarkable number of truly out- ing him advancement to Fellow of both Although I am sad to see John leave,
standing applications and carrying out SIAM and the ACM. I am extremely excited to have Bobby
a thorough review and interview pro- Bobby’s contributions, to the com- take up the role at ACM. Bobby brings
cess, the committee forwarded a name puting community in general and with him a special passion and bright
that was unanimously accepted by the ACM in particular, have been nothing new ideas that will help continue the
ACM EC, with the unanimous concur- short of stellar. Early on he served as great tradition of leadership, innova-
rence of the full ACM Council. chair of an ACM special interest group tion, openness, and growth that have
I am extremely pleased to welcome (SIGNUM) and later editor-in-chief of marked the history of ACM.
Robert (Bobby) Schnabel as the new the highly respected SIAM Review. He
CEO and ED of ACM. Bobby will take will step down as founding chair of Alexander L. Wolf is president of ACM and a professor
in the Department of Computing at Imperial College
up his position on November 1, and, the ACM Education Policy Committee, London, U.K.
in the interim, I have named Pat Ryan which under his leadership has played
acting ED. Bobby is without question a pivotal role in convincing policymak- Copyright held by author.
PHOTOGRA PH BY DAVID KAT ZIVE

Bobby Schnabel assumes the role of ACM’s CEO and Executive Director on Nov. 1, 2015.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF THE ACM 5


COMMUNICATIONS OF THE ACM
Trusted insights for computing’s leading professionals.

Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.

ACM, the world’s largest educational STA F F EDITORIAL BOARD ACM Copyright Notice
and scientific computing society, delivers Copyright © 2015 by Association for
resources that advance computing as a DIRECTOR OF GROUP PU BLIS HING E DITOR- IN- C HIE F Computing Machinery, Inc. (ACM).
science and profession. ACM provides the Scott E. Delman Moshe Y. Vardi Permission to make digital or hard copies
computing field’s premier Digital Library cacm-publisher@cacm.acm.org eic@cacm.acm.org of part or all of this work for personal
and serves its members and the computing NE W S or classroom use is granted without
profession with leading-edge publications, Executive Editor fee provided that copies are not made
Co-Chairs
conferences, and career resources. Diane Crawford or distributed for profit or commercial
William Pulleyblank and Marc Snir
Managing Editor advantage and that copies bear this
Board Members
Executive Director and CEO Thomas E. Lambert notice and full citation on the first
Mei Kobayashi; Kurt Mehlhorn;
John White Senior Editor page. Copyright for components of this
Michael Mitzenmacher; Rajeev Rastogi
Deputy Executive Director and COO Andrew Rosenbloom work owned by others than ACM must
Patricia Ryan Senior Editor/News VIE W P OINTS be honored. Abstracting with credit is
Director, Office of Information Systems Larry Fisher Co-Chairs permitted. To copy otherwise, to republish,
Wayne Graves Web Editor Tim Finin; Susanne E. Hambrusch; to post on servers, or to redistribute to
Director, Office of Financial Services David Roman John Leslie King lists, requires prior specific permission
Darren Ramdin Rights and Permissions Board Members and/or fee. Request permission to publish
Director, Office of SIG Services Deborah Cotton William Aspray; Stefan Bechtold; from permissions@acm.org or fax
Donna Cappo Michael L. Best; Judith Bishop; (212) 869-0481.
Director, Office of Publications Art Director Stuart I. Feldman; Peter Freeman;
Bernard Rous Andrij Borys Mark Guzdial; Rachelle Hollander; For other copying of articles that carry a
Director, Office of Group Publishing Associate Art Director Richard Ladner; Carl Landwehr; code at the bottom of the first or last page
Scott E. Delman Margaret Gray Carlos Jose Pereira de Lucena; or screen display, copying is permitted
Assistant Art Director Beng Chin Ooi; Loren Terveen; provided that the per-copy fee indicated
Mia Angelica Balaquiot Marshall Van Alstyne; Jeannette Wing in the code is paid through the Copyright
ACM COU NC I L
Designer Clearance Center; www.copyright.com.
President
Alexander L. Wolf Iwona Usakiewicz
Production Manager P R AC TIC E Subscriptions
Vice-President
Lynn D’Addesio Co-Chairs An annual subscription cost is included
Vicki L. Hanson
Director of Media Sales Stephen Bourne in ACM member dues of $99 ($40 of
Secretary/Treasurer
Jennifer Ruzicka Board Members which is allocated to a subscription to
Erik Altman
Public Relations Coordinator Eric Allman; Charles Beeler; Bryan Cantrill; Communications); for students, cost
Past President
Virginia Gold Terry Coatta; Stuart Feldman; Benjamin Fried; is included in $42 dues ($20 of which
Vinton G. Cerf
Publications Assistant Pat Hanrahan; Tom Limoncelli; is allocated to a Communications
Chair, SGB Board
Juliet Chance Kate Matsudaira; Marshall Kirk McKusick; subscription). A nonmember annual
Patrick Madden
Erik Meijer; George Neville-Neil; subscription is $100.
Co-Chairs, Publications Board
Jack Davidson and Joseph Konstan Columnists Theo Schlossnagle; Jim Waldo
David Anderson; Phillip G. Armour; ACM Media Advertising Policy
Members-at-Large The Practice section of the CACM
Michael Cusumano; Peter J. Denning; Communications of the ACM and other
Eric Allman; Ricardo Baeza-Yates; Editorial Board also serves as
Mark Guzdial; Thomas Haigh; ACM Media publications accept advertising
Cherri Pancake; Radia Perlman; the Editorial Board of .
Leah Hoffmann; Mari Sako; in both print and electronic formats. All
Mary Lou Soffa; Eugene Spafford;
Pamela Samuelson; Marshall Van Alstyne advertising in ACM Media publications is
Per Stenström
C ONTR IB U TE D A RTIC LES at the discretion of ACM and is intended
SGB Council Representatives
Co-Chairs to provide financial support for the various
Paul Beame; Barbara Boucher Owens; CO N TAC T P O IN TS Al Aho and Andrew Chien activities and services for ACM members.
Andrew Sears Copyright permission Board Members Current Advertising Rates can be found
permissions@cacm.acm.org William Aiello; Robert Austin; Elisa Bertino; by visiting http://www.acm-media.org or
BOARD C HA I R S Calendar items Gilles Brassard; Kim Bruce; Alan Bundy; by contacting ACM Media Sales at
Education Board calendar@cacm.acm.org Peter Buneman; Peter Druschel; (212) 626-0686.
Mehran Sahami and Jane Chu Prey Change of address Carlo Ghezzi; Carl Gutwin; Gal A. Kaminka;
Practitioners Board acmhelp@acm.org James Larus; Igor Markov; Gail C. Murphy; Single Copies
George Neville-Neil Letters to the Editor Bernhard Nebel; Lionel M. Ni; Kenton O’Hara; Single copies of Communications of the
letters@cacm.acm.org Sriram Rajamani; Marie-Christine Rousset; ACM are available for purchase. Please
REGIONA L C O U N C I L C HA I R S Avi Rubin; Krishan Sabnani; contact acmhelp@acm.org.
ACM Europe Council W E B S IT E Ron Shamir; Yoav Shoham; Larry Snyder;
Fabrizio Gagliardi http://cacm.acm.org Michael Vitale; Wolfgang Wahlster; COMMUNICATIONS OF THE ACM
ACM India Council Hannes Werthner; Reinhard Wilhelm (ISSN 0001-0782) is published monthly
Srinivas Padmanabhuni AU TH O R G U ID EL IN ES by ACM Media, 2 Penn Plaza, Suite 701,
ACM China Council http://cacm.acm.org/ RES E A R C H HIGHLIGHTS New York, NY 10121-0701. Periodicals
Jiaguang Sun Co-Chairs postage paid at New York, NY 10001,
Azer Bestovros and Gregory Morrisett and other mailing offices.
PUB LICATI O N S BOA R D ACM ADVERTISING DEPARTM E N T Board Members
Martin Abadi; Amr El Abbadi; Sanjeev Arora; POSTMASTER
Co-Chairs 2 Penn Plaza, Suite 701, New York, NY
Dan Boneh; Andrei Broder; Doug Burger; Please send address changes to
Jack Davidson; Joseph Konstan 10121-0701
Stuart K. Card; Jeff Chase; Jon Crowcroft; Communications of the ACM
Board Members T (212) 626-0686
Sandhya Dwaekadas; Matt Dwyer; 2 Penn Plaza, Suite 701
Ronald F. Boisvert; Nikil Dutt; Roch Guerrin; F (212) 869-0481
Alon Halevy; Maurice Herlihy; Norm Jouppi; New York, NY 10121-0701 USA
Carol Hutchins; Yannis Ioannidis;
Catherine McGeoch; M. Tamer Ozsu; Director of Media Sales Andrew B. Kahng; Henry Kautz; Xavier Leroy;
Mary Lou Soffa Jennifer Ruzicka Kobbi Nissim; Mendel Rosenblum;
jen.ruzicka@hq.acm.org David Salesin; Steve Seitz; Guy Steele, Jr.; Printed in the U.S.A.
ACM U.S. Public Policy Office David Wagner; Margaret H. Wright
Renee Dopplick, Director Media Kit acmmediasales@acm.org
1828 L Street, N.W., Suite 800 WEB
Washington, DC 20036 USA Association for Computing Machinery Chair
T (202) 659-9711; F (202) 667-1066 (ACM) James Landay
2 Penn Plaza, Suite 701 Board Members A
SE
REC
Y

Computer Science Teachers Association New York, NY 10121-0701 USA Marti Hearst; Jason I. Hong;
E

CL
PL

Lissa Clayborn, Acting Executive Director T (212) 869-7440; F (212) 869-0481 Jeff Johnson; Wendy E. MacKay
NE
TH

S
I

Z
I

M AGA

6 COMMUNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


cerf’s up

DOI:10.1145/2786922 Vinton G. Cerf

Milestones
Last month I wrote about how much we owe
to John White, not only for his 17 years of
service to ACM as CEO, but also his earlier
services in various ACM volunteer positions,
including president. Now I want to gramming, in their largest sense, that we Should we be concerned about the
draw attention to the fact that ACM will should seek to understand? What other increasingly pervasive role of soft-
be 70 years old in 2017. It is not too early questions would you put on such a list? ware in our lives? Should we be learn-
to being thinking about how we might Another thing we might consider ing about and teaching some form of
usefully recognize this milestone. We is how we think computing (and com- “responsible programming?” Should
had a remarkable celebration we called munications) will change over the next there be incentives for software mak-
ACM 1 in 1997 as we reached the 50th 70 years. The Internet, the World Wide ers and providers to assure that cred-
year of our existence and while 70 is not Web and the Internet of Things are al- ible measures have been taken to maxi-
the same as, say, 100, it is a significant ready a part of daily life for about 40% mize safety and minimize risk to those
moment and worthy of some serious of the world’s population and smart- using or dependent on software? What
thought. Such moments encourage all phones for an even larger fraction. Can might those incentives look like? What
of us to think about what we want to we try to imagine what the next major should we do about preserving digital
achieve in the future, even as we look changes might be? Will nano devices, content including the software needed
back on the many years of our existence programmable and autonomous ro- to correctly interpret it? What are the
and accomplishments. bots, machine learning and artificial in- implications for intellectual property
One immediate thought is to try telligence have as large or even greater protection and for the interests of the
to compose a list of, say, 25 problems impact than what has gone before? Will general public? What is the future of
in computer science that remain un- programs learn to evolve themselves ACM’s Digital Library and the concept
solved—rather like David Hilbert’s 23 to pose and solve new problems? Do of open access to publications, data,
problems in mathematics published in we face an optimistic future in which and software?
1900.a What might be on our list? I am software artifacts are seen as partners The notion of “stock taking” seems
not sure how to formulate this, but one or are the darker predictions of the re- very applicable here. What will be the
thing on my list is to ask whether we placement of humanity with smarter state of affairs for computing in 2017
can ever get to the point where we can robotic software more credible? and how should we help to shape its
write bug-free code in a reliable fash- future? What is the role of ACM and
ion! There are other questions having its members? What about the millions
to do with computability and computer What will be the of professionals (and amateurs) with
design. Will new machine architectures an interest in computing who may not
permit some form of computation that state of affairs for be members but are affected by ACM’s
goes beyond the reach of the Turing Ma- computing in 2017 work? It is not too early to begin think-
chine? Are there limits to quantum com- ing about 2017 and, as we wish John
putation? Scott Aaronson tackles that and how should White a fond farewell and his succes-
question in a 2008 Scientific American we help to shape sor a warm welcome, let us begin the
article.b What do we not know about the process of formulating a meaningful
fundamentals of computation and pro- its future? What is and substantive recognition of that
the role of ACM milestone year. I await your sugges-
tions with great anticipation!
a http://en.wikipedia.org/wiki/Hilbert%27s_ and its members?
problems
Vinton G. Cerf is vice president and Chief Internet Evangelist
b Scott Aaronson, The Limits of Quantum Com- at Google. He served as ACM president from 2012–2014.
puters, Scientific American 298, (2008) 62-69;
doi:10.1038/scientificamerican0308-62. Copyright held by author.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF THE ACM 7


letters to the editor

DOI:10.1145/2785962

Quality vs. Quantity in Faculty Publications

A
S A FORMER computer sci- handle these reviews internally. Those thon in these courses, they are ready
ence department chair, I lacking faculty with the requisite ex- to move on to the wider world of soft-
have long felt the system pertise could commission reviews ware development involving more
for evaluating and pro- from outside experts, possibly in ex- challenging languages like Java, C,
moting faculty is far too change for a nontrivial honorarium. and Erlang. But for beginners, Python
dependent on quantity and fails to Jonathan Turner, Sarasota, FL is here to stay.
account adequately for quality. So I Kenneth Lambert, Lexington, VA
applaud the Computing Research As-
sociation best practices memo Moshe Author’s Response: Esther Shein’s news article (Mar.
Y. Vardi mentioned in his Editor’s Let- The Editor’s Letter in question indeed 2015) downplayed the robust nature
ter “Incentivizing Quality and Impact suggested leading computing-research of Python, perhaps in the interests of
in Computing Research” (May 2015). organizations sign a statement committing journalistic “balance.” But the word
However, more specific guidelines are to adopt the CRA memo as the basis for count dedicated to critique seemed
needed for this initiative to be effec- their own hiring and promotion practices. misplaced in an article that might
tive. I would urge all computing de- Such a statement can be more detailed otherwise have presented more posi-
partments to inform candidates for than the CRA memo. tive examples of use. For example,
assistant-professor positions their Moshe Y. Vardi, Editor-in-Chief the repeated reference to Python as a
submitted CVs should list no more “scripting language” was dismissive.
than three publications. Candidates Python is not only useful for scripting
must also be the primary author of at Python for Beginners Here to Stay applications or OS functions but for
least two of them, meaning they were It was refreshing to see Python dis- industrial applications with tens of
responsible for at least 50% of the intel- cussed as a primary teaching language thousands of LOC. I once led a devel-
lectual content and at least 50% of the by Esther Shein in her news article opment team that wrote a robust Web
writing. On any other publication, the “Python for Beginners” (Mar. 2015). transaction B2B infrastructure in Py-
candidate would be considered a sec- The Python experience for introduc- thon that could handle individual da-
ondary author; yes, some papers may tory students at my college—Wash- tabase files with millions of records. It
have no primary author. ington and Lee University—has been is still running two decades later.
I would go further. Faculty being nothing but positive. We knew when Python is strongly dynamically
considered for tenure should be per- we switched from Java to Python a de- typed, enabling it to handle flexible
mitted to list at most 15 publications cade ago that the language would offer runtime situations efficiently. If a
and be a secondary author on at most the résumé value of widespread use in programmer wants the advantages
five of them. Full-professor candi- industry, as well as the pedagogical of static typing, it can be bolted on
dates should be permitted to list at value of stepwise, stress-free intro- using a preprocessor like “mypy,”
most 25 and be a secondary author on duction of programming constructs, an experimental optional static type
at most 10. Also, in either case, candi- from simple algorithmic code to the checker for Python. A programmer
dates should be required to list three use and definition of functions, ob- could also use Python to implement
recent papers for which they are pri- jects, and classes. Minimal syntactic automated testing, structured docu-
mary authors and on which they are to “pixie dust” and support for safe se- mentation, and other safe-coding
be explicitly evaluated. mantics are the most critical features practices. But doing so would be use-
These thresholds are meant as ex- of Python for novices. The language is ful no matter what the language, and in
amples, not firm proposals. What well suited to the kinds of short pro- any case goes beyond the target audience
is important is departments and re- grams assigned in beginning courses of “beginners.”
search organizations send a clear mes- while appealing to students with widely Shriram Krishnamurthi, the Brown
sage to candidates that CV-padding different backgrounds. University professor quoted in the ar-
will not help their case for promotion Excellent APIs are available for cre- ticle, complained Python has limited
and tenure. ating easy GUI-based Python programs support for data structures. What it
To make evaluations of papers in introductory courses (http://home. does have is built-in tuple, list, set, ar-
meaningful, they should be a written wlu.edu/~lambertk/breezypythongui/ ray, and dictionary types, and queues,
part of the hiring/promotion case. index.html) and for teaching data struc- b-trees, ordered dictionaries, named
That is, for each paper to be evaluated, tures and object-oriented program- tuples; other functions are part of the
some expert in the candidate’s field ming in a systematic way (http://home. standard library. Moreover, third-party
should be asked to study the subject wlu.edu/~lambertk/classes/112/). libraries exist for every other imagin-
paper and write a substantive review. When students learn the basic able situation. For custom-mixed data
Many departments will be able to principles of programming with Py- structures, Python lets programmers

8 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


letters to the editor

use a class. Krishnamurthi bemoaned Information security fundamentals. es by inoperability. Confidentiality is


their “onerous syntax and tricky se- threatened first by taking or observing.
mantics,” which might be a reason- Disclosure means somebody in pos-
Information security preserves
able criticism when a programmer session reveals information to another
goes beyond trivial cases, but for sim- confidentiality person. Integrity, meaning whole and
possession (control)
ple structures a programmer needs integrity sound, is not necessarily threatened by
only the “pass” command, as in the authenticity altering data. Altering data threatens
availability
following example: utility information authenticity.
Moreover, the article switched from
class Album: information to devices threatened by
pass loss of availability. What about when
of information from unintended
favourite = Album() acts of adversaries and forces causing information is unavailable or unus-
favourite.name = ‘Revolver’ destruction or copying
able? Sametinger et al. are not to
favourite.artist = ‘The interference or use of false data blame for such oversights since these
Beatles’ misuse or failure to use conundrums are abundant through-
change or replacement
favourite.year = 1966 finding or taking out the information-security litera-
disclosure or observation ture and even in laws and regulations.
misrepresentation or endangerment
A novice can understand this code The figure here is a formal dili-
once the instantiation at line 3 is ex- gence model I devised for reviewing
plained. It is hardly “onerous,” as Krish- the information security of more
namurthi claimed. From here, put and by application of than 200 enterprise clients (http://
get methods, type checking, valida- avoidance and deterrence www.cbi.umn.edu/) that comes clos-
prevention and detection
tion, and other functions can be added mitigation and investigation
er than any article I know to describ-
in an incremental way, an approach transference and audit ing the basic elements of informa-
eminently suitable for beginners. sanctions and rewards tion security.
recovery and correction
The problem of underestimating motivation and education I challenge Sametinger et al., as
Python may involve bringing biases well as all Communications readers, to
from one language into another. The identify any significant detail I over-
lack of “struct,” or static types, does to achieve
looked in this model at the same level
not make Python any less capable. It of abstraction and within normal dic-
desired protection
could do the opposite. There are cer- generally accepted practices tionary meanings of its words.
tainly domains where some other pro- avoidance of negligence Donn B. Parker, Los Altos, CA
compliance with laws
gramming language would be prefer- meeting regulations and standards
able. But Python is the language for ethical practices
protection of rights
the 99%. Authors’ Response:
Robin Parmar, Limerick, Ireland Parker mentions the “… weak and
deficient fundamentals of information
mation security. It said, “Security is security.” It is not the fundamentals
Comprehensive Diligence about protecting information and in- that are weak and deficient but their
for Information Security formation systems from unauthorized execution. Devices could be much more
Security specialists must address in- access and use.” Authorized but mali- secure. Parker is right there are aspects
formation security as precisely and cious access and use must also be ad- of security like authorized but malicious
completely as possible because clev- dressed. Acts not mentioned include access we have not considered. For
er adversaries use attack methods unintended representation, taking, now, though, we would be happy to
and vulnerabilities defenders have interference, endangerment, failure avoid unauthorized access and trust our
overlooked. Security must be com- to use, and false data entry. Privacy is cardiologist. We agree privacy is a human
prehensive, as it is no stronger than indeed not a fundamental element of right, but some security measures are
its weakest link. Building security of security but a human right achieved, needed to avoid disclosure.
information should start with formal in part, through protection of confi- Johannes Sametinger,
models that address all known mate- dentiality, possession, integrity, au- Jerzy Rozenblit, Roman Lysecky,
rial weaknesses and security controls thenticity, availability, and utility. and Peter Ott
and practices. Though Johannes Sa- Sametinger et al. also said, “Con-
metinger et al.’s review article “Secu- fidentiality, integrity, and availability
Communications welcomes your opinion. To submit a
rity Challenges for Medical Devices” (CIA) of information are core design Letter to the Editor, please limit yourself to 500 words or
(Apr. 2015) was excellent describing and operational goals.” Possession less, and send to letters@cacm.acm.org.
many potentially mortal threats and and control, authenticity, and util-
calling for formal methods and mod- ity are also needed to achieve security.
els to achieve security, it was only one They said, incorrectly, confidentiality
more article demonstrating the weak is threatened by disclosure, integrity by
and deficient fundamentals of infor- altering data, and availability of devic- © 2015 ACM 0001-0782/15/07 $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF THE ACM 9


ACM
ON A MISSION TO SOLVE TOMORROW.

Dear Colleague,

Computing professionals like you are driving innovations and


transforming technology across continents, changing the way
we live and work. We applaud your success.

We believe in constantly redefining what computing can and


should do, as online social networks actively reshape relationships
among community stakeholders. We keep inventing to push
computing technology forward in this rapidly evolving environment.

For over 50 years, ACM has helped computing professionals to be their most creative,
connect to peers, and see what’s next. We are creating a climate in which fresh ideas are
generated and put into play.

Enhance your professional career with these exclusive ACM Member benefits:

• Subscription to ACM’s flagship publication Communications of the ACM


• Online books, courses, and webinars through the ACM Learning Center
• Local Chapters, Special Interest Groups, and conferences all over the world
• Savings on peer-driven specialty magazines and research journals
• The opportunity to subscribe to the ACM Digital Library, the world’s
largest and most respected computing resource

We’re more than computational theorists, database engineers, UX mavens, coders and
developers. Be a part of the dynamic changes that are transforming our world. Join
ACM and dare to be the best computing professional you can be. Help us shape the
future of computing.

Sincerely,

Alexander Wolf
President
Association for Computing Machinery

Advancing Computing as a Science & Profession


SHAPE THE FUTURE OF COMPUTING.
JOIN ACM TODAY.
ACM is the world’s largest computing society, offering benefits and resources that can advance your career and
enrich your knowledge. We dare to be the best we can be, believing what we do is a force for good, and in joining
together to shape the future of computing.

SELECT ONE MEMBERSHIP OPTION


ACM PROFESSIONAL MEMBERSHIP: ACM STUDENT MEMBERSHIP:
q Professional Membership: $99 USD q Student Membership: $19 USD
q Professional Membership plus q Student Membership plus ACM Digital Library: $42 USD
ACM Digital Library: $198 USD ($99 dues + $99 DL) q Student Membership plus Print CACM Magazine: $42 USD
q ACM Digital Library: $99 USD q Student Membership with ACM Digital Library plus
(must be an ACM member) Print CACM Magazine: $62 USD

q Join ACM-W: ACM-W supports, celebrates, and advocates internationally for the full engagement of women in
all aspects of the computing field. Available at no additional cost.
Priority Code: CAPP
Payment Information
Payment must accompany application. If paying by check
or money order, make payable to ACM, Inc., in U.S. dollars
Name or equivalent in foreign currency.

ACM Member # q AMEX q VISA/MasterCard q Check/money order


Mailing Address
Total Amount Due

Credit Card #
City/State/Province
Exp. Date
ZIP/Postal Code/Country
Signature
Email

Return completed application to:


Purposes of ACM ACM General Post Office
ACM is dedicated to: P.O. Box 30777
1) Advancing the art, science, engineering, and New York, NY 10087-0777
application of information technology
Prices include surface delivery charge. Expedited Air
2) Fostering the open interchange of information Service, which is a partial air freight delivery service, is
to serve both professionals and the public available outside North America. Contact ACM for more
3) Promoting the highest professional and information.
ethics standards Satisfaction Guaranteed!

BE CREATIVE. STAY CONNECTED. KEEP INVENTING.

1-800-342-6626 (US & Canada) Hours: 8:30AM - 4:30PM (US EST) acmhelp@acm.org
1-212-626-0500 (Global) Fax: 212-944-1318 acm.org/join/CAPP
The Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLOG@CACM
community. In each issue of Communications, we’ll publish
selected posts or excerpts.

Follow us on Twitter at http://twitter.com/blogCACM

DOI:10.1145/2771281 http://cacm.acm.org/blogs/blog-cacm

The Dangers been instances when drones have been


given medals, and have been ceremoni-

of Military Robots,
ally buried when “killed in action.” It is
fascinating to see the great emotional
appeal of the machines juxtaposed

the Risks of with the intellectual fear of what robots


might do in the future, when they be-
come more able to act independently.

Online Voting That fear is more than just the resi-


due of the dark tropes of the Termina-
tor and Matrix film franchises—not to
John Arquilla considers the evolution of defense drones,
mention Ultron—or of the “Battlestar
and why Duncan A. Buell thinks we are not ready for e-voting. Galactica” TV series reboot; or, for that
matter, of the worries about robots
most recently expressed by luminaries
John Arquilla War II, AI has advanced enormously, and like Stephen Hawking. There are real
“Meet A.I. Joe” the U.S. military has continued to show and practical concerns about auton-
http://bit.ly/1HYARMi a steady appetite for acquiring lethal omy and flawed machine judgment.
May 1, 2015 robots. Tomahawk cruise missiles, the How is a robot to determine the differ-
Isaac Asimov’s Three great-grandchildren of the Norden, once ence between enemy soldiers and non-
Laws of Robotics, first launched find their own way to far-dis- combatants? This is a difficult-enough
codified in his 1942 short story “Run- tant targets over even the most compli- problem for human soldiers in the ir-
around” (http://bit.ly/1AAkKhW), cated terrain; yet the enemy sometimes regular wars of our time. Other ques-
sought to steer use of artificial intelli- eludes the Tomahawks, which end up tions are: When should a robot keep
gence (AI) in peaceful directions. Robots killing the wrong people. The Phalanx fighting or stop shooting if the foe is
were never to harm humans, or by inac- ship-defense system is another impor- wounded, or is trying to surrender? In
tion allow them to come to harm. Within tant military robot—many missiles the case of a robot-piloted attack air-
months of the elucidation of these laws, move too fast for human reflexes—but craft, can it discriminate between mili-
however, an extremely primitive robot, Phalanx, limited to close-in, Gatling-gun- tary and civilian targets adequately?
the Norden bombsight, was being put like defensive fire, is unlikely to cause Even worse, can robots be hacked?
to lethal use by the U.S. Army Air Corps. collateral damage to noncombatants. The Iranians claim to have hacked an
The bombsight combined a computer For all the advances in AI, there re- American drone and brought it down
that calculated various factors affecting mains a significant reluctance to em- safely on their territory back in 2011.
an aircraft’s arrival over a target with an brace the notion of fully autonomous However it happened, they have it, and
autopilot. Touted for its accuracy from action by machines. Hence the popular- refused to return it when President
high altitude, the Norden nevertheless ity today of remotely controlled aircraft Obama somewhat cheekily asked for it
tended to miss the aim point by an aver- and ground combat systems. Indeed, back. This incident should prompt us
age of a quarter-mile. Yet pilots routinely the growth in the number of drone pi- to consider the question: What if ro-
turned over control to the machine for lots has been explosive, and soldiers on bots could be taken over and turned on
the final run to target. the ground have become very attached their masters? A coup of this sort would
In the 70 years since the end of World to their machine buddies; there have require a high level of technological so-

12 COMMUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


blog@cacm

phistication and an absolute mastery keep alive the ideal of peaceful robotics ficials can be made to listen.
of the radio-electromagnetic spectrum, that Asimov introduced so long ago. ACM members are moderately aware
but the consequences of one side being There may be no way, ultimately, to that Alex Halderman of the University of
able to do this would be catastrophic for stop the spread of killer robots, but at Michigan and his students (legitimate-
the side whose robots were taken over. the very least they should be obligated ly) hacked the test Internet voting soft-
Strategic concerns aside, on the po- to observe the laws of armed conflict, ware of the District of Columbia. They
litical front, ethical and practical con- like their human counterparts. are probably less aware that Halderman
cerns about robots have been raised at and Vanessa Teague of the University of
the United Nations, which in an April Duncan A. Buell Melbourne recently demonstrated that
2013 report called for an immediate “Computer Security vendor software being used for an Aus-
moratorium on the development and and the Risks tralian statewide election was subject
deployment of “lethal autonomous of Online Voting” to a standard security flaw. Teague and
robots.” The report is part of a valiant http://bit.ly/1CeFQ4A Halderman responded in the best man-
effort to stave off the onset of an AI April 2, 2015 ner of professionals recognizing a flaw:
arms race, but indicators from many One hears from the science commu- they alerted the Australian CERT, which
places are that the race is already on. nity that scientists need to stand up alerted the New South Wales Election
In Britain, for example, the Taranis and explain their science to the public. Commission (NSWEC), which patched
combat aircraft (named for the Celtic The usual topics—climate change, evo- the software. Then Halderman and
god of thunder) has autonomous ca- lutionary biology, stem cell research— Teague went public, but the software
pabilities, unlike the Predators and are outside the normal scope of most patch to prevent votes from being inter-
other types of drones we have become ACM members, but we in the computer cepted and changed in flight to election
familiar with over the past decade. The sciences are the experts on one matter central did not happen until one-fourth
British are also moving toward field- that has enormous impact. of the total online votes were cast.
ing robot soldiers; their humanoid In the wake of the 2000 U.S. presiden- The response from the NSWEC was
Porton Man reflects exceptional so- tial election, Congress passed the Help both dismissive and frightening. On the
phistication of design. America Vote Act (HAVA). The nation one hand, it was admitted the NSWEC
The Russians are not far behind, al- largely turned away from older election had factored the likelihood of cor-
though their Strelok sharpshooter and technology and moved to computer- rupted votes into their analysis. On the
Metalliste machine gun and grenade based systems, in some states relying other hand, they targeted Teague with
launcher systems, while apparently hav- entirely on the software and sometimes- a formal complaint to her university,
ing some autonomous capability, seem arcane procedures that provided no sec- accusing her of a breach of ethics and
to be under fairly close human control. ondary method against which to com- suggesting further that a DoS attack was
But don’t count on it staying that way. pare the tally produced by the software. coming from the University of Michi-
The Chinese military is making swift Now, as the HAVA machines age, gan. The NSWEC then had the audacity
advances in both remote-controlled many election officials around the to publish with the vendor a puff piece
and autonomous systems, on land, at country and around the world seem en- (http://bit.ly/1GVp4Nr) on online voting
sea, and in the air. Their advances in na- chanted with the marketing hype of In- without mentioning the security flaw.
val mine warfare include weapons that ternet voting software vendors and are The Australian election is not an
can sense the type of ship coming along buying in to the notion that we could— anomaly. Many U.S. states are toying
and move stealthily to attack it. There and should—vote online now and in with the notion of online voting, con-
is also evidence of Chinese sea mines the very near future. tracting their elections to private com-
with a capability to detect and then at- Never mind the almost-daily reports panies whose code has never been given
tack helicopters flying above them. of data breaches of financial organiza- a public vetting. As scientists, we would
Against these threats, the U.S. Navy is tions with deep pockets to spend on all probably rather be doing science
developing autonomous mine-clearing securing their computers. Never mind than trying to find ways to convince the
technologies, giving the undersea fight that governments, with shallower pock- public and election officials that secu-
an increasingly robot vs. robot flavor. ets, are routinely hacked, or that former rity online today is not up to the task
The same is true in cyberspace, FBI director Robert Mueller went on of voting online today. Yet the need is
where the sheer speed and complexity of the record that the only two kinds of critical for disseminating the hard facts
offensive and defensive operations are computers are those that have already about computer security and the huge
driving a shift to reliance on robots for been hacked and those that no one has risks of online voting. We must take
waging cyberwars. This is an area about yet bothered to hack. Election officials this on as a professional responsibility.
which little can be said openly save that seem in awe of ill-defined vendor terms The nation and the world deserve no
here is yet more evidence that the arms like “military-grade encryption.” less than our full involvement.
race the United Nations seeks to prevent In a small corner of the ACM world,
is well under way. That suggests it is high scientists wonder why officials are not John Arquilla is professor and chair of defense analysis
time for an international conference, hearing the message of computer secu- at the United States Naval Postgraduate School. Duncan
A. Buell is professor of computer science and engineering
and an accompanying discourse, on rity experts. It is time for that small cor- at the University of South Carolina.
the prospects for crafting robotic arms ner to expand and for the membership
control agreements. In this way, we can to find a voice in hopes that election of- © 2015 ACM 0001-0782/15/07 $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 13
N
news

Science | DOI:10.1145/2771283 Chris Edwards

Growing Pains
for Deep Learning
Neural networks, which support online image search and speech
recognition, eventually will drive more advanced services.

A
D VA N C E S I N T H E O R Y and
computer hardware have
allowed neural networks
to become a core part of
online services such as
Microsoft’s Bing, driving their image-
search and speech-recognition sys-
tems. The companies offering such
capabilities are looking to the technol-
ogy to drive more advanced services in
the future, as they scale up the neural
networks to deal with more sophisti-
cated problems.
It has taken time for neural net-
works, initially conceived 50 years ago,
to become accepted parts of informa-
tion technology applications. After a
flurry of interest in the 1990s, support-
ed in part by the development of highly
specialized integrated circuits designed
to overcome their poor performance
on conventional computers, neural
networks were outperformed by other
algorithms, such as support vector ma-
chines in image processing and Gauss-
ian models in speech recognition. neurons in the following layer. The net- the human brain has a deeper archi-
Older simple neural networks use works are trained by iteratively adjust- tecture involving a number of hidden
IMAGE BY SERGEY NIVENS

only up to three layers, split into an in- ing the weights that each neuron ap- layers, the results from early experi-
put layer, a middle ‘hidden’ layer, and plies to its input data to try to minimize ments on these types of systems were
an output layer. The neurons are highly the error between the output of the en- worse than for shallow networks. In
interconnected across layers. Each tire network and the desired result. 2006, work on deep architectures re-
neuron feeds its output to each of the Although neuroscience suggested ceived a significant boost from work by

14 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


news

Geoffrey Hinton and Ruslan Salakhut-


dinov at the University of Toronto.
ian models with deep neural nets, the
error rates went way down.” ACM
They developed training techniques
that were more effective for training
networks with multiple hidden layers.
Deep neural nets showed an im-
provement of more than a third, cut-
ting error rates on speech recognition
Member
One of the techniques was ‘pre-train-
ing’ to adjust the output of each layer
with little background noise from 35%
to less than 25%, with optimizations
News
independently before moving on to allowing further improvements since ELLIOTT CONSIDERS
trying to optimize the network’s out- their introduction. ONE MORE BIG PROJECT
put as a whole. The approach made it There are limitations to this form of Chip Elliott has
possible for the upper layers to extract learning. London-based DeepMind— a triple-threat
career: he is
high-level features that could be used which was bought by Google in early chief scientist at
more efficiently to classify data by the 2014 for $400 million—used comput- Raytheon BBN
lower, hidden layers. er games to evaluate the performance Technologies,
adjunct
Even with improvements in train- of deep neural networks on different professor of computer science at
ing, scale presents a problem for deep types of problems. Google researcher Dartmouth College, and Futures
learning. The need to fully interconnect Volodymyr Mnih says the system can- Director at the National Science
neurons, particularly in the upper lay- not deal with situations such as tra- Foundation’s Global
Environment for Network
ers, requires immense compute power. versing a maze, where the rewards only Innovations (GENI). Elliott is also
The first layer for an image-processing come after successfully completing a an active inventor, with over 90
application may need to analyze a mil- number of stages. In these cases, the issued patents.
Elliott was born in
lion pixels. The number of connections network has very little to learn from
Connecticut, raised in Kansas,
in the multiple layers of a deep network when it tries various random initial and “escaped back to New
will be orders of magnitude greater. maneuvers but fails. The deep neural England” to earn a B.A. in
“There are billions and even hundreds network fares much better at games mathematics from Dartmouth
College in New Hampshire. He
of billions of connections that have such as Breakout and Virtual Pinball, began his computer science
to be processed for every image,” says where success may be delayed, but it career as a programmer,
Dan Cireşan, researcher at the Manno, can learn from random responses. teaching and founding the
When it comes to deploying deep startup True Basic. His passion
Switzerland-based Dalle Molle Insti-
is research “seven to 10 years
tute for Artificial Intelligence Research networks in commercial applications, beyond current economic
(IDSIA). Training such a large network teams have turned to custom comput- drivers, to explore [my vision].”
requires quadrillions of floating-point er designs using field-programmable One such project involved
collaborating with quantum
operations, he adds. gate arrays (FPGAs). These implement physicists and photonics
Researchers such as Cireşan found custom electronic circuits using a experts to build the world’s
it was possible to use alternative com- combination of programmable logic first quantum network, linking
puter architectures to massively speed lookup tables, hard-wired arithmetic Raytheon BBN with Harvard and
Boston universities.
up processing. Graphics processing logic units optimized for digital signal His research at GENI
units (GPUs) made by companies such processing, and a matrix of memory centers on large-scale cloud
as AMD and nVidia provide the abil- cells to define how all of these ele- infrastructures. GENI provides
3,000 researchers at 50 U.S.-
ity to perform hundreds of floating- ments are connected. based universities with a
point operations in parallel. Previous Chinese search-engine and web- virtual laboratory to promote
attempts to speed up neural-network services company Baidu, which uses innovation. “Most clouds are
training revolved around clusters deep neural networks to provide pre-baked; we bake our own,”
Elliott says, adding, “Our
of workstations that are slower, but speech recognition, image searches, researchers are programming
which were easier to program. In one and to serve contextual advertise- new types of clouds that are
experiment in which a deep neural ments, decided to use FPGAs rather better at processing big data.”
than GPUs in production servers. Among the initiatives he
network was trained to look for char-
is keen to pursue at BBN are
acteristic visual features of biological According to Jian Ouyang, senior ar- “futuristic cellular Internets”;
cell division, Cireşan says the training chitect at Baidu, although individual fully homomorphic encryption—
phase could have taken five months GPUs provide peak floating-point “done correctly, you can perform
computation on encrypted
on a conventional CPU; “it took three performance, in the deep neural net- data without unencrypting it”;
days on a GPU.” work applications used by Baidu, the synthetic biology, “to make
Yann LeCun, director of artificial FPGA consumes less power for the cells into engineered systems,
intelligence research at Facebook same level of performance and could performing advanced tasks like
producing diesel fuel inside plant
and founding director of New York be mounted on a server blade, pow- cells,” and producing machines
University’s Center for Data Science, ered solely from the PCI Express bus that “sense interactions with
says, “Before, neural networks were connections available on the mother- chemicals, light, and pressure,
to incorporate in the human
not breaking records for recognizing board. A key advantage of the FPGA is bloodstream” for the potential
continuous speech; they were not big that because the results from one cal- medical benefits.
enough. When people replaced Gauss- culation can be fed directly to the next —Laura DiDio

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 15
news

without needing to be held temporari- not be present in certain databases


ly in main memory, the memory band- for a given candidate, a probabilistic
width requirement is far lower than Researchers model can deal with the situation bet-
with GPU or CPU implementations. are trying to ter than traditional machine-learning
“With the FPGA, we don’t have to techniques. The work lags behind that
modify the server design and environ- use distributed on neural networks, but researchers
ment, so it is easy to deploy on a large processing to build have started work on effective training
scale. We need many functions to be techniques, as well as scaling up pro-
supported that are impossible to de- more extensive cessing to work on platforms such as
ploy at the same time in FPGA. But we deep-learning multi-GPU machines.
can use their reconfigurability to move “We carry an additional algorithmic
functions in and out of the FPGA as networks that can burden, that of propagating the uncer-
needed. The reconfiguration time is cope with much tainty around the network,” Lawrence
less than 10µs,” says Ouyang. says. “This is where the algorithmic
The Baidu team made further space larger datasets. problems begin, but is also where
savings by using a simplified floating- we’ve had most of the breakthroughs.”
point engine. “Standard floating-point According to Lawrence, deep-
implementations provided by proces- learning systems based on Gauss-
sors can handle all possible excep- ian processes are likely to demand
tions. But in our situation we don’t “committees” of smaller networks to greater compute performance, but the
need to handle all of the exceptions of reduce error rates compared to larger systems are able to automatically de-
the IEEE [754] standard.” single deep-learning networks. In one termine how many layers are needed
As well as finding ways to use more test of traffic-sign recognition, the com- within the network, which is not cur-
effective processors, researchers are bination of techniques resulted in bet- rently possible with systems based on
trying to use distributed processing ter performance than human observers. neural networks. “This type of struc-
to build more extensive deep-learning Deciding on the distortions to use tural learning is very exciting, and was
networks that can cope with much for a given class of patterns takes hu- one of the original motivations for
larger datasets. The latency of transfers man intervention. Cireşan says it considering these models.”
over a network badly affects the speed would be very difficult to have networks In currently more widespread
of training. However, rearranging the self-learn the best combination of dis- neural-network systems, Cireşan says
training algorithms together with a tortions, but that it is typically an easy work is in progress to remove further
shift from Ethernet networking to In- decision for humans to make when set- limitations to building larger, more ef-
finiband, which offers lower latency, al- ting up the system. fective models, “But I would say that
lowed a team from Stanford University One potential issue with convention- what we would like mostly is to have
in 2013 to achieve almost linear speed- al deep learning is access to data, says a better understanding of why deep
ups for multiple parallel GPUs. In more Neil Lawrence, a professor of machine learning works.”
recent work using clusters of CPUs learning in the computer science de-
rather than GPUs, Microsoft developed partment of the University of Sheffield.
Further Reading
a way to relax the synchronization re- He says deep models tend to perform
quirements of training to allow execu- well in situations where the datasets are Hinton, G.E., and Salakhutdinov, R.R.
Reducing the dimensionality of data with
tion across thousands of machines. well characterized and can be trained on
neural networks, Science (2006), Vol 313,
More scalable networks have made a large amount of appropriately labeled p 504.
it possible for Baidu to implement an data. “However, one of the domains
Schmidhuber, J.
“end to end” speech recognition sys- that inspires me is clinical data, where Deep learning in neural networks: an
tem called Deep Speech. The system this isn’t the case. In clinical data, most overview, Neural Networks (2015), Volume
does not rely on the output of tradition- people haven’t had most clinical tests 61, pp85-117 (ArXiv preprint: http://arxiv.
al speech-processing algorithms, such applied to them most of the time. Also, org/pdf/1404.7828.pdf)
as the use of hidden Markov models to clinical tests evolve, as do the diseases Mnih, V., et al
boost its performance on noisy inputs. that affect patients. This is an example Human-level control through deep
It reduced errors on word recognition of ‘massively missing data.’” reinforcement learning, Nature (2015), 518,
pp529-533
to just over 19% on a noise-prone da- Lawrence and others have suggest-
taset, compared to 30.5% for the best ed the use of layers of Gaussian pro- Damianou A.C. and Lawrence N.D.
Deep Gaussian processes, Proceedings
commercial systems available at the cesses, which use probability theory,
of the 16th International Conference
end of 2014. in place of neural networks, to provide on Artificial Intelligence and Statistics
However, pre-processing data and effective learning on smaller datasets, (AISTATS) 2013. (ArXiv preprint: http://
combining results from multiple and for applications in which the neu- arxiv.org/pdf/1211.0358.pdf)
smaller networks can be more ef- ral networks do not perform well, such
fective than relying purely on neural as data that is interconnected across Chris Edwards is a Surrey, U.K.-based writer who reports
on electronics, IT, and synthetic biology.
networks. Cireşan has used a com many different databases, which is the
bination of image distortions and case in healthcare. Because data may © 2015 ACM 0001-0782/15/07 $15.00

16 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


news

Technology | DOI:10.1145/2771299 Gregory Goth

Bringing Big Data


to the Big Tent
Open source tools assist data science.

S
OME EXPERIENCED HANDS in
the field of data analysis feel
the differences between in-
vestigational data scientists,
who work on the leading
edge of concepts using statistical tools
such as the R programming environ-
ment, and operational data scientists,
who have traditionally used general-pur-
pose programming languages like C++
and Java to scale analytics to real-time
enterprise-level computational assets,
need to become less relevant.
That sentiment is growing, espe-
cially as tools emerge that enable indi-
vidual scientists to analyze ever-larger
amounts of data. Though many, if not
The Scientific Software Network Map allows interactive exploration of data showing
most, of these data practitioners are not software tools used, usage trends, which packages were used together to produce a result,
considered software developers in the and publications resulting from that use.
traditional sense, the data their analysis
creates becomes itself an increasingly Apache Spark cluster computing plat- the past two O’Reilly data science tool
valuable resource for many others. form and its companion machine learn- surveys—it has also become rather un-
“It is increasingly important to bridge ing library developed at the AMPLab. wieldy, according to Jim Herbsleb, a
that gap, for a few reasons,” says Michael The attraction of open source tools professor of computer science at Carn-
Franklin, director of the AMPLab, an might be ascribed to the academic egie Mellon University.
open source innovation lab at the Uni- culture in which much cutting-edge “There’s a lot duplication of effort
versity of California at Berkeley. Frank- data science is being done, according out there, a lot of missed opportuni-
lin’s reasons for closing the gap include: to Stanford University marine biolo- ties, where one scientist has developed
! Having a unified system for both gist Luke Miller. Miller started using R a tool for him or herself, and with a few
speeds up discovery by “closing the loop” when he began working at an institu- tweaks, or if it conformed to a particu-
between exploration and operation. tion that did not have a free site license lar standard or used a particular data
! Having different systems intro- for Matlab; his observations are shared format, it could be useful to a much
duces the potential for error—a model by Scott Chamberlain, co-founder of wider community,” he says. “R is all
created in the “investigational” tool rOpenSci, which creates open source over the world, on everybody’s lap-
set may be translated incorrectly to the scientific R packages, collections of tops, so it’s really hard to get a sense
IMAGE COURTESY OF T HE SCIENTIF IC SO F T WA RE NET WORK MA P PROJ ECT

(different) operational system—some- R functions, data, and compiled code of what’s happening out in the field.
times these errors can be quite subtle that help reduce redundant coding. That’s what we’re trying to address.”
and go unnoticed. “Academics have this culture of Herbsleb and post-doctoral re-
! Increasingly, data scientists are not wanting to have to pay for things,” searcher Chris Bogart have created the
responsible for both exploration and Chamberlain says. “They want to use Scientific Software Network Map (http://
production. things that either the university has a scisoft-net-map.isri.cmu.edu/), essen-
The tools around many of the most license for, or they want it to be free. tially a meta-tool that allows interac-
innovative and promising data science That’s the culture; love it or hate it, tive exploration of data showing which
initiatives are often open source re- that’s the way it is.” software tools have been used, how use
sources that range from the R statistical is trending, which packages are used to-
programming language, widely used Maturing the R Ecosystem gether to produce a result, and the pub-
by individual researchers in the life sci- While the free culture around R has al- lications that resulted from that use.
ences, physical sciences, and social sci- lowed it to spread to many disciplines Herbsleb views the newly launched
ences alike, to the enterprise-friendly —it has narrowly outpaced Python in project, funded by a National Science

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 17
news

Foundation grant, more as a reference For experienced end users like Miller, are also embracing non-traditional
point for R package developers who there appears to be a critical mass of communities of data scientists, tout-
wish to see how end users are utiliz- mailing list exchanges, an index of do- ing speed and ease of use rather than
ing their work than for those seeking main-specific “task views” on the Com- programming environment as selling
tools. “If developers know what other prehensive R Archive Network (CRAN) points, though R interfaces are increas-
people need or want from a tool they website (http://cran.r-project.org/), and ingly viewed as important features.
are working on, they are often willing dialogue within scientific communities For some of these projects, survival
to do some extra work to make sure it that facilitate reuse of code and overall seems assured, through ample fund-
works for everybody,” he says. “That’s expansion of bodies of knowledge codi- ing and acceptance by the open source
another thing one could hope to get fied in R; what one may term the “opera- infrastructure. Spark, for example, which
from the map: a better understanding tional” side of the worldwide availability boasts in-memory processing capabili-
of what community needs are by seeing of reproducible data. ties that can be 100 times faster than
how the tools are actually being used. “There are enough R users who have Hadoop MapReduce, was initially de-
Another use case is for communities written very useful scripts that I can veloped as part of a $10-million Na-
of scientists who either already do, piggy-back onto and open these same tional Science Foundation award to the
or would like to, start managing their data files and plunk through them,” he Berkeley AMPLab and granted top-level
software assets as a community.” says. “Python in some ways may be a bet- status by the Apache Software Founda-
However, Larry Pace, author of sev- ter way to do that, certainly in terms of tion in 2014. The developers of Spark
eral books on R, thinks Herbsleb’s time invested; a good Python program- (which includes the MLbase machine
network map could be exactly the ref- mer probably would have gotten it done learning stack) cite rapid growth in the
erence neophyte R users, as well as de- faster than I did in R. We’re not doing number of people trained on it; AM-
velopers, would find very useful. wildly fancy programming here; we’re PLab co-director Ion Stoica estimates
“I can see the various R packages I just opening data files and parsing them. 2,000 received training in 2014 and es-
use on this map; it’s a really, really cool But in terms of integrating into the rest timates 5,000 will receive training in
way to show this,” Pace says. “There of my data visualization and analysis 2015. The first Spark Summit, held in
are, even in R right now, tools that do workflow, there’s very little reason to step December 2013, had 450 attendees and
what psychologists do, psychometric outside of R for that sort of task.” a second, in June 2014, had 1,100.
analyses and things like that. They’re “Today, Spark is the most active big
just not well publicized, so people User-Friendly ML Emerging data open source project in terms of
don’t know about them.” Developers of machine learning tools number of contributors, in number of

Milestones

Oxford Celebrates Lovelace’s 200th Birthday


This year, the University of in one of Alan Turing’s most and the University of Oxford’s ! Jennifer Widom, Fletcher
Oxford will celebrate the 200th famous papers, “Computing Department of Computer Science, Jones Professor and chair of the
anniversary of the birth of Machinery and Intelligence,” as working with colleagues in the Computer Science Department,
computer visionary Ada Lovelace. “can a machine think?” Mathematics Institute, Oxford Stanford University
The centerpiece of the The Symposium will be aimed e-Research Centre, Somerville ! Ulrike Sattler, professor of
celebration will be a display at the at a broad audience interested College, the Department of Computer Science, University of
University of Oxford’s Bodleian in the history and culture of English and TORCH. Manchester
Library (to run Oct. 29 through mathematics and computer For more information or to ! Ursula Martin CBE, professor
Dec. 18) and a symposium on Dec. science, and will present current register your interest, please visit of Computer Science, University
9–10 presenting Lovelace’s life and scholarship on Lovelace’s life http://blogs.bodleian.ox.ac.uk/ of Oxford
work, and contemporary thinking and work, and link her ideas to adalovelace/. ! James Essinger, author of Ada
on computing and artificial contemporary thinking about Lovelace, A Female Genius: How
intelligence. computing, artificial intelligence, FURTHER CELEBRATION Ada Lovelace started the Com-
Ada, Countess of Lovelace and the brain. Confirmed speakers On Oct. 16, Oxford’s Sommerville puter Age (2013)
(1815–1852), is best known for an include Lovelace biographer Betty College will host the Ada Lovelace Somerville College also
article she wrote about Charles Toole, computer historian Doron Bicentenary: Celebrating will exhibit materials from Ada
Babbage’s unbuilt computer, the Swade, mathematician Marcus Women in Computer Science, Lovelace’s literary life.
Analytical Engine, that presented du Sautoy, and graphic novelist a celebration of the life of the Somerville College’s Ada
the first documented computer Sydney Padua. mathematician and scientific Lovelace Bicentenary Celebration
program, designed to calculate Other activities will include visionary and of women in will be chaired by Mason Porter,
Bernoulli numbers, and explained a workshop for early career science. tutor in Applied Mathematics at
the ideas underlying Babbage’s researchers, a “Music and The event will take place in Somerville College.
machine–which similarly underlie Machines” event, and a dinner the Flora Anderson Hall of For more information about
every computer and computer at Oxford’s Balliol College on Somerville College. Speakers will attending Sommerville College’s
program in use today. Lovelace Dec. 9, the eve of Lovelace’s 200th include: Ada Lovelace Bicentenary
also wrote about computers’ birthday. ! Cecilia Mascolo, professor of Celebration, please contact
creative possibilities and limits; Oxford’s celebration will be Mobile Systems at University of Barbara.raleigh@some.ox.ac.uk.
her contribution was highlighted led by the Bodleian Libraries Cambridge —Lawrence M. Fisher

18 COMMUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


news

commits, number of lines of code—in face that may appeal to those uncom-
whatever metric you are using, it’s sev- fortable with command line interfaces)
eral times more active than other open Spark boasts in a test of k-nearest neighbors, Curtin
source projects, including Hadoop,” in-memory says ultimately, the prevailing vibe
Stoica says. among the community is cooperative
To help attract data scientists who processing more than competitive.
have not traditionally been users of ma- capabilities that “We all sort of have the same goal,”
chine learning at scale, the Spark team he says, “but there’s an understanding
has developed features including: can be 100 times the Shogun guys built their own thing.
! Spark’s MLlib machine learning
faster than Hadoop Shogun focuses on kernel machines
library already ships with application and supports vector machines and
programming interfaces (APIs) such as MapReduce. classifiers of that ilk. mlpack focuses
a pipelines API, which streamlines the more on problems where you’re com-
sequence of data pre-processing, fea- paring points in some space, like near-
ture extraction, model fitting, and vali- est neighbor.”
dation stages by a sequence of dataset That sense of community may prove
transformations. Each transformation paramount for small projects such as
takes an input dataset and outputs the Curtin says mlpack appeals to non- mlpack. While Curtin says the growth
transformed dataset, which becomes experts through a consistent API that curve of mlpack’s downloads is becom-
the input to the next stage. provides default parameters that can ing exponential, he has limited time
! Spark’s developers have also been be left unspecified. Additionally, a user to develop features that might spur
perfecting its R interface. Introduced in could move from one method to anoth- growth even further among non-expert
January 2014, early versions of SparkR er while expecting to interact with the users, such as designing a GUI and au-
supported a distributed list-like API new method in the same way. Expert tomatic bindings for other languages.
that maps to Spark’s Resilient Distrib- users may use its native C++ environ- Those things, in fact, may be for
uted Dataset (RDD) API. More recently, ment to customize their work. somebody else to do, says Curtin.
says SparkR developer Shivaram Ven- Curtin also says that, though ml- “My interest is less in seeing mlpack
kataraman, the team is close to releas- pack and Spark’s target audiences may as itself grow, although that’s nice,” he
ing a DataFrame API that will allow differ, the tool can be very useful for says. “I want to see the code get used
R users to use a familiar data frame operational tasks in areas such as the more than anything else. That’s what
concept, but now on large amounts of burgeoning field of data-driven popu- brought me to open source in the first
data. Venkataraman says the develop- lation health management. A data place; building things that are useful
ers also have some work in progress in analyst for a public health department for people.”
SparkR that will allow R users to call or healthcare delivery system, for ex-
Spark’s machine learning algorithms ample, could use one of mlpack’s clus-
Further Reading
by using the pipelines. At a high level, tering algorithms on a set of numerical
Venkataraman says, the interface will values that include fields such as a per- Matloff, N.
The Art of R Programming: A Tour of
allow R users to use specific columns son’s name and location (both mapped
Statistical Software Design, No Starch
of the DataFrame for machine learning to a granular numerical format) and vi- Press, San Francisco, CA, 2011.
and parameter tuning. tal health data such as blood glucose
Pace, L.
! Franklin also says the MLbase proj- and blood pressure, to predict where Beginning R: An Introduction to Statistical
ect, still under development, is going clinicians will have to deliver more in- Programming, Apress, New York, NY, 2012.
one step further to democratize ma- tense care. Wickham, H.
chine learning over big data by allow- “There are a couple of clustering al- The Split-Apply-Combine Strategy for Data
ing users to specify what they want to gorithms in mlpack you could use out Analysis, Journal of Statistical Software 40
predict and then having the underlying of the box, like k-means and Gaussian (1), April 2011
system determine the best way to ac- Mixture Models, and there are enough Sparks, E., Talwalkar, A., Smith, V., Kottalam,
complish that using MLlib and the rest flexible tools inside mlpack that if you J., Pan, X., Gonzalez, J., Franklin, M., Jordan, I.,
of the scalable Spark infrastructure. really want to dig deep into it and write and Kraska, T.
MLI: An API for Distributed Machine
A much smaller project than Spark, some C++, you can really fine-tune the
Learning, International Conference on Data
called mlpack, also targets those not ex- clustering algorithm or write a new or Mining, Dallas, TX, Dec. 2013
perienced with machine learning. The modified one.”
Curtin, R., Cline, J., Slagle, N.P., March, W.,
project’s “primary maintainer,” Ryan While mlpack research shows Parikshit, R., Mehta, N., and Gray, A.
Curtin, a Georgia Institute of Technology benchmarks significantly faster than MLPACK: A Scalable C++ Machine Learning
doctoral student, says the project, which those of other open source platforms Library, Journal of Machine Learning
launched its 1.0 version in December such as Shogun (which originally fo- Research 14, March 2013
2011, now includes five or six consis- cused on large-scale kernel methods
tent developers, 10 to 12 occasional con- and bioinformatics) and Weka (a popu- Gregory Goth is an Oakville, CT-based writer who
specializes in science and technology.
tributors, and perhaps 20 who may send lar general-purpose machine learning
in one contribution and then move on. platform with a graphical user inter- © 2015 ACM 0001-0782/15/07 $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 19
news

Society | DOI:10.1145/2771297 Gregory Mone

The New Smart Cities


How urban information systems are slowly
revamping the modern metropolis.

M
O RE T H AN H ALF of the
world’s population cur-
rently lives in or around
a city. By the year 2050,
the United Nations proj-
ects another 2.5 billion people could
be moving to metropolises. As urban
populations increase, the number of
data-generating sensors and Internet-
connected devices will grow even fast-
er. Experts say cities that capitalize on
all the new urban data could become
more efficient and more enjoyable
places to live. The big question now is
how to make that happen.
Until recently, this seemed to be
as simple as choosing the right out-
of-the-box smart city solution from a
big multinational corporation. “The
overhyped promises from the big
technology companies have kind of
blown away,” says Anthony Townsend, Chicago will be home to the Array of Things, a planned citywide installation of at least
Senior Research Scientist at New York 500 sensor-packed boxes.
University’s Rudin Center for Trans-
portation Policy and Management. the Urban Center for Computation cisions about where to place new bus
“Now you have city governments re- and Data, says the project began as an stops or how much road salt to apply to
grouping and developing compre- effort to better understand air qual- certain areas after a heavy snow. There
hensive long-range visions of the role ity downtown, but after conversations is also the potential to use the data to
information technology will play in with city officials and other scientists, fight gridlock. “The traffic control-
making their city better.” it expanded in scope. lers would like these on every corner,”
The leading urban centers are not The boxes will be approximately the Catlett says.
placing their technological futures in height and width of a parking sign and Average citizens will benefit as well.
the hands of a company or a single uni- attached to poles or other fixtures. Each Today, a runner equipped with a smart-
versity research group. Instead, they unit will have 17 or 18 sensors, and al- phone or a wearable fitness device can
are relying on a combination of aca- though the final implementation has easily track her route, mileage, aver-
demics, civic leaders, businesses, and not been determined, the boxes will age heart rate, and total steps; now,
individual citizens working together likely measure temperature, precipita- imagine that gadget talking to the city
to create urban information systems tion, humidity, air quality, and pedes- itself. The Array of Things units could
that could benefit all these groups. The trian flow. Microphones, for instance, exchange data each time a wearable de-
challenges are diverse and demanding, could detect noise pollution or trucks vice communicating via a Bluetooth or
but a handful of new projects around idling in one spot for too long. Wi-Fi signal comes within range. “Our
the world are providing a glimpse of The data captured will be valuable phones can keep track of so much,”
what a truly smart city could offer. to a variety of groups. Atmospheric says Catlett, “but I’m not going to find
scientists hoping to understand urban out my exposure to particulate matter
The Sensing City climate will have access to detailed in- or carbon monoxide or ozone. This
In Chicago, Argonne National Labora- formation on air quality, temperature, new device will tell me my exposure to
IMAGE BY ALEX D IBROVA

tory computer scientist Charlie Catlett and precipitation. Social scientists various emissions; then, maybe I can
is leading a collaboration called the will be able to study pedestrian counts use that to change my route or manage
Array of Things, a planned citywide in- and flows to analyze how people move my exposure.”
stallation of at least 500 sensor-packed through urban spaces. City planners The city of Glasgow, as part of a
boxes. Catlett, who is also director of will be able to make more informed de- project called Future City Glasgow,

20 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


news

is testing a similar technology: an in- the large scale, tracking waste disposal it, too, by working closely with urban
stallation of intelligent street lamps and recycling, but the researchers also informatics groups at local universi-
that will measure air quality, traffic, hope residents will play an active role ties, encouraging businesses to use
and pedestrian flow. Yet Glasgow is as willing sources of data. the data on pedestrian density to de-
also depending on a broader range CUSP deputy director Constantine termine where to establish new shops,
of sources to drive its transformation Kontokosta points to the installation and calling for enterprising citizens to
into a smarter metropolis. Recently, of sensors that track energy and water come up with new solutions on their
Glasgow announced the launch of its consumption down to individual out- own. “It’s an opportunity to stimulate
City Data Hub, a cloud-based, publicly lets and faucets. The data will be ano- innovation within the city,” says Birch-
accessible information store drawn nymized, and while residents will still enall. “You don’t have to rely on the
from more than 400 datasets. The Hub be able to opt out, Kontokosta suspects government. If the data is open, any-
provides data on traffic, pedestrian this information will prove appealing. one could try to solve the city’s prob-
footfall in retail areas, parking spots “The value is clear,” he says. “We’re lems through technology.”
at public garages, and more. Eventu- saying that if you let us collect informa- The flood of information could
ally, it could tap into other sources tion about how much energy and water also create new challenges, according
as well, such as smartcard readers at you’re consuming, then we will analyze to Pete Beckman, director of the Exa-
subway turnstiles. “We’re creating an the data, and we’ll tell you about others scale Technology and Computing In-
ecosystem of data providers through- so you can gauge how you’re doing.” stitute at Argonne and a leader in the
out the city,” says Colin Birchenall, The residents might be inspired to Array of Things project. “We’ll have so
the technology architect for Future change their behavior as a result, and much data about our smart cities, but
City Glasgow. perhaps use less water or energy. will we have the operating system?”
The Quantified Community and the asks Beckman.
Citizens First Array of Things projects hope to track Overall, the changes to these smart
In all these projects, organizers hope pedestrian movement as well, but in cities will be incremental, whether
this ecosystem will include the citizens both cases, the researchers have de- they are driven by researchers like
themselves. A city can only install so signed the systems to preserve privacy. Beckman and Catlett, urban planners,
many smartlights or lamppost-mount- If an Array of Things unit was measur- private innovators, or some combina-
ed units, but almost every resident will ing pedestrian flow, and you were to tion. “You’re not going to wake up one
be carrying a sensor-packed smart- walk past with a Bluetooth-enabled day and live in the city of the future,”
phone. Tapping into that data will pro- smartphone, the device might register says Townsend. “Cities are systems of
vide a whole new layer of information. that signal, but it would not store the systems. It takes time to bring about
For citizens to volunteer this informa- unique signature from your phone; large-scale change, but in the future I
tion, though, there has to be a clear instead, it would simply interpret that think we will be living in cities that fun-
value proposition. signal as the presence of a person and damentally operate differently.”
As an example, Birchenall cites delete the private data, so there would
Glasgow’s new cycling app, which be no record that you, specifically, were
Further Reading
helps riders plan routes, easily locate there. “Our plan is to count the num-
bicycle racks, and more; the app re- ber of unique addresses that we can Townsend, A.
Smart Cities: Big Data, Civic Hackers,
quests the right to gather data on a see, then throw those addresses away
and the Quest for a New Utopia,
user’s trips. There is no immediate and report only the count,” Catlett ex- W.W. Norton, 2013.
benefit, but Birchenall says cyclists plains. “This is meant to be a public
Li, T., Keahey, K., Sankaran, R.,
have opted in because of the potential data utility, so we have to think hard Beckman, P., and Raicu, I.
long-term benefits. The idea is that if about how to protect people’s privacy.” A Cloud-based Interactive Data
users do volunteer their data, then city Infrastructure for Sensor Networks,
planners will be better able to analyze Open Data, Endless Possibilities IEEE/ACM Supercomputing/SC, 2014.
cycling traffic, and potentially improve This public utility approach is one of Shin, D.H.
urban infrastructure by adding racks the common threads linking the dif- Ubiquitous City: Urban Technologies,
or widening lanes. ferent smart city projects. The Quanti- Urban Infrastructure, and Urban
Informatics, The Journal of Information
This approach is also behind cer- fied Community project will publicly Science, 2009.
tain aspects of the Quantified Commu- release all of its data. Similarly, the Ar-
Carvalho, L.
nity project, a collaboration between ray of Things units are designed to dis- Smart Cities from Scratch? A Socio-
real estate developers, technology and patch information to Argonne Nation- Technical Perspective, The Cambridge
design firms, and researchers at New al Laboratory for quality control and Journal of Regions, Economy and Society,
York University’s Center for Urban Sci- calibration, then on to the City of Chi- 2014.
ence and Progress (CUSP). As the de- cago for mass dissemination through Future City Glasgow: A Day in the Life
velopers revamp and rebuild several its open data portal. http://bit.ly/1GMuHwP
derelict city blocks in New York, CUSP In Glasgow, the City Data Hub
researchers are working with them makes all the information gathered Gregory Mone is a Boston, MA-based science writer and
the author of the children’s novel Fish.
to install a range of smart systems. easily accessible to local residents. The
There will be sensors that operate on city is counting on them to make use of © 2015 ACM 0001-0782/15/07 $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 21
news

Awards | DOI:10.1145/2773558 Lawrence M. Fisher

ACM Announces
2014 Award Recipients
Recognizing excellence in technical and professional achievements
and contributions in computer science and information technology.

ACM recently announced the recipi- University of Rochester, founded Mi- the field of computer science.”
ents of its 2014 Software System Award, crosoft Research in 1991. Today, he is As corporate vice president of Mi-
Distinguished Service Award, Out- vice president and chief research of- crosoft Research, Wing oversees its
standing Contribution to ACM Award, ficer of Microsoft’s Applications and core research laboratories around the
and Doctoral Dissertation Awards. Services Group. world. Previously, she led the CISE Di-
Tevanian, who holds rectorate of the NSF, and was the Pres-
Software System Award a B.A. in mathematics ident’s Professor of Computer Science
Mach, a pioneering from the University of at Carnegie Mellon University.
operating system Rochester and earned Wing earned S.B. and S.M. degrees
used as the basis for M.S. and Ph.D. degrees in electrical engineering and com-
later operating sys- in computer science puter science from the Massachu-
tems, received the from Carnegie Mellon University, setts Institute of Technology (MIT),
2014 ACM Software spent nearly 10 years at Apple and where she also received a Ph.D. in
System Award. Lead was a member of the company’s computer science.
developers Rick Rashid of Micro- leadership team, before becom- The Distinguished Service Award
soft and Avadis (Avie) Tevanian cre- ing managing director of Elevation is presented based on the value and
ated an operating system, whose Partners, a private equity firm. degree of services to the computing
innovative approaches to virtual The ACM Software System Award community, including consideration
memory management and micro- honors an institution or individual of activities outside ACM and empha-
kernel architecture established a for developing a software system sizing contributions to the computing
foundation for later operating sys- that has had a lasting influence, re- community at large.
tems on personal computers, tablets, flected in contributions to concepts, The first ACM presi-
and mobile phones. in commercial acceptance, or both. dent from outside
The Mach operating system, a re- The award is accompanied by a prize North America, Hall

PHOTOS COURTESY OF M ICROSOF T RESEA RCH (RICK RASH ID; JEA NNETTE WING), APPLE (AVIE TEVAN IAN ), BBC (W E N DY HALL)
search project funded by the U.S. De- of $35,000. Financial support for the was cited “for guiding
fense Advanced Research Projects Agen- award is provided by IBM. ACM to become a truly
cy at Carnegie Mellon University from international organiza-
1983 through 1992, was based on in- Distinguished Service, tion, helping improve
novative approaches to virtual mem- Outstanding Contribution to ACM diversity within ACM, and working to
ory management and microkernel ar- Jeannette Wing of Mi- increase ACM’s visibility in scientific
chitecture. Mach’s influence reflects crosoft Research has venues worldwide.”
both significant commercial accep- been named recipi- Hall initiated the establishment of
tance and substantial contributions ent of the 2014 ACM ACM Councils in Europe, India, and
to the concept of operating systems. Distinguished Service China, and has focused on the educa-
The Mach kernel forms the heart Award, and Dame Wen- tion of upcoming computer science
of the Apple iOS and OS X systems. dy Hall of the Universi- generations, promoting gender diver-
Mach was at the core of NeXT’s oper- ty of Southampton has been selected sity and nurturing talent in computing
ating system, which Apple acquired to receive the 2014 Outstanding Con- from all corners of the world. A profes-
and subsequently used as the basis of tribution to ACM Award. sor of computer science at the Univer-
OS X and iOS. Mach’s influence can Wing was cited for her advocacy sity of Southampton, U.K., Hall was a
also be traced to operating systems of “computational thinking,” which founding director of the Web Science
such as GNU Hurd, and UNIX systems can be used to algorithmically solve Research Initiative to promote the
OSF/1, Digital Unix, and Tru64 Unix. complicated problems of scale; her discipline of Web Science and foster
Rashid, who graduated from leadership of the Computer and In- research collaboration between the
Stanford University with degrees in formation Science and Engineering University of Southampton and MIT.
mathematics and comparative lit- (CISE) Directorate of the U.S. National A native of London, Hall received
erature and earned M.S. and Ph.D. Science Foundation (NSF), “and for B.Sc. and Ph.D. degrees in mathemat-
degrees in computer science from the drawing new and diverse audiences to ics from the University of Southamp-

22 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


news

ton, and a Master of Science degree in form computations on large clusters Milestones
computing from City College London. in a fault-tolerant manner. Zaharia
She has served as president of the Brit-
ish Computer Society, and since 2014,
implements RDDs in the open source
Apache Spark system, which matches
NAS
has served as a commissioner for the
Global Commission on Internet Gov-
or exceeds the performance of spe-
cialized systems in many application
Elects 5
ernance. In 2009, she was appointed
Dame Commander of the Order of the
domains, achieving speeds up to 100
times faster for certain applications. Computer
British Empire.
Outstanding Contribution to ACM
Award recipients are selected based
It also offers stronger fault tolerance
guarantees and allows these work-
loads to be combined.
Scientists
on the value and degree of their ser- An assistant professor in the Mas- Five computer scientists were
vice to ACM. sachusetts Institute of Technology among those recently elected
to the National Academy of
“The computing field has benefited Computer Science and Artificial Intel- Sciences (NAS) in recognition
immensely from the passion and ener- ligence Laboratory, Zaharia complet- of distinguished, continuing
gy of these two world-leading comput- ed his dissertation at the University of achievements in original
er scientists,” said ACM President Alex- California, Berkeley, which nominat- research.
The new members are:
ander L. Wolf. “Wing recognized early ed him. Zaharia received a Bachelor of ! Robert E. Kahn, president/
on that the ever-increasing impact of Mathematics degree from the Univer- CEO of the Corporation for
computing in modern society must be sity of Waterloo, where he won a gold National Research Initiatives,
Reston, VA;
accompanied by a language in which medal at the ACM International Col- ! Jitendra Malik, Arthur J.
we can discuss computing’s intellectu- legiate Programming Contest in 2005. Chick Professor of Electrical
al underpinnings with those not versed He is a co-founder and chief technol- Engineering and Computer
in its technical complexities. ogy officer of Databricks, the company Sciences at the University of
California, Berkeley; and
“Hall provided leadership and in- commercializing Apache Spark. ! Moshe Y. Vardi, Karen Os-
spiration at a time when the comput- ACM’s annual Doctoral Disser- trum George Distinguished Ser-
ing discipline exploded onto the inter- tation Award comes with a $20,000 vice Professor in Computational
Engineering at Rice University,
national scene, promoting ACM as the prize. Financial sponsorship of the
Houston, TX, and editor-in-chief
foremost association of computing award is provided by Google Inc. of Communications.
professionals worldwide.” Honorable Mention for the 2014 Joining the ranks of NAS
ACM Doctoral Dissertation Award foreign associates were:
! Manindra Agrawal, N. Rama
Doctoral Dissertation went to John Criswell of the Univer- Rao Chair Professor in the depart-
Matei Zaharia has been sity of Rochester, and John C. Duchi ment of computer science and
PHOTOS COURTESY OF M ASSACHU SET TS INSTIT UT E OF TECH NOLOGY ( MATEI ZA H ARIA) , STA NFORD UNIV E RSI T Y (JOHN D UCH), JOHN CRI S W E LL

selected to receive of Stanford University, who share a engineering of the Indian Institute
ACM’s 2014 Doctoral $10,000 prize, with financial sponsor- of Technology, Kanpur, India; and
! Kurt Mehlhorn, director of
Dissertation Award for ship provided by Google Inc. the Max Planck Institute for Infor-
his innovative solution Criswell’s disserta- matics at Saarbrücken, Germany.
to tackling the surge in tion, “Secure Virtual Ar-
data processing work- chitecture: Security for IBM’S GENTRY NAMED
MACARTHUR FELLOW
loads and accommodating the speed Commodity Software Among the newest Fellows
and sophistication of complex multi- Systems” (http://bit. named by the MacArthur
stage applications and more interac- ly/1IjTPxn), describes Foundation is Craig Gentry,
a research scientist in the
tive ad hoc queries. His work proposed a compiler-based in- Cryptography Research Group at
a new architecture for cluster comput- frastructure designed to address the IBM Thomas J. Watson Research
ing systems, achieving best-in-class challenges of securing systems that Center in Yorktown Heights,
performance in a variety of workloads use commodity operating systems like NY, recognized for “fueling a
revolution in cryptography and
while providing a simple program- UNIX or Linux. theoretical computer science
ming model that lets users easily and Duchi’s disserta- through his elegant solutions
efficiently combine them. tion, “Multiple Op- to some of the discipline’s most
challenging open problems.”
To address the limited processing timality Guarantees
Gentry published a plausible
capabilities of single machines in in Statistical Learn- candidate construction of a
an age of growing data volumes and ing” (http://bit. fully homomorphic encryption
stalling process speeds, Zaharia de- ly/1DI4L2C), explores scheme, making it possible to
perform arbitrary computations
veloped Resilient Distributed Datas- trade-offs that occur on data while encrypted and
ets (RDDs). As described in his disser- in modern statistical and machine without needing a secret key.
tation “An Architecture for Fast and learning applications. In addition, Gentry and his
General Data Processing on Large colleagues published a plausible
Lawrence M. Fisher is Senior Editor/News for ACM
candidate multilinear map, and
Clusters” (http://bit.ly/1ENKA9e), used those findings to present
Magazines.
RDDs are a distributed memory ab- the first example of cryptographic
straction that lets programmers per- © 2015 ACM 0001-0782/15/07 $15.00 software obfuscation.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 23
V
viewpoints

DOI:10.1145/2770890 Pamela Samuelson

Legally Speaking
Anti-Circumvention
Rules Limit
Reverse Engineering
Considering some of the requested exceptions
to technical protection mechanisms.

U
NTIL T H E U. S. Congress The DMCA rules, for instance, in- press their objections to the proposed
passed the Digital Millen- clude exceptions for law enforce- exceptions. In late May, the Office
nium Copyright Act (DMCA) ment, intelligence, and national se- held hearings to allow proponents to
in 1998, reverse engineer- curity purposes, for making software offer further arguments in support of
ing of computer programs interoperable, and for encryption and proposed exemptions and opponents
and other digital works was widely computer security research under to rebut those arguments. Thereafter
regarded as lawful in the U.S. The certain conditions. the Copyright Office will review the
DMCA changed the law because the In response to expressions of con- record, hold some hearings, and ulti-
entertainment industry feared clever cern that the anti-circumvention rules mately issue rules that will either grant
hackers could and would bypass tech- might have detrimental effects on the or deny the requested exceptions. This
nical protection measures (TPMs) that ability to make fair and otherwise law- column provides an overview of the
the industry planned to use to protect ful uses of technically protected digital requested exceptions and delves into
their copyrighted works from unau- content, Congress created a triennial some proposals that may be of interest
thorized copying and dissemination. rulemaking process that enables af- to computing professionals.
The industry persuaded Congress to fected persons to request special ex-
make it illegal to circumvent TPMs ceptions to the anti-circumvention Overview of Submissions
and to make or offer circumvention rules to engage in specified legitimate Approximately half of the proposed ex-
tools to the public. activities that TPMs are thwarting. ceptions aim to enable interoperabil-
Circumvention of TPMs is, of In November 2014, the Copyright ity with devices or software that the an-
course, a form of reverse engineering. Office received more than 40 propos- ti-circumvention rules arguably makes
This activity is now illegal not only in als for special exceptions to the DMCA illegal. Some submissions argue for
the U.S., but also in most of the rest anti-circumvention rules. In February exceptions to allow bypassing TPMs
of the world unless there is a special 2015, the Office received detailed com- for purposes of repair and modifica-
exception that permits circumven- ments explaining the rationales for tion of software in vehicles. A few ask
tion-reverse engineering for specific the proposed exceptions. In March, for broader exceptions for computer
purposes under specific conditions. opponents had the opportunity to ex- security research purposes.

24 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


viewpoints

Several proposed exceptions aim to allow bypassing of TPMs to “un- obtained thereby can only be used or
to overcome impediments the anti- lock” cellphones so their owners can disseminated to others for interoper-
circumvention rules pose for creating access alternative wireless networks. ability purposes.
multimedia e-books, other educa- Even though the Copyright Office de- Does that exception permit cir-
tional materials, documentary films, nied a requested exception to enable cumvention for purposes of enabling
and remixes of technically protected this activity in the last triennial review, consumers to use computer tablets
works. Two submissions request ex- Congress passed a special law in 2014 or wearable computing devices to ac-
ceptions for bypassing TPMs to pro- that granted an exception for this le- cess alternative wireless networks or
vide assistive technologies for print- gitimate activity, a sensible result giv- to access mobile hotspots? The Rural
disabled persons so they can, for en that cellphone unlocking poses no Wireless Association fears it does not,
example, have access to digital books threat of copyright infringement. so it is seeking exemptions for these
in alternative formats. kinds of activities. Unfortunately, the
One submission asks for an excep- Interoperability cellphone unlocking exception passed
tion to enable consumers to be able to Because the DMCA rules have an in- by Congress does not extend to these
continue to use videogames they have teroperability exception, it may seem devices. Yet, these uses would seem to
purchased after the games’ makers puzzling that so many of the proposed pose no threat of copyright infringe-
have stopped providing support for exceptions to the anti-circumvention ment to justify outlawing this type of
the games. Another submission seeks rules address interoperability issues. circumvention of TPMs.
to enable space-shifting of DVD mov- The existing exception permits reverse Another interoperability exception
ies. Two others want to make broader engineering of technically protected being sought is Public Knowledge’s ef-
personal uses of technically protected software “for the sole purpose of iden- fort to enable bypassing of TPMs that
IMAGE BY X-RAY PICTURES

works. All submissions for this year’s tifying and analyzing those elements makers of 3D-printing devices have
triennial review can be found at http:// of the program that are necessary to embedded in their software to stop
copyright.gov/1201. achieve interoperability of an indepen- unauthorized firms from competing
Missing from the triennial review dently created computer program with in the supply of feedstock to owners of
in 2014–2015 is a proposed exception other programs.” The information their 3D printers. Competition policy

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 25
viewpoints

would seem to support the grant of ception for film studies professors to
this exception, which also poses no bypass CSS to show clips from movies
threat of copyright infringement. Congress to illustrate filmmaking techniques.
Two other interoperability excep- should have However, the Office did not recognize
tions focus on bypassing TPMs to en- that many other types of instructors
able consumers to have more choices adopted narrower could benefit from an exception that
on the applications that can run on anti-circumvention enabled them to make fair use clips of
their devices. One submission asks for movies to illustrate other types of les-
an exception so owners of Linux oper- rules in the sons.
ating system computers can watch law- first place. During the current triennial review,
fully purchased DVD movies. Another the Authors Alliance (of which I am a
submission requests an exception so co-founder) has proposed exception
owners of videogame consoles can by- for multimedia e-books that would,
pass TPMs that limit the applications for instance, enable me to show clips
that can run on those consoles. from various James Bond movies
so that my students could consider
Computer Security ences and in journal publications. The whether James Bond is an “idea” or an
Computer researchers Steve Bellovin, DMCA rules now contemplate that a “expression” under copyright law, an
Matt Blaze, Ed Felten, Alex Halder- copyright owner in technically protect- issue that has been litigated in some
man, and Nadia Heninger submitted a ed software could enjoin dissemina- U.S. cases.
request for a computer security testing tion of research results. This has had a Will the Office recognize the valid-
exception that would permit bypass- chilling effect on the research that can ity of the interoperability and comput-
ing TPMs to access computer software be done to test the security of a wide er security testing exceptions being
and databases embodied in various variety of computing systems. sought? One can certainly hope so.
technologies to test for vulnerabilities, However, without a team of technolo-
malfunctions, and flaws. What Will the Office Do? gists to analyze the submissions and
Among the types of software sys- If the past is any predictor of the fu- advise the Office about the exception
tems in devices these researchers ture, chances are quite high the Copy- proposals, there is reason to worry the
envision testing are: insulin pumps, right Office will eventually deny the Office will regard these exceptions
pacemakers, car components (includ- overwhelming majority of the request- skeptically, especially if entertain-
ing braking and acceleration systems), ed anti-circumvention exceptions, no ment industry groups oppose them as
controls for nuclear power plants, matter how harmless they might seem. they have in the past.
smart grids, and transit systems, as Some proposals will likely be re-
well as smart technologies for the jected because the Office believes pro- Conclusion
home. These researchers argue that ponents failed to prove that TPMs are Congress should have adopted nar-
such systems are very important for actually an impediment to lawful uses rower anti-circumvention rules in
the health and safety of their users and of copyrighted works; it is not enough the first place. Only circumventions
of the public at large. Malfunctions, to assert that TPMs might impede le- that facilitate copyright infringement
flaws, and vulnerabilities may cause gitimate activities. should be illegal. This would obviate
considerable harms to individuals and Some proposals may be denied be- the need for a triennial review process,
to the public, so good faith testing is a cause the Office perceives the request- and make reverse engineering of digi-
public good. It too poses no threat of ed exception will enable infringing tal works far less risky than it is today.
infringement, which was the principal uses. Eldridge Alexander, for instance, Over time, the anti-circumvention
justification for adoption of the anti- is unlikely to get an exception so he can rules may perhaps be amended so that
circumvention rules in the first place. bypass CSS to create a software library computer security and interoperabil-
There is an existing computer se- of his DVD movies because bypassing ity interests are better protected than
curity exception in the DMCA, but it CSS would also enable infringing uses they are now. Yet until that day comes,
requires advance permission of the of the movies. we should be grateful the triennial re-
owner of the computing system being The Office may dismiss some re- view process exists to provide a mecha-
tested and seems to limit the dissemi- quested exceptions as unnecessary be- nism by which computing profession-
nation of results of security testing to cause it perceives there are other ways als, among others, can make the case
the owner of that computing system. to achieve the stated objective (for ex- for reverse engineering as a legitimate
The Bellovin submission wants ample, video capture of images from activity that serves the public interests
computer security researchers to be movies for educational or critical uses in competition, ongoing innovation,
able to test vulnerabilities without rather than bypassing the TPMs). and public health and safety.
getting advance permission. The re- Even those exceptions the Office
searchers also want to be able to dis- grants may be more restrictive as Pamela Samuelson (pam@law.berkeley.edu) is the
Richard M. Sherman Distinguished Professor of Law and
seminate their research results in granted than as requested. During Information at the University of California, Berkeley.
responsible ways, such as by presen- the last triennial review, for example,
tation of research results at confer- the Office was willing to grant an ex- Copyright held by author.

26 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


V
viewpoints

DOI:10.1145/2770892 L. Jean Camp

Computing Ethics
Respecting People and
Respecting Privacy
Minimizing data collection to protect
user privacy and increase security.

P
EOPLE CONFLATE PRIVACY and financial assets); attendees at the Dash- understand the routine level of geo-
security. For many there are con convention for Tumblr enthusiasts graphic tracking (pinpointing the
no trade-offs, choices, or (predominantly young people familiar device’s location). Intensive tracking
decisions: guarantee secu- with the Internet); and college students might betray a pathological desire for
rity and you guarantee pri- from psychology and computer science data collection, but it might also be
vacy. But computer designers know it is who are typically “digital natives.” Pri- a result of the easiest default to set.
more complicated than that. This col- vacy and security practices were not Some designers have difficulty mas-
umn argues that starting with respect well understood by intelligent, edu- tering permissions, and extensive data
for people who desire privacy will help cated, and cognitively flexible people, collection can be the consequence. I
guide good security design. For example, although those with significant exper- argue for data minimization, mean-
to help mitigate the security threat of tise came closer to understanding the ing collecting the least data needed to
identity theft one wants to consider the purpose of certificates. help ensure privacy and security. Fail-
loss of private information. Mobile device permissions can be ure to minimize data collection can be
Good design practice is a responsi- even more confusing.3,4 Few people an ethical issue, but many developers
bility. The ACM Code of Ethics requires
that designers “respect the privacy of
others” and provides two paragraphs of
best practice. Many users are both busy
and insufficiently proficient techni-
cally to watch out for themselves. They
understand neither technical minutiae
nor the basics of privacy. A recent sur-
vey asked about Internet public key cer-
tificates (PKI) certificates. Most people
said they did not know what such cer-
tificates are, or that PKI certificates
provide more protection than they do.
Many thought PKI certificates ensure
privacy, prevent tracking, provide legal
IMAGE BY ALICIA KUBISTA /A ND RIJ BORYS ASSOCIAT ES

accountability, or certify that a site is


protected from “hackers.”2 They con-
fuse security goals and privacy, and sel-
dom understand related risks well.
The survey included a range of re-
spondents: shoppers at the Blooming-
ton Farmers Market, people who attend
Indiana University’s “Mini University”
(participants are active retirees who
can be vulnerable and need to protect

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 27
viewpoints

fail to grasp its importance. Here, I of- ucts that carry too much risk to priva-
fer three reasons to seek data minimi- cy, resulting in an untapped market.
zation to protect privacy, and thereby Thinking in advance Users expect computer designers to
security. about privacy can follow the code of ethics laid down
First, collecting and transmitting about privacy. To the extent that par-
data exacerbates the risk of data ex- help both designers ticipation in the network is a form of
filtration, a leading risk in security and users. enfranchisement, poor design for pri-
attacks. Attackers can subvert a com- vacy is a kind of disenfranchisement.
puter or mobile phone to acquire cre- Privacy risks and loss should not
dentials. An outward-facing mobile be invisible. If someone trades pri-
phone camera can allow reconstruc- vacy for pricing, that trade should be
tion of sensitive office space. Attack- abundantly clear. People do not ex-
ers can examine an account before reprehensible, it’s a crime.” Superfish pect their televisions to listen to every
deciding how to exploit it.1 Tradi- on Lenovo computers is being investi- word in the house. They want trans-
tional email attacks (for example, gated as wiretapping. The successful parency and risk communication.
the stranded travelera) can be used app StealthGenie has been consid- Some users will ignore a manufactur-
on phone-based clients, and URLs ered a kind of spyware, and its CEO is er that includes hidden surveillance
for malware diffusion can be sent by now under indictment. Thinking in capability while others will be furi-
short text messages (SMS), email, and advance about privacy can help both ous. Transparency is expected by the
other phone-based mechanisms. SMS designers and users. U.S. Federal Trade Commission and is
is already used to obtain funds direct- Why do people accept privacy- common in many markets. The mar-
ly through per-charge messages. Re- violating products in the first place? ket for computing devices, from com-
ducing the amount of data that can be Perhaps they do not care much about puters to mobile phones, should be
exfiltrated reduces both privacy and privacy or do not understand the im- one such market. This does not mean
security risks. plications. Or they care but accept the everyone has the same expectations of
Second, data minimization can re- convenience in exchange for the risk privacy. It merely means that custom-
duce business risk. Flashlight apps of privacy violation. Perhaps they can- ers should know about privacy risks,
that exfiltrate phone information to not use the tools intended to provide and be able to handle those risks as
support advertisers can exfiltrate more protection, so designers think users they see fit. Choices about privacy and
data than needed. Too much exfiltra- are protected but in fact they are not. security are important. The choices
tion led to the removal of flashlight There are good economic reasons to that designers provide should help
apps as iPhones and Android included support those people who care about surgeons, air traffic controllers, and
free integrated flashlights. Exfiltration privacy. A market with perfect infor- other highly skilled individuals who
can lead to short-term profit but long- mation enables price discrimina- have responsibilities for the security
term losses to the companies involved. tion in which each person pays the and privacy of others—and those less
Apps can destroy markets when they amount he or she is willing to pay, highly skilled (and less empowered as
weaken privacy controls. The app Girls no less and no more. Some custom- well), make sensible decisions about
Around Me used Foursquare data to ers will pay money for privacy, while their privacy and security needs.
display Facebook profiles of women some will pay in possible loss of pri- When security and privacy technolo-
near a location, including pictures. vacy to avoid spending money. But gies fail, those with the knowledge,
The app was making money but Four- the presence or lack of protections role, and skills put them in the best
square blocked it because violating for privacy should be clear. position to prevent the failures bear
privacy lost users for Foursquare. An When people do not understand much of the responsibility.
immensely popular and valuable appli- the privacy risks of their own choices,
cation became a liability. Superfish on there is not only a business process References
1. Bursztein, E. et al. Handcrafted fraud and extortion:
Lenovo computers hurt Lenovo; what failure, but also an ethical one. Usabil- Manual account hijacking in the wild. In Proceedings
Lenovo considered advertising others ity analysis should indicate whether of the 2014 Conference on Internet Measurement
Conference, ACM, 2014.
considered malware. tools are effective for people trying 2. Camp, L.J., Kelley, T., and Rajivan, P. Instrument for
Data minimization can also reduce to protect themselves. People can be Measuring Computing and Security Expertise–TR715.
Indiana University, Department of Computer Science
legal problems. Assistant U.S. Attor- bad at risk analysis, so care must be Technical Report Series; http://www.cs.indiana.edu/
ney General Leslie Caldwell of the taken to help them understand risks. cgi-bin/techreports/TRNNN.cgi?trnum=TR715.
3. Felt, A.P. et al. Android permissions: User attention,
Justice Department’s Criminal Divi- People might choose to take risks on- comprehension, and behavior. In Proceedings of the
Eighth Symposium on Usable Privacy and Security,
sion said, “Selling spyware is not just line for the same reasons they choose ACM, 2012.
risks offline. Of course, designs can- 4. Kelley, P.G. et al. A conundrum of permissions:
Installing applications on an Android smartphone.
a The “stranded traveler” scam uses a subverted not force people to make careful risk Financial Cryptography and Data Security. Springer
account to send a message to all account con- decisions. But designs often stop far Berlin Heidelberg, 2012, 68–79.
tacts asking for emergency help. The attacker short of that point. Lack of informa-
pretends to be the account owner, claims to L. Jean Camp (ljeanc@gmail.com) is a professor of
be desperately stranded, and requests money
tion about privacy risk can lead to
informatics at Indiana University Bloomington.
be sent immediately to a specific (attacker- consumer regret and unwillingness to
owned) bank account. reinvest. Or users might refuse prod- Copyright held by author.

28 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


V
viewpoints

DOI:10.1145/2770894 David Anderson

Historical Reflections
Preserving the
Digital Record of
Computing History
Reflecting on the complexities associated with
maintaining rapidly changing information technology.

I
N CRE A S I N G LY, T H E SOU RCE ma-
terials on which historians
of computing work are either
born digital or have been digi-
tized. How these digital trea-
sures are preserved, and how continu-
ing access to them may be secured over
the long term are therefore matters of
concern to us. One of the key deter-
minants of whether this material will
remain accessible is the file format in
which the source materials are stored.
Choosing the “wrong” format will have
significant implications for the extent
to which files might be supported by
systems, automated tools, or workflow
associated with the digital content life
cycle processes in libraries, archives,
and other digital repositories. For this
reason, digital preservationists have
always taken a keen interest in file for-
mats, and are understandably motivat-
ed to maintain up-to-date information
on the characteristics of different for-
mats, and the effect these have on their
preservability over time.
A considerable amount of digital the iPRES conference, the most recent the zeitgeist of the field, and last year
material is now subject to mandatory of which was held in Melbourne, Aus- it was difficult not to be struck by the
deposit to national libraries, meaning tralia last year; the next conference amount of attention being paid to the
even data creators who have no inter- will be held this November in Chapel difficulties posed by file format issues,
est in preservation or long-term access Hill, N.C.a The iPRES conference al- and the extent to which a viable format
IMAGE BY WIKTO RIA PAW L A K

need to develop some understanding ways provides an opportunity to gauge registry could be created, which both
of the formats preferred, or deemed ac- responds to the needs of the preserva-
ceptable by repositories and why. a See http://www.digitalmeetsculture.net/article/
tion community and involves practi-
The principal annual gathering of international-conference-on-digital-preser- tioners meaningfully going forward.
digital preservationists takes place at vation-ipres-2015/. Format was prominent both in the pa-

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 29
viewpoints

pers presented, and in the break-time For example, when speaking of “col-
conversations, some of which were or depth” we are talking how finely lev-
quite animated. The problems caused Files, even when els of color can be expressed in a given
by format obsolescence and interoper- preserved perfectly, “format” (encoding scheme). It is not
ability affect all of us. Whether we are possible to talk meaningfully about col-
scientists wanting to return to data are useless if or depth at the level of individual bits,
generated by a specialized scientific we do not have or at bitstream level. Color depth is a
instrument that no longer sits in the “higher”-level concept than it is possible
lab, or animators working on Holly- software that can to express at the level of bits, it requires
wood blockbusters whose work must make sense of them. an abstraction to a level where encoding
be done from scratch many times in a schemes exist. This does not preclude
single movie because the studio wants us talking at the higher level about bits
to make use of features in the latest of course, and indeed, that is how color
piece of bleeding-edge software. At a depth is normally discussed—that is,
more mundane level, files created in how many bits a given encoding scheme
version 1.0 of our favorite word proces- ated with known formats. Tools such devotes to representing the color of a
sor are seldom fully compatible when as DROID (Digital Record and Object particular pixel. “Color,” of course, is
we upgrade to later versions. Format Identification), have been developed to a higher-level concept again. There are
obsolescence has the capacity to cut us make this process easier, and form an no red bits, or green bitstreams. Talk
off from the fruits of research results important part of the digital forensics of “color” does not belong there. It is
whether publicly funded, or developed toolkit. Perhaps, therefore, we might important to have a clear sense of what
by companies. Files, even when pre- avoid confusion over “formats” by con- concepts apply at each level of descrip-
served perfectly, are useless if we do centrating our attention on the identi- tion if we are to avoid making “category
not have software that can make sense fying characteristics of bitstreams. In- mistakes.”b It is difficult to see how, if
of them. The pace at which technology deed, there appeared to be no dissent we were to restrict ourselves to discuss-
drives forward, and the lure of the new, from the view that file formats can be ing bitstreams alone we would continue
means it is ever more difficult to ensure thought of as (usually nested) interpre- to have at our disposal the set of con-
legacy files are kept fully accessible. tations (or encoding schemes) of bit- cepts, such as “color depth,” that we re-
One of the complaints that came streams. However tempting such lin- quire routinely.
up was that “format,” despite being guistic legerdemain may be, there are A second set of difficulties flow
a term in common use, is, in fact, not at least two grounds for thinking this is from the disturbingly large problem
well understood or agreed upon even not the right approach. space that results from understand-
within the “format” community. Spe- Abstraction is the process of es- ing formats to be identifiable ways of
cific problems arising as the result of tablishing the level of complexity on encoding computer files. For example,
particular “formats” include: which people interact with systems. a file of bit length L, where each bit
! Not having publicly available spec- For example, when we are writing code may take 1 of V values may be poten-
ifications; involving numerical operations we tially encoded in VL ways. So, a binary
! Having implementations that do normally ignore the way in which num- file, 8-bits long, is capable of being
not comply with published specifica- bers are represented in the underlying represented in 256 distinguishable
tions; computer hardware (for example, 16- ways and could, therefore give rise to
! Having format specifications, or bit or 32-bit), concentrating entirely on 256 different formats. A 64-bit file can
“usual” implementations changed “number.” On other occasions, these support 18,446,744,073,709,600,000
over time, and between different ven- implementation (or lower-level) details different representations. Even if we
dors, users, and others; and are critical and we pay close attention exclude the notion of files that are all
! Producing a comprehensive map- to each “bit.” format and no “payload” the problem
ping of possible formats for files (of Changing our level of abstraction space is not significantly reduced. The
any reasonable length) so as to be able down (for example, from “format” to smallest “payload” the system I have
to identify files of unknown format. “bitstream”) is a certainly recognized just described can support is a single
These, and similar considerations way of getting around talking about dif- “payload” bit, the remainder of the
give rise to a certain degree of conster- ficult, disputed, or otherwise problem- file being taken up with encoding the
nation, and a general lack of confidence atic concepts, similarly changing a level format. Thus, a binary file 8 bits long
concerning the likelihood of succeed- of abstraction (up) is often very helpful (with a 1-bit “payload”) can be repre-
ing in ‘imposing’ much by way of a in introducing illuminating concepts or sented in 128 distinguishable ways
single ‘understanding’ of format in gen- organizational principles that not only and could, therefore give rise to 128
eral, or even a given individual format. are not apparent at a lower level but sim- different formats and there could be
A common approach to identifying ply do not apply. The restricted applica- exactly two different “payloads” repre-
files, the format of which is unknown, bility of concepts (they are tied to the sentable in this scheme for each of the
is to examine the individual bytes level of abstraction—and in some sense possible formats. Our 64-bit file (with a
(the bitstream) looking for character- define it) is often easier for us to notice
istic patterns or ‘signatures’ associ- in one direction of travel than the other. b http://en.wikipedia.org/wiki/Category_mistake

30 COMMUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


viewpoints

1-bit “payload”) can similarly support ists within the biodiversity informatics

Coming Next Month in COMMUNICATIONS


9,223,372,036,854,780,000 different community to provide globally unique
representations/formats, any of which identifiers in the form of Life Science
may represent one of two payloads. Suf- Identifiers (LSID) for all biological
fice it to say the numbers involved are names. Three large nomenclatural
very large, so large in fact, that if we in- databases have already begun this
sist on trying to wrestle with the whole process: Index Fungorum, Interna-
theoretical problem space, expressed tional Plant Names Index (IPNI), and
purely in terms of bitstreams, we are ZooBank. Other databases, which pub-
not likely to make much progress. lish taxonomic rather than nomen-
Some other limitations of bit- clatural data, have also started using
streams are also apparent. Files hav- LSIDs to identify taxa.
ing exactly similarly bitmap sequences The solution did not, in the biologi-
(syntax) need not have the same format cal sciences, involve abandoning talk Programming the
(semantics). Formats are encodings, of genus or species in favor of DNA Quantum Future
and different vendors can impose dif- sequences (for example) but was built
ferent “meaning” on the same pat- on a broadly Linnaean taxonomic ap- Network Science,
terns. Syntax is not the same as se- proach. This is the sort of approach
Web Science, and
mantics. It is simply not possible, in that might be employed in our more
the abstract, to distinguish between a modest and easily tractable domain. Internet Science
given 6-bit pattern followed by 2 bits of Having decided which attributes
“payload” and a 7-bit pattern (coinci- of files are of interest (and of course From the EDVAC
dentally sharing the same initial 6 bits) there will be many, and the list will to WEBVACs
followed by a 1-bit “payload.” How we change over time) we can begin to
interpret (encode) these 8 bits is a mat- group these into different catego-
ter of choice, which is apt to change ries. This is broadly analogous to
Testing Web
over time, place, and circumstance. the Life, Domain, Kingdom, Phylum, Applications with
It is clear that if, for example, we Class, Order, Family, Genus, Species State Objects
aim to have a comprehensive mapping taxonomy that is familiar in biology.
of possible formats for files (of any The biological approach has been Soylent: A Word
reasonable length), basing our work perfectly able to withstand heated
on bitstreams is not going to succeed. scientific dispute of the correct clas-
Processor with
Abandoning talk of formats in favor of sification of individual life-forms, a Crowd Inside
bitstream language, for all its intellec- and it is by no means uncommon to
tual/mathematical attractiveness, will see, for example, an insect being clas- Learning Through
not solve the problems with which we sified differently over time, as scien- Computational
are faced or with which we should be at- tific understanding and technique
tempting to grapple. I am therefore dis- have changed or developed. It is not
Creativity
inclined to go down the bitstream route. necessary for us to get classification
Some of the problems we need to be schemes right the first time, or for Deploying Complex
addressing include: them to be frozen for all time, in or- Technologies
! Categorization of formats; der for them to gain widespread ac- in a Traditional
! Modeling format; and ceptance, and to confer significant
! Identifying the format of files benefits on the community.
Organization
whose format is currently unknown. By declining to speak of formats,
There is every reason to be hopeful in favor of speaking of bitstreams, we Surveillance and
that substantial progress can be made will not only fail to improve matters Falsification
against each of these problems. but may actually make intractable the Implications for Open
In the biological sciences (which format problems we need to tackle.
represents a much more complex and
Source Intelligence
Insofar as we do want to employ mul-
multifaceted domain than computer tiple levels of abstraction in our dis- Investigations
file formats), broadly similar kinds of cussions of the overall problem space,
problems: we must be careful which concepts we
! Categorization of life-forms; deploy where, or we are likely to make
! Modeling life-forms; and category mistakes.
! Identifying the genus and species Plus the latest news about
illusions for computers
(and other characteristics) of life-forms David Anderson (cdpa@btinternet.com) is the CiTECH
Research Centre Director at the School of Creative and humans, touching
whose identity is currently unknown. Technologies, University of Portsmouth, U.K. the virtual, and the morality
These problems have been substan- of self-driving cars.
tially solved and a movement now ex- Copyright held by author.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 31
V
viewpoints

DOI:10.1145/2770896 Phillip G. Armour

The Business
of Software
An Updated
Software Almanac
Research into what makes software projects succeed.

S
O FT WA RE PROJ E CT S CANbe
so complicated and so dif-
ferent from each other that
predicting whether they
will succeed or fail can be
as difficult as forecasting the weather
or picking winning stocks. Will the
project entirely fulfill its goals? Will it
deliver some value at a higher cost or
later than desired? Or will it just crash
and burn leaving the exhausted survi-
vors to lick their wounds, bury the dead
bodies, and shred the evidence?
Courageous efforts are being made
to collect and codify the data that is
available, to try to spot what trends
are occurring in the industry, and to
provide some useful guidelines for
managing the business of software.
The recently published QSM Software
Almanac, dubbed the “2014 Research
Edition,” is an example of this.
Five Core Metrics ailments, but they can give provide
QSM Software Almanac In software projects and organiza- insights into the health of projects
Quantitative Software Management tions, the list of things-that-could-be- and organizations.
(QSM) published the IT Metrics Edi- measured is daunting. The intricacy These metrics2 are:
tion of its Software Almanac in 2006.1 of projects, the differences in the Schedule—the elapsed calendar
IMAGE BY ALICIA KUBISTA /A ND RIJ BORYS ASSOCIAT ES

This highly detailed analysis of thou- types of systems being built, in proj- time from a project start until comple-
sands of project data points was inter- ect life cycles, in toolsets, in exper- tion of its mandate;
esting reading and reached a few quite tise and development environments Effort or cost—which is usually driv-
jaw-dropping conclusions. The 2014 all add to the variables operating in en by the number of staff employed
version confirms much of the 2006 our business. But some simple core during the project duration;
analysis and provides further insights. metrics, available from all projects Functionality delivered or “system
It begins by suggesting how software and enterprises, can be as important size”—this is the value component of
projects and organizations should be as the measurement of temperature the project, what is obtained for the
measured and what we can infer from and blood pressure are to medical cost and schedule expended;
these measurements. diagnosis. They may not diagnose all Quality or defect level—the “anti-

32 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


viewpoints

size” or that portion of the system that els of expected reuse typically cost twice of evolution, Agile methods are best
does not work properly on delivery; and that of equivalent new development. suited to smaller projects. This find-
“Productivity”—the rate at which Agile advantage tops out around ing seems to support a general indus-
the resources of time, effort, and staff 30K line-of-code equivalents. The try perception of Agile. But QSM also
are turned into delivered functionality data shows that, at their present stage found that large Agile projects have
(minus defects).
The QSM Software Almanac uses Figure 1. Typical engineering project duration.
these relatively simple metrics, backed
up with more sophisticated measure- 500
Engineering Systems: Project Effort

Requirements thru Initial Release


ments, to make some important points
about modern systems development:
400
Projects seldom measure perfor-

Effort (Staff Months):


mance. Most projects in QSM’s data-
300
base measure and report schedule, ef-
fort, and size, but only one-third of the
200
projects actually use this data to assess
their performance. This is a missed op-
portunity for improvement. 100
Sacrifice schedule first. Projects are
willing to overrun schedules more than 0
0 50,000 100,000 150,000 200,000 250,000 300,000
budget and are more willing to overrun
budget than deliver less functionality Delivered System Size (LOC Equivalents)
to customers.
Big teams are bad. In this study
we see repeatedly that larger teams Figure 2. Typical engineering project effort.
cost a lot more, deliver lower-quality
systems, but seldom save time. In Engineering Systems: Project Duration
25
one quoted example, cutting project
staff from 100 people to 52 people
Requirements thru Initial Release

saved $12 million but resulted in 20

only a modest increase in schedule.


Calendar Time (M):

Worst-in-class projects, as measured 15


by poor performance and low qual-
ity, are strongly correlated with large 10
team size and best-in-class projects
are strongly correlated with small 5
team sizes. In addition, large-staffed
projects tend to try to fix any problems 0
they encounter by adding even more 0 50,000 100,000 150,000 200,000 250,000 300,000
staff whereas small-staffed projects Delivered System Size (LOC Equivalents)
usually try other, more effective, ap-
proaches. QSM’s data also showed
a wide range of staffing for similarly Figure 3. Typical engineering project average staff.
sized systems; this may be driven by
the common practice of throwing Engineering Systems: Project Average Staff
people at projects in trouble. Smaller- 20
staffed projects showed between three
Requirements thru Initial Release

and eight times greater productivity


15
than larger staffed projects delivering
Average Staff:

equivalent value(!).
Quality is getting better over time. 10
This is especially true for engineering
systems, which are getting larger while
IT systems are getting smaller and tak- 5
ing longer. This was shown for best-in-
class projects; worst-in-class projects
0
seldom collect quality data, but prob- 0 50,000 100,000 150,000 200,000 250,000 300,000
ably have worse quality.
Delivered System Size (LOC Equivalents)
Reuse does not usually work. “Minor
enhancement” projects with high lev-

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 33
viewpoints

these factors are overall technical sys-


tem complexity and team communica-
In software projects tion complexity—an important finding
and organizations, for distributed or offshore teams.

the list of things-that- Performance Benchmark Tables


could-be-measured The Almanac’s final set of tables can
ACM
ACM Conference
Conference be used to determine if your project
is daunting. fits close to typical measured norms
Proceedings
Proceedings for business, engineering, or real-time
Now
Now Available via
Available via systems development. Given a reason-
able assessment of delivered system
Print-on-Demand!
Print-on-Demand! size, these tables would allow a project
much more variable results, so it is manager to determine typical values
likely the benefits of larger Agile proj- for project duration, effort/cost, and
Did you know that you can ects depend a lot on how they are im- average staff in a minute or two, and
now order many popular plemented. Also, larger Agile projects avoid serious overcommitments in
that experience problems are better time, money, staff, or delivered func-
ACM conference proceedings off adding time than dropping func- tions (see figures 1–3 for typical engi-
via print-on-demand? tionality or adding staff. neering system ranges).
Projects grow. Average functional-
ity-to-be-delivered increases 15% over Lessons Learned
Institutions, libraries and the life of a project, typically creating This study indicates some simple les-
individuals can choose schedule overruns of 8% and cost over- sons:
runs of 16%. ! Collect the basic and practical core
from more than 100 titles metrics on projects and use them to fig-
on a continually updated Some Industry Trends Over Time ure out what you can do, what you are
list through Amazon, Barnes The Almanac shows: doing, and how well you are doing it.
! The median project duration con- ! Commit projects at levels around
& Noble, Baker & Taylor, traction that occurred during the 1990s measured industry benchmarks in du-
Ingram and NACSCORP: and 2000s has leveled off to 10–11 ration, effort, and staff.
CHI, KDD, Multimedia, months. ! Avoid using large teams—they will
! Project effort is declining along probably be ineffective.
SIGIR, SIGCOMM, SIGCSE, with the delivered size of systems, but ! Do not add people to a project that
SIGMOD/PODS, delivered size is varying more as time hits snags—it will likely make things
and many more. goes by. worse; instead consider giving the proj-
! Median project team size is rela- ect more time.
tively stable. ! Do not assume reuse will save you
For available titles and ! There is a strong trend toward anything in time or effort.
more financial systems development ! Manage larger Agile projects care-
ordering info, visit: in IT. fully so the benefits are not nullified by
librarians.acm.org/pod ! Mixed-language development is size and complexity.
the norm, but Java has become the
most common development language. It’s Free
The QSM Software Almanac 2014 Re-
Measurement Ideas search Edition contains a great many
Some of the articles in the Almanac more insights and detailed analyses of
describe useful approaches for the the copious amount of data collected.
measurement of projects. One article It is free for download from QSM at:
describes the use of Shewart control http://bit.ly/1y5MahV.
charts (modified for Rayleigh math-
ematics) to calibrate defect discovery References
1. Armour, P.G. Software, hard data. Commun. ACM 49, 9
trends and use them for predictive pur- (Sept. 2006), 15–17.
poses. Another covers four models for 2. Putnam, L.H. and Myers, W. Five Core Metrics. Dorset
House Publishing Co., 2003.
data mining that can be used for pro-
cess improvement. A third article de-
Phillip G. Armour (armour@corvusintl.com) is a vice
scribes an empirical study to see what president at Applied Pathways LLC, in Schaumburg, IL,
project and environmental factors most and a senior consultant at Corvus International Inc., in
Deer Park, IL.
correlate with schedule performance
on projects. For business applications Copyright held by author.

34 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


V
viewpoints

DOI:10.1145/2770929 Juan E. Gilbert, Jerlando F.L. Jackson, Edward C. Dillon, Jr., and LaVar J. Charleston

Broadening Participation
African Americans in
the U.S. Computing
Sciences Workforce
An exploration of the education-to-work pipeline.

B
ROADENING PARTICIPATION IN technology workforce in Silicon Valley sented in this column is from the Com-
computing has received a and other technology hubs, an impor- puting Research Association (CRA)
great deal of media coverage tant question emerges: How are Ph.D.- Taulbee Survey (see http://www.cra.
recently on diversity chal- granting computing departments do- org/resources/taulbee/)a and the NSF
lenges in Silicon Valley.2,8 ing regarding the representation of Survey of Earned Doctorates (SED) Tab-
Major Silicon Valley technology com- African Americans? In this column, we ulation Engine (http://ncses.norc.org).
panies, including Dell and Intel, have examine efforts to increase Ph.D. and
released employment data and the faculty production for African Ameri- African American Ph.D. and
lack of diversity has caused many to cans in computing sciences through Faculty Production Landscape
question their commitment.2 For ex- the work of a new project funded by Table 1 contains data from 2003 to
ample, Intel has committed $300 mil- the National Science Foundation—In- 2013 on computer sciences (CS) Ph.D.
lion over the next five years to improve stitute for African-American Mentor- production for African Americans from
the company’s workforce diversity.8 As ing in Computing Sciences (iAAMCS, the CRA Taulbee Survey and the NSF
employment data is released for the pronounced “i am c s”). The data pre- Survey of Earned Doctorates that uses
data from the National Center for Sci-
ence and Engineering Statistics (NC-
SES). Taylor and Ladner7 first reported
differences in the CRA Taulbee Survey
data and the WebCASPAR data. There-
fore, this column will use data from
both sources with respect to Ph.D.
production to show the contrast. Both
datasets show an increase in the raw
number of African American CS Ph.D.’s
produced; however, the total percent-
age distribution among Ph.D.’s has not

a The CRA Taulbee Survey contains informa-


tion on the enrollment, production, and em-
ployment of Ph.D.’s in computer science and
computer engineering (CS and CE) and demo-
graphic data for faculty in CS and CE in North
America. It also includes data on gender and
IMAGE COURTESY OF INTEL

ethnicity breakdowns. Known limitations of


Taulbee Survey data include: reporting univer-
sities can vary from year to year and reporting
universities are only a subset of all Ph.D.-grant-
A panel discussion at the 2015 International Consumer Electronics Show on the importance ing institutions, but generally cover all the top
of accelerating diversity in the technology industry. producing institutions.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 35
viewpoints

changed much. There is at least a 50% These numbers are promising, but ing graduate school compared to their
increase in the raw number of Ph.D.’s there is still much work to be done. White and Asian male counterparts.
produced, but the total percentage dis- The fact that the highest concentra- The article also stated that scientists
tribution remains relatively flat. Clear- tion levels for assistant, associate, and from underrepresented backgrounds,
ly, the overall CS Ph.D. production has full professors has been respectively which included American Indian/
increased at a rate that limits increases 3.50%, 1.80%, and 0.80% are very low. Alaska Native, Black/African American,
in the percentage. However, does this There is opportunity for improvement Hispanic/Latino, or Native Hawaiian/
increase in overall CS Ph.D. production within the academy. So, why are these Pacific Islander, earn 10% of life science
among African Americans result in an numbers so low with respect to African Ph.D.’s and that number had remained
increase in African American faculty? American CS tenure-track faculty and unchanged since 1980, which is better
Table 2 captures the African Ameri- Ph.D. production? How can these num- than CS Ph.D. production for African
can CS tenure-track faculty from 2003 bers be improved? Americans and all underrepresented
to 2013 from the CRA Taulbee Survey. Gibbs et al.1 conducted a study of groups combined.
Table 2 only counts faculty in Ph.D.- 1,500 recent American BMS (biomedi- Considering these numbers, it
granting departments that participate cal sciences) Ph.D. graduates (includ- would appear there are many African
in the Taulbee survey. Broader surveys ing 276 underrepresented minorities) Americans and other underrepresent-
of African American faculty and their that examined career preferences over ed minorities pursuing science and
ranks do not seem to exist. This data the course of their graduate training engineering Ph.D.’s; however, so few
shows more improvement compared to experiences. Their findings showed a chose CS. Data from many decades
the Ph.D. production data. At all ranks, disproportionately low interest among ago suggests African Americans do not
there is more than a 100% increase for underrepresented minorities and pursue science, technology, engineer-
African Americans in CS tenure-track women in pursuing an academic career ing, and mathematics (STEM) degrees
faculty positions from 2003 to 2013. at a research university upon complet- because of cultural stigmas (including
sentiments such as it’s not cool to be a
Table 1. African American Ph.D. production 2003–2013 CRA Taulbee Survey and scientist or scientists are nerds). Hager
2006–2012 NSF SED.
and Elton3 surveyed college freshmen
and Sewell and Martin6 surveyed high
Taulbee Survey NSF SED NCSES school juniors. In these two studies,
Year Reported Percent Year Reported Percent they found African American men
2003 10 1.30% expressed a greater interest in social
2004 12 1.50% service fields compared to White men,
2005 9 0.50% who prefer STEM disciplines. In gener-
2006 18 1.40% 2006 20 1.38% al, African American college students
2007 19 1.20% 2007 37 2.24% are highly represented in disciplines
2008 22 1.50% 2008 36 2.01% such as education, humanities, and so-
2009 17 1.30% 2009 33 2.05% cial sciences.5 Hall and Post-Kammer4
2010 17 1.30% 2010 37 2.22% reported African Americans choose
2011 16 1.20% 2011 39 2.28% these disciplines because they have a
2012 27 1.80% 2012 40 2.17% cultural orientation and expectation
2013 22 1.50% to help others. Historically, STEM dis-
ciplines are generally not seen as disci-
plines that can be used to help others.
Do these cultural stigmas still apply
Table 2. African American CS faculty 2003–2013 CRA Taulbee Survey. and are they the reason for the low rep-
resentation of African Americans in
CS Ph.D. programs? If so, one possible
Year Full Percent Associate Percent Assistant Percent
solution could be to show examples
2003 6 0.40% 10 0.90% 16 1.40%
of how computing research can have
2004 9 0.60% 8 0.70% 24 2.00%
an impact on society. That is, demon-
2005 7 0.40% 12 1.10% 23 1.90%
strate how computing can change lives
2006 8 0.50% 11 0.90% 26 2.30% and improve the human condition,
2007 8 0.50% 11 0.90% 21 2.10% especially for African Americans. Con-
2008 14 0.70% 20 1.40% 21 2.00% necting computing to solving human
2009 10 0.50% 16 1.20% 22 2.50% conditions that will likely impact com-
2010 11 0.60% 17 1.20% 24 2.90% munities of color is one of the primary
2011 12 0.60% 21 1.40% 23 3.00% goals of the iAAMCS.
2012 16 0.80% 25 1.60% 26 3.30%
2013 16 0.80% 25 1.80% 25 3.50% Emerging National Resource for
Diversifying Computing Sciences
The iAAMCS is a NSF Broadening Par-

36 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


viewpoints

ticipation in Computing (BPC) Alli-


ance (see http://www.iaamcs.org). The
These numbers
Calendar
mission of iAAMCS is to: increase the
number of African Americans receiv-
ing Ph.D. degrees in computing sci-
are promising, but of Events
ences; promote and engage students in there is still much July 4–8
teaching and training opportunities; work to be done. Sponsored: Innovation
and Technology in
and add more diverse researchers into
Computer Science
the advanced technology workforce. Education Conference 2015,
iAAMCS provides various activities that Vilnius, Lithuania,
serve as interventions for increasing Sponsored: ACM/SIG,
Contact: Valentina Dagiene
interest in computing sciences among Email: valentina.dagiene@mii.
underrepresented minorities with a to AYUR, DREU is an activity that pro- vu.lt
particular focus on African Americans, vides research opportunities to Afri-
as described here. can American (and other underrep- July 6–9
ISSAC’15: International
Faculty and student training. This resented) students in preparation for Symposium on Symbolic and
activity consists of face-to-face work- graduate school. This activity accepts Algebraic Computation,
shops and webinars for African Ameri- applications from students and men- Bath, UK,
can (and other underrepresented) tors who are then matched based on Sponsored: ACM/SIG,
Contact: Stephen Alexander Linton
students and their advisors to provide interests and backgrounds. During Email: steve.linton@st-andrews.
a shared context for the completion DREU, students complete a 10-week ac.uk
of the research experience. The stu- research experience that consists of
dents received training in time man- several checkpoints in the process to July 6–July 10
LICS ‘15: 2015 ACM/IEEE
agement, managing expectations, and ensure uniform expectations and out- Symposium on Logic in
research processes (for example, lit- comes. The DREU program is a collab- Computer Science,
erature review, source control, and so oration between the CRA-W (Comput- Kyoto, Japan
Co-Sponsored: Other Societies,
forth). The advisor’s training is based ing Research Association’s Committee Contact: Masahito Hasegawa
on documented differences between on the Status of Women in Computing Email: hassei@kurims.kyoto-u.
mentoring by effective and non-effec- Research) and the CDC (Coalition to ac.jp
tive teachers of African American (and Diversity Computing).
July 14–17
other underrepresented) students. Technical webinars and distin- e-Energy’15: The Sixth
Academic year undergraduate re- guished lecture series (DLS). DLS is International Conference on
search (AYUR). The purpose of AYUR an activity that provides targeted pre- Future Energy Systems,
is to increase the African American sentation interventions for African Bangalore, India,
Sponsored: ACM/SIG,
Ph.D. pipeline by introducing second- American students (and other under- Contact: Deva P. Seetharam,
semester freshmen and sophomore represented groups) in computing sci- Email: deva.seetharam@gmail.
African American students in com- ences. Topics include: computing re- com
puting to research. Specifically, the search, academic faculty employment,
July 16–17
objectives of AYUR are to: develop research scientist positions, and other SIGDOC ‘15: The 33rd ACM
undergraduate research competence topics related to the benefits of getting International Conference on the
in a computing area such as robot- a Ph.D. in computing sciences. Each Design of Communication,
Limerick Ireland,
ics, game programming, mobile ap- presentation is an hour long with a pe- Sponsored: ACM/SIG,
plications, or other computing re- riod dedicated for the mentor leading Contact: Kathie Gossett,
search area; increase critical inquiry the session to answer questions from Email: kgossett@iastate.edu
and critical thinking associated with attendees. These lectures are held on-
July 20–24
conducting research; increase active site at the host institution that places ISSTA ‘15: International
engagement through a research expe- the request for the lecture. Symposium on Software Testing
rience (presentations, competitions, Distinguished fellows writing work- and Analysis,
and so forth); develop a repository of shop (DFWW). DFWW serves as a plat- Baltimore, MD,
Sponsored: ACM/SIG,
models of effective undergraduate form for African American undergrad- Contact: Michal T. Young,
research mentoring; and ultimately, uate and graduate students (and other Email: michal@cs.uoregon.edu
increase the number of African Ameri- underrepresented groups) to learn the
can computing students with an inter- process of writing a competitive appli- July 21–23
PODC ‘15: ACM Symposium
est in pursuing graduate degrees in CS. cation for summer internships, gradu- on Principles of Distributed
AYUR also encourages students to ex- ate school, and/or external funding. Computing,
tend their research beyond this activity The targeted audience for this activity Donostia-San Sebastián Spain
and pursue other research opportuni- Co-Sponsored: ACM/SIG,
is junior and senior-level undergradu-
Contact: Chryssis Georgiou,
ties (for example, REUs). ates and first- and second-year stu- Email: chryssis@cs.ucy.ac.cy
Distributed research experiences dents as well as faculty that advise or
for undergraduates (DREU). Similar mentor these students.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 37
viewpoints

needs. iAAMCS participants also join


a network of individuals across the
country, building a virtual (and some-
times in-person) network. This allows
for any individual student, faculty
member, or computing professional
across the U.S. to become connected
to others as a part of iAAMCS, includ-
ing (in fact targeting) those that are
not a member of any partnering in-
stitution. If iAAMCS meets its objec-
tives, there should be shifts in the
numbers of African Americans pursu-
ing and receiving Ph.D.’s in CS. There
should also be a continued increase
in their representation in CS tenure-
track faculty positions at all ranks.
The iAAMCS projects are designed to
provide mentoring, support, and a dif-
ferent perspective of computing for
Troy Hill, from Winston Salem State University, won first place in the Robotics Simulation all underrepresented minorities with
Competition at ARTSI Robotics Competition at the 2014 Tapia Conference. a focus on African Americans.

K–12 outreach. This activity pro- petition is held annually at the ACM References
1. Gibbs, K.D., Jr., McGready, J., Bennett, J.C., Griffin, K.
vides opportunities for African Ameri- Richard Tapia Celebration of Diversity Biomedical science Ph.D. career interest patterns by
can undergraduate and graduate stu- in Computing Conference. race/ethnicity and gender. PLoS ONE 9, 12 (Dec. 2014),
e114736: DOI: 10.1371/journal.pone.0114736.
dents (and other underrepresented Tapia Celebration of Computing in 2. Guynn, J. Intel pledges diversity by 2020, invests $300
million. USA Today (2015); http://www.usatoday.com.
groups) to work with middle and high Diversity Conference. iAAMCS spon- 3. Hager, P.C. and Elton, C.F. The vocational interests of
school students through computing sors students to attend this event. Black males. Journal of Vocational Behavior 1 (1971),
153–158.
awareness and exposure events, after- The premise of this activity is that 4. Hall, E.R. and Post-Kammer, P. Black mathematics
school programs, and summer camps. there are professionals across the and science majors: Why so few? Career Development
Quarterly 35 (1987), 206–219.
The goal of these programs is to pro- U.S. who are interested in supporting 5. Powell, L. Factors associated with the underrepresenta-
vide opportunities for low-income African American (and other diverse) tion of African-Americans in mathematics and science.
The Journal of Negro Education 59, 3 (1990), 292–298.
and middle-class African American, students. Members of iAAMCS are 6. Sewell, T.E. and Martin, R.P. Racial differences in
Latino, and female students to ex- both highly visible and deeply aware patterns of occupational choice in adolescents.
Psychology in the Schools 13, (1976), 326–333.
plore computer science and develop of who many of those individuals are. 7. Taylor, V. and Ladner, R. Data trends on minorities and
programming and computational Underrepresented minority and fe- people with disabilities in computing. Commun. ACM
54, 12 (Dec. 2011), 34–37.
thinking skills. Through graduate and male students are much less aware of 8. Vara, V. Can Intel make silicon valley more diverse?
undergraduate participation in these those individuals. So the vision is that The New Yorker (2015); http://www.newyorker.com.

programs, we aim to provide middle identifying and engaging this group


Juan E. Gilbert (juan@juangilbert.com) holds the
school and high school students with of individuals could serve as a strong Andrew Banks Family Preeminence Endowed Chair and
role models with similar racial and resource, both as role models and is the associate chair of research in the Computer and
Information Science and Engineering Department at
gender backgrounds to help them de- mentors for these students. When a the University of Florida where he leads the Human-
velop identities as computer scientists. broad group of diverse students inter- Experience Research Lab.
For the graduate and undergraduate act with each other, there are benefits Jerlando F.L. Jackson (jjackson@education.wisc.edu) is
the Vilas Distinguished Professor of Higher Education and
facilitators we aim to provide oppor- given their common bond of being un- the director and chief research scientist of Wisconsin’s
tunities for them to give back to com- derrepresented. Equity and Inclusion Laboratory (Wei LAB) at the
University of Wisconsin-Madison.
munities with similar demographics
Edward C. Dillon, Jr. (ecdillon@cise.ufl.edu) the iAAMCS
as the ones they came from and to help Conclusion project manager and a postdoctoral associate in the
bolster their skills and identities as Currently, iAAMCS is in its second year Computer and Information Science and Engineering
Department at the University of Florida.
computer scientists. as an organization. It is developing a
LaVar J. Charleston (charleston@wisc.edu) is the
ARTSI robotics competition. model to engage, support, and sus- assistant director and a senior research associate at
iAAMCS is the sponsor of (and provides tain students via a national network. Wisconsin’s Equity and Inclusion Laboratory (Wei LAB)
within the Wisconsin Center for Education Research at the
some funding for) the ARTSI robotics Building upon the prior relationships University of Wisconsin-Madison.
competition. The objective of this com- and networks developed by related or-
IMAGE COURTESY OF IA AM CS

This material is based in part upon work supported by


petition is to develop an active robotics ganizations such as ELA and XSEDE, the National Science Foundation under Grant Number
CNS-1457855. Any opinions, findings, and conclusions or
community for African American (and iAAMCS participants are connected recommendations expressed in this material are those of
other underrepresented) students and to a larger network of peers, faculty, the authors and do not necessarily reflect the views of the
National Science Foundation.
recruit them to pursue graduate train- and computing professionals to meet
ing and careers in research. This com- the student’s individual interests and Copyright held by authors.

38 COMMUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


V
viewpoints

DOI:10.1145/2668022 Vijay Kumar and Thomas A. Kalil

Viewpoint
The Future of
Computer Science
and Engineering
Is in Your Hands
How government service can profoundly influence
computer science research and education.

M
ANY RESEARCHERS ARE
concerned about flat or
declining research bud-
gets in computer science
and engineering. Al-
though the research community is
right to be concerned about the short-
age of funds, they should also be con-
cerned about the shortage of scien-
tists and engineers that are willing to
serve in governments to address this
challenge. Computer scientists and
engineers can have a huge impact on
the future of the field and the future
of the U.S. By serving in the govern-
ment, they can design and launch
new research initiatives, inform IT-
related policy decisions, and serve as
a catalyst for public-private partner-
ships involving government, industry,
and academia.
In the U.S., one way of advancing
computer science and shaping public
policy in computer science research
and education is by serving in federal A conceptualized software system inspired by a DARPA project investigating software
agencies that oversee investments in systems that will remain functional in excess of 100 years.
education and research in science and
technology. Program and division di- Agency (DARPA), the National Insti- many other federal agencies help cre-
IMAGE COURTESY OF DA RPA

rectors at the National Science Foun- tute for Standards and Technology ate national research and development
dation (NSF), the National Institutes (NIST), the United States Department initiatives that can create and/or trans-
of Health (NIH), National Aeronautics of Agriculture (USDA), the Department form disciplines.
and Space Administration (NASA), the of Transportation (DOT), the Depart- The White House Office of Science
Defense Advanced Research Projects ment of Homeland Security (DHS), and and Technology Policy (OSTP) is an

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 39
viewpoints

agency that works with all other federal OSTP has also designed a closely

COMMUNICATIONSAPPS agencies. OSTP’s overarching missiona


is to “ensure that federal investments
in science and technology are making
the greatest possible contribution to
related interagency initiative in cyber
physical systems (CPS) with active par-
ticipation from NSF, NIST, DOT, and
DHS. CPS are smart networked sys-
economic prosperity, public health, tems with embedded sensors, proces-
environmental quality, and national sors, and actuators that are designed
security.” OSTP also works to formu- to sense and interact with the physical
late policies that can advance science world (including human users). CPS
and technology and to make sure in- technology is pervasive and will trans-
depth scientific knowledge and exper- form entire industrial sectors, includ-
tise in technology are used to inform ing transportation, healthcare, energy,
policymaking. Over the last three years, manufacturing, and agriculture. This
OSTP has championed a number of initiative addresses the urgent need
new research initiatives that support for research and development for the
computer science and engineering, design and verification of software to
or that harness information technol- provide guaranteed performance, the
ogy to advance national priorities and around-the-clock reliability needed for
accelerate the pace of discovery in sci- safety-critical applications; education
ence and engineering. and training required for CPS applica-
Access the In recent years, the U.S. govern- tions; and development of standards
ment has created two national educa- and reference architectures for CPS in
latest issue, tion and research initiatives that are different industry sectors.
particularly relevant to computer sci- NSF has already invested in CPS
past issues, ence and engineering. The National by committing over $140M for CPS
BLOG@CACM, Robotics Initiative (NRI) is a multi-
agency initiative designed by OSTP
research and education over the last
four years.c In addition, NSF has cre-
News, and to support research and development ated a CPS Virtual Organization (VO),d
in robotics science and technology a network of federal agencies, private
more. and promote applications of robotics industry, and academic research labs
in manufacturing, healthcare, agri- actively engaged in CPS research, de-
culture, space exploration, national velopment, and education. Two White
defense, homeland security, civil in- House Presidential Innovation Fellows
frastructure, and education. In June supported by the National Institute for
2011, President Obama unveiled the Standards and Technology (NIST) are
NRI, with an initial focus on robots
that can work with humans to extend
c Cyber-Physical Systems; http://www.nsf.gov/
and augment human skills. NSF, NIH, pubs/2013/nsf13502/nsf13502.htm.
Available for iPad, NASA, and USDA issued a joint NRI d Cyber-Physical Systems Virtual Organization;
iPhone, and Android solicitationb and committed over $50 http://cps-vo.org/
million to develop the science and
technology for robots that can safely
coexist and operate in close proxim- The last two
ity to humans. The NRI now, in its
third year, includes a team of over 50 years have been
program managers from nine differ- particularly exciting
ent funding agencies and many of the
best U.S. laboratories including the for robotics,
Available for iOS, Army Research Laboratory, the Naval cyber physical
Android, and Windows Research Laboratory, and the NASA
Johnson Space Center. The NRI is systems,
broadening in scope to include a wid- and the Internet
er range of applications and will help
strengthen U.S. leadership in robotics of Things.
science and technology.

a Office of Science and Technology Policy;


http://www.whitehouse.gov/ostp
b National Robotics Initiative; http://www.nsf.
gov/pubs/2011/nsf11553/nsf11553.htm

40 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7

ACM_CACM_Apps2015_ThirdVertical_V01.indd 1 6/4/15 2:51 PM


viewpoints

addressing sector-specific open test- six- or 12-month appointments for


beds for research and development, innovators in the private sector who
and for creating standards and refer- Federal agencies want to work with the government on
ence architectures for CPS. Industry benefit from developing high-impact technologi-
leaders such as GE, AT&T, Intel, IBM, cal solutions that can save lives, cre-
and Cisco are exploring mechanisms leadership in ate new jobs, or make government
for collaborating with each other and computer science more efficient. At OSTP, appoint-
with government agencies and univer- ments of a year or more are possible
sity researchers. and information through the Intergovernmental Per-
It is important to note that lead- technologies. sonnel Act (IPA) and this can offer a
ers from academia and industry have great opportunity for those who have
played a very significant role in shap- a sabbatical leave. There are also crit-
ing both the NRI and CPS initiatives. In ical opportunities in other agencies
2008, the community pressed to create such as NSF and DARPA. Indeed, the
a National Robotics Week. In May 2009 creation of the NRI and CPS initia-
and then again in March 2013, road- mendous potential of this field. This tives was in large part because of indi-
mapse for robotics research, develop- is why it is important for all of us to be viduals that took advantage of these
ment and education, were presented collectively and continuously engaged opportunities at NSF.
to the Congressional Robotics Caucus.f in science and technology policy. We Agencies like these clearly play a crit-
OSTP, NSF, NIST, and other agencies need experts that can help identify the ical role in funding computer science
held many workshops to help refine most important challenges in robot- and engineering research and educa-
the vision for the future of robotics and ics and CPS, and that can design high- tion. However, federal agencies benefit
CPS. In both these initiatives, individu- impact research initiatives to address from leadership in computer science
als who served as rotators in the posi- these challenges. and information technology. For ex-
tions of program or division director It is also important to realize that ample, NIH recently recruited a chief
at NSF, NASA, USDA, and NIH invested behind the recent avalanche of suc- data scientist and many agencies have
resources under their control to create cess stories involving start-ups and recruited chief technology officers.
new programs. investments by major companies is In the U.S., the Office of Science and
The last two years have been par- the “snowball” of decades of federal Technology Policy and many other fed-
ticularly exciting for robotics, CPS, funding in these fields. With targeted eral agencies are committed to invest-
and the Internet of Things. For ex- investments we can create the scien- ing in computer science research and
ample, on December 2, 2013, Amazon tific and technology breakthroughs to harnessing information technology
CEO Jeff Bezos announced plans for that will serve as a catalyst for the in- to address some of the nation’s most
Amazon Prime Air, which will deliver dustries and jobs of the future. For important challenges. In other coun-
small packages weighing less than five example, the U.S. federal government tries, the agencies responsible for R&D
pounds (which constitute 86% of prod- should make investments that pro- investment and their charters are ob-
ucts shipped by Amazon) using aerial mote the infrastructure needed for viously different. However, as in any
robots in the next five years. Google has R&D; reduce the “design, build, test” country, we cannot do this without the
acquired robotics companies with ex- cycle; and attract the next generation active participation and engagement
pertise in legged robots, low-cost arms, of scientists, engineers, and entrepre- of creative members of the research
perception, virtual reality, and omnidi- neurs to advance robotics and CPS, community. Democracy is not a spec-
rectional wheels. Recent reports on the and their applications in domains tator sport, and government is only as
Internet of Things predict economic such as manufacturing, transporta- effective as the people that chose to
impact in the tens of trillions of dol- tion, healthcare, and energy. embrace public service.
lars1 and over 75% of business leaders Given the importance of these and
surveyed predicted a direct impact of many other opportunities, it is time References
1. Disruptive technologies: Advances that will transform
this technology on their business.2 for more members of the computer life, business, and the global economy. McKinsey
While there are a number of tech- science community to spend time in Global Institute Report, May 2013; http://www.
mckinsey.com/insights/business_technology/
nology demonstrations in robotics Washington, D.C. Whether you are in disruptive_technologies.
2. The Internet of Things Business Index, A report from
and CPS that suggest that these fields industry or in academia, a student or a The Economist Intelligence Unit, 2013; http://www.
are becoming mature, many of these professional, we urge you to reach out economistinsights.com/analysis/internet-things-
business-index.
solutions only work under tightly con- and connect with federal government
strained conditions. The recent DARPA agencies whose missions are aligned
Vijay Kumar (kumar@seas.upenn.edu) is UPS Foundation
Robotics Challenge serves to highlight with your interests. OSTP offers se- Professor at the University of Pennsylvania and served
many of the open problems in robot- mester-long internships for students as the assistant director of Robotics and Cyber Physical
Systems at the White House Office of Science and
ics in addition to underscoring the tre- of all majors. The Presidential Inno- Technology Policy from 2012–2014.
vation Fellows (PIF) programg offers Thomas A. Kalil (Thomas_A._Kalil@ostp.eop.gov) is
Deputy Director for Technology and Innovation at the
e Robotics Virtual Organization; http://www. White House Office of Science and Technology Policy.
robotics-vo.us/ g Presidential Innovation Fellows; http://www.
f Robotics Caucus; http://www.roboticscaucus.org/ whitehouse.gov/innovationfellows/ Copyright held by authors.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 41
practice
DOI:10.1145/ 2747303
distribution. Automated trading pro-
Article development led by
queue.acm.org
vides a window into the engineering
challenges of ever-shrinking latency re-
quirements and therefore may be use-
The finance industry has unique demands ful to software engineers in other fields.
for low-latency distributed systems. This article focuses on applications
where latency (as opposed to through-
BY ANDREW BROOK put, efficiency, or some other metric)
is one of the primary design consid-

Low-Latency
erations. Phrased differently, “low-la-
tency systems” are those for which la-
tency is the main measure of success
and is usually the toughest constraint

Distributed
to design around. The article presents
examples of low-latency systems that
illustrate the external factors that drive
latency and then discusses some prac-

Applications
tical engineering approaches to build-
ing systems that operate at low latency.

Why Is Everyone in Such a Hurry?

in Finance
To understand the impact of latency
on an application, it is important first
to understand the external, real-world
factors that drive the requirement. The
following examples from the finance
industry illustrate the impact of some
real-world factors.
Request for quote trading. In 2003 I
worked at a large bank that had just de-
ployed a new Web-based institutional
foreign-currency trading system. The
VI RTUALLY ALL S YS T E MShave some requirements for quote and trade engine, a J2EE (Java
2 Platform, Enterprise Edition) appli-
latency, defined here as the time required for a system cation running in a WebLogic server
to respond to input. (Non-halting computations exist, on top of an Oracle database, had re-
but they have few practical applications). Latency sponse times that were reliably under
two seconds—fast enough to ensure
requirements appear in problem domains as diverse good user experience.
as aircraft flight controls,a voice communications,b Around the same time the bank’s
website went live, a multibank online
multiplayer gaming,c online advertising,d and trading platform was launched. On this
scientific experiments.e new platform, a client would submit an
Distributed systems—in which computation occurs RFQ (request for quote) that would be
forwarded to multiple participating
on multiple networked computers that communicate banks. Each bank would respond with
and coordinate their actions by passing messages— a quote, and the client would choose
which one to accept.
present special latency considerations. In recent
ILLUSTRATION BY MA RIO WAGNER

My bank initiated a project to con-


years the automation of financial trading has driven nect to the new multibank platform.
requirements for distributed systems with challenging The reasoning was that since a two-sec-
ond response time was good enough
latency requirements (often measured in microseconds for a user on the website, it should be
or even nanoseconds; see Table 1) and global geographic good enough for the new platform, and

42 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 43
practice

Table 1. Units of time. so the same quote and trade engine


could be reused. Within weeks of going
live, however, the bank was winning a
Millisecond = 10-3 seconds 1,000 milliseconds = 1 second
surprisingly small percentage of RFQs.
Microsecond = 10-6 seconds 1,000 microseconds = 1 millisecond
The root cause was latency. When two
Nanosecond = 10-9 seconds 1,000 nanoseconds = 1 microsecond
banks responded with the same price
(which happened quite often), the first
response was displayed at the top of the
Table 2. Sequence of events. list. Most clients waited to see a few dif-
ferent quotes and then clicked on the
100 Banks A and B receive messages indicating that the interbank rate increased to 1.3563 / one at the top of the list. The result was
1.3566. Both start the process of computing updated quotes. the fastest bank often won the client’s
150 Bank A publishes an updated quote of 1.3562 / 1.3567. business—and my bank was not the
200 Client C receives the updated quote from bank A but still has the old quote from bank B of fastest one.
1.3557 / 1.3561.
The slowest part of the quote-gener-
210 Client C recognizes an arbitrage opportunity and sends a request to sell 1 million euros to
ation process occurred in the database
bank A (receiving $1.3562 million) and at the same time sends a request to buy 1 million
euros from bank B (paying $1.3561 million). queries loading customer-pricing pa-
260 Bank A receives the request from C in which A receives 1 million euros from C and pays rameters. Adding a cache to the quote
$1.3562 million in return. engine and optimizing a few other “hot
260 Bank B receives the request from C in which B pays 1 million euros to C and receives spots” in the code brought quote laten-
$1.3561 million. cy down to the range of approximately
270 Since the quote is valid, A sends back an acknowledgment to C and at the same time 100 milliseconds. With a faster engine,
sends a request to the interbank market to sell 1 million euros.
the bank was able to capture significant
270 B likewise finds the request to be for a valid quote (it has not yet sent out the update that
was started at 100ms) so acknowledges to C and sends a request to buy 1 million euros
market share on the competitive quota-
from the interbank market. tion platform—but the market contin-
320 C receives acknowledgments from A and B. Its net profit is $100 (it received $1.3562 mil- ued to evolve.
lion from A and paid $1.3561 to B). Streaming quotes. By 2006 a new
350 Bank B publishes its updated quote of 1.3562 / 1.3567. This does not change the process- style of currency trading was becoming
ing of the order that was already received from client C at the old rate. popular. Instead of a customer sending
380 Bank A receives acknowledgment from the interbank market. Bank A makes a net profit of a specific request and the bank respond-
$100 (it paid $1.3562 million to C and received $1.3563 million from the interbank market).
ing with a quote, customers wanted the
380 Bank B receives acknowledgment from the interbank market. Unfortunately, the current
interbank rate is 1.3563 / 1.3566, so it its order is filled at 1.3566, which means it pays
banks to send a continuous stream of
$1.3566 million to buy the 1 million euros needed to offset the amount sold to C. Bank B quotes. This streaming-quotes style
thus suffers a net loss of $500. of trading was especially popular with
certain hedge funds that were develop-
ing automated trading strategies—ap-
plications that would receive streams of
Table 3. Examples of approximate latencies. quotes from multiple banks and auto-
matically decide when to trade. In many
cases, humans were now out of the loop
0 (1 second) Acceptable response time to Web page loads or other query-/response- on both sides of the trade.
style interactions.
To understand this new competitive
-1 (100 milliseconds) Human perception of delay in interactive media (for example, voice com-
munications). Real-time advertising auctions. Network latency across
dynamic, it is important to know how
continents or oceans. banks compute the rates they charge
-2 (10 milliseconds) Hard-disk seek times and related I/O operations on spinning disks when their clients for foreign-exchange trans-
the data is not cached. actions. The largest banks trade cur-
-3 (1 millisecond) Real-time analytics in competitive environments (for example, natural rencies with each other in the so-called
language processing, topic classification).
interbank market. The exchange rates
-4 (100 microseconds) Limits of metropolitan-area communications: microwaves travel about set in that market are the most competi-
30km in air; light travels about 20km in fiber.
tive and form the basis for the rates (plus
-5 (10 microseconds) Trade execution logic for relatively simple strategies running on CPU.
some markup) that are offered to clients.
-6 (1 microsecond) Sending a packet from one server to another over a single-hop local-area
network. Every time the interbank rate changes,
-7 (100 nanoseconds) Main memory access on modern Intel microprocessors. Thread context each bank recomputes and republishes
switch (best case; it can be much worse). Switching latency in 10Gbps the corresponding client rate quotes. If
cut-through switch. a client accepts a quote (that is, requests
-8 (10 nanoseconds) Accuracy of good commercial GPS-based time sources. to trade against a quoted exchange rate),
-9 (1 nanosecond) Time to execute a single instruction on a modern CPU, assuming that then the bank can immediately execute
operands are in registers or L1 cache.
an offsetting trade with the interbank
market, minimizing risk and locking in
a small profit. There are, however, risks

44 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


practice

to banks that are slow to update their quote engine in C++. The small delays of the physical infrastructure. Finally,
quotes. A simple example can illustrate: added by each hop in the network from it is sometimes possible to restructure
Imagine the interbank spot market the interbank market to the bank and the algorithms (or even the problem
for EUR/USD has rates of 1.3558/1.3560. onward to its clients were now signifi- definition itself) to achieve low latency.
(The term spot means the agreed-upon cant, so the bank upgraded firewalls Lies, damn lies, and statistics. The
currencies are to be exchanged within and procured dedicated telecom cir- first step to solving most optimization
two business days. Currencies can cuits. Network upgrades combined problems (not just those that involve
be traded for delivery at any mutually with the faster quote engine brought software) is to measure the current sys-
agreed-upon date in the future, but end-to-end quote latency down below tem’s performance. Start from the high-
the spot market is the most active in 10 milliseconds for clients who were est level and measure the end-to-end la-
terms of number of trades.) Two rates physically located close to our facilities tency. Then measure the latency of each
are quoted: one for buying (the bid in New York, London, or Hong Kong. component or processing stage. If any
rate), and one for selling (the offered Trading performance and profits rose stage is taking an unusually large por-
or ask rate). In this case, a participant accordingly—but, of course, the mar- tion of the latency, then break it down
in the interbank market could sell one ket kept evolving. further and measure the latency of its
euro and receive 1.3558 U.S. dollars in substages. The goal is to find the parts
return. Conversely, one could buy one Engineering Systems of the system that contribute the most
euro for a price of 1.3560 U.S. dollars. for Low Latency to the total latency and focus optimi-
Say that two banks, A and B, are The latency requirements of a given zation efforts there. This is not always
participants in the interbank market application can be addressed in many straightforward in practice, however.
and are publishing quotes to the same ways, and each problem requires a dif- For example, imagine an applica-
hedge-fund client, C. Both banks add ferent solution. There are some com- tion that responds to customer quote
a margin of 0.0001 to the exchange mon themes, though. First, it is usually requests received over a network. The
rates they quote to their clients—so necessary to measure latency before it client sends 100 quote requests in
both publish quotes of 1.3557/1.3561 can be improved. Second, optimization quick succession (the next request is
to client C. Bank A, however, is faster often requires looking below abstrac- sent as soon as the prior response is re-
at updating its quotes than bank B, tak- tion layers and adapting to the reality ceived) and reports total elapsed time
ing about 50 milliseconds while bank
B takes about 250 milliseconds. There Example quote engine and test harness.
are approximately 50 milliseconds
of network latency between banks A (In the client application)
and B and their mutual client C. Both
for (int i = 0; i < 100; i++){
banks A and B take about 10 millisec- RequestMessage requestMessage = new RequestMessage(quoteRequest);
onds to acknowledge an order, while long sentTime = getSystemTime();
the hedge fund C takes about 10 mil- sendMessage(requestMessage);
ResponseMessage responseMessage = receiveMessage();
liseconds to evaluate new quotes and long quoteLatency = getSystemTime() - sentTime;
submit orders. Table 2 breaks down logStats(quoteLatency);
the sequence of events. }
The net effect of this new streaming-
(In the quote server application)
quote style of trading was that any bank
that was significantly slower than its while (true){
rivals was likely to suffer losses when RequestMessage requestMessage = receive();
long receivedTime = getSystemTime();
market prices changed and its quotes QuoteRequest quoteRequest = parseRequest(requestMessage);
were not updated quickly enough. At long parseTime = getSystemTime();
the same time, those banks that could long parseLatency = parseTime - receivedTime;
ClientProfile profile = lookupClientProfile(quoteRequest.client);
update their quotes fastest made sig-
long profileTime = getSystemTime();
nificant profits. Latency was no longer long profileLatency = profileTime - parseTime;
just a factor in operational efficiency Quote quote = computeQuote(profile);
or market share—it directly impacted long computeTime = getSystemTime();
long computeLatency = computeTime - profileTime;
the profit and loss of the trading desk. logQuote(quote);
As the volume and speed of trading long logTime = getSystemTime();
increased throughout the mid-2000s, long logLatency = logTime - computeTime;
these profits and losses grew to be QuoteMessage quoteMessage = new QuoteMessage(quote);
long serializeTime = getSystemTime();
quite large. (How low can you go? Table long serializationLatency = serializeTime - logTime;
3 shows some examples of approxi- send(quoteMessage);
mate latencies of systems and applica- long sentTime = getSystemTime();
long sendLatency = sentTime - serializeTime;
tions across nine orders of magnitude.) logStats(parseLatency, profileLatency, computeLatency,
To improve its latency, my bank logLatency, serializationLatency, sendLatency);
split its quote and trading engine into }
distinct applications and rewrote the

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 45
practice

of 360 milliseconds—or 3.6 millisec- measured inside the quote server ap- minimal impact on the performance of
onds on average to service a request. plication does not quite add up to the the device. Most operating systems pro-
The internals of the application are latency reported by the client applica- vide similar capabilities in software, al-
broken down and measured using the tion. That is most likely because they beit with a somewhat higher risk of de-
same 100-quote test set: are not actually measuring the same laying the actual traffic. Time-stamped
! Read input message from network thing. Consider the simplified pseu- network-traffic captures (often called
and parse: 5 microseconds do-code illustrated in the accompany- packet captures) can be a useful tool to
! Look up client profile: 3.2 millisec- ing figure. measure more precisely when a mes-
onds (3,200 microseconds) Note the elapsed time measured sage was exchanged between two parts
! Compute client quote: 15 micro- by the client application includes the of the system. These measurements
seconds time to transmit the request over the can be obtained without modifying the
! Log quote: 20 microseconds network, as well as the time for the re- application itself and generally with
! Serialize quote to a response mes- sponse to be transmitted back. The very little impact on the performance of
sage: 5 microseconds quote server, on the other hand, mea- the system as a whole.f
! Write to network: 5 microseconds sures the time elapsed only from the ar- Clock synchronization. One of the
As clearly shown in this example, rival of the quote to when it is sent (or challenges of measuring performance
significantly reducing latency means more precisely, when the send method at short time scales across distributed
addressing the time it takes to look up returns). The 350-microsecond dis- systems is clock synchronization. In
the client’s profile. A quick inspection crepancy between the average response general, to measure the time elapsed
shows the client profile is loaded from time measured by the client and the from when an application on server A
a database and cached locally. Further network could cause the equivalent transmits a message to a second ap-
testing shows when the profile is in measurement by the quote server, but it plication on server B, it is necessary to
the local cache (a simple hash table), might also be the result of delays within check the time on A’s clock when the
response time is usually under a micro- the client or server. Moreover, depend- message is sent and on B’s clock when
second, but when the cache is missed it ing on the programming language and the message arrives, and then subtract
takes several hundred milliseconds to operating system, checking the system those two timestamps to determine
load the profile. The average of 3.2 mil- clock and logging the latency statistics the latency. If the clocks on A and B are
liseconds was almost entirely the result may introduce material delays. not in sync, then the computed latency
of one very slow response (of about 320 This approach is simplistic, but will actually be the real latency plus the
milliseconds) caused by a cache miss. when combined with code-profiling clock skew between A and B.
Likewise, the client’s reported 3.6-milli- tools to find the most commonly ex- When is this a problem in the real
second average response time turns out ecuted code and resource contention, world? Real-world drift rates for the
to be a single very slow response (350 it is usually good enough to identify quartz oscillators that are used in most
milliseconds) and 99 fast responses that the first (and often easiest) targets for commodity server motherboards are
took around 100 microseconds each. latency optimization. It is important to on the order of 10–5, which means the
Means and outliers. Most systems keep this limitation in mind, though. oscillator may be expected to drift by
exhibit some variance in latency from Measuring distributed systems 10 microseconds each second. If un-
one event to the next. In some cases the latency via network traffic capture. corrected, it may gain or lose as much
variance (and especially the highest- Distributed systems pose some addi- as a second over the course of a day.
latency outliers) drive the design much tional challenges to latency measure- For systems operating at time scales of
more so than the average case. It is im- ment—as well as some opportunities. milliseconds or less, clock skew may
portant to understand which statistical In cases where the system is distrib- render the measured latency meaning-
measure of latency is appropriate to uted across multiple servers it can be less. Oscillators with significantly low-
the specific problem. For example, if difficult to correlate timestamps of re- er drift rates are available, but without
you are building a trading system that lated events. The network itself can be some form of synchronization, they
earns small profits when the latency is a significant contributor to the latency will eventually drift apart. Some mech-
below some threshold but incurs mas- of the system. Messaging middleware anism is needed to bring each server’s
sive losses when latency exceeds that and the networking stacks of operat- local clock into alignment with some
threshold, then you should be measur- ing systems can be complex sources of common reference time.
ing the peak latency (or, alternatively, latency. Developers of distributed systems
the percentage of requests that exceed At the same time, the decomposi- should understand Network Time
the threshold) rather than the mean. tion of the overall system into separate Protocol (NTP) at a minimum and are
On the other hand, if the value of the processes running on independent encouraged to learn about Precision
system is more or less inversely propor- servers can make it easier to measure Time Protocol (PTP) and usage of ex-
tional to the latency, then measuring certain interactions accurately between ternal signals such as GPS to obtain
(and optimizing) the average latency components of the system over the net- high-accuracy time synchronization in
makes more sense even if it means work. Many network devices (such as practice. Those who need time accu-
there are some large outliers. switches and routers) provide mecha- racy at the sub-microsecond scale will
What are you measuring? Astute nisms for making time-stamped copies want to become familiar with hardware
readers may have noticed the latency of the data that traverse the device with implementations of PTP (especially at

46 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


practice

the network interface) as well as tools application performance.


for extracting time information from ! Threads allow developers to de-
each core’s local clock.g compose a problem into separate se-
Abstraction vs. reality. Modern quences of instructions that can be
software engineering is built upon ab-
stractions that allow programmers to The latency allowed to run concurrently, subject to
certain ordering constraints, and that
manage the complexity of ever-larger
systems. Abstractions do this by sim-
requirements can operate on shared resources (such
as memory). This allows developers to
plifying or generalizing some aspect of a given take advantage of multicore proces-
of the underlying system. This does
not come for free, though—simplifica-
application sors without needing to deal directly
with issues of scheduling and core as-
tion is an inherently lossy process and can be addressed signment. In some cases, however, the
some of the lost details may be impor-
tant. Moreover, abstractions are often
in many ways, overhead of context switches and pass-
ing data between cores can outweigh
defined in terms of function rather and each problem the advantages gained by concurrency.
than performance.
Somewhere deep below an appli- requires a ! Hierarchical storage and cache-
coherency protocols allow program-
cation are electrical currents flowing different solution. mers to write applications that use large
through semiconductors and pulses of amounts of virtual memory (on the order
light traveling down fibers. Program- of terabytes in modern commodity serv-
mers rarely need to think of their sys- ers), while experiencing latencies mea-
tems in these terms, but if their con- sured in nanoseconds when requests
ceptualized view drifts too far from can be serviced by the closest caches.
reality they are likely to experience un- The abstraction hides the fact the fastest
pleasant surprises. memory is very limited in capacity (for
Four examples illustrate this point: example, register files on the order of a
! TCP provides a useful abstraction few kilobytes), while memory that has
over UDP (User Datagram Protocol) in been swapped out to disk may incur la-
terms of delivery of a sequence of bytes. tencies in the tens of milliseconds.
TCP ensures bytes will be delivered in Each of these abstractions is extremely
the order they were sent even if some of useful but can have unanticipated con-
the underlying UDP datagrams are lost. sequences for low-latency applications.
The transmission latency of each byte There are some practical steps to take to
(the time from when it is written to a TCP identify and mitigate latency issues re-
socket in the sending application until it sulting from these abstractions.
is read from the corresponding receiv- Message and network protocols.
ing application’s socket) is not guaran- The near ubiquity of IP-based networks
teed, however. In certain cases (specifi- means that regardless of which messag-
cally when an intervening datagram is ing product is in use, under the covers
lost) the data contained in a given UDP the data is being transmitted over the
datagram may be delayed significantly network as a series of discrete packets.
from delivery to the application, while The performance characteristics of the
the missed data ahead of it is recovered. network and the needs of an applica-
! Cloud hosting provides virtual tion can vary dramatically—so one size
servers that can be created on demand almost certainly does not fit all when
without precise control over the loca- it comes to messaging middleware for
tion of the hardware. An application latency-sensitive distributed systems.
or administrator can create a new vir- There is no substitute for getting un-
tual server “on the cloud” in less than der the hood here. For example, if an ap-
a minute—an impossible feat when plication runs on a private network (you
assembling and installing physical control the hardware), communications
hardware in a data center. Unlike the follow a publisher/subscriber model,
physical server, however, the location and the application can tolerate a cer-
of the cloud server or its location in tain rate of data loss, then raw multicast
the network topology may not be pre- may offer significant performance gains
cisely known. If a distributed applica- over any middleware based on TCP. If
tion depends on the rapid exchange an application is distributed across very
of messages between servers, then the long distances and data order is not im-
physical proximity of those servers may portant, then a UDP-based protocol may
have a significant impact on the overall offer advantages in terms of not stalling

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 47
practice

to resend a missed packet. If TCP-based exceed the time spent operating on it.
messaging is being used, then it is worth ! Finally, if there are more threads
keeping in mind that many of its param- than cores, then the operating system
eters (especially buffer sizes, slow start, must periodically perform a context
and Nagle’s algorithm) are configurable
and the “out-of-the-box” settings are Modern software switch in which the thread running on
a given core is halted, its state is saved,
usually optimized for throughput rather
than latency.h
engineering and another thread is allowed to run.
The cost of a context switch can be
Location. The physical constraint is built upon significant. If the number of threads
that information cannot propagate
faster than the speed of light is a very
abstractions far exceeds the number of cores, then
context switching can be a significant
real consideration when dealing with that allow source of delay.
short time scales and/or long distanc-
es. The two largest stock exchanges,
programmers In general, application design
should use threads in a way that repre-
NASDAQ and NYSE, run their matching to manage sents the inherent concurrency of the
engines in data centers in Carteret and
Mahwah, New Jersey, respectively. A ray the complexity underlying problem. If the problem
contains significant computation that
of light takes 185 microseconds to trav- of ever-larger can be performed in isolation, then a
el the 55.4km distance between these
two locations. Light in a glass fiber with systems. larger number of threads is called for. On
the other hand, if there is a high degree
a refractive index of 1.6 and follow-
ing a slightly longer path (roughly 65
Abstractions do this of interdependency between computa-
tions or (worst case) if the problem is
km) takes almost 350 microseconds to by simplifying inherently serial, then a single-thread-
make the same one-way trip. Given the
computations involved in trading deci-
or generalizing ed solution may make more sense. In
both cases, profiling tools should be
sions can now be made on time scales some aspect of the used to identify excessive lock conten-
of 10 microseconds or less, signal prop-
agation latency cannot be ignored. underlying system. tion or context switching. Lock-free
data structures (now available for sev-
Threading. Decomposing a prob- eral programming languages) are an-
lem into a number of threads that can other alternative to consider.i
be executed concurrently can greatly It is also worth noting the physical ar-
increase performance, especially in rangement of cores, memory, and I/O may
multicore systems, but in some cases not be uniform. For example, on modern
it may actually be slower than a single- Intel microprocessors certain cores can in-
threaded solution. teract with external I/O (for example, net-
Specifically, multithreaded code work interfaces) with much lower latency
incurs overhead in the following three than others, and exchanging data between
ways: certain cores is faster than others. As a re-
! When multiple threads operate on sult, it may be advantageous explicitly to
the same data, controls are required pin specific threads to specific cores.j
to ensure the data remains consistent. Hierarchical storage and cache
This may include acquisition of locks or misses. All modern computing systems
implementations of read or write bar- use hierarchical data storage—a small
riers. In multicore systems, these con- amount of fast memory combined with
currency controls require that thread multiple levels of larger (but slower)
execution is suspended while messages memory. Recently accessed data is
are passed between cores. If a lock is cached so that subsequent access is
already held by one thread, then other faster. Since most applications exhibit
threads seeking that lock will need to a tendency to access the same memory
wait until the first one is finished. If multiple times in a short period, this
several threads are frequently access- can greatly increase performance. To
ing the same data, then there may be obtain maximum benefit, however, the
significant contention for locks. following three factors should be in-
! Similarly, when multiple threads corporated into application design:
operate on the same data, the data it- ! Using less memory overall (or at
self must be passed between cores. If least in the parts of the application
several threads access the same data that are latency sensitive) increases the
but each performs only a few compu- probability that needed data will be
tations on it, then the time required available in one of the caches. In par-
to move the data between cores may ticular, for especially latency-sensitive

48 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


practice

applications, designing the app so that allow detailed measurement of latency A conventional implementation of
frequently accessed data fits within the as and when optimization becomes this trading system would cache the
CPU’s caches can significantly improve necessary. Moreover (and as discussed market-price data and, upon receipt of
performance. Specifications vary but in the next section), the most useful op- the earnings data, decide whether to
Intel’s Haswell microprocessors, for timizations may not be in the applica- buy or sell (or neither), construct an or-
example, provide 32KB per core for L1 tion code at all. der, and serialize that order to an array
data cache and up to 40MB of shared Changes in design. The optimiza- of bytes to be placed into the payload of
L3 cache for the entire CPU. tions presented so far have been lim- a message and sent to the exchange.
! Repeated allocation and release of ited to improving the performance of An alternative implementation per-
memory should be avoided if reuse is a system for a given set of functional forms most of the same steps but does
possible. An object or data structure that requirements. There may also be op- so on every market-data update rather
is allocated once and reused has a much portunities to change the broader de- than only upon receipt of the earnings
greater chance of being present in a sign of the system or even to change data. Specifically, when each market-
cache than one that is repeatedly allocat- the functional requirements of the sys- data update message is received, the
ed anew. This is especially true when de- tem in a way that still meets the overall application constructs two new orders
veloping in environments where memo- objectives but significantly improves (one to buy, one to sell) at the current
ry is managed automatically as overhead performance. Latency optimization is prices and serializes each order into a
caused by garbage collection of memory no exception. In particular, there are message. The messages are cached but
that is released can be significant. often opportunities to trade reduced not sent. When the next market-data
! The layout of data structures in efficiency for improved latency. update arrives, the old order messages
memory can have a significant impact Three real-world examples of de- are discarded and new ones are creat-
on performance because of the archi- sign trade-offs between efficiency and ed. When the earnings data arrives, the
tecture of caches in modern proces- latency are presented here, followed application simply decides which (if
sors. While the details vary by platform by an example where the requirements either) of the order messages to send.
and are outside the scope of this ar- themselves present the best opportu- The first implementation is clearly
ticle, it is generally a good idea to pre- nity for redesign. more efficient (it has a lot less wasted
fer arrays as data structures over linked Speculative precomputation. In cer- computation), but at the moment
lists and trees and to prefer algorithms tain cases trading efficiency for latency when latency matters most (that is,
that access memory sequentially since may be possible, especially in systems when the earnings data has been re-
these allow the hardware prefetcher that operate well below their peak ca- ceived), the second algorithm is able
(which attempts to load data preemp- pacity. In particular, it may be advanta- to send out the appropriate order mes-
tively from main memory into cache geous to compute possible outputs in sage sooner. Note that this example
before it is requested by the applica- advance, especially when the system presents application-level precompu-
tion) to operate most efficiently. Note is idle most of the time but must react tation; there is an analogous process of
also that data that will be operated on quickly when an input arrives. branch prediction that takes place in
concurrently by different cores should A real-world example can be found in pipelined processors that can also be
be structured so it is unlikely to fall in the systems used by some firms to trade optimized (via guided profiling) but is
the same cache line (the latest Intel stocks based on news such as earnings outside the scope of this article.
CPUs use 64-byte cache lines) to avoid announcements. Imagine the market Keeping the system warm. In some
cache-coherency contention. expects Apple to earn between $9.45 and low-latency systems long delays may oc-
A note on premature optimiza- $12.51 per share. The goal of the trading cur between inputs. During these idle
tion. The optimizations just presented system, upon receiving Apple’s actual periods, the system may grow “cold.”
should be considered part of a broader earnings, would be to sell some number Critical instructions and data may be
design process that takes into account of shares of Apple stock if the earnings evicted from caches (costing hundreds
other important objectives including were below $9.45, buy some number of of nanoseconds to reload), threads
functional correctness, and maintain- shares if the earnings were above $12.51, that would process the latency-sensitive
ability. Keep in mind Knuth’s quote and do nothing if the earnings fall within input are context-switched out (cost-
about premature optimization being the expected range. The act of buying or ing tens of microseconds to resume),
the root of all evil; even in the most per- selling stocks begins with submitting an or CPUs may switch into power-saving
formance-sensitive environments it is order to the exchange. The order consists states (costing a few milliseconds to
rare that a programmer should be con- of (among other things) an indicator of exit). Each of these steps makes sense
cerned with determining the correct whether the client wishes to buy or sell, from an efficiency standpoint (why run a
number of threads or the optimal data the identifier of the stock to buy or sell, CPU at full power when nothing is hap-
structure until empirical measure- the number of shares desired, and the pening?), but all of them impose latency
ments indicate a specific part of the price at which the client wishes to buy or penalties when the input data arrives.
application is a hot spot. Focus instead sell. Throughout the afternoon leading In cases where the system may go
should be on ensuring performance re- up to Apple’s announcement, the client for hours or days between input events
quirements are understood early in the would receive a steady stream of market- there is a potential operational issue as
design process and the system archi- data messages that indicate the current well: configuration or environmental
tecture is sufficiently decomposable to price at which Apple’s stock is trading. changes may have “broken” the system

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 49
practice

rest of the document had arrived), led to


a redesign of the system.
a
http://copter.ardupilot.com/
b
http://queue.acm.org/detail.cfm?id=1028895/
The FTP server was rewritten to for-
c
http://queue.acm.org/detail.cfm?id=971591/ ward portions of the document to the
d
http://acuityads.com/real-time-bidding/ parser as they arrived rather than wait
e
http://home.web.cern.ch/about/accelerators/cern-neutrinos-gran-sasso/ for the entire document. Likewise, the
f
See https://tools.ietf.org/html/rfc1305/, https://tools.ietf.org/html/rfc5905/;
http://www.nist.gov/el/isd/ieee/ieee1588.cfm/; and http://queue.acm.org/detail. parser was rewritten to operate on a
cfm?id=2354406/ stream of incoming data rather than
g
See http://wireshark.org and http://tcpdump.org/ on a single document. The result was
h
http://queue.acm.org/detail.cfm?id=2539132/
i
http://queue.acm.org/detail.cfm?id=2492433/
that in many cases the earnings data
j
http://queue.acm.org/detail.cfm?id=2513149/ could be extracted within just a few
milliseconds of the start of the arrival
of the document. This reduced overall
in some important way, but will not ies, provides both the benefit of less- latency (as observed by the client) by
be discovered until the event occurs— frequent outages and the avoidance of several tens of milliseconds without
when it is too late to fix. some of the larger delays. the internal implementation of the
A common solution to both prob- Streaming processing and short cir- parsing algorithm being any faster.
lems is to generate a continuous stream cuits. Consider a news analytics system
of dummy input data to keep the sys- whose requirements are understood to Conclusion
tem “warm.” The dummy data needs be “build an application that can extract While latency requirements are com-
to be as realistic as possible to ensure corporate earnings data from a press re- mon to a wide array of software appli-
it keeps the right data in the caches lease document as quickly as possible.” cations, the financial trading industry
and that breaking changes to the en- Separately, it was specified the press re- and the segment of the news media
vironment are detected. The dummy leases would be pushed to the system that supplies it with data have an espe-
data needs to be reliably distinguish- via FTP. The system was thus designed cially competitive ecosystem that pro-
able from legitimate data, though, to as two applications: one that received duces challenging demands for low-
prevent downstream systems or clients the document via FTP; and a second that latency distributed systems.
from being confused. parsed the document and extracted the As with most engineering prob-
Redundant processing. It is com- earnings data. In the first version of this lems, building effective low-latency
mon in many systems to process the system, an open source FTP server was distributed systems starts with hav-
same data through multiple indepen- used as the first application, and the sec- ing a clear understanding of the prob-
dent instances of the system in parallel, ond application (the parser) assumed it lem. The next step is measuring actual
primarily for the improved resiliency would receive a fully formed document performance and then, where neces-
that is conferred. If some component as input, so it would not start parsing the sary, making improvements. In this
fails, the user will still receive the re- document until it had fully arrived. domain, improvements often require
sult needed. Low-latency systems gain Measuring the performance of the some combination of digging below
the same resiliency benefits of parallel, system showed that while parsing was the surface of common software ab-
redundant processing but can also use typically completed in just a few mil- stractions and/or trading some degree
this approach to reduce certain kinds liseconds, receiving the document via of efficiency for improved latency.
of variable latency. FTP could take tens of milliseconds
All real-world computational pro- from the arrival of the first packet to
Related articles
cesses of nontrivial complexity have the arrival of the last packet. Moreover, on queue.acm.org
some variance in latency even when the earnings data was often present in
the input data is the same. These varia- There’s Still Some Life Left in Ada
the first paragraph of the document.
Alexander Wolfe
tions can be caused by minute differ- In a multistep process it may be http://queue.acm.org/detail.cfm?id=1035608
ences in thread scheduling, explicitly possible for subsequent stages to start
Principles of Robust Timing
randomized behaviors such as Ether- processing before prior stages have fin- over the Internet
net’s exponential back-off algorithm, ished, sometimes referred to as stream- Julien Ridoux and Darryl Veitch
or other unpredictable factors. Some oriented or pipelined processing. This can http://queue.acm.org/detail.cfm?id=1773943
of these variations can be quite large: be especially useful if the output can be Online Algorithms in
page faults, garbage collections, and computed from a partial input. Taking High-Frequency Trading
network congestion can all cause oc- this into account, the developers recon- Jacob Loveless, Sasha Stoikov, and Rolf Waeber
casional delays that are several orders ceived their overall objective as “build a http://queue.acm.org/detail.cfm?id=2534976f
of magnitude larger than the typical system that can deliver earnings data as
Andrew Brook is the CTO of Selerity, a provider of real-
processing latency for the same input. quickly as possible to the client.” This time news, data, and content analytics. Previously he led
Running multiple, independent in- broader objective, combined with the development of electronic currency-trading systems at
two large investment banks and launched a pre-dot-com
stances of the system, combined with a understanding the press release would startup to deliver AI-powered scheduling software to
protocol that allows the end recipient arrive via FTP and it was possible to ex- agile manufacturers.
to accept the first result produced and tract the earnings data from the first Copyright held by author.
discard subsequent redundant cop- part of the document (that is, before the Publication rights licensed to ACM. $15.00.

50 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


DOI:10.1145/ 2 75 5 5 0 3

Article development led by


queue.acm.org

An agile process implementation.


BY PHELIM DOWLING AND KEVIN MCGRATH

Using Free and


Open Source
Tools to Manage
Software Quality
agile software development place
T H E PR IN C IP L E S OF
more emphasis on individuals and interactions than
on processes and tools. They steer us away from heavy
documentation requirements and guide us along a path
of reacting efficiently to change rather than sticking
rigidly to a predefined plan. To support this flexible

method of operation, it is important to project planning, defect management,


have suitable applications to manage test case creation and execution, auto-
the team’s activities. It is also essential mated testing, load testing, and so on.
to implement effective frameworks to But these tools can be expensive and
ensure quality is being built into the may be too much of a burden for some
product early and at all levels. With these organizations’ idea of agility.
concerns in mind and coming from a Other solutions exist that can sup-
budget-conscious perspective, this arti- port your agile philosophy and not
cle will explore the free and open source make any dent at all in your hard-
applications and tools used by one orga- earned bottom line. An abundance of
nization in its quest to build process and free and open source tools have been
quality around its projects and products. available for quite a while now. Many
How a software development team companies and organizations are em-
implements process and handles its ploying open source tools to meet their
application life cycle management can business needs.1 These applications
have a huge bearing on the success of often have large communities of active
the product it is developing.2 A lean and engaged contributors document-
and efficient mode of operation is the ing the features of the product and
key, especially for early-stage start-ups. releasing enhancements on a regular
Many commercial-grade products and basis. We recommend proper due dili-
frameworks are available for integrat- gence in examining the suitability of
ing all aspects of your development the tools for your needs and determin-
process: requirements management, ing their track records of performance,

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 51
practice

maintenance, and growth. If they pass This difficulty led to the creation At the requirements gathering stage
that check, open source tools can more of a subteam of tech leads and pro- all project requirements are docu-
than meet the needs of small project cess champions to review available mented in Redmine as user stories.
outfits, funded research and develop- tools that might be better suited to These are initially stored in the project
ment teams, early start-ups, and even our needs. Their remit was to find a backlog before being allocated to spe-
larger-scale enterprises. tool that would allow us to manage our cific engineering releases and biweekly
activities across multiple projects effi- iterations. The user stories are then
Project Management ciently, was adaptable to our fluid agile broken down into subtasks and sub-
and Issue Tracking ethos, and had zero cost. features so that distinct and measur-
The Telecommunications Software & This eventually led to the rollout of able elements of work are planned and
Systems Group (TSSG)5 previously used Redmine as the organization’s prima- executed to achieve the overall goal.
tools such as XPlanner for project plan- ry tool for project management and These pieces of work are allocated to
ning and Bugzilla for defect manage- issue tracking. Redmine is an open individuals on the team to be complet-
ment. While individually these tools source application and is released un- ed within a specific time period.
were quite functional, the lack of inte- der the GNU General Public License The Redmine tool facilitates many
gration in terms of gathering require- v2. Written using the Ruby on Rails of the other agile methods used by our
ments, planning, logging defects, and Web application framework, we have engineering teams, including time
so on became an obstacle to our desire installed our implementation on an management, defect tracking, email
for a more streamlined and agile way of Ubuntu VM. Utilizing the open source integration, task boards, sprint cards,
working. Expecting developers to look Apache Web server, TSSG staff access and so on.
at one system for bugs to be fixed and the tool via any common browser. While there is an active community
another to determine what features to The application’s MySQL database of developers working on Redmine and
work on was inefficient. The Verifica- is backed up regularly because of the creating plugins of value, this is some-
tion and Validation (VnV) team made widespread use and critical stature of what of a double-edged sword. We have
efforts to link defects with features but this tool within the organization. The found in the past that upgrading the
this was manual and laborious. Plan- integral nature of Redmine within core product can be problematic due
ning of daily and weekly activities was the TSSG agile process along with the to these plugins. The plugins are not
harder because of the standalone na- other tools we use is depicted in the always compatible with newer releas-
ture of the tools. accompanying figure. es, leading to headaches during the up-

Open source and free tools mapped to the TSSG agile process.

Redmine
Master
Project Plan

Redmine
Requirements Project Z Release

Project Y
ER 1 ER n ER n
Project X (3 Months) (x Months) (x Months)
User Stories
Release Plan

IR 1 IR n IR n IR n IR n IR n
(2 wks) (2 wks) (2 wks) (2 wks) (2 wks) (2 wks)

Cobertura Hudson/Jenkins Dev Env


SVN/GIT
Code Repo Code Continuous
Redmine Coverage Integration
Building Quality

Junit
Maven/Ant
Defect Unit Tests
Build Product
Tracking Product
Application Release
Klaros NetSparker
Test Case Security
Management Full system Full system
Testing Quality test at test at
end of ER end of ER
JMeter Selenium
Load Testing Auto tests Test Env

52 COMMUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


practice

grade and broken functionality for fea- presented. Jenkins can also be config- MDaemon Free mail server and Sen-
tures that teams were previously using. ured to send email messages to team dEmail, a lightweight command line
members on build pass and/or fail. Of email client.
Code Repository and course, under-the-hood open source The tests are executed in Firefox
Continuous Integration tools such as Ant and Maven are actu- and recorded via the Selenium IDE.
One of the first tasks for a project team ally generating the builds. The tests are then saved on the server
is to set up a source code control frame- In the TSSG we promote the concept side. A single file to list all the individ-
work. It is critical to maintain a central of test-driven development and, while ual test cases to be executed as part of
area for the team to check in code and the choice of tool for unit testing will a test suite is created. A batch file is de-
have it backed up regularly. Tracking vary depending on the technologies be- veloped to execute the entire suite and
versions of your code will allow you ing used, JUnit is the most common. scheduled to run at desired intervals.
to roll back to a previous revision if On top of this, we measure the percent- The batch file calls the Selenium tool
required. Spawning branches as the age of our code that is covered by tests and tells it which suite to run (and on
project grows and tagging builds to al- using tools such as Cobertura. A Cober- which browser) and where to save the
low hot fixes are likely requirements. tura plugin for Jenkins allows the code results. The final step is to email the re-
In TSSG, we have used locally hosted coverage to be analyzed from the Jen- sults to the distribution list.
versions of both the Apache Subversion kins build Web page. This element of Quite often you will find that other
system and the Git repository, and both our process often generates the most users have encountered roadblocks
tools have met the needs of our projects. heated discussions between the VnV in automating elements of their tests.
Another handy integration point team and the engineering teams. Our To overcome this they may have coded
we have set up for projects is between goal is a 90% level of code coverage, a ‘user extension’ for Selenium and
Git and Redmine. This allows us to but there is always debate about the made it openly available. Contained in
view the Git repository, commits, and difficulty of achieving this target with a flat file configured in your local envi-
stats directly from our project man- certain technologies. Also, the time ronment, these coded solutions pro-
agement system. and resources required to reach the vided by individual contributors can
A concept we are currently exploring last few are often questioned in terms prove to be very useful. One of the earli-
is a peer review of code before commit. of return on investment. However, the est extensions we used was the beatnic-
This would especially apply to junior de- team mentality promoted in our agile Click function. This handy extension
velopers paired with more experienced philosophy means that consensus is acted as a workaround for an issue with
colleagues. Tools such as Gerrit and Re- always reached. timeout failures on certain native click
view Board are being examined for suit- Another aspect of the TSSG’s agile activities, allowing the test to execute
ability in this regard. While it may bring process methodology is an indepen- the desired functionality seamlessly.
extra overhead before commit we feel it dent code review team. This team ex- Another example of a user extension
can add extra quality to our products in amines the project’s implementation we have implemented in our frame-
the long run. Code reviews and peer pro- of a build framework using the tools work is for flow control logic. The abil-
gramming are philosophies that TSSG described in this section with a view to ity to easily include goto statements
promotes, but are too often overlooked building quality into our code from the and while loops has helped improve
due to time constraints. If these tools ground up. They may also advise the the robustness and coverage provided
can facilitate code review and peer pro- developers to implement code analysis by our automated test suites.
gramming in a relatively seamless man- using open source tools such as Check- The Selenium framework imple-
ner, we anticipate trialing their use on a style, Gmetrics, or CodeNarc. mented in TSSG allows us to execute
subset of projects before pushing them functions in a browser session, record
out for general use. Automated Testing those steps, and then replay them au-
A key element of our agile process The majority of the projects in the TSSG tomatically. The test steps can then be
is to build our code regularly. Continu- start as funded research. However, as extended to include checks for specific
ous integration will execute all the unit the project evolves and the solution text or elements on the site. You can
and acceptance tests at build time, and stabilizes and matures, the product capture and store variables for reuse
makes regular builds available for veri- will generally take on a more commer- within the test suite. Pauses and waits
fication on the test environment. In the cial nature. As this stage approaches can be added to replicate real user be-
TSSG we have used Hudson and more we have found it beneficial to imple- havior. Screenshots can be captured.
frequently Jenkins for this. Easily in- ment Selenium automated test suites Data-driven testing is supported; that
stalled and configured, Jenkins builds on the components developed to date. is, external files with lists of param-
can be triggered by code check-ins (to Our backend implementation is eters can be fed into the tests. Other
the repositories mentioned earlier) Selenium 2.0 WebDriver running on a more advanced features include the
or at regular intervals. A user-friendly Windows server. The Selenium IDE is use of JavaScript snippets, for example
Web-based interface makes it easy to installed as a Firefox browser add-on to manipulate data or strings.
view a list of builds including the latest. for recording the test cases. To allow Although the recording of the tests
You can view the changes made in any the results of a test suite to be emailed must be carried out in Firefox, the test
build, and a comprehensive overview to the relevant team members we have execution supported by the Selenium
of the test results for each build is also installed two other pieces of freeware, WebDriver allows us to run the tests

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 53
practice

against Firefox, IE, Chrome, Safari, and formance. It was originally designed for
other common browsers. testing Web applications but has since
On certain projects we encountered expanded to other tests. JMeter provides
issues with Selenium not being able to the control needed on the tests and uses
interact with Windows-specific pop-up
dialogues. While Selenium provides A key element of multiple threads to emulate multiple
users. While JMeter is not a browser,
a feature to handle browser pop-ups
(via getWindowHandles and switchTo.
our agile process and does not function like one, it can
emulate browser behavior in a number
window), the framework cannot han- is to build our of ways—caching, cookie management,
dle non-browser components. To deal
with this we used another open source
code regularly. and header management.
JMeter includes an HTTP proxy
automation tool, iMacros. Using iMac- Continuous which, when enabled, can be used to re-
ros to handle the interaction with the
external dialogue, the iMacro script
integration will cord scripts when testing websites. The
idea is to point the browser to the JMe-
was saved as a bookmark in the brows- execute all the unit ter proxy so that JMeter can sniff the re-
er and called from Selenium, allowing
seamless automation of the full test. and acceptance quests made through the browser and
store them under the specified control-
Selenium works well for our VnV tests at build time, ler in JMeter. The information derived

and makes regular


test team, who utilize its record/play- can be used to create a test plan. JMeter
back authoring of tests with some ex- will then play back the HTTP requests
tensions as described previously. But
it can also be a powerful tool for devel-
builds available for to the server, but we can simulate any
number of users doing this. We also
opers who need to create tests in Java, verification on the use Fiddler and Firebug to dig deeper
Ruby, Python, and other languages us-
ing the Selenium API.
test environment. into Web traffic when required.
Unfortunately, JMeter does not ex-
The most significant implementa- ecute the JavaScript found in HTML
tion of Selenium automated tests by pages. Nor does it render the HTML
TSSG was on the FeedHenry project. pages as a browser does (it is possible
Working within an aggressive biweekly to view the response as HTML). So,
release cycle,3 the comprehensive au- depending on the Web application, it
tomated test suite maintained by the may be necessary to incorporate Se-
VnV team gave the fledgling company lenium tests in a separate test run if
the confidence and space to release you are testing the client-side business
features for its enterprise mobile ap- logic and use JMeter for the backend
plication platform every two weeks. components. JMeter also has an ex-
RedHat’s €63.5 million acquisition4 tensive set of third-party plugins that
of FeedHenry (www.feedhenry.com), extend its functionality with an array of
which started as a small research proj- graphs, new load delivery controllers,
ect, is a vindication of both the TSSG‘s and other functions that act as add-ons
research and innovation model and to the original JMeter package.
the successful implementation of our Recently we have had a require-
agile process and tools. ment to perform load tests across an
Openfire IM server instance to test
Load Testing rules from a policy engine we are cre-
Performance monitoring is an impor- ating as part of a new project. In this
tant part of the software life cycle at scenario we used Mockaroo to gener-
TSSG as many of our products will go to ate a large number of test users with
live trial and full commercialization. We realistic characteristics. We then used
conduct load testing in two ways. Endur- Tsung test scripts to fire load at the test
ance testing involves evaluating a sys- environment. Tsung’s Erlang founda-
tem’s ability to handle a moderate work- tion means it uses fewer threads than
load for a longer period of time. Volume JMeter, allowing larger-scale testing,
testing, on the other hand, subjects a but it is trumped by JMeter’s superior
system to a heavy load for a limited peri- GUI, which allows easy configuration
od of time. Both approaches have identi- of tests and analysis of results.
fied bottlenecks, bugs, and component
limitations in certain projects. Test Case Management
Apache JMeter is an open source The TSSG has chosen the Klaros test
load-testing tool designed to load test management system for creating and
functional behavior and measure per- executing manual test cases. A com-

54 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


practice

munity edition that is free for unlim- ed and analyzed as required. Similarly, Checkstyle and FindBugs to find flaws
ited use provides all the functionality test results from test automation tools at that level. However, we feel leverag-
required to maintain manual test cases such as Selenium can also be imported ing existing TSSG expertise in this area
across multiple projects. and added to the manual test results. can help provide a framework and pro-
Windows and Linux versions are cess for a more bulletproof offering to
available, and we have installed our Challenges the overall VnV process.
instance on a Windows server. The Supporting mobile usage of the apps
downloadable executable will install and features produced by our proj- Conclusion
an Apache Tomcat application server ects is an essential element of our The TSSG operates without baseline
and a file-based database Apache Der- quality offering. The matrix of test funding and must compete nationally
by (although this can be changed to scenarios can grow quite large when and internationally to finance projects.
more robust solutions such as MySQL you want to cover phones and tablets With a commitment to product excel-
or PostgreSql). Users can interact with across iOS, Android, and other plat- lence epitomized by the VnV teams’
the test case system via any common forms. Add in the spectrum of manu- role on all projects, the use of free and
Web browser. facturers with their own variations, open source tools to build layers of
The free edition of Klaros supports and comprehensively validating app quality into our applications is a neces-
multiple projects being maintained quality can be a challenge. sity. The many integration points be-
simultaneously. Within these projects On certain projects we have dabbled tween the tools described are an added
the test case manager can build up in the pool of commercial offerings benefit in pursuing a zero-cost solution
comprehensive test steps, test cases, that provide a wide range of devices to application life cycle management.
and test suites. Varying test environ- along with automated test functional- The tools described in this article are
ments can be maintained for each proj- ity, Device Anywhere and Perfecto Mo- only some of those used within the
ect, for example, for different OSes or bile being the most popular. However, TSSG process. Monitoring the evolu-
application servers. Also, versioning of long-term use of these products was tion of existing and new zero-cost tools
the product being tested is supported not financially sustainable on proj- is an ongoing task in our freeware and
via the concept of tracking the system ects with tight budgets. We have made open source strategy.
under test (SUT). some attempts to automate and script
At execution time, the tester can de- tests against Android devices using
Related articles
cide to complete a single test or a full ADB and monkeyrunner. MonkeyTalk on queue.acm.org
test suite. The test run can be suspend- also looks like a reasonable prospect
Adopting DevOps Practices
ed and resumed as needed. The result for both iOS and Android. However,
in Quality Assurance
of each test (pass or fail) and any errors the resources required to build and James Roche
observed are captured at every step. Af- maintain automated mobile test suites http://queue.acm.org/detail.cfm?id=2540984
ter the test suite has been completed, have not always proved justifiable. Mo- Building Collaboration into IDEs
an easy-to-read report is generated in bile projects with agile requirements Li-Te Cheng, et al.
various formats (pdf, html, csv). This and aggressive deadlines have not yet http://queue.acm.org/detail.cfm?id=966803
report will display the date, test envi- proved suitable for large-scale auto- Breaking The Major Release Habit
ronment, system under test, a sum- mated testing, while some lack of con- Damon Poole
mary table of the test case results, a fidence in emulator testing was also a http://queue.acm.org/detail.cfm?id=1165768
pie chart of the results, and a more de- concern. On the flip side, the advent
References
tailed listing of the individual test cas- of wearable technology such as glass- 1. Baldwin H. Reasons companies say yes to open
es with details such as execution times. es and watches broadens the horizon source. (Jan. 6, 2014); http://bit.ly/1uJ7uYK
2. Clarke, P. and O’Connor, R. The influence of SPI on
Also available in the community even further and may warrant a more business success in software SMEs: An empirical
edition is integration with other open strategic approach. study. J. Systems and Software 85, 10 (2012),
2356–2367.
source testing tools. Redmine, de- Another aspect we have identified 3. Dowling, P. Successfully transitioning a research project
scribed earlier in this article, can be for improvement in our process is se- to a commercial spin-out using an Agile software
process. In J. Software: Evolution and Process 26 (May
configured as an issue management curity testing. Many of our projects are 2014), 468–475; doi: 10.1002/smr.1611
tool so that tickets can be created di- in the areas of e-health, finance, and 4. Kepes, B. Red Hat acquires FeedHenry to move into
the mobile space. Forbes (Sept. 18, 2014); http://
rectly in the chosen bug tracking and personal digital data and have obvious onforb.es/1CXd4WF
project management tool. The contin- requirements in terms of protecting 5. TSSG; www.tssg.org/.

uous integration servers Hudson and user data. Protecting our own intellec-
Phelim Dowling has 18 years of experience in the IT
Jenkins (described earlier) can also tual property and hosted services are industry. He joined the TSSG Verification and Validation
be integrated with Klaros. By install- also ongoing concerns. We have used team in 2006 and has facilitated on projects such as
Muzu, FeedHenry, Kodacall, and more recently the EU
ing the Klaros plugin in Hudson or Netsparkers community edition along FP7-funded OpenLab initiative.
Jenkins, the results of test cases gen- with other free tools on certain proj- Kevin McGrath joined the commercialization team
erated at build time can be imported ects to try to identify vulnerabilities to within Waterford Institute of Technology’s TSSG in
2006 as a verification and validation engineer. He has
into Klaros via a REST interface. The cross-site scripting, injection attacks, worked on FeedHenry SWAGGER, COIN, SSGS, EGP, and
import is tagged with the appropriate and so on. The code review team, men- Billing4Rent, an EU-funded project.

project, test environment, and SUT tioned previously, examines projects’ Copyright held by authors.
IDs and the results can then be report- use of static analysis tools such as Publication rights licensed to ACM. $15.00.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 55
contributed articles
DOI:10.1145/2699414
and the 2013 discovery of the Higgs bo-
Scientific discovery and engineering innovation son), powerful astronomy instruments
(such as the Hubble Space Telescope,
requires unifying traditionally separated high- which yielded insights into the uni-
performance computing and big data analytics. verse’s expansion and dark energy), or
high-throughput DNA sequencers and
BY DANIEL A. REED AND JACK DONGARRA exploration of metagenomics ecology,
ever-more powerful scientific instru-

Exascale
ments continually advance knowl-
edge. Each such scientific instrument,
as well as a host of others, is critically
dependent on computing for sensor

Computing
control, data processing, international
collaboration, and access.
However, computing is much more
than an augmenter of science. Unlike

and Big Data


other tools, which are limited to partic-
ular scientific domains, computational
modeling and data analytics are appli-
cable to all areas of science and engi-
neering, as they breathe life into the
underlying mathematics of scientific
models. They enable researchers to un-
derstand nuanced predictions, as well
as shape experiments more efficiently.
They also help capture and analyze the
torrent of experimental data being pro-
duced by a new generation of scientific
NEARLY T WO C ENT URI E S ago, the English chemist instruments made possible by advanc-
Humphrey Davy wrote “Nothing tends so much to the es in computing and microelectronics.
Computational modeling can il-
advancement of knowledge as the application of a new luminate the subtleties of complex
instrument. The native intellectual powers of men mathematical models and advance
science and engineering where time,
in different times are not so much the causes of the cost, or safety precludes experimen-
different success of their labors, as the peculiar nature of tal assessment alone. Computational
the means and artificial resources in their possession.” models of astrophysical phenomena,
on temporal and spatial scales as di-
Davy’s observation that advantage accrues to those who
have the most powerful scientific tools is no less true key insights
today. In 2013, Martin Karplus, Michael Levitt, and Arieh " The tools and cultures of high-
performance computing and big data
Warshel received the Nobel Prize in chemistry for their analytics have diverged, to the detriment
work in computational modeling. The Nobel committee of both; unification is essential to address
a spectrum of major research domains.
said, “Computer models mirroring real life have " The challenges of scale tax our ability
become crucial for most advances made in chemistry to transmit data, compute complicated
functions on that data, or store
ILLUSTRATION BY PETER BOLLINGER

today,”17 and “Computers unveil chemical processes, a substantial part of it; new approaches
are required to meet these challenges.
such as a catalyst’s purification of exhaust fumes or " The international nature of science demands
the photosynthesis in green leaves.” further development of advanced
computer architectures and global
Whether describing the advantages of high-energy standards for processing data, even as
international competition complicates
particle accelerators (such as the Large Hadron Collider the openness of the scientific process.

56 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 57
contributed articles

Figure 1. Data analytics and computing ecosystem compared.

Application Level Mahout, R, and Applications Applications and Community Codes

Hive Pig Sqoop Flume FORTRAN, C, C++, and IDEs


Zookeeper (coordination)

Map-Reduce Storm Domain-specific Libraries


Application Level

AVRO
MPI/OpenMP Performance
Cloud Services (e.g., AWS)

Hbase BigTable Numerical and Debugging


+ Accelerator
(key-value store) Tools Libraries (such as PAPI)
Middleware and
Management
System
Lustre (Parallel Batch Scheduler Monitoring
HDFS (Hadoop File System) File System) (such as SLURM) Tools

Virtual Machines and Cloud Services


(optional)
System Software

Linux OS variant Linux OS variant

Ethernet Local Node Commodity X86 Infiniband + SAN + Local X86 Racks +
Cluster Hardware Switches Storage Racks Ethernet Node Storage GPUs or
Switches Accelerators

Data Analytics Ecosystem Computational Science Ecosystem

verse as planetary system formation, lations. Machine learning has yielded puting and data-analysis systems,
stellar dynamics, black hole behavior, new insights into health risks and the along with a host of new design chal-
galactic formation, and the interplay spread of disease via analysis of social lenges at massive scale, are raising
of baryonic and putative dark matter, networks, Web-search queries, and new questions about advanced com-
have provided new insights into theo- hospital data. It is also key to event puting research investment priorities,
ries and complemented experimental identification and correlation in do- design, and procurement models, as
data. Sophisticated climate models mains as diverse as high-energy phys- well as global collaboration and com-
that capture the effects of greenhouse ics and molecular biology. petition. This article examines some
gases, deforestation, and other plan- As with successive generations of of these technical challenges, the
etary changes have been key to under- other large-scale scientific instru- interdependence of computational
standing the effects of human behavior ments, each new generation of ad- modeling and data analytics, and the
on the weather and climate change. vanced computing brings new capa- global ecosystem and competition for
Computational science and engi- bilities, along with technical design leadership in advanced computing.
neering also enable multidisciplinary challenges and economic trade-offs. We begin with a primer on the history
design and optimization, reducing Broadly speaking, data-generation ca- of advanced computing.
prototyping time and costs. Advanced pabilities in most science domains are
simulation has enabled Cummins to growing more rapidly than compute Advanced Computing Ecosystems
build better diesel engines faster and capabilities, causing these domains By definition, an advanced computing
less expensively, Goodyear to design to become data-intensive.23 High- system embodies the hardware, soft-
safer tires much more quickly, Boeing performance computers and big-data ware, and algorithms needed to deliver
to build more fuel-efficient aircraft, systems are tied inextricably to the the very highest capability at any given
and Procter & Gamble to create better broader computing ecosystem and its time. As in Figure 1, the computing
materials for home products. designs and markets. They also sup- and data analytics ecosystems share
Similarly, “big data,” machine port national-security needs and eco- some attributes, notably reliance on
learning, and predictive data analytics nomic competitiveness in ways that open source software and the x86 hard-
have been hailed as the fourth para- distinguish them from most other sci- ware ecosystem. However, they differ
digm of science,12 allowing researchers entific instruments. markedly in their foci and technical
to extract insights from both scientific This “dual use” model, together approaches. As scientific research in-
instruments and computational simu- with the rising cost of ever-larger com- creasingly depends on both high-speed

58 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

computing and data analytics, the po- and capacity the primary optimization collection is as large as yesterday’s en-
tential interoperability and scaling criteria. However, industry is now turn- terprise-scale storage.
convergence of these two ecosystems is ing to FPGAs and improved network Lest this seem an exaggeration, the
crucial to the future. designs to optimize performance. measured performance of an Apple
Scientific computing. In the 1980s, Atop this hardware, the Apache iPhone 6 or Samsung Galaxy S5 on stan-
vector supercomputing dominated Hadoop25 system implements a Ma- dard linear algebra benchmarks now
high-performance computing, as em- pReduce model for data analytics. Ha- substantially exceeds that of a Cray-1,
bodied in the eponymously named doop includes a distributed file system which is widely viewed as the first suc-
systems designed by the late Seymour (HDFS) for managing large numbers cessful supercomputer. That same
Cray. The 1990s saw the rise of massively of large files, distributed (with block smartphone has storage capacity rival-
parallel processing (MPPs) and shared replication) across the local storage of ing the text-based content of a major
memory multiprocessors (SMPs) built the cluster. HDFS and HBase, an open- research library.
by Thinking Machines, Silicon Graphics, source implementation of Google’s Just a few years ago, teraflops (1012
and others. In turn, clusters of commod- BigTable key-value store,3 are the big- floating point operations/second) and
ity (Intel/AMD x86) and purpose-built data analogs of Lustre for computa- terabytes (1012 bytes of secondary stor-
processors (such as IBM’s BlueGene), tional science, albeit optimized for dif- age) defined state-of-the-art advanced
dominated the previous decade. ferent hardware and access patterns. computing. Today, those same values
Today, these clusters are augmented Atop the Hadoop storage system, represent a desk-side PC with Nvidia
with computational accelerators in the tools (such as Pig18) provide a high- or Intel Xeon Phi accelerator and local
form of coprocessors from Intel and level programming model for the two- storage. Advanced computing is now
graphical processing units (GPUs) from phase MapReduce model. Coupled defined by multiple petaflops (1015 float-
Nvidia; they also include high-speed, with streaming data (Storm and Flume), ing operations/second) supercomput-
low-latency interconnects (such as In- graph (Giraph), and relational data ing systems and cloud data centers with
finiband). Storage area networks (SANs) (Sqoop) support, the Hadoop ecosystem many petabytes of secondary storage.
are used for persistent data storage, is designed for data analysis. Moreover, Figure 2 outlines this exponential
with local disks on each node used only tools (such as Mahout) enable classifi- increase in advanced computing capa-
for temporary files. This hardware eco- cation, recommendation, and predic- bility, based on the widely used High-
system is optimized for performance tion via supervised and unsupervised Performance LINPACK (HPL) bench-
first, rather than for minimal cost. learning. Unlike scientific computing, mark6 and Top500 list of the world’s
Atop the cluster hardware, Linux application development for data ana- fastest computers.16 Although solution
provides system services, augmented lytics often relies on Java and Web ser- of dense linear systems of equations is
with parallel file systems (such as Lustre) vices tools (such as Ruby on Rails). no longer the best measure of delivered
and batch schedulers (such as PBS and performance on complex scientific and
SLURM) for parallel job management. Scaling Challenges engineering applications, this histori-
MPI and OpenMP are used for internode Given the rapid pace of technologi- cal data illustrates how rapidly high-
and intranode parallelism, augmented cal change, leading-edge capability is performance computing has evolved.
with libraries and tools (such as CUDA and a moving target. Today’s smartphone Though high-performance computing
OpenCL) for coprocessor use. Numerical computes as fast as yesterday’s super- has benefited from the same semicon-
libraries (such as LAPACK and PETSc) and computer, and today’s personal music ductor advances as commodity com-
domain-specific libraries complete the
software stack. Applications are typically Figure 2. Advanced computing performance measured by the HPL benchmark.
developed in FORTRAN, C, or C++.
Data analytics. Just a few years ago, Tianhe-2
1.00E+17 K
the very largest data storage systems 1.00E+16 Road Runner
contained only a few terabytes of sec- 1.00E+15 Titan
Earth Simulator Tianhe-1A
ondary disk storage, backed by auto- 1.00E+14
1.00E+13 ASCI Red Blue Gene/L
mated tape libraries. Today, commer- 1.00E+12 TMC CM-5 ASCI White
cial and research cloud-computing 1.00E+11 TMC CM-2
systems each contain many petabytes 1.00E+10 Cray X-MP Cray T3D
Flop/s

1.00E+09
of secondary storage, and individual 1.00E+08
Cray 1
Cray 2
CDC 7600
research laboratories routinely process 1.00E+07 CDC 6600
terabytes of data produced by their 1.00E+06 IBM 7090
IBM 360/195
1.00E+05
own scientific instruments. 1.00E+04 UNIVAC 1
As with high-performance comput- 1.00E+03
ing, a rich ecosystem of hardware and 1.00E+02 EDSAC 1
1.00E+01
software has emerged for big-data 1.00E+00
analytics. Unlike scientific-computing
1950
1952
1954
1956
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014

clusters, data-analytics clusters are


typically based on commodity Ethernet Year
networks and local storage, with cost

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 59
contributed articles

puting, sustained system performance Race to the future. For scientific many research groups to apply ma-
has improved even more rapidly due to and engineering computing, exas- chine learning to large-scale scientific
increasing system size and parallelism. cale (1018 operations per second) is data without deep knowledge of ma-
The growth of personal, business, the next proxy in the long trajectory chine-learning algorithms. The same
government, and scientific data has of exponential performance increases is true of community codes for compu-
been even more dramatic and well doc- that has continued for more than half tational science modeling.
umented. Commercial cloud providers a century. Likewise, large-scale data Hardware, software, data, and poli-
are building worldwide networks of preservation and sustainability within tics. Historically, high-performance
data centers, each costing hundreds and across disciplines, metadata cre- computing advances have been largely
of millions of dollars, to support Web ation and multidisciplinary fusion, and dependent on concurrent advances in
search engines, social networks, and digital privacy and security define the algorithms, software, architecture, and
cloud services. Concurrently, the vol- frontiers of big data. This multifaceted hardware that enable higher levels of
ume of scientific data produced annu- definition of advanced computing en- floating-point performance for com-
ally now challenges the budgets of na- compasses more than simply quantita- putational models. Advances today are
tional research agencies. tive measures of sustained arithmetic also shaped by data-analysis pipelines,
As an example, Figure 3 outlines the operation rates or storage capacity and data architectures, and machine learn-
exponential growth in the number of analysis rates; it is also a relative term ing tools that manage large volumes of
objects stored in Amazon’s Simple Stor- encompassing qualitative improve- scientific and engineering data.
age Service (S3). Atop such low-level ser- ments in the usable capabilities of ad- However, just as changes in scien-
vices, companies (such as Netflix) imple- vanced computing systems at all scales. tific instrumentation scale bring new
ment advanced recommender systems to As such, it is intended to suggest a new opportunities, they also bring new
suggest movies to subscribers and then frontier of practical, delivered capa- challenges, some technical but others
stream selections. Scientific research- bility to scientific and engineering re- organizational, cultural, and econom-
ers also increasingly explore these same searchers across all disciplines. ic, and they are not self-similar across
cloud services and machine-learning However, there are many challenges scales. Today, exascale computing sys-
techniques for extracting insight from on the road to ever more advanced com- tems cannot be produced in an afford-
scientific images, graphs, and text data. puting, including, but not limited to, able and reliable way or be subject to
There are natural technical and econom- system power consumption and envi- realistic engineering constraints on
ic synergies among the challenges facing ronmentally friendly cooling, massive capital and operating costs, usability,
data-intensive science and exascale parallelism, and component failures, and reliability. As the costs of advanced
computing, and advances in both are data and transaction consistency, meta- computing and data-analysis systems,
necessary for future scientific break- data and ontology management, preci- whether commercial or scientific, have
throughs. Data-intensive science sion and recall at scale, and multidisci- moved from millions to billions of dol-
relies on the collection, analysis, and plinary data fusion and preservation. lars, design and decision processes
management of massive volumes of Above all, advanced computing sys- have necessarily become more com-
data, whether obtained from scientific tems must not become so arcane and plex and fraught with controversy. This
simulations or experimental facilities. complex that they and their services is a familiar lesson to those in high-
In each case, national and international are unusable by all but a handful of energy physics and astronomy, where
investment in “extreme scale” systems experts. Open source toolkits (such as particle accelerators and telescopes
will be necessary to analyze the massive Hadoop, Mahout, and Giraph), along have become planetary-scale instru-
volumes of data that are now common- with a growing set of domain-specific mentation and the province of inter-
place in science and engineering. tools and languages, have allowed national consortia and global politics.
Advanced computing is no exception.
Figure 3. Growth of Amazon S3 objects. The research-and-development costs
to create an exascale computing system
2,500
have been estimated by many experts
to exceed one billion U.S. dollars, with
2,000 an annual operating cost of tens of mil-
2,000
lions of dollars. Concurrently, there is
growing recognition that governments
1,500 and research agencies have substan-
tially underinvested in data retention
1,000 1,000 and management, as evinced by multi-
billion-dollar private-sector investments
500
566 in big data and cloud computing. The
262
largest commercial cloud data centers
2.9 14 40
102 each cost more than $500 million to con-
0
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013 struct, and Google, Amazon, Microsoft,
Facebook, and other companies operate
global networks of such centers.

60 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

Against this backdrop, U.S. support advanced computing is now widely used
for basic research is at a decadal low, in engineering and advanced manufac-
when adjusted for inflation,2 and both turing. From understanding the subtle-
the U.S. and the European Union con- ties of airflow in turbomachinery to
tinue to experience weak recoveries
from the economic downturn of 2008. Computing chemical molecular dynamics for con-
sumer products to biomass feedstock
Further exacerbating the challenges,
the global race for advanced com-
technology is modeling for fuel cells, advanced com-
puting has become synonymous with
puting hegemony is convolved with poised at important multidisciplinary design and optimiza-
national-security desires, economic
competitiveness, and the future of the
inflection points, tion and advanced manufacturing.
Looking forward, only a few exam-
mainstream computing ecosystem. at the very ples are needed to illustrate the deep
The shift from personal comput-
ers to mobile devices has also further
largest scale, and diverse scientific and engineering
benefits from advanced computing:
raised competition between the U.S.- or leading-edge Biology and biomedicine. Biology and
dominated x86 architectural ecosys-
tem and the globally licensed ARM eco- high-performance biomedicine have been transformed
through access to large volumes of ge-
system. Concurrently, concerns about
national sovereignty, data security, and
computing, and the netic data. Inexpensive, high-through-

very smallest scale,


put genetic sequencers have enabled
Internet governance have triggered capture of organism DNA sequences
new competition and political con-
cerns around data services and cloud-
or semiconductor and made possible genome-wide asso-
ciation studies for human disease and
computing operations. processes. human microbiome investigations, as
Despite these challenges, there is well as metagenomics environmental
reason for cautious optimism. Every studies. More generally, biological and
advance in computing technology has biomedical challenges span sequence
driven industry innovation and eco- annotation and comparison, protein-
nomic growth, spanning the entire structure prediction; molecular simula-
spectrum of computing, from the tions and protein machines; metabolic
emerging Internet of Things to ubiq- pathways and regulatory networks;
uitous mobile devices to the world’s whole-cell models and organs; and or-
most powerful computing systems and ganisms, environments, and ecologies;
largest data archives. These advances High-energy physics. High-energy
have also spurred basic and applied re- physics is both computational- and
search in every domain of science. data-intensive. First-principles compu-
Solving the myriad technical, po- tational models of quantum chromo-
litical, and economic challenges will be dynamics provide numerical estimates
neither easy nor even possible by tack- and validations of the Standard Model.
ling them in isolation. Rather, it will Similarly, particle detectors require the
require coordinated planning across measurement of probabilities of “inter-
government, industry, and academia, esting” events in large numbers of obser-
commitment to data sharing and sus- vations (such as in 1016 or more particle
tainability, collaborative research and collisions observed in a year). The Large
development, and recognition that both Hadron Collider and its experiments
competition and collaboration will be necessitated creating a worldwide com-
necessary for success. The future of big puting grid for data sharing and reduc-
data and analytics should not be pitted tion, driving deployment of advanced
against exascale computing; both are networks and protocols, as well as a
critical to the future of advanced com- hierarchy of data repositories. All were
puting and scientific discovery. necessary to identify the long-sought
Higgs boson;
Scientific and Engineering Climate science. Climate science is
Opportunities also critically dependent on the avail-
Researchers in the physical sciences and ability of a reliable infrastructure for
engineering have long been major users managing and accessing large hetero-
of advanced computing and computa- geneous quantities of data on a global
tional models. The more recent adoption scale. It is inherently a collaborative
by the biological, environmental, and so- and multidisciplinary effort requiring
cial sciences has been driven in part by sophisticated modeling of the physical
the rise of big-data analytics. In addition, processes and exchange mechanisms

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 61
contributed articles

among multiple Earth realms—atmo- neutron by providing data on the in-


sphere, land, ocean, and sea ice—and ternal structure of materials from the
comparing and validating these simu- atomic scale (atomic positions and ex-
lations with observational data from citations) up to the mesoscale (such as
various sources, all collected over long
periods. To encourage exploration, It is important for the effects of stress);
Steel production. Steel production
NASA has made climate and Earth sci-
ence satellite data available through
all of computer via continuous casting accounts for
an important fraction of global energy
Amazon Web Services; science to design consumption and greenhouse gases
Cosmology and astrophysics. Cosmol-
ogy and astrophysics are now critically
algorithms that production. Even small improvements
to this process would have profound
dependent on advanced computational communicate as societal benefits and save hundreds of
models to understand stellar structure,
planetary formation, galactic evolution,
little as possible, millions of dollars. High-performance
computers are used to improve under-
and other interactions. These models ideally attaining standing of this complex process via
combine fluid processes, radiation
transfer, Newtonian gravity, nuclear lower bounds on comprehensive computational mod-
els, as well as to apply those models to
physics, and general relativity (among the amount of find operating conditions to improve

communication
other processes). Underlying them is a the process; and
rich set of computational techniques Text and data mining. The explosive
based on adaptive mesh refinement
and particle-in-cell, multipole algo-
required. growth of research publications has
made finding and tracking relevant
rithms, Monte Carlo methods, and research increasingly difficult. Beyond
smoothed-particle hydrodynamics; the volume of text, principles have
Astronomy. Complementing com- different or similar names across do-
putation, whole-sky surveys, and a new mains. Text classification, semantic
generation of automated telescopes graph visualization tools, and recom-
are providing new insights. Rather mender systems are increasingly be-
than capture observational data to an- ing used to identify relevant topics and
swer a known question, astronomers suggest relevant papers for study.
now frequently query extant datasets There are two common themes
to discover previously unknown pat- across these science and engineering
terns and trends. Big-data reduction challenges. The first is an extremely
and unsupervised learning systems are wide range of temporal and spatial
an essential part of this exploratory im- scales and complex, nonlinear inter-
age analysis; actions across multiple biological and
Cancer treatment. Effective cancer physical processes. These are the most
treatment depends on early detection demanding of computational simula-
and targeted treatments via surgery, tions, requiring collaborative research
radiation, and chemotherapy. In turn, teams, along with the very largest and
tumor identification and treatment most capable computing systems. In
planning are dependent on image en- each case, the goal is predictive simu-
hancement, feature extraction and clas- lation, or gleaning insight that tests
sification, segmentation, registration, theories, identifies subtle interactions,
3D reconstruction, and quantification. and guides new research.
These and other machine-learning The second theme is the enormous
techniques provide not only diagnostic scale and diversity of scientific data
validation, they are increasingly used to and the unprecedented opportuni-
conduct comparative and longitudinal ties for data assimilation, multidis-
analysis of treatment regimes; ciplinary correlation, and statistical
Experimental and computational analysis. Whether in biological or
materials science. Experimental and physical sciences, engineering or busi-
computational materials science is ness, big data is creating new research
key to understanding materials prop- needs and opportunities.
erties and engineering options; for
example, neutron scattering allows Technical Challenges in
researchers to understand the struc- Advanced Computing
ture and properties of materials, mac- The scientific and engineering oppor-
romolecular and biological systems, tunities made possible through ad-
and the fundamental physics of the vanced computing and data analytics

62 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

are deep, but the technical challenges are also dependent on new memory the developer’s efforts while support-
in designing, constructing, and op- technologies, including processor-in- ing dynamic, fine-grain parallelism.
erating advanced computing and memory, stacked memory (Micron’s Much can be learned from Web and
data-analysis systems of unprecedented HMC is an early example), and non- cloud services where abstraction lay-
scale are just as daunting. Although volatile memory approaches. Although ers and domain-specific toolkits allow
cloud-computing centers and exascale the particulars differ for computation developers to deploy custom execution
computational platforms are seem- and data analysis, algorithmic deter- environments (virtual machines) and
ingly quite different, as discussed ear- minants of memory capacity will be leverage high-level services for reduc-
lier, the underlying technical challenges a significant driver of overall system tion of complex data. The scientific
of scale are similar, and many of the cost, as the memory per core for very computing challenge is retaining ex-
same companies and researchers are large systems will necessarily be small- pressivity and productivity while also
exploring dual-use technologies appli- er than in current designs; delivering high performance.
cable to both. Scalable system software that is pow- Reformulation of science problems
In a series of studies over the past five er and failure aware. Traditional high- and refactoring solution algorithms.
years, the U.S. Department of Energy performance computing software has Many thousands of person-years have
identified 10 research challenges10,15,24 been predicated on the assumption been invested in current scientific and
in developing a new generation of ad- that failures are infrequent; as we ap- engineering codes and in data mining
vanced computing systems, including proach exascale levels, systemic resil- and learning software. Adapting scien-
the following, augmented with our own ience in the face of regular component tific codes to billion-way parallelism
comparisons with cloud computing: failures will be essential. Similarly, dy- will require redesigning, or even rein-
Energy-efficient circuit, power, and namic, adaptive energy management venting, the algorithms and potentially
cooling technologies. With current semi- must become an integral part of sys- reformulating the science problems.
conductor technologies, all proposed tem software, for both economic and Integrating data-analytics software
exascale designs would consume hun- technical reasons. and tools with computation is equally
dreds of megawatts of power. New de- Cloud services for data analytics, daunting; programming languages
signs and technologies are needed to given their commercial quality-of- and models differ, as do the communi-
reduce this energy requirement to a service agreements, embody large ties and cultures. Understanding how
more manageable and economically numbers of resilience techniques, in- to do these things efficiently and ef-
feasible level (such as 20MW–40MW) cluding geo-distribution, automatic fectively will be key to solving mission-
comparable to that used by commer- restart and failover, failure injection, critical science problems;
cial cloud data centers; and introspective monitoring; the Ensuring correctness in the face of
High-performance interconnect tech- Netflix “Simian Army”a is illustrative faults, reproducibility, and algorithm
nologies. In the exascale-computing re- of these techniques. verification. With frequent transient
gime, the energy cost to move a datum Data management software that can and permanent faults, lack of repro-
will exceed the cost of a floating-point handle the volume, velocity, and diver- ducibility in collective communication,
operation, necessitating very energy sity of data. Whether computationally and new mathematical algorithms
efficient, low-latency, high-bandwidth generated or captured from scientific with limited verification, computation
interconnects for fine-grain data ex- instruments, efficient in situ data anal- validation and correctness assurance
changes among hundreds of thou- ysis requires restructuring of scientific will be much more important for the
sands of processors. Even with such workflows and applications, building next generation of massively parallel
designs, locality-aware algorithms and on lessons gleaned from commer- systems, whether optimized for scien-
software will be needed to maximize cial data-analysis pipelines, as well as tific computing, data analysis, or both;
computation performance and reduce new techniques for data coordinating, Mathematical optimization and un-
energy needs; learning, and mining. Without them, certainty quantification for discovery,
Driven by cost necessity, commer- I/O bottlenecks will limit system utility design, and decision. Large-scale com-
cial cloud computing systems have and applicability; putations are themselves experiments
been built with commodity Ethernet Programming models to express mas- that probe the sample space of numeri-
interconnects and adopted a bulk syn- sive parallelism, data locality, and resil- cal models. Understanding the sen-
chronous parallel computation model. ience. The widely used communicating sitivity of computational predictions
Although this approach has proven sequential process model, or MPI pro- to model inputs and assumptions,
effective, as evidenced by widespread gramming, places the burden of local- particularly when they involve com-
adoption of MapReduce toolkits (such ity and parallelization on application plex, multidisciplinary applications
as Hadoop), a lower-cost, convergence developers. Exascale computing sys- requires new tools and techniques
interconnect would benefit both com- tems will have billion-way parallelism for application validation and assess-
putation and data-intensive platforms and frequent faults. Needed are more ment. The equally important analogs
and open new possibilities for fine- expressive programming models able in large-scale data analytics and ma-
grain data analysis. to deal with this behavior and simplify chine learning are precision (the frac-
Advanced memory technologies to tion of retrieved data that is relevant)
improve capacity. Minimizing data a http://techblog.netflix.com/2011/07/netflix- and recall (the fraction of relevant data
movement and minimizing energy use simian-army.html retrieved); and

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 63
contributed articles

Software engineering and support- mains divided on possible approaches, Chip power limits have stimulat-
ing structures to enable productivity. with strong believers that technical ed great interest in the ARM proces-
Although programming tools, com- obstacles limiting extension of current sor ecosystem. Because ARM designs
pilers, debuggers, and performance- approaches will be overcome and oth- were optimized for embedded and
enhancement tools shape research ers who believe more radical technol- mobile devices, where limited power
productivity for all computing sys- ogy and design approaches (such as consumption has long been a design
tems, at scale, application design and quantum and superconducting devic- driver, they have simpler pipelines and
management for reliable, efficient, es) may be required. instruction decoders than x86 designs.
and correct computation is especially Hardware and architecture chal- In this new world, hardware/soft-
daunting. Unless researcher produc- lenges. Although a complete descrip- ware co-design becomes de rigeur,
tivity increases, the time to solution tion of the hardware and software with devices and software systems
may be dominated by application de- technical challenges just outlined is interdependent. The implications
velopment, not computation. beyond the scope of this article, review are far fewer general-purpose perfor-
Similar hardware and software stud- of a selected subset is useful to illu- mance increases, more hardware di-
ies1,14 chartered by the U.S. Defense Ad- minate the depth and breadth of the versity, elevation of multivariate opti-
vanced Research Projects Agency iden- problems and their implications for mization (such as power, performance,
tified the following challenges, most the future of both advanced computing and reliability) in programming mod-
similar to those cited by the Depart- and the broader deployment of next- els, and new system-software-resource-
ment of Energy studies: generation consumer- and business- management challenges.
Energy-efficient operation. Energy- computing technologies. Resilience and energy efficiency at
efficient operation to achieve desired Post-Dennard scaling. For decades, scale. As advanced computing and data
computation rates subject to overall Moore’s “law” has held true due to analysis systems grow ever larger, the
power dissipation; the hard work and creativity of a great assumption of fully reliable operation
Memory capacity. Primary and sec- many people, as well as many billions becomes much less credible. Although
ondary memory capacity and access of dollars of investment in process the mean time before failure for indi-
rates, subject to power constraints; technology and silicon foundries. It vidual components continues to in-
Concurrency and locality. Concur- has also rested on the principle of crease incrementally, the large overall
rency and locality to meet performance Dennard scaling,5,13 providing a recipe component count for these systems
targets while allowing some threads to for shrinking transistors and yielding means the systems themselves will fail
stall during long-latency operations; smaller circuits with the same power more frequently. To date, experience
Resilience. Resilience, given large density. Decreasing a transistor’s lin- has shown failures can be managed
component counts, shrinking silicon ear size by a factor of two thus reduced but only with improved techniques for
feature sizes, low-power operation, and power by a factor of four, or with both detecting and understanding compo-
transient and permanent component voltage and current halving. nent failures.
failures; Although transistor size continues Data from commercial cloud data
Application scaling. Application scal- to shrink, with 22-nanometer feature centers suggests some long-held as-
ing subject to memory capacity and size now common, transistor power sumptions about component failures
communication latency constraints; consumption no longer decreases ac- and lifetimes are incorrect.11,20–22 A
Managing parallelism. Expressing cordingly. This has led to limits on chip 2009 Google study22 showed DRAM
and managing parallelism and locality clock rates and power consumption, error rates were orders-of-magnitude
in system software and portable pro- along with design of multicore chips higher than previously reported,
gramming models, including runtime and the rise of dark silicon—chips with with over 8% of DIMMs affected by
systems, schedulers, and libraries; and more transistors than can be active si- errors in a year. Equally surprising,
Software tools. Software tools for multaneously due to thermal and pow- these were hard errors, rather than
performance tuning, correctness as- er constraints.8 soft, correctable (via error-correcting
sessment, and energy management. These semiconductor challenges code) errors.
Moreover, a 2011 study by the U.S. have stimulated a rethinking of chip de- In addition to resilience, scale also
National Academy of Sciences (NAS)9 sign, where the potential performance brings new challenges in energy man-
suggested that, barring a break- advantage of architectural tricks—su- agement and thermal dissipation. To-
through, the exponential increases in perpipelining, scoreboarding, vector- day’s advanced computing and data-
performance derived from shrinking ization, and parallelization—must analysis systems consume megawatts
semiconductor feature size and ar- be balanced against their energy con- of power, and cooling capability and
chitectural innovation are nearing an sumption. Simpler designs and func- peak power loads limit where many
end. This study, along with others, sug- tion-specific accelerators often yield a systems can be placed geographically.
gests computing technology is poised better balance of power consumption As commercial cloud operators have
at important inflection points, at the and performance. This architectural learned, energy infrastructure and pow-
very largest scale, or leading-edge high- shift will be especially true if the bal- er are a substantial fraction of total sys-
performance computing, and the very ance of integer, branch, and floating- tem cost at scale, necessitating new in-
smallest scale, or semiconductor pro- point operations shifts to support in frastructure approaches and operating
cesses. The computing community re- situ data analysis and computing. models, including low-power designs,

64 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

cooling approaches, energy account- The temporal cost of data move-


ability, and operational efficiencies. ment will challenge traditional al-
Software and algorithmic challenges. gorithmic design approaches and
Many of the software and algorithmic comparative optimizations, making
challenges for advanced computing
and big-data analytics are themselves Programming redundant computation sometimes
preferable to data sharing and elevat-
consequences of extreme system scale. models and tools ing communication complexity to
As noted earlier, advanced scientific
computing shares many of the scaling are perhaps parity with computation. It is there-
fore important for all of computer
problems of Web and cloud services
but differs in its price-performance
the biggest point science to design algorithms that
communicate as little as possible,
optimization balance, emphasizing of divergence ideally attaining lower bounds on the
high levels of performance, whether
for computation or for data analysis.
between the amount of communication required.
It will also require models and meth-
This distinction is central to the design scientific-computing ods to minimize and tolerate (hide)
choices and optimization criteria.
Given the scale and expected error
and big-data latency, optimize data motion, and
remove global synchronization.
rates of exascale computing systems, ecosystems. Adaptive system software. Resource
design and implementation of algo- management for today’s high-perfor-
rithms must be rethought from first mance computing systems remains
principles, including exploration of rooted in a deus ex machina model,
global synchronization-free (or at least with coordinated scheduling and
minimal) algorithms, fault-oblivious tightly synchronized communication.
and error-tolerant algorithms, archi- However, extreme scale, hardware het-
tecture-aware algorithms suitable for erogeneity, system power, and heat-
heterogeneous and hierarchical orga- dissipation constraints and increased
nized hardware, support for mixed- component failure rates influence not
precision arithmetic, and software for only the design and implementation
energy-efficient computing. of applications, they also influence
Locality and scale. As noted earlier, the design of system software in areas
putative designs for extreme-scale as diverse as energy management and
computing systems are projected to re- I/O. Similarly, as the volume of scien-
quire billion-way computational con- tific data grows, it is unclear if the tra-
currency, with aggressive parallelism ditional file abstractions and parallel
at all system levels. Maintaining load file systems used by technical comput-
balance on all levels of a hierarchy of al- ing will scale to trans-petascale data
gorithms and platforms will be the key analysis.
to efficient execution. This will likely Instead, new system-software and
require dynamic, adaptive runtime operating-system designs will need
mechanisms4 and self-aware resource to support management of heteroge-
allocation to tolerate not only algorith- neous resources and non-cache-coher-
mic imbalances but also variability in ent memory hierarchies, provide appli-
hardware performance and reliability. cations and runtime with more control
In turn, the energy costs and laten- of task scheduling policies, and man-
cies for communication will place an age global namespaces. They must also
even greater premium on computation expose mechanisms for finer measure-
locality than today. Inverting long-held ment, prediction, and control of power
models, arithmetic operations will be management, allowing schedulers to
far less energy intensive and more ef- map computations to function-specif-
ficient than communication. Algorith- ic accelerators and manage thermal
mic complexity is usually expressed envelopes and application energy pro-
in terms of number of operations per- files. Commercial cloud providers have
formed rather than quantity of data already faced many of these problems,
movement to memory. This is directly and their experience in large-scale re-
opposed to the expected costs of com- source management has much to offer
putation at large scale, where memory the scientific computing ecosystem.
movement will be very expensive and Parallel programming support. As
operations will be nearly free, an issue the diversity, complexity, and scale of
of importance to both floating-point- advanced computing hardware has in-
intensive and data-analysis algorithms. creased, the complexity and difficulty of

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 65
contributed articles

have already benefited through recent


extensions for, say, task parallelism,
accelerators, and thread affinity.
Domain-specific languages (DSLs)
are languages that specialize to a par-
ticular application domain, represent-
ing a means of extending the existing
base-language by hosting DSL exten-
sions. Embedded DSLs are a pragmatic
way to exploit the sophisticated analy-
sis and transformation capabilities of
the compilers for standard languages.
The developer writes the application
using high-level primitives a compiler
will transform into efficient low-level
code to optimize the performance on
the underlying platform.
Algorithmic and mathematics chal-
lenges. Exascale computing will put
greater demand on algorithms in at
least two areas: the need for increas-
ing amounts of data locality to per-
form computations efficiently and the
need to obtain much higher levels of
fine-grain parallelism, as high-end sys-
tems support increasing numbers of
developing applications has increased performance-aware design and devel- compute threads. As a consequence,
as well, with many operating functions opment of applications based on inte- parallel algorithms must adapt to this
now subsumed by applications. Ap- grated performance and correctness environment, and new algorithms and
plication complexity has been further models, these tools must be integrated implementations must be developed
exacerbated by the increasingly multi- with compilers and runtime systems, to exploit the computational capabili-
disciplinary nature of applications that provide more support for heteroge- ties of the new hardware.
combine algorithms and models span- neous hardware and mixed program- Significant model development, al-
ning a range of spatiotemporal scales ming models, and provide more sophis- gorithm redesign, and science-appli-
and algorithmic approaches. ticated data processing and analysis. cation reimplementation, supported
Consider the typical single-program Programming models and tools by (an) exascale-appropriate program-
multiple-data parallel-programming are perhaps the biggest point of di- ming model(s), will be required to
or bulk-synchronous parallel model, vergence between the scientific-com- exploit the power of exascale archi-
where application data is partitioned puting and big-data ecosystems. The tectures. The transition from current
and distributed across the individual latter emphasizes simple abstractions sub-petascale and petascale computing
memories or disks of the computation (such as key-value stores and MapRe- to exascale computing will be at least as
nodes, and the nodes share data via net- duce), along with semantics-rich data disruptive as the transition from vector
work message passing. In turn, the appli- formats and high-level specifications. to parallel computing in the 1990s.
cation code on each node manages the This has allowed many developers to
local, multilevel computation hierarchy— create complex machine-learning ap- Economic and Political Challenges
typically multiple, multithreaded, possibly plications with little knowledge of the The technical challenges of advanced
heterogeneous cores, and (often) a GPU underlying hardware or system soft- computing and big-data analytics are
accelerator—and coordinates I/O, man- ware. In contrast, scientific computing shaped by other elements of the broad-
ages application checkpointing, and over- has continued to rely largely on tradi- er computing landscape. In particular,
sees power budgets and thermal dissipa- tional languages and libraries. powerful smartphones and cloud com-
tion. This daunting level of complexity New programming languages and puting services are rapidly displacing
and detailed configuration and tuning models, beyond C and FORTRAN, will the PC and local servers as the comput-
makes developing robust applications help. Given the applications software ing standard. This shift has also trig-
ILLUSTRATION BY PETER BOLLINGER

an arcane art accessible to only a dedi- already in place for technical comput- gered international competition for in-
cated and capable few. ing, a radical departure is not realistic. dustrial and business advantage, with
Ideally, future software design, de- Programming features found in new countries and regions investing in new
velopment, and deployment will raise languages (such as Chapel and X10) technologies and system deployments.
the abstraction level and include perfor- have already had an indirect effect on Computing ecosystem shifts. The
mance and correctness in mind at the existing program models. Existing pro- Internet and Web-services revolution
outset rather than ex situ. Beyond more gramming models (such as OpenMP) is global, and U.S. influence, though

66 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

substantial, is waning. Notwithstand- formance Computing Center Stuttgart, rience developing and operating the
ing Apple’s phenomenal success, most Finland’s IT Center for Science Ltd., K computer, which, at 10 petaflop/s,
smartphones and tablets are now de- and Partner Development Center Swe- was ranked the fastest supercomput-
signed, built, and purchased globally, den, as well as the Dresden University er in the world in 2011. Estimated to
and the annual sales volume of smart- of Technology, which will lend exper- cost ¥140 billion ($1.38 billion), the
phones and tablets exceeds that of PCs tise in performance optimization. In exascale system design will be based
and servers. addition, the CRESTA team also in- on a combination of general-purpose
This ongoing shift in consumer cludes application professionals from processors and accelerators and in-
preferences and markets is accom- European science and industry, as volve three key Japanese computer
panied by another technology shift. well as HPC vendors, including HPC vendors—Fujitsu, Hitachi, and NEC—
Smartphones and tablets are based on tool developer Allinea and HPC ven- as well as technical support from the
energy-efficient microprocessors—a dor Cray. CRESTA focuses on the use University of Tokyo, University of Tsu-
key component of proposed exascale of applications as co-design drivers for kuba, Tokyo Institute of Technology,
computing designs—and systems-on- software development environments, Tohoku University, and RIKEN.
a-chip (SoCs) using the ARM archi- algorithms and libraries, user tools, China. China’s Tianhe-2 system
tecture. Unlike Intel and AMD, which and underpinning and crosscutting is the world’s fastest supercomputer
design and manufacture the x86 chips technologies. today. It contains 16,000 nodes, each
found in today’s PCs and most leading- The Mont-Blanc project, led by the with two Intel Xeon processors and
edge servers and HPC systems, ARM Barcelona Supercomputing Center, three Intel Xeon Phi coprocessors. It
does not manufacture its own chips. brings together European technol- also contains a proprietary high-speed
Rather, it licenses its designs to others, ogy providers ARM, Bull, Gnodal, and interconnect, called TH Express-2, de-
who incorporate the ARM architecture major supercomputing organizations signed by the National University for
into custom SoCs that are manufac- involved with the Partnership for Ad- Defense Technology (NUDT). NUDT
tured by global semiconductor found- vanced Computing in Europe (PRACE) conducts research on processors,
ries like Taiwan’s TSMC. project, including Juelich, Leibniz- compilers, parallel algorithms, and
International exascale projects. The Rechenzentrum, or LRZ, GENCI, and systems. Based on this work, China is
international competition surrounding CINECA. The project intends to deploy expected to produce a 100-petaflop/s
advanced computing mixes concern a first-generation HPC system built systems in 2016 built entirely from Chi-
about economic competitiveness, shift- from energy-efficient embedded tech- nese-made chips, specifically the Shen-
ing technology ecosystems (such as nologies and conduct the research nec- Wei processor, and interconnects.
ARM and x86), business and technical essary to achieve exascale performance Tianhe-2 was to be upgraded from a
computing (such as cloud computing with energy-efficient designs. peak of 55 petaflop/s to 100 petaflop/s
services and data centers), and scien- DEEP, led by Forschungszentrum in 2015, but the U.S. Department of
tific and engineering research. The Eu- Juelich, seeks to develop an exascale- Commerce has restricted exports of
ropean Union, Japan, China, and U.S. enabling platform and optimization Intel processors to NUDT, the National
have all launched exascale computing of a set of grand-challenge codes. The Supercomputing Center in Changsha,
projects, each with differing emphasis system is based on a commodity cluster National Supercomputing Center in
on hardware technologies, system soft- and accelerator design—Cluster Boost- Guangzhou, and the National Super-
ware, algorithms, and applications. er Architecture—as a proof-of-concept computing Center in Tianjin due to
European Union. The European for a 100 petaflop/s PRACE production national-security concerns.
Union (EU) announced the start of its system. In addition to the lead partner, U.S. Historically, the U.S. Network-
exascale research program in October Juelich, project partners include Intel, ing and Information Technology Re-
2011 with €25 million in funding for ParTec, LRZ, Universität Heidelberg, search and Development program has
three complementary research proj- German Research School for Simula- spanned several research missions
ects in its Framework 7 effort. The tion Sciences, Eurotech, Barcelona and agencies, with primary leadership
Collaborative Research into Exascale Supercomputing Center, Mellanox, by the Department of Energy (DOE),
Systemware, Tools and Applications École Polytechnique Fédérale de Laus- Department of Defense (DoD), and
(CRESTA), Dynamical Exascale En- anne, Katholieke Universiteit Leuven, National Science Foundation (NSF).
try Platform (DEEP), and Mont-Blanc Centre Européen de Recherche et de DOE is today the most active deployer
projects will each investigate different Formation Avancée en Calcul Scienti- of high-performance computing sys-
exascale challenges using a co-design fique, the Cyprus Institute, Universität tems and developer of plans for exas-
model spanning hardware, system Regensburg, CINECA, a consortium of cale computing. In contrast, NSF and
software, and software applications. 70 universities in Italy, and Compagnie DoD have focused more on broad cy-
This initiative represents Europe’s first Générale de Géophysique-Veritas. berinfrastructure and enabling-tech-
sustained investment in exascale re- Japan. In December 2013, the nologies research, including research
search. Japanese Ministry of Education, Cul- cloud services and big-data analytics.
CRESTA brings together four Eu- ture, Sports, Science and Technol- Although planning continues, the U.S.
ropean high-performance computing ogy (MEXT) selected RIKEN to devel- has not yet mounted an advanced com-
centers: Edinburgh Parallel Comput- op and deploy an exascale system by puting initiative similar to those under
ing Centre (project lead), the High Per- 2020. Selection was based on its expe- way in Europe and Japan.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 67
contributed articles

International collaboration. Although Private and global research. Private- Symposium on Computer Architecture (San Jose, CA,
June 4–8). ACM, New York, 2011, 365–376.
global competition for advanced com- sector competition and global-re- 9. Fuller, S.H. and Millett, L.I. Computing performance: Game
puting and data-analytics leadership search collaboration are both neces- over or next level? Computer 44, 1 (Jan. 2011), 31–38.
10. Geist, A. and Lucas, R. Major computer science
continues, there is active international sary to address design, test, and deploy challenges at exascale. International Journal of High
collaboration. The International Ex- exascale-class computing and data- Performance Applications 23, 4 (Nov. 2009), 427–436.
11. Gill, P., Jain, N., and Nagappan, N. Understanding
ascale Software Project (IESP) is one analysis capabilities. network failures in data centers: Measurement,
such example in advanced comput- There are great opportunities and analysis, and implications. Proceedings of ACM
SIGCOMM 41, 4 (Aug, 2011), 350–361.
ing. With seed funding from govern- great challenges in advanced comput- 12. Hey, T., Tansley, S., and Tolle, K. The Fourth
Paradigm: Data-Intensive Scientific Discovery.
ments in Japan, the E.U., and the U.S., ing, in both computation and data Microsoft Research, Redmond, WA, 2009; http://
as well as supplemental contributions analysis. Scientific discovery via com- research.microsoft.com/en-us/UM/redmond/about/
collaboration/fourthparadigm/4th_PARADIGM_
from industry stakeholders, IESP was putational science and data analytics BOOK_complete_HR.pdf
formed to empower ultra-high-resolu- is truly the “endless frontier” about 13. Kamil, S., Shalf, J., and Strohmaier, E. Power efficiency
in high-performance computing. In Proceedings of the
tion and data-intensive science and en- which Vannevar Bush spoke so elo- Fourth Workshop on High-Performance, Power-Aware
gineering research through 2020. quently in 1945. The challenges are Computing (Miami, FL, Apr.). IEEE Press, 2008.
14. Kogge, P., Bergman, K., Borkar, S. et al. Exascale
In a series of meetings, the interna- for all of computer science to sustain Computing Study: Technology Challenges in Achieving
tional IESP team developed a plan for a the research, development, and de- Exascale Systems. U.S. Defense Advanced Research
Projects Agency, Arlington, VA, 2008; http://www.cse.
common, high-quality computational ployment of the high-performance nd.edu/Reports/2008/TR-2008-13.pdf
environment for petascale/exascale computing infrastructure needed to 15. Lucas, R., Ang, J., Bergman, K., Borkar, S. et al. Top Ten
Exascale Research Challenges. Office of Science, U.S.
systems. The associated roadmap for enable those discoveries. Department of Energy, Washington, D.C., Feb. 2014;
software development would take the http://science.energy.gov/~/media/ascr/ascac/pdf/
meetings/20140210/Top10reportFEB14.pdf
community from its current position Acknowledgments 16. Meuer, H., Strohmaier, E., Dongarra, J. and Simon, H. Top
to exascale computing.7 We are grateful for insights and per- 500 Supercomputer Sites, 2015; http://www.top500.org
17. Nobelprize.org. Nobel Prize in Chemistry 2013;
spectives we received from the DARPA http://www.nobelprize.org/nobel_prizes/chemistry/
laureates/2013/press.html
Conclusion and DOE exascale hardware, software, 18. Olston, C., Reed, B., Strivastava, U., Kumar, R., and
Computing is at a profound inflection and application study groups. We also Tomkins, A. Pig Latin: A not-so-foreign language for
data processing. In Proceedings of the 2008 ACM
point, economically and technically. acknowledge the insightful comments SIGMOD International Conference on Management of
The end of Dennard scaling and its and suggestions from the reviewers of Data (Vancouver, BC, Canada, June 9–12). ACM Press,
New York, 2008, 1099–1110.
implications for continuing semicon- earlier drafts of this article. Daniel A. 19. Partnership for Advanced Computing in Europe
ductor-design advances, the shift to Reed acknowledges support from the (PRACE). 2014; http://www.prace-ri.eu/
20. Pinheiro, E., Weber, W.-D., and Barroso, L.A. Failure
mobile and cloud computing, the ex- National Science Foundation under trends in a large disk drive population. In Proceedings
plosive growth of scientific, business, NSF grant ACI-1349521. Jack Dongarra of the Fifth USENIX Conference on File and Storage
Technologies (San Jose, CA, Feb. 13–16). USENIX
government, and consumer data and acknowledges support from the Na- Association, Berkeley, CA, 2007.
opportunities for data analytics and tional Science Foundation under NSF 21. Schroeder, B. and Gibson, G.A. Understanding disk failure
rates: What does an MTTF of 1,000,000 hours mean to
machine learning, and the continuing grant ACI-1339822 and by the Depart- you? ACM Transactions on Storage 3, 3 (Oct. 2007), 8.
need for more-powerful computing ment of Energy under DOE grant DE- 22. Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM errors
in the wild: A large-scale field study. In Proceedings of
systems to advance science and engi- FG02-13ER26151. the 11th International Joint Conference on Measurement
neering are the context for the debate and Modeling of Computer Systems (Seattle, WA, June).
ACM Press, New York, 2009, 193–204.
over the future of exascale computing References
23. U.S. Department of Energy. Synergistic Challenges
in Data-Intensive Science and Exascale Computing.
and big data analysis. However, cer- 1. Amarasinghe, S. et al. Exascale Software Study:
Report of the Advanced Scientific Computing
Software Challenges in Extreme-Scale Systems.
tain things are clear: Defense Advanced Research Projects Agency,
Advisory Committee Subcommittee, Mar. 30, 2013;
http://science.energy.gov/~/media/ascr/ascac/pdf/
Big data and exascale. High-end data Arlington, VA, 2009; http://www.cs.rice.edu/~vs3/PDF/
reports/2013/ASCAC_Data_Intensive_Computing_
Sarkar-ACS-July-2011-v2.pdf
analytics (big data) and high-end com- 2. American Association for the Advancement of
report_final.pdf
24. U.S. Department of Energy. The Opportunities and
puting (exascale) are both essential Science. Guide to R&D Funding - Historical Data.
Challenges of Exascale Computing. Office of Science,
AAAS, Washington, D.C., 2015; http://www.aaas.org/
elements of an integrated computing page/historical-trends-federal-rd
Washington, D.C., 2010; http://science.energy.
gov/~/media/ascr/ascac/pdf/reports/Exascale_
research-and-development agenda; 3. Chang, F. et al. Bigtable: A distributed storage system
subcommittee_report.pdf
for structured data. ACM Transactions on Computer
neither should be sacrificed or mini- Systems 26, 2 (June 2008), 4:1–4:26.
25. White, T. Hadoop: The Definitive Guide. O’Reilly Media,
May 2012.
mized to advance the other; 4. Datta, K. et al. Stencil computation optimization
and auto-tuning on state-of-the-art multicore
Algorithms, software, applications. architectures. In Proceedings of the 2008 ACM/IEEE Daniel A. Reed (dan-reed@uiowa.edu) is Vice President
Research and development of next- Conference on Supercomputing (Austin, TX, Nov. for Research and Economic Development, and Professor
15–21). IEEE Press, Piscataway, NJ, 2008, 1–12. of Computer Science, Electrical, and Computer
generation algorithms, software, and 5. Dennard, R.H., Gaensslen, F.H., Yu, H.-n., Rideout, V.L., Engineering and Medicine, at the University of Iowa.
applications is as crucial as investment Bassous, E., and LeBlanc, A.R. Design of ion-implanted
MOSFETs with very small physical dimensions. Jack Dongarra (dongarra@icl.utk.edu) holds an
in semiconductor devices and hard- IEEE Journal of Solid State Circuits 9, 5 (Jan. 1974), appointment at the University of Tennessee, Oak Ridge
256–268. National Laboratory, and the University of Manchester.
ware; historically the research commu- 6. Dongarra, J.J. The LINPACK benchmark: An
nity has underinvested in these areas; explanation. In Proceedings of the First International © 2015 ACM 00010782/15/07 $15.00
Conference on Supercomputing (Athens, Greece, June
Information technology ecosystem. 8–12). Springer-Verlag, New York, 1988, 456–474.
The global information technology 7. Dongarra, J.J. et al. The international exascale
software project roadmap. International Journal of Watch the authors discuss
ecosystem is in flux, with the transition High Performance Computing Applications 25, 1 (Feb. their work in this exclusive
to a new generation of low-power mo- 2011), 3–60. Communications video.
8. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, http://cacm.acm.org/
bile devices, cloud services, and rich K., and Burger, D. Dark silicon and the end of multicore videos/exascale-
data analytics; and scaling. In Proceedings of the 38th Annual International computing-and-big-data

68 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


DOI:10.1145/ 2 6 9 9 4 1 8

A deep, fine-grain analysis of


rhetorical structure highlights crucial
sentiment-carrying text segments.
BY ALEXANDER HOGENBOOM, FLAVIUS FRASINCAR,
FRANCISKA DE JONG, AND UZAY KAYMAK

Using
Rhetorical
Structure in
Sentiment
Analysis
POPULAR WEB SITES like Twitter, Blogger, and Epinions
let users vent their opinions on just about anything
through an ever-increasing amount of short messages,
blog posts, and reviews. Automated sentiment-analysis
techniques can extract traces of people’s sentiment, or
attitude toward certain topics, from such texts.6
This analysis can yield a competitive posts16 discuss products or brands.
advantage for businesses,2 as one-fifth Identifying people’s opinions on these
of all tweets10 and one-third of all blog topics,11 and identifying the pros and
cons of specific products,12 are there-
key insights fore promising applications for senti-
ment-analysis tools.
" Traditional automated sentiment
analysis techniques rely mostly on word
Many commercial sentiment-anal-
frequencies and thus fail to capture ysis systems rely mostly on the occur-
crucial nuances in text. rence of sentiment-carrying words,
" A better understanding of a text’s although more sophistication can be
sentiment can be obtained by guiding the introduced in several ways. Among
analysis by the text’s rhetorical structure.
open research issues identified by
" A deep, fine-grain analysis of the Feldman6 is the role of textual struc-
rhetorical structure of smaller units of
ture. Structural aspects may contain
text can highlight crucial sentiment-
carrying text segments and help correctly valuable information;19,24 sentiment-
interpret these segments. carrying words in a conclusion may

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 69
contributed articles

Figure 1. Interpretations of a positive RST-structured sentence consisting of nuclei sentiment analysis through a deep
(marked with vertical lines) and satellites. analysis of a text’s rhetorical struc-
ture can yield better understanding
Negative words are in red in italics, and positive words are in green and of conveyed sentiment with respect to
underlined in italics. Sentiment-carrying words with relatively high intensity are
brighter. Horizontal lines indicate the spans of the RST elements at each level
an entity of interest.
of the hierarchical rhetorical structure. Arrows and their (capitalized) captions Our contribution is threefold. First,
represent the relations of satellite elements to nucleus elements. Text segments as an alternative to existing shallow RST-
and RST elements assigned a relatively low weight in the analysis of the guided sentiment analysis approaches
conveyed sentiment are more transparent than those with higher weights.
that typically focus on rhetorical rela-
tions in top-level splits of RST trees, we
BACKGROUND
propose to focus on the leaf nodes of
RST trees or, alternatively, to account for
ATTRIBUTION ATTRIBUTION the full RST trees. Second, we propose a
novel RST-based weighting scheme that
While always he hates this John bitterly he enjoyed this is more refined than existing weighting
complaining that type of movies, confessed that movie. schemes.8,24 Third, rhetorical relations
(a) Accounting for sentiment-carrying words only. across sentences and paragraphs can
provide useful context for the rhetorical
BACKGROUND structure of their subordinate text seg-
ments. Whereas existing work mostly
ATTRIBUTION ATTRIBUTION
guides sentiment analysis through sen-
tence-level analyses of rhetorical struc-
ture (if at all), we therefore additionally
While always he hates this John bitterly he enjoyed this
complaining that type of movies, confessed that movie. incorporate paragraph-level and docu-
(b) Accounting for rhetorical roles as identified by the RST tree's top-level split. ment-level analyses of rhetorical struc-
ture into the process. We thus account
BACKGROUND
for rhetorical relations across sentences
and paragraphs.
ATTRIBUTION ATTRIBUTION
Sentiment Analysis and
Rhetorical Relations
While always he hates this John bitterly he enjoyed this
Automated sentiment analysis is re-
complaining that type of movies, confessed that movie.
(c) Accounting for the rhetorical roles of the leaf nodes of the RST tree.
lated to natural language processing,
computational linguistics, and text
mining. Typical tasks include distin-
BACKGROUND guishing subjective text segments from
objective ones and determining the
ATTRIBUTION ATTRIBUTION polarity of words, sentences, text seg-
ments, and documents.18 We address
While always he hates this John bitterly he enjoyed this the binary document-level polarity clas-
complaining that type of movies, confessed that movie. sification task, dealing with classifying
(d) Accounting for the full, hierarchical rhetorical structure. documents as positive or negative.
The state of the art in sentiment anal-
ysis has been reviewed extensively.6,18
contribute more to a text’s overall sen- While the benefit of exploiting iso- Existing methods range from machine
timent than sentiment-carrying words lated rhetorical roles for sentiment learning methods that exploit patterns
in, say, background information. analysis has been demonstrated in ex- in vector representations of text to
Existing work typically uses structural isting work, the full rhetorical struc- lexicon-based methods that account
aspects of text to distinguish important ture in which these roles are defined for semantic orientation in individual
text segments from less-important seg- has thus far been ignored. However, a words by matching the words with a
ments in terms of their contribution to text segment with a particular rhetori- sentiment lexicon, listing words and
a text’s overall sentiment, subsequently cal role can consist of smaller subor- their associated sentiment. Lexicon-
weighting a segment’s conveyed senti- dinate segments that are rhetorically based methods are typically more ro-
ment in accordance with its importance. related to one another, thus forming a bust across domains and texts,24 allow-
A segment’s importance is often related hierarchical rhetorical tree structure. ing for intuitive incorporation of deep
to its Position in a text,19 yet recently Important segments may contain linguistic analyses.
developed methods make coarse-grain less-important parts. Since account- Deep linguistic analysis is a key
distinctions between segments based ing for such structural aspects of a success factor for sentiment-analysis
on their rhetorical roles5,8,24 by applying text enables better understanding of systems, as it helps deal with “compo-
Rhetorical Structure Theory (RST).13 the text,15 we hypothesize that guiding sitionality,”21 or the way the semantic

70 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

orientation of text is determined by lation labels to those (compound) seg-


the combined semantic orientations ments of a document most likely to be
of its constituent phrases. This compo- rhetorically related. SPADE and HILDA
sitionality can be captured by account- take as input free text (a priori divided
ing for the grammatical20 or the discur-
sive4,5,8,22,24 structure of text. This analysis can into paragraphs and sentences) and
produce LISP-like representations of
RST13 is a popular discourse-analy-
sis framework, splitting text into rhe-
yield a competitive the text and its rhetorical structure.
Document-level polarity classifica-
torically related segments that may in advantage for tion has been guided successfully by
turn be split, thus yielding a hierarchi-
cal rhetorical structure. Each segment
businesses, as one- analyses of the most relevant text seg-
ments, as identified by differentiating
is either a nucleus or a satellite. Nuclei fifth of all tweets between top-level nuclei and satellites
constitute a text’s core and are sup-
ported by satellites that are considered
and one-third of all in sentence-level RST trees.24 Another,
more elaborate, method of utilizing
less important. In total, 23 types of blog posts discuss RST in sentiment analysis accounts
relations between RST elements have
been proposed;13 for example, a satel- products or brands. for the different types of relations be-
tween nuclei and satellites,8 yielding
lite may form an elaboration on or a improvements over methods that only
contrast with a nucleus. distinguish nuclei from satellites.5,8
Consider the positive sentence
“While always complaining that he Polarity Classification Guided
hates this type of movies, John bitterly by Rhetorical Structure
confessed that he enjoyed this movie,” Existing methods do not use RST to
which contains mostly negative words. its full extent yet typically focus on
RST can split the sentence into a hi- top-level splits of sentence-level RST
erarchical rhetorical structure of text trees and thus employ a rather shal-
segments (see Figure 1). The top-level low, coarse-grain analysis. However,
nucleus contains the core message in RST, rhetorical relations are de-
“John bitterly confessed he enjoyed fined within a hierarchical structure
this movie,” with a satellite providing that could be accounted for. Impor-
background information. This back- tant nuclei may contain less-impor-
ground satellite consists of a nucleus tant satellites that should be treated
(“he hates this type of movies”) and accordingly. We therefore propose to
an attributing satellite (“While always guide polarity classification through
complaining that”). Similarly, the core a deep analysis of a text’s hierarchi-
consists of a nucleus (“he enjoyed this cal rhetorical structure rather than
movie”) and an attributing satellite its isolated rhetorical relations. We
(“John bitterly confessed that”). The account for rhetorical relations with-
sentence conveys a positive overall sen- in and across sentences by allowing
timent toward the movie, due to the for not only sentence-level but also
way the sentiment-carrying words are paragraph-level and document-level
used in the sentence; the actual senti- analyses of RST trees.
ment is conveyed by the nucleus “he Fine-grain analysis. Figure 1 out-
enjoyed this movie.” lines the potential of RST-guided
Several methods of using rhetori- polarity classification. Based on the
cal relations for sentiment analysis sentiment-carrying words alone, our
are available, with some, including our example sentence is best classified
own methods, relying more strongly as negative, as in Figure 1a. Account-
on RST5,8,24 than others.4,22 In order ing for the rhetorical roles of text seg-
to identify rhetorical relations in text, ments as identified by the RST tree’s
the most successful methods use the top-level split enables a more elabo-
Sentence-level PArsing of DiscoursE, rate but still coarse-grain analysis of
or SPADE, tool,23 which creates an RST the overall sentiment, as in Figure
tree for each sentence. Another parser 1b. The top-level nucleus contains as
is the HIgh-Level Discourse Analyzer, or many positive as negative words and
HILDA,9 which parses discourse struc- may thus be classified as either posi-
ture at the document level by means tive or negative. The negative words in
of a greedy bottom-up tree-building the top-level satellite trigger a nega-
method that uses machine-learning tive classification of this segment,
classifiers to iteratively assign RST re- which is a potentially irrelevant seg-

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 71
contributed articles

ment that should be assigned a lower Then, for each sentence, it automati- (2)
weight in the analysis. cally determines the part-of-speech
However, such a coarse-grain analy- (POS) and lemma of each word. It
sis does not capture the nuances of then disambiguates the word sense with Td representing all top-level RST
lower-level splits of an RST tree; for of each word using an unsupervised nodes in the RST trees for document d.
instance, the top-level nucleus in our algorithm that iteratively selects the Leaf-level rhetorical structure. An-
example consists of two segments, one word sense with the highest seman- other method is our leaf-level RST-
of which is the sentence’s actual core tic similarity to the word’s context.8 It based analysis L. The sentiment score
(“he enjoyed this movie”), whereas the then retrieves the sentiment of each ζ Lsi of an RST segment si from the leaf
other is less relevant and should there- word, associated with its particular nodes Ld of an RST tree for document d
fore be assigned a lower weight in the combination of POS, lemma, and is computed as the summed sentiment
analysis. Accounting for the rhetorical word sense from a sentiment lexicon score of its words, weighted for the seg-
roles of the leaf nodes of an RST tree like SentiWordNet.1 ment’s rhetorical role, or
rather than the top-level splits can Rhetorical structure processing. In
thus enable a more accurate sentiment order to guide the polarity classifica- (3)
analysis, as in Figure 1c. tion of a document d by its rhetorical
However, an exclusive focus on leaf structure, sentiment scores are com-
nodes of RST trees does not account puted for each segment si. Our frame- Hierarchical rhetorical structure.
for the text segments’ rhetorical roles work supports several methods of The last supported approach is our
being defined within the context of computing such scores: a baseline method of accounting for the full path
the rhetorical roles of the segments method plus several RST-based meth- from an RST tree root to a leaf node,
that embed them; for instance, the ods that can be applied to sentence- such that the sentiment conveyed by
second leaf node in our example RST level, paragraph-level, and document- the latter can be weighted while ac-
tree is a nucleus of a possibly irrel- level RST trees. counting for its rhetorical context. In
evant background satellite, whereas Baseline. As a baseline, we consider our hierarchy-based sentiment-scor-
the fourth leaf node is the nucleus of text segments to be the sentences sd ing method H, we model the sentiment
the nucleus and constitutes the ac- of document d, with their associated score ζ Hsi of a leaf-level RST segment si
tual core. The full rhetorical structure baseline sentiment score ζBsi being the as a function of the sentiment scores of
must be considered in the analysis weighted sum of the score ζtj of each its words and the weights wrn associated
in order to account for this circum- word tj and its weight wtj, or with the rhetorical role of each node rn
stance, as in Figure 1d. from the nodes Psi on the path from the
We propose a lexicon-based sen- (1) root to the leaf, or
timent-analysis framework that can
perform an analysis of the rhetorical (4)
structure of a piece of text at various Top-level rhetorical structure. Our
levels of granularity and that can sub- framework additionally supports the
sequently use this information for clas- top-level RST-based method applied
sifying the text’s overall polarity. Our in existing work, an approach we re-
framework (see Figure 2) takes several fer to as T. The sentiment score ζBsi of where represents a diminishing fac-
steps in order to classify the polarity of a top-level RST segment si is defined tor and rn signals the level of node rn
a document. as the sum of the sentiment ζtj associ- in the RST tree, with the level of the
Word-level sentiment scoring. Our ated with each word tj in segment si, root node being 1. For > 1, each sub-
method first splits a document into weighted with a weight wsi associated sequent level contributes less than its
paragraphs, sentences, and words. with the segment’s rhetorical role, or parent to the segment’s RST-based
weight, thus preserving the hierarchy
Figure 2. A schematic overview of our sentiment-analysis framework; solid arrows of the relations in the path.
indicate information flow, and dashed arrows indicate a used-by relationship.
Classifying document polarity. The
segment-level sentiment scores can
be aggregated in order to determine
Documents Paragraph Splitter Sentence Splitter Word Tokenizer
the overall polarity of a document. The
baseline, top-level, leaf-level, and hier-
archy-based sentiment scores ζBd, ζTd, ζdL,
Word Sense
Disambiguator Lemmatizer POS Tagger
Weighting
Schemes
and ζHd for document d are defined as

Semantic Lexicon (5)


Rhetorical
Document Classified
Word Scorer Structure
Classifier Documents (6)
Processor
Sentiment Lexicon
(7)

72 COMMUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

(8) Table 1. Characteristics of our considered polarity-classification approaches, or our


baselines, our SPADE-based sentence-level RST-guided methods, and our HILDA-based
The resulting document-level senti- sentence-level, paragraph-level, and document-level RST-guided methods.
ment scores can be used to classify
document d’s polarity cd as negative Method Unit RST Weights
(–1) or positive (1), following BASELINE DOCUMENT NONE BASELINE

POSITION DOCUMENT NONE POSITION


(9) SPADE.S.T.I Sentence Top-level I
SPADE.S.T.II Sentence Top-level II
Here, ε represents an offset that cor-
SPADE.S.T.X Sentence Top-level X
rects a possible bias in the sentiment
SPADE.S.T.F Sentence Top-level F
scores caused by people’s tendency to
SPADE.S.L.I Sentence Leaf-level I
write negative reviews with rather posi-
SPADE.S.L.II Sentence Leaf-level II
tive words,24 or
SPADE.S.L.X Sentence Leaf-level X

(10) SPADE.S.L.F Sentence Leaf-level F


SPADE.S.H.I Sentence Hierarchical I

with Φ and N denoting the respective SPADE.S.H.II Sentence Hierarchical II


subsets of positive and negative docu- SPADE.S.H.X Sentence Hierarchical X
ments in a training set. SPADE.S.H.F Sentence Hierarchical F
Weighting schemes. We consider HILDA.S.T.I Sentence Top-level I
six different weighting schemes. Two HILDA.S.T.II Sentence Top-level II
serve as baselines and are applicable HILDA.S.T.X Sentence Top-level X
to the baseline sentiment scoring ap- HILDA.S.T.F Sentence Top-level F
proach as defined in Equations (1) HILDA.S.L.I Sentence Leaf-level I
and (5). The others apply to our three HILDA.S.L.II Sentence Leaf-level II
RST-based approaches as defined in HILDA.S.L.X Sentence Leaf-level X
(2) and (6), in (3) and (7), and in (4) and HILDA.S.L.F Sentence Leaf-level F
(8), respectively. HILDA.S.H.I Sentence Hierarchical I
Our Baseline scheme serves as an
HILDA.S.H.II Sentence Hierarchical II
absolute baseline and assigns each
HILDA.S.H.X Sentence Hierarchical X
word a weight of 1; structural aspects
HILDA.S.H.F Sentence Hierarchical F
are not accounted for. A second base-
HILDA.P.T.I Paragraph Top-level I
line is a position-based scheme. In
HILDA.P.T.II Paragraph Top-level II
this Position scheme, word weights
HILDA.P.T.X Paragraph Top-level X
are uniformly distributed, ranging
from 0 for the first word to 1 for the HILDA.P.T.F Paragraph Top-level F

last word of a text, as an author’s views HILDA.P.L.I Paragraph Leaf-level I


are more likely to be summarized near HILDA.P.L.II Paragraph Leaf-level II
the end of a text.19 HILDA.P.L.X Paragraph Leaf-level X
The first RST-based scheme (I) as- HILDA.P.L.F Paragraph Leaf-level F
signs a weight of 1 to nuclei and a HILDA.P.H.I Paragraph Hierarchical I
weight of 0 to satellites;24 the second HILDA.P.H.II Paragraph Hierarchical II
RST-based scheme (II) matches the HILDA.P.H.X Paragraph Hierarchical X
second set of weights for nuclei and HILDA.P.H.F Paragraph Hierarchical F
satellites used by Taboada et al.,24 or 1.5 HILDA.D.T.I Document Top-level I
and 0.5, respectively. In both schemes HILDA.D.T.II Document Top-level II
I and II, we set the diminishing factor HILDA.D.T.X Document Top-level X
for the H method to 2, such that each HILDA.D.T.F Document Top-level F
level in a tree is at least as important
HILDA.D.L.I Document Leaf-level I
as all its subsequent levels combined,
HILDA.D.L.II Document Leaf-level II
thus enforcing a strict hierarchy.
HILDA.D.L.X Document Leaf-level X
Another RST-specific scheme is the
HILDA.D.L.F Document Leaf-level F
extended scheme X in which we differ-
HILDA.D.H.I Document Hierarchical I
entiate satellite weights by their RST re-
HILDA.D.H.II Document Hierarchical II
lation type.8 Additionally, we propose a
novel extension of X: the full weighting HILDA.D.H.X Document Hierarchical X

scheme F, in which we not only differ- HILDA.D.H.F Document Hierarchical F


entiate satellite weights but also nucle-
us weights by their RST relation type.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 73
contributed articles

Table 2. Performance measures of our considered baselines, our SPADE-based sentence-level For both X and F, the weights and the
RST-guided methods, and our HILDA-based sentence-level, paragraph-level, and document- diminishing factor can be optimized.
level RST guided methods, based on tenfold cross-validation on the movie-review dataset.
The best performance in each group of methods is in bold for each performance measure. Evaluation
By means of a set of experiments, we
Positive Negative Overall
evaluate the variants of our polarity-
Method Precision Recall F1 Precision Recall F1 Accuracy F1
Baseline 0.632 0.689 0.659 0.658 0.599 0.627 0.644 0.643
classification approach guided by
Position 0.637 0.713 0.673 0.674 0.593 0.631 0.653 0.652
structural aspects of text. We focus
SPADE.S.T.I 0.638 0.675 0.656 0.655 0.617 0.635 0.646 0.646
on a widely used collection of 1,000
positive and 1,000 negative English
SPADE.S.T.II 0.640 0.688 0.663 0.663 0.613 0.637 0.651 0.650
movie reviews.17
SPADE.S.T.X 0.693 0.725 0.709 0.712 0.679 0.695 0.702 0.702
Experimental setup. In the Java
SPADE.S.T.F 0.703 0.726 0.715 0.717 0.694 0.705 0.710 0.710
implementation of our framework,
SPADE.S.L.I 0.636 0.702 0.667 0.667 0.598 0.631 0.650 0.649
we detect paragraphs using the <P>
SPADE.S.L.II 0.640 0.700 0.669 0.669 0.607 0.637 0.654 0.653
and </P> tags in the HTML files. For
SPADE.S.L.X 0.699 0.715 0.707 0.708 0.692 0.700 0.704 0.703
sentence splitting, we rely on the
SPADE.S.L.F 0.705 0.731 0.718 0.721 0.694 0.707 0.713 0.712
preprocessed reviews.17 The frame-
SPADE.S.H.I 0.647 0.678 0.662 0.662 0.630 0.645 0.654 0.654 work uses the Stanford Tokenizer14
SPADE.S.H.II 0.642 0.696 0.668 0.668 0.612 0.639 0.654 0.653 for word segmentation. For POS tag-
SPADE.S.H.X 0.707 0.723 0.715 0.716 0.700 0.708 0.712 0.711 ging and lemmatization, our frame-
SPADE.S.H.F 0.710 0.738 0.724 0.727 0.699 0.713 0.719 0.718 work uses the OpenNLP3 POS tagger
HILDA.S.T.I 0.633 0.676 0.654 0.652 0.608 0.629 0.642 0.642 and the JWNL API,25 respectively. We
HILDA.S.T.II 0.636 0.686 0.660 0.659 0.607 0.632 0.647 0.646 link word senses to WordNet,7 thus
HILDA.S.T.X 0.692 0.709 0.701 0.702 0.685 0.693 0.697 0.697 enabling retrieval of their sentiment
HILDA.S.T.F 0.697 0.745 0.720 0.726 0.676 0.700 0.711 0.710 scores from SentiWordNet1 by sub-
HILDA.S.L.I 0.629 0.685 0.656 0.654 0.596 0.624 0.641 0.640 tracting the associated negativity
HILDA.S.L.II 0.636 0.685 0.660 0.659 0.608 0.632 0.647 0.646 scores from the positivity scores. This
HILDA.S.L.X 0.698 0.711 0.705 0.706 0.693 0.699 0.702 0.702 retrieval method yields real num-
HILDA.S.L.F 0.705 0.732 0.718 0.721 0.693 0.707 0.713 0.712 bers ranging from –1 (negative) to 1
HILDA.S.H.I 0.634 0.675 0.654 0.653 0.611 0.631 0.643 0.643 (positive). In the final aggregation of
HILDA.S.H.II 0.638 0.688 0.662 0.661 0.609 0.634 0.649 0.648
word scores, our framework assigns a
HILDA.S.H.X 0.699 0.693 0.696 0.695 0.701 0.698 0.697 0.697
weight to each word by means of one
HILDA.S.H.F 0.699 0.740 0.719 0.724 0.682 0.702 0.711 0.711
of our methods (see Table 1), most of
which are RST-based.
HILDA.P.T.I 0.618 0.638 0.628 0.626 0.605 0.615 0.622 0.621
We evaluate our methods on the ac-
HILDA.P.T.II 0.628 0.674 0.650 0.648 0.600 0.623 0.637 0.637
curacy, precision, recall, and F1-score
HILDA.P.T.X 0.681 0.697 0.689 0.690 0.674 0.682 0.686 0.685
on positive and negative documents,
HILDA.P.T.F 0.703 0.702 0.702 0.702 0.703 0.703 0.703 0.702
as well as on the overall accuracy and
HILDA.P.L.I 0.632 0.684 0.657 0.656 0.602 0.628 0.643 0.642
macro-level F1-score. We assess the
HILDA.P.L.II 0.633 0.685 0.658 0.657 0.603 0.629 0.644 0.643
statistical significance of performance
HILDA.P.L.X 0.690 0.705 0.697 0.698 0.683 0.691 0.694 0.694
differences through paired, one-sided
HILDA.P.L.F 0.701 0.720 0.710 0.712 0.693 0.702 0.707 0.706 t-tests, comparing methods against
HILDA.P.H.I 0.583 0.609 0.596 0.591 0.565 0.578 0.587 0.587 one another in terms of their mean per-
HILDA.P.H.II 0.629 0.682 0.655 0.653 0.598 0.624 0.640 0.639 formance measures over all 10 folds,
HILDA.P.H.X 0.706 0.683 0.694 0.693 0.716 0.704 0.700 0.699 under the null hypothesis that the
HILDA.P.H.F 0.713 0.692 0.703 0.701 0.722 0.711 0.707 0.707 mean performance of a method is less
HILDA.D.T.I 0.627 0.616 0.621 0.622 0.633 0.628 0.625 0.624 than or equal to the mean performance
HILDA.D.T.II 0.627 0.650 0.639 0.637 0.614 0.625 0.632 0.632 of another method.
HILDA.D.T.X 0.682 0.689 0.685 0.686 0.678 0.682 0.684 0.683 In order to evaluate our methods,
HILDA.D.T.F 0.684 0.696 0.690 0.691 0.679 0.685 0.688 0.687 we apply tenfold cross-validation. For
HILDA.D.L.I 0.627 0.679 0.652 0.650 0.596 0.622 0.638 0.637 each fold, we optimize the offsets,
HILDA.D.L.II 0.631 0.687 0.658 0.656 0.598 0.626 0.643 0.642 weights, and diminishing factor for
HILDA.D.L.X 0.689 0.719 0.704 0.706 0.675 0.690 0.697 0.697 weighting schemes X and F using par-
HILDA.D.L.F 0.701 0.727 0.714 0.717 0.690 0.703 0.709 0.708 ticle-swarm optimization,5 with parti-
HILDA.D.H.I 0.580 0.516 0.546 0.564 0.627 0.594 0.572 0.570
cles searching a solution space, where
HILDA.D.H.II 0.630 0.663 0.646 0.645 0.611 0.627 0.637 0.637
their coordinates correspond with the
HILDA.D.H.X 0.706 0.696 0.701 0.700 0.710 0.705 0.703 0.703
weights and offsets (between –2 and 2)
and (between 1 and 2). The fitness of
HILDA.D.H.F 0.707 0.708 0.707 0.707 0.706 0.707 0.707 0.707
a particle is its macro-level F1-score on
a training set.

74 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

Experimental results. Our experi- top-level RST analysis method T. The analysis is performed on the smallest
mental analysis consists of four steps. deeper analyses L and H clearly yield considered units (sentences), the hi-
First, we compare the performance a significant advantage over the rather erarchy of RST trees is accounted for,
of our considered methods. We then shallow analysis method T. and the weights are differentiated per
analyze the optimized weights and di- Some methods stand out in particu- type of rhetorical relation. These re-
minishing factors. Subsequently, we lar. First, HILDA.D.T and HILDA.P.T sults confirm our hypothesis that the
demonstrate how documents are typi- are relatively weak, especially when sentiment conveyed by a text can be
cally perceived by distinct methods. using weighting schemes I and II. The captured more adequately by incor-
Last, we discuss several caveats for top-level RST analysis method T is typi- porating a deep analysis of the text’s
our findings. cally too coarse-grain for larger RST rhetorical structure.
Performance. Table 2 presents our trees, as, for instance, documents are Optimized weights. The optimized
methods’ performance, whereas Fig- segmented in only two parts when us- weights for distinct types of nuclei
ure 3 shows the p-values for the macro- ing document-level RST trees. Other and satellites, as defined in RST,13 ex-
level F1-scores for all combinations of weak methods are HILDA.D.H.I and hibit several patterns. First, nucleus
methods, ordered from top to bottom HILDA.P.H.I. The combination of elements are generally rather impor-
and from left to right in accordance naïve weighting scheme I with deep, tant in weighting scheme X, with most
with an initial ranking from worst- to hierarchy-based analysis of the rhetori- weights ranging between 0.5 and 1. The
best-performing methods. We sort cal structure of a document or its para- sentiment expressed in elaborating
the methods based on their weight- graphs results in a narrow focus on very satellites is typically assigned a simi-
ing scheme (Baseline, Position, I, specific segments. lar importance in weighting scheme
II, X, and F), analysis level (HILDA.D, Approaches that stand out positive- X. Furthermore, contrasting satellites
HILDA.P, HILDA.S, and SPADE.S), and ly are those applying hierarchy-based mostly receive weights around or below
RST analysis method (T, L, and H), RST analysis H to sentence-level RST 0. Background satellites are typically as-
respectively. Darker colors indicate trees, with weighting schemes X and signed relatively low weights as well.
lower p-values, with darker rows and F, or HILDA.S.H.X, HILDA.S.H.F, In weighting scheme F, nuclei and
columns signaling weak and competi- SPADE.S.H.X, and SPADE.S.H.F. satellites in an attributing relation are
tive approaches, respectively. Given a These approaches perform compara- typically both assigned weights around
correct ordering, darker colors should bly well because they involve detailed 0.5. Conversely, for background and
be right from the diagonal, toward the analysis of rhetorical structure; the contrasting relations, satellites are
upper-right corner of Figure 3.
Three trends can be observed in Figure 3. The p-values for the paired, one-sided t-test assessing the null hypothesis of the
mean macro-level F1-scores of the methods in the columns being lower than or equal to
Table 2 and Figure 3. First, weighting the mean macro-level F1-scores of the methods in the rows.
schemes X and F significantly outper-
form the I, II, Baseline, and Position
schemes, with F significantly outper-
forming X in most cases. Conversely,
schemes I, II, Baseline, and Position
do not exhibit clearly significant per-
formance differences with respect to
one another.
A second trend is that methods
guided by document-level RST trees
are typically outperformed by compa-
rable methods utilizing paragraph-
level RST trees. Sentence-level RST
trees yield the best performance. An
explanation lies in misclassifications
of rhetorical relations being potential-
ly more harmful in larger RST trees; a
misclassified relation in one of the top
levels of a document-level RST tree can
cause a misinterpretation of a large
part of the document.
The third trend is that methods ap-
plying the hierarchy-based RST analy-
sis method H typically slightly out-
perform comparable approaches that
use the leaf-based analysis L instead,
which in turn significantly outperform
comparable approaches using the

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 75
contributed articles

more clearly distinct from nuclei. Back- assigning each part of the review an dation is emphasized in the last para-
ground satellites are typically assigned equal weight (see Figure 4a), this rela- graph. All in all, it is the incorporation
less importance than their associated tive abundance of positive words sug- of the detailed analysis of the review’s
nuclei, with respective weights of 0 gests the review is rather positive, rhetorical structure into the senti-
and 1. For contrasting relations, nuclei whereas it is in fact negative. ment-analysis process that makes it
are mostly assigned weights between Our best-performing RST-based possible for our polarity classifier to
0.5 and 1, whereas optimized satellite baseline, SPADE.S.T.X, considers only better understand the review.
weights are typically negative. the top-level splits of sentence-level Caveats. Our failure analysis has
The optimized values for the dimin- RST trees and yields a rather coarse- revealed that some misclassifica-
ishing factor are typically around 1.5 grain segmentation of the sentences, tions of authors’ sentiment are due
and 1.25 for weighting schemes X and leaving little room for more subtle to misinterpreted sarcasm and an
F, respectively. These values result in nuance (see Figure 4b). Nevertheless, occasional misinterpretation of prov-
the first 15 (for X) or 30 (for F) levels of SPADE.S.T.X successfully highlights erbs. Additionally, not all sentiment-
RST trees being accounted for. Inter- the conclusion and the arguments sup- carrying words are identified as such
estingly, with some (document-level) porting this conclusion. The polarity of by SentiWordNet, as their word sense
RST trees being more than 100 levels some positive words is even inverted, cannot be successfully disambigu-
deep in our corpus, the optimized di- as the reviewer uses them in a rather ated or the word cannot be found in
minishing factors effectively disregard negative way. our sentiment lexicon. Other chal-
the lower, most-fine-grain parts of RST SPADE.S.H.F, our best-performing lenges are related to the information
trees, thus realizing a balance between approach, yields a very detailed analy- content of documents; for instance,
the level of detail and the potential sis in which subtle distinctions are in our corpus, some reviewers tend
noise in the analysis. made between small text segments to mix their opinionated statements
Processing a document. The observed (see Figure 4c). Such nuance helps with plot details containing irrel-
differences in performance of our bring out the parts of the review that evant sentiment-carrying words. Ad-
polarity-classification methods origi- are most relevant with respect to its ditionally, some reviewers evaluate
nate in how these methods perceive a overall sentiment. SPADE.S.H.F ig- a movie by comparing it with other
document in the sentiment-analysis nores most of the irrelevant back- movies. Such statements require
process. Figure 4 visualizes how the ground information in the first para- a distinction between entities and
interpretations of a movie review differ graph and highlights the reviewer’s their associated sentiment, as well as
across various methods. main concerns in the second and third real-world knowledge to be incorpo-
Our software assigns many words paragraphs. Moreover, sentiment re- rated into the analysis.
in our example positive sentiment, lated to the movie’s good aspects is Even though our RST-based polari-
whereas fewer words are identified as often inverted and mostly ignored by ty-classification methods cannot cope
carrying negative sentiment. When SPADE.S.H.F. The overall recommen- particularly well with the specific phe-

Figure 4. Movie review cv817_3675, as processed by various sentiment-analysis methods.

Negative words are in red in italics, and positive words are in green and underlined in italics. Sentiment-carrying words with high
intensity are brighter, and words carrying less sentiment are darker. Text segments assigned
a relatively low weight in the analysis of the conveyed sentiment are more transparent than text segments with higher weights.

(a) BASELINE. (b) SPADE.S.T.X. (c) SPADE.S.H.F.

76 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

nomena mentioned earlier, they signif- References


Linguistics, Stroudsburg, PA, 2002, 79–86.
20. Sayeed, A., Boyd-Graber, J., Rusk, B., and Weinberg,
icantly outperform our non-RST base- 1. Baccianella, S., Esuli, A., and Sebastiani, F.
A. Grammatical structures for word-level
SentiWordNet 3.0: An enhanced lexical resource for
lines. However, these improvements sentiment analysis and opinion mining. In Proceedings
sentiment detection. In Proceedings of the 2012
Conference of the North American Chapter of the
come at a cost of processing times in- of the Seventh Conference on International Language Association for Computational Linguistics: Human
Resources and Evaluation (Valletta, Malta, May 19–21).
creasing by almost a factor of 10. The European Language Resources Association, Paris,
Language Technologies (Montreal, Canada, July
3–8). Association for Computational Linguistics,
bottleneck here is formed by the RST France, 2010, 2200–2204. Stroudsburg, PA, 2012, 667–676.
2. Bal, D., Bal, M., van Bunningen, A., Hogenboom, A.,
parsers, rather than by the application Hogenboom, F., and Frasincar, F. Sentiment analysis
21. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
C., Ng, A., and Potts, C. Recursive deep models for
of our weighting schemes. with a multilingual pipeline. In Proceedings of the 12th semantic compositionality over a sentiment treebank.
International Conference on Web Information System In Proceedings of the 2013 Conference on Empirical
Engineering (Sydney, Australia, Oct. 12–14) Volume Methods in Natural Language Processing (Seattle, WA,
Conclusion 6997 of Lecture Notes in Computer Science. Springer, Oct. 18–21). Association for Computational Linguistics,
Berlin, Germany, 2011, 129–142. Stroudsburg, PA, 2013, 1631–1642.
We have demonstrated that senti- 3. Baldridge, J. and Morton, T. OpenNLP, 2004; http:// 22. Somasundaran, S., Namata, G., Wiebe, J., and Getoor,
ment analysis can benefit from deep opennlp.sourceforge.net/ L. Supervised and unsupervised methods in employing
4. Chardon, B., Benamara, F., Mathieu, Y., Popescu, discourse relations for improving opinion polarity
analysis of a text’s rhetorical structure, V., and Asher, N. Measuring the effect of discourse classification. In Proceedings of the 2009 Conference
enabling the distinction be made be- structure on sentiment analysis. In Proceedings of on Empirical Methods in Natural Language Processing
the 14th International Conference on Intelligent Text (Singapore, Aug. 6–9). Association for Computational
tween important text segments and Processing and Computational Linguistics (University Linguistics, Stroudsburg, PA, 2009, 170–179.
less-important ones in terms of their of the Aegean, Samos, Greece, Mar. 24–30) Volume 23. Soricut, R. and Marcu, D. Sentence-level discourse
7817 of Lecture Notes in Computer Science. Springer, parsing using syntactic and lexical information. In
contribution to a text’s overall senti- Berlin, Germany, 2013, 25–37. Proceedings of the Human Language Technology
ment. This is a significant step forward 5. Chenlo, J., Hogenboom, A., and Losada, D. Rhetorical and North American Association for Computational
structure theory for polarity estimation: An Linguistics Conference (Edmonton, Canada, May
with respect to existing work, which is experimental study. Data and Knowledge Engineering 27–June 1). Association for Computational Linguistics,
limited to guiding sentiment analysis 94, B (Nov. 2014), 135–147. Stroudsburg, PA, 2003, 149–156.
6. Feldman, R. Techniques and applications for sentiment 24. Taboada, M., Voll, K., and Brooke, J. Extracting
by shallow analyses of rhetorical rela- analysis. Commun. ACM 56, 4 (Apr. 2013), 82–89. Sentiment as a Function of Discourse Structure
tions in (mostly sentence-level) rhetori- 7. Fellbaum, C. WordNet: An Electronic Lexical Database. and Topicality Technical Report 20. Simon Fraser
MIT Press, Cambridge, MA, 1998. University, Burnaby, B.C. Canada, 2008; http://www.
cal structure trees. 8. Heerschop, B., Goossen, F., Hogenboom, A., Frasincar, cs.sfu.ca/research/publications/techreports/#2008
Our contribution is threefold. F., Kaymak, U., and de Jong, F. Polarity analysis of 25. Walenz, B. and Didion, J. OpenNLP, 2008; http://
texts using discourse structure. In Proceedings of the jwordnet.sourceforge.net/
First, our novel polarity-classification 20th ACM Conference on Information and Knowledge
Management (Glasgow, U.K., Oct. 24–28). ACM Press,
methods guided by deep leaf-level or New York, 2011, 1061–1070. Alexander Hogenboom (hogenboom@ese.eur.nl) is
hierarchy-based RST analyses signifi- 9. Hernault, H., Prendinger, H., duVerle, D., and Ishizuka, a Ph.D. student at Erasmus University Rotterdam, the
M. HILDA: A discourse parser using support vector Netherlands, where he is affiliated with the Econometric
cantly outperform existing approach- machine classification. Dialogue and Discourse 1, 3 Institute, the Erasmus Centre for Business Intelligence at
es, which are guided by shallow RST (Dec. 2010), 1–33. the Erasmus Research Institute of Management, and the
10. Jansen, B., Zhang, M., Sobel, K., and Chowdury, A. Erasmus Studio.
analyses or by no RST-based analyses Twitter power: Tweets as electronic word of mouth. Flavius Frasincar (frasincar@ese.eur.nl) is an assistant
at all. Second, the novel RST-based Journal of the American Society for Information professor of information systems at the Econometric
Science and Technology 60, 11 (Nov. 2009), 2169–2188. Institute and founding member of the Erasmus Centre for
weighting scheme in which we dif- 11. Kim, S. and Hovy, E. Determining the sentiment of Business Intelligence at the Erasmus Research Institute
ferentiate the weights of nuclei and opinions. In Proceedings of the 20th International of Management of Erasmus University Rotterdam, the
Conference on Computational Linguistics (University Netherlands.
satellites by their RST relation type of Geneva, Switzerland, Aug. 23–27). Association for
significantly outperforms existing Computational Linguistics, Stroudsburg, PA, 2004, Franciska de Jong (f.m.g.dejong@utwente.nl) is a full
1367–1373. professor of language technology at the University of
schemes. And third, we have com- 12. Kim, S. and Hovy, E. Automatic identification of pro Twente, the Netherlands, and a professor of e-research
for the social sciences and the humanities at Erasmus
pared the performance of polarity- and con reasons in online reviews. In Proceedings of
University Rotterdam, the Netherlands, where she is also
the 21st International Conference on Computational
classification approaches guided by Linguistics and 44th Annual Meeting of the Association director of the Erasmus Studio.
sentence-level, paragraph-level, and for Computational Linguistics (Sydney, Australia, July Uzay Kaymak (u.kaymak@ieee.org) is a full professor
17–21). Association for Computational Linguistics, of information systems in health care in the Department
document-level RST trees, thus dem- Stroudsburg, PA, 2006, 483–490. of Industrial Engineering & Innovation Sciences at the
onstrating that RST-based polarity 13. Mann, W. and Thompson, S. Rhetorical structure Eindhoven University of Technology, the Netherlands.
theory: Toward a functional theory of text
classification works best when focus- organization. Text 8, 3 (Jan. 1988), 243–281.
ing on RST trees of smaller units of a 14. Manning, C., Grow, T., Grenager, T., Finkel, J., and
Bauer, J. Stanford Tokenizer, 2010; http://nlp.stanford.
text (such as sentences). edu/software/tokenizer.shtml
In future work, we aim to investi- 15. Marcu, D. The rhetorical parsing of unrestricted texts:
A surface-based approach. Computational Linguistics
gate the applicability of other, possibly 26, 3 (Sept. 2000), 395–448.
more scalable methods for exploiting 16. Melville, P., Sindhwani, V., and Lawrence, R. Social
media analytics: Channeling the power of the
(discourse) structure of text in senti- blogosphere for marketing insight. In Proceedings of
ment analysis. We also plan to validate the First Workshop on Information in Networks (New
York, Sept. 25–26, 2009); http://www.prem-melville.
our findings on other corpora covering com/publications/
other domains and other types of text. 17. Pang, B. and Lee, L. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proceedings of
Acknowledgments the 42nd Annual Meeting of the Association for
Computational Linguistics (Barcelona, Spain, July
Special thanks go to Bas Heerschop 21–26). Association for Computational Linguistics,
Stroudsburg, PA, 2004, 271–280.
and Frank Goossen for their con- 18. Pang, B. and Lee, L. Opinion mining and sentiment
tributions in the early stages of this analysis. Foundations and Trends in Information
Retrieval 2, 1 (Jan. 2008), 1–135.
work. We are partially supported by 19. Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up?
the Dutch national research program Sentiment classification using machine learning
techniques. In Proceedings of the 2002 Conference
COMMIT (http://www.commit-nl.nl/ on Empirical Methods in Natural Language Processing Copyright held by authors.
about-commit). (Honolulu, HI, Oct. 25–27). Association for Computational Publication rights licensed to ACM. $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 77
contributed articles
DOI:10.1145/2699390
look ahead. Passwords were originally
Theory on passwords has lagged practice, deployed in the 1960s for access to
time-shared mainframe computers, an
where large providers use back-end environment unrecognizable by today’s
smarts to survive with imperfect technology. Web users. Many practices have sur-
vived with few changes even if no lon-
BY JOSEPH BONNEAU, CORMAC HERLEY, ger appropriate.9,19 While partly attrib-
PAUL C. VAN OORSCHOT, AND FRANK STAJANO utable to inertia, this also represents
a failure of the academic literature to
provide approaches that are convinc-

Passwords and
ingly better than current practices.
We identify as outdated two models
that still underlie much of the current

the Evolution
password literature. First is the model
of a random user who draws passwords
uniformly and independently from
some set of possible passwords. It has

of Imperfect
resulted in overestimates of security
against guessing and encouraged in-
effectual policies aimed at strength-
ening users’ password choices. The

Authentication
second is that of an offline attack
against the password file. This model
has inflated the importance of un-
throttled password guessing relative to
other threats (such as client malware,
phishing, channel eavesdropping, and
plaintext leaks from the back-end) that
are more difficult to analyze but sig-
nificantly more important in practice.
Together, these models have inspired
PASSWORDS HAVE DOMINATED human-computer an awkward jumble of contradictory
authentication for 50 years despite consensus among advice that is impossible for humans
to follow.1,19,30,36
researchers that we need something more secure and The focus of published research
deserve something more user friendly. Much published on clean, well-defined problems has
research has focused on specific aspects of the problem caused the neglect of the messy compli-
cations of real-world Web authentica-
that can be easily formalized but do not actually havea tion. This misplaced focus continues to
major influence on real-world design goals, which are hinder the applicability of password re-
search to practice. Failure to recognize
never authentication per se, but rather protection of
user accounts and sensitive data. As an example of this key insights
disconnect, academic research often recommends " Simplistic models of user and attacker
strict password-composition policies (such as length behaviors have led the research community
to emphasize the wrong threats.
requirements and mandating digits and nonalphabetic
" Authentication is a classification problem
characters) despite the lack of evidence they actually amenable to machine learning, with
reduce harm. many signals in addition to the password
available to large Web services.
We argue that critically revisiting authentication as " Passwords will continue as a useful
a whole and passwords’ role therein is required to signal for the foreseeable future, where
the goal is not impregnable security but
understand today’s situation and provide a meaningful reducing harm at acceptable cost.

78 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


the broad range of usability, deploy- have bolstered, not replaced, passwords providers with extensive knowledge
ability, and security challenges in Web with multiple parallel complementary of their users’ habits. It makes au-
IM AGE BY IWO NA USA KIEWICZ/ AND RIJ BO RYS ASSOCIAT ES,
BASED O N IDEA BY J OSEPH BO NNEAU A ND FRA NK STAJA NO

authentication has produced a long authentication mechanisms. These are thentication more privacy invasive
list of mutually incompatible password combined, often using machine learn- and increasingly difficult to com-
requirements for users and countless ing, so as to minimize cost and annoy- prehend for users and researchers
attempts by researchers to find a mag- ance while providing enough security alike. We encourage researchers to
ic-bullet solution despite drastically dif- for e-commerce and online social inter- acknowledge this trend and focus on
ferent requirements in different appli- action to prosper. We expect authenti- addressing related security, privacy,
cations. No single technology is likely to cation to gradually become more secure and usability challenges.
“solve” authentication perfectly for all and less obtrusive, even if perhaps tech-
cases; a synergistic combination is re- nically inelegant under the hood. Lessons from the Past
quired. Industry has already moved in This trend is not without down- From the beginning, passwords have
this direction. Many leading providers sides. It strongly favors the largest been a security band-aid. During de-

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 79
contributed articles

velopment of the first time-sharing for user authentication. estimated the gap between modeled
operating systems in the 1960s, pass- As Web-based services proliferated, behavior and reality. Security policies
words were added to protect against usability problems arose that had not and research models have been slow
practical jokes and researchers using existed for system passwords. Reset- to adjust to the magnitude of the inac-
more resources than authorized. The ting forgotten passwords, previously a curacies revealed by new data sources
1961 Compatible Time-Sharing Sys- manual task for IT support staff, was (such as leaked password datasets and
tem at MIT was likely the first to deploy automated through email, creating a large-scale measurement studies).5
passwords in this manner. Security is- common central-point-of-failure for One of the best-known early sourc-
sues arose immediately. Multiple cases most users. The increased number es on password policies is the U.S.
were reported of users guessing one of accounts held by individual users Defense Department’s circa 1985
another’s passwords and also at least scuttled the assumption of dedicated Green Book,38 which specified detailed
one leak of the master password file passwords per account, and password policies for mitigating the risk of pass-
that was stored in unencrypted form. reuse became commonplace. Phishing word guessing, including rate limit-
Yet these issues were easily addressed grew into a major concern, but anti- ing, hashing passwords at rest, and
administratively, as all users were part phishing proposals requiring proto- limiting the lifetime of passwords. The
of the same academic organization. col or user-interface changes failed to Green Book avoided the complexity of
With development of access control gain adoption. Instead, the primary user behavior altogether, putting forth
in MULTICS and Unix in the 1970s, countermeasure involves blacklists of as one of three main principles: “Since
passwords were adapted to protect known phishing sites and machine- many user-created passwords are par-
sensitive data and computational re- learning classifiers to recognize new ticularly easy to guess, all passwords
sources. MULTICS protected pass- phishing sites as they arise.16 should be machine-generated.”
words by storing them in hashed form, Attempts to make a dedicated busi- The same year, NIST published its
a practice invented by Roger Needham ness out of authentication in the con- Password Usage guidelines in the Fed-
and Mike Guy at the University of Cam- sumer space have failed consistently. eral Information Processing Standards
bridge in the 1960s. Robert Morris’s While there has long been interest in (FIPS) series39 that were heavily derived
and Ken Thompson’s seminal 1979 deploying hardware tokens as a sec- from the Green Book. In addition to the
treatment of password security25 de- ond factor, standalone tokens (such recommended machine-chosen pass-
scribed the evolution toward dedicat- as RSA SecurID) have seen limited words, the FIPS guidelines allowed
ed password hashing and salting via deployment outside enterprise envi- user-chosen passwords with the caveat
the crypt() function, along with the ronments, likely due to the cost of to- that users “shall be instructed to use a
first analysis of dictionary attacks and kens relative to the value of free on- password selected from all acceptable
brute-force guessing. line accounts. Microsoft Passport and passwords at random, if possible, or
A decade later, the 1988 Morris In- OpenID, among many attempts to offer to select one that is not related to their
ternet worm demonstrated the vulner- single sign-on for the Web, have failed personal identity, history, or environ-
ability of many systems to password to gain mass adoption. ment.” Today, nearly all nonmilitary
guessing. Administrators adapted by The widespread availability of applications allow user-chosen pass-
storing password hashes in more heav- smartphones may be changing the words due to the perceived difficulty
ily protected shadow password files24 equation, however, as in the early of remembering machine-chosen
and, sometimes, proactively checking 2010s a number of online services, in- passwords. Yet the FIPS guidelines re-
user passwords for guessability.35 cluding Facebook, Google, and Twit- tained most other recommendations
With the mid-1990s advent of the ter deployed free smartphone applica- from the Green Book unchanged, in-
World Wide Web and e-commerce, tions to act as a second factor based cluding calculations of password se-
early attempts were made to replace on the emerging time-based one-time curity based on allowed characters
passwords with public-key cryptogra- password (TOTP) standard.27 Other and length requirements, limits on
phy via secure sockets layer (SSL) client services send codes via short message password lifetime, and forced updates.
certificates or the competing secure service (SMS) as a backup authentica- This encouraged the unrealistically op-
electronic transaction (SET) protocol. tion mechanism. A few services have timistic assumption that users choose
Ultimately, managing certificates and offered dedicated tokens as a second passwords similarly to random pass-
private keys client-side proved too factor, typically in environments at word generators that has persisted to
burdensome and a market never de- greater risk of fraud (such as eBay and this day.
veloped. Instead, secure connections World of Warcraft). Estimating password strength via “en-
on the Web almost universally rely on Random models for user behavior. tropy.” The guessing resistance of user-
one-way authenticated SSL. Servers are In addition to being regarded as the chosen passwords is often estimated
authenticated by a certificate at the SSL weak link in password systems, us- by modeling passwords as random
layer, while users are left to prove their ers are also typically the most difficult choices from a uniform distribution.
identity later with no explicit protocol component to model. Ideally, they This enables straightforward calcula-
support. Text-based passwords entered would choose passwords composed tions of expected guessing times in the
in HTML forms in exchange for HTTP of random characters. But even re- tradition of the 1985 Green Book. To at-
cookies have become the dominant, al- searchers who acknowledge this is an tempt to capture the fact many users
beit never formally specified, protocol idealized model have usually under- choose from a relatively small number

80 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

of common passwords (despite the on the configuration of a particular


very large theoretical space from which tool, is to simply run a popular open-
to choose text passwords), researchers source password-cracking library
often choose a relatively small uniform against a set of passwords and evaluate
distribution. The logarithm of the size
of this uniform distribution in this Text-based the average number of guesses needed
to find a given proportion of them.23,42
model is often called “entropy,” in ref-
erence to Claude Shannon’s famous
passwords entered This approach can be applied to even
a single password to evaluate its rela-
measure H1 of the minimum average in HTML forms tive strength, though this clearly over-
number of bits needed to encode sym-
bols drawn from a distribution. Unfor-
in exchange for estimates security relative to any real
adversaries who use a more favorable
tunately, Shannon entropy models the HTTP cookies cracking library.
number of guesses needed by an at-
tacker who can check in constant time
have become Improving password strength. Sim-
ple measures like Shannon and NIST
if an unknown password is a member the dominant, entropy make increases in password
of an arbitrarily sized set of possibili-
ties. This does not correspond to any albeit never strength seem tantalizingly close.
Composition policies that increase the
real guessing attack. formally specified, minimum length or expand the classes
A more direct metric is “guesswork,”
G,25 the expected number of queries by protocol for of characters a password seem to cause
reliable increases in these measures if
an adversary guessing individual pass-
words sequentially until all are found.
user authentication passwords are random; for example,
the NIST guidelines suggest requiring
It can be shown that H1 provides a on the Web. at least one uppercase and non-alpha-
lower bound on G.25 However, G is also betic character. While acknowledging
problematic, as it can be highly skewed users may insert them in predictable
by rare but difficult-to-guess pass- places, the guidelines still estimate
words.5 To address this bias, “partial” an increase in guessing difficulty of a
(or “marginal”) guessing metrics have password by six bits (or a factor of 64)
been developed.5,32 One formulation compared to a policy of allowing any
is partial guesswork (Gα), which mod- password. However, experiments have
els an attacker making only enough shown this is likely an overestimate by
guesses to have a probability α of suc- an order of magnitude.40
ceeding.5 This encapsulates the tradi- Such password policies persist de-
tional G when α = 1, with lower values spite imposing a high usability cost, dis-
of α modeling more realistic attackers. cussed later, though tellingly, their use
Such metrics have been proven not to is far less common at sites facing greater
be lower-bounded in general by Shan- competition (such as webmail provid-
non entropy.5,32 While partial-guessing ers15) than at sites with little competition
metrics provide an appropriate math- (such as universities and government
ematical model of password-guessing services). Instead, research suggests the
difficulty, they require a very large sam- most effective policy is simply using a
ple to be estimated accurately, typically large blacklist34 to limit the frequency of
millions of passwords,5 and have thus the most common passwords, bound-
found limited practical use. ing online guessing attacks to a predict-
Heuristic measures of password able level, and conceding that many us-
strength are often needed for smaller ers will choose passwords vulnerable to
datasets. NIST’s Electronic Authentica- offline guessing.
tion Guidelines8 (in many ways an up- A related goal has been to nudge us-
date to the Green Book) acknowledged ers toward better passwords through
the mathematical unsoundness of feedback (such as graphical meters
Shannon entropy for guessing but still that indicate estimated strength of
introduced a heuristic method for es- their password, as they choose it). In an
timating “entropy” (NIST entropy) of experimental setting, very aggressive
password distributions under various strength meters can make guessing
composition policies. This model has user-chosen passwords dramatically
since been used in many academic more difficult.37 However, in studies
studies, though it has been found to using meters typical of those found
produce relatively inaccurate esti- in practice, with users who were not
mates in practice.23,40 The preferred prompted to consider their password,
empirical approach, albeit dependent the impact of meters was negligible;

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 81
contributed articles

many users failed to notice them at Offline guessing (cracking). Much


all.12 An empirical data point comes attention has been devoted to devis-
from Yahoo!, where adding a password- ing strategies for picking passwords
strength meter did improve password complex enough to resist offline crack-
security but only marginally.5
Independence when choosing mul- In addition ing. Yet this countermeasure may stop
real-world damage in at most a narrow
tiple passwords. A random user model
often further assumes every password
to being set of circumstances.13 For an attacker
without access to the password file, any
will be independently chosen. In prac- regarded guessing must be done online, which
tice, this is rarely true on the Web, as
users cope with the large number of
as the weak link can be rate-limited. If passwords in a
leaked file are unhashed, they are ex-
accounts by password reuse, some- in password posed in plaintext regardless of com-
times with slight modification; for
example, a 2007 telemetry study esti-
systems, plexity; if hashed but unsalted, then
large “rainbow tables”31 allow brute-
mated the median user has 25 pass- users are force look-up up to some length.a Only
word-protected accounts but only six
unique passwords.14 This has direct also typically if the attacker has obtained a password
file that had been properly hashed
security implications, as leaks at one the most difficult and salted do password-cracking ef-
website can compromise security at
another. Even if a user has not exactly component ficiency and password strength make
a real difference. And yet, while hash-
reused the password, attackers can
guess small variations that may double
to model. ing and salting have long been consid-
ered best practice by security profes-
their chances of success in an online- sionals, they are far from universal.
guessing scenario.10 Related password Empirical estimates suggest over 40%
choices similarly undermine the secu- of sites store passwords unhashed;7
rity goals of forced password updates, recent large-scale password file leaks
as an attacker with knowledge of a us- revealed many were plaintext (such as
er’s previous sequence of passwords can RockYou and Tianya), hashed but un-
often easily guess the next password.43 salted (such as LinkedIn), improperly
Offline vs. online threats. The secu- hashed (such as Gawker), or reversibly
rity literature distinguishes between encrypted (such as Adobe).
online attackers who must interact with Finally, offline attackers may be in-
a legitimate party to authenticate and terrupted if their breach is detected
offline attackers who are limited only in and administrators can force affected
terms of their computational resources. users to reset their passwords. Pass-
Superficially, offline attackers are word resets are often not instituted at
far more powerful, as they typically breached websites due to fear of los-
can make an unbounded number of ing users; they are even less commonly
guesses and compare them against a mandated for users whose password
known hash of the password. Yet many may have been leaked from a compro-
additional avenues of attack are avail- mised third-party website where it may
able to the online attacker: stealing the have been reused.
password using client-side malware, Online guessing. Online attackers
phishing the password using a spoofed can verify whether any given password
site, eavesdropping the password as it guess is correct only by submitting it to
is transmitted, stealing the password the authentication server. The number
from the authentication server, steal- of guesses that can be sent is limited.
ing the password from a second au- A crude “three strikes” model is an
thentication server where the user has obvious way of throttling attacks, but
reused it, and subverting the automat- relatively few sites implement such a
ed password reset process. deterministic policy,7 probably to avoid
A critical observation is that strong denial of service.
passwords do not help against any of Nonetheless, online guessing at-
these other attacks. Even the strongest tacks are in some ways much more
passwords are still static secrets that
can be replayed and are equally vulner- a Freely available tables quickly reverse the MD5
able to phishing, theft, and eavesdrop- hash of any alphanumeric password of up to 10
characters. Using the passwords from RockYou
ping. Mandating stronger passwords (a site that had a major password leak in 2009),
does nothing to increase security we can estimate such a table would cover 99.7%
against such attacks. of users for low-value online accounts.

82 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

costly to mount than offline attacks, on single solution can be expected to ad- words entirely. A better choice is to
a per-guess basis. Whereas offline, an dress all requirements, ranging from prioritize competing requirements
attacker might check a billion guesses financial to privacy protection. The list depending on organizational priori-
on a single host, online an attacker of usability, deployability, and security ties and usage scenarios and aim for
might need thousands of hosts. First, if requirements is simply too long (and gradual adoption. Given their universal
we assume IP addresses that send mil- rarely documented explicitly). support as a base user-authentication
lions of failed attempts will be blocked, An in-depth review of 35 proposed mechanism, passwords are sensibly
the load must be distributed. Also, the password alternatives using a frame- implemented first, offering the cheap-
load could exceed legitimate traffic; in a work of 25 comparison criteria found est way to get things up and running
service with one million users where the no proposal beats passwords on all when an idea is not yet proven and se-
average user logs in once per day, a to- fronts.6 Passwords appear to be a Pare- curity is not yet critical, with no learn-
tal of one billion guesses (one thousand to equilibrium, requiring some desir- ing curve or interoperability hurdles.
guesses per account) is as many login able property X be given up to gain any Low adoption costs also apply to us-
requests as the legitimate population new benefit Y, making passwords very ers of new sites, who need low barriers
generates in three years. If legitimate difficult to replace. when exploring new sites they are not
users fail 5% or so of the time (due to, Reviewing how categories of these sure they will return to. Financial web-
say, typos or forgetting) the attacker will password alternatives compare to sites are the rare exception, with offline
generate as many fail events as the legit- regular passwords yields insight. Pass- capital and users whose accounts are
imate population generates in 60 years. word managers—software that can all clearly valuable.
Choosing a password to withstand remember and automatically type The list of challenges to would-be
an offline attack is thus much more passwords for users—may improve alternatives goes on. Improving secu-
difficult than choosing one to with- security and usability in the common rity despite any decline in usability may
stand an online attack. Yet the addi- case but are challenging for users to mean losing potential new users (and
tional effort pays off only in the very configure across all user agents. This sometimes existing users) to competi-
restricted circumstances in which problem also affects some graphical tors. Some alternatives require server
offline attacks can occur.13 It makes password schemes,4 while others offer or client software modifications by
little sense to focus on this risk when insufficient security gains to overcome various independent parties and are of-
offline attacks are dwarfed by other change-resisting inertia. Biometric ten a showstopper; some others expect
vectors (such as malware). schemes, besides their significant de- large numbers of users to change their
ployment hurdles, appear poorly suited existing habits or be trained to use
Today’s “Overconstrained” World to the unsupervised environment of new mechanisms. However, some are
Passwords offer compelling econom- Web authentication; fraudsters can just only partial solutions or address only
ic advantages over the alternatives, replay digital representations of finger- a subset of security threats; some are
with lowest start-up and incremental prints or iris patterns. Schemes using even less user friendly, though in new
costs per user. Due largely to their sta- hardware tokens or mobile phones to and different ways. Moreover, as men-
tus as the incumbent solution, they generate one-time access codes may tioned earlier, some are more costly
also have clear “deployability” advan- be promising, with significant security and bring other deployment challeng-
tages (such as backward compatibil- advantages, but ubiquitous adoption es (such as interoperability, compat-
ity, interoperability, and no migration remains elusive due to a combination ibility with existing infrastructure, and
costs). But it is not these factors alone of usability issues and cost. painful migration).
that are responsible for their longev- Federated authentication, or “single Advice to users. Users face a pletho-
ity; the “password replacement prob- sign-on” protocols, in which users are ra of advice on passwords: use a differ-
lem” is both underspecified and over- authenticated by delegation to central ent one for each account; change them
constrained.6,20 identity providers, could significantly often; use a mix of letters, punctuation,
It is underspecified in that there is reduce several problems with pass- symbols, and digits; make them at least
no universally agreed set of concrete words without completely eliminating eight characters long; avoid personal
requirements covering diverse envi- them. Yet, besides introducing serious information (such as names and birth-
ronments, technology platforms, cul- privacy issues, they have been unable days); and do not write them down.
tures, and applications; for example, to offer a business model sufficiently These suggestions collectively pose an
many authentication proposals be- appealing to relying sites. The most unrealistic burden and are sometimes
come utterly unworkable on mobile successful deployment to date, Face- mutually incompatible; a person can-
devices with touchscreens, many Asian book Connect (a version of OAuth), not be expected to memorize a differ-
languages are now typed with con- incentivizes relying parties with user ent complex password for each of, say,
stant graphical feedback that must be data, mandating a central role for Face- 50 accounts, let alone change all of
disabled during password entry, and book as sole identity provider, which them on a rolling basis. Popular wis-
many large websites must support both does little for privacy. dom has summarized the password
low-value forum accounts and impor- With no clear winner satisfying all advice of the security experts as “Pick
tant e-commerce or webmail accounts criteria, inertia becomes a substantial something you cannot remember, and
through a single system. It is simul- hurdle and the deck is stacked against do not write it down.”
taneously overconstrained in that no technologies hoping to replace pass- Each bit of advice may be useful

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 83
contributed articles

against a specific threat, motivating Advice to avoid reusing passwords proposals to end spam, including
security professionals to offer them is also common. While it is a good de- cryptographic protocols to prevent do-
in an attempt to cover their bases fense against cross-site password com- main spoofing and microcharges for
and avoid blame for any potential se- promise, it is, for most users, incom- each email message sent, most email
curity breaches regardless of the in- patible with remembering passwords. providers have settled for approaches
convenience imposed on users. This Better advice is probably to avoid reus- that classify mail based on known
approach is predicted to lead to fail- ing passwords for important accounts patterns of attacker behavior. These
ure by the “compliance budget” mod- and not to worry about the large num- defenses are not free or easy to imple-
el3 in which the willingness of each ber of accounts of little value to an at- ment, with large Web operators often
user to comply with annoying security tacker (or their owner). devoting significant resources toward
requirements is a finite, exhaustible Moreover, we consider the advice keeping pace with abuse as it evolves.
resource that should be managed as against writing passwords to be out- Yet, ultimately, this cost is typically far
carefully as any other budget. Indeed, moded for the Web. Stories involving less than any approach requiring users
websites (such as online stores), whose “Post-it notes on the monitor” usually to change behavior.
users are free to vote with their wallets, refer to corporate passwords where In the case of authentication,
are much more careful about not ex- users feel no personal stake in their banks provide a ready example of liv-
ceeding their customers’ compliance security. Most users understand writ- ing with imperfect technology. Even
budget than sites (such as universities) ten-down passwords should be kept though credit-card numbers are es-
whose users are “captive.”15 in a wallet or other safe location gen- sentially static secrets, which users
Useful security advice requires a erally not available to others, even ac- make no attempt to conceal from
mature risk-management perspec- quaintances. With this caveat, written merchants, fraud is kept to accept-
tive and rough quantification of the passwords are a worthwhile trade-off able levels by back-end classifiers.
risks and costs associated with each if they encourage users to avoid very Technologies like “chip and PIN”
countermeasure. It also requires ac- weak passwords. Password managers have not been a magic bullet where
knowledging that, with passwords as can be an even better trade-off, im- deployed.2 Cards are still stolen, PINs
deployed today, users have little con- proving usability (no remembering, can be guessed or observed, signature
trol over the most important coun- no typing) and allowing a different transactions still exist as a fallback,
termeasures. In particular, running strong password for each account. and online payments without a PIN,
a personal computer free of malware However, they introduce a single point or “card not present” transactions,
may be the most important step, of failure, albeit perhaps no more vul- are still widespread.
though it is challenging and often nerable than Web accounts already Yet banks survive with a non-binary
ignored in favor of password advice, due to the prevalence of email-based authentication model where all avail-
which is simpler to conceptualize but password reset. able information is considered for
far less important. Likewise, good each transaction on a best-effort basis.
rate limiting and compromise detec- Multidimensional Future Web authentication is converging on a
tion at the server-side are critical, as We appear stuck between the intrac- similar model, with passwords persist-
discussed earlier, but users have no table difficulty of replacing passwords ing as an imperfect signal supplement-
agency other than to patronize better- and their ever-increasing insecurity ed by many others.
implemented sites. and burden on users. Many research- Web authentication as classification.
Choosing extremely strong pass- ers have predicted the dam will burst Behind the scenes, many large websites
words, as is often advised, is of far soon and the industry will simply have have already transitioned to a risk-based
more limited benefit; evidence it re- to pay the necessary costs to replace model for user authentication. This ap-
duces harm in practice is elusive. As passwords. However, these predictions proach emerged by the early 2000s at
noted earlier, password cracking is have been made for over a decade. online financial sites.41 While an incor-
rarely a critical step in attacks. Hence The key to understanding how large rect password means access should be
making passwords strong enough service providers manage, using what denied, a correct password is just one
to resist dedicated cracking attacks appears to be a “broken” technology, signal or feature that can be used by a
seems an effort poorly rewarded for is that websites do not need perfec- classifier to determine whether or not
all but the most critical Web accounts. tion. The problem of compromised the authentication attempt involves the
For important accounts, password-se- accounts is just one of many forms of genuine account owner.
lection advice should encourage pass- abuse, along with spam, phishing, click The classifier can take advantage
words not easily guessed by acquain- fraud, bots, fake accounts, scams, and of many signals besides the password,
tances and sufficient for withstanding payment fraud. None of them has been including the user’s IP address; geolo-
a reasonable amount of online guess- completely defeated technologically, cation; browser information, includ-
ing, perhaps one million guesses. but all are managed effectively enough ing cookies; the time of login; how the
About half of users are already above to keep things running. password is typed; and what resources
this bar,5 but discouraging simple dic- In nearly every case, techniques that are being requested. Unlike pass-
tionary passwords via advice, strength “solve” the problem technically have words, these implicit signals are avail-
meters, and blacklists remains advis- lost out to ones that manage them sta- able with no extra effort by the user.21
able to help out the others. tistically; for example, despite many Mobile devices introduce many new

84 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

signals from sensors that measure “advanced persistent threats,” which


user interaction.11 While none of are technically sophisticated and
these signals is unforgeable, each aimed at a single user account, are the
is useful; for example, geolocation most difficult challenge, as attackers
can be faked by a determined adver-
sary, 28 and browser fingerprinting Only if the attacker can tailor techniques to victims and
leave relatively little signal available
techniques appear to be an endless
arms race. 29 Nonetheless, both may
has obtained for classifiers.
New modes of operation. Authenti-
be valuable in multidimensional so- a password file cation by classification enables funda-
lutions, as the difficulty of forging all
signals can be significant in practice;
that had been mentally new methods of operation.
Authentication can be a more flexible
for example, by combining 120 such properly hashed process, with additional information
signals, Google reported a 99.7% re-
duction in 2013 in the rate of accounts
and salted do demanded as needed if the classifier’s
confidence is low or the user attempts
compromised by spammers.18 password-cracking a particularly sensitive operation, a
Unlike traditional password au-
thentication, the outcome is not bi- efficiency and process called “progressive authenti-
cation”;33 for example, a site may ask
nary but a real-valued estimate of the password strength users to confirm their identity by SMS
likelihood the attempt is genuine.
Generally these results will be dis- make a real or phone call if it notices suspicious
concurrent activity on their account
cretized, as users must be given ac-
cess to some resource or not, and any
difference. from geographically distant locations.
“Multi-level authentication” becomes
classifier will inevitably make false ac- possible, with users given limited ac-
cept and false reject errors. Sites will cess when the classifier’s confidence is
continue to develop their machine- relatively low. In the U.K., some banks
learning techniques to reduce these offer users a read-only view of their ac-
errors and may deploy new technol- count with just a password but require
ogy (such as two-factor authentica- a security token to actually transfer
tion and origin-bound certificates17) money out.
to increase the number (and quality) Sites may also ask for less informa-
of signals available. tion, including not requiring a pass-
Web authentication is by no means word to be entered, when they have
an easy domain to address for machine reasonable confidence, from second-
learning. The trade-off between false ac- ary signals, the correct user is present.
cepts and false rejects is difficult to get A form of this is already in place—
right. For financial sites, false accepts where persistent session cookies once
translate to fraud but can usually be allowed password-less login for a pre-
recovered from by reversing any fraudu- determined duration, the decision of
lent payments. However, for sites where when to recheck the password is now
false accepts result in disclosure of made dynamically by a risk classifier
sensitive user data, the confidentiality instead. A stronger version is oppor-
violations can never be undone, making tunistic two-factor authentication, en-
them potentially very costly. Meanwhile, suring correct authentication when the
false rejects annoy customers, who may second factor is present but enabling
switch to competitors. fallback security if the password is still
Obtaining a large sample of correct and enough additional signals
ground truth to train the classifier is are presented.17
another challenge, as it is difficult to The limit of this evolution is “con-
find examples of attacks administra- tinual authentication.” Instead of simply
tors do not yet know about. Finan- checking passwords at the entrance gate,
cially motivated attackers are again the classifier can monitor the actions of
likely the easiest to deal with, as their users after letting them in and refining
attacks typically must be scalable, its decision based on these additional
leading to a large volume of attacks signals. Ultimately, continual authen-
and hence training data. Nonfinan- tication may mean the authentication
cially motivated attackers (such as ex- process grows directly intertwined with
partners) may be more difficult to de- other abuse-detection systems.
tect algorithmically, but users are far Changes to the user experience. As
better positioned to deal with them in sites aim to make correct authentica-
real life. Targeted attacks, including tion decisions “magically” on the back-

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 85
contributed articles

end through machine learning, most parties as an identification service. Be-


changes to the user experience should ing accepted by more relying parties in
be positive. Common cases will be in- turn encourages users to register ac-
creasingly streamlined; users will be counts, further enhancing the attrac-
asked to type passwords (or take other
explicit action) as rarely as possible. Targeted attacks, tiveness of these identity providers
to relying parties. The second, “two-
However, users also face potential
downsides, as systems grow increasing-
including sided market,” or positive feedback
loop, is for user data. Large services
ly opaque and difficult to understand. “advanced with more user data can provide more
First, users may see more requests
for second factors (such as a one-time
persistent accurate authentication. This attracts
users to interact with the services
code over SMS) when the classifier’s threats,” more frequently, providing even more
confidence is low. Users may also face
more cases (such as when traveling or
technically data for authentication. Developing
and maintaining a complex machine-
switching to a new machine) where sophisticated learning-based approach to authenti-
they are unable to access their own
account despite correctly remem- and aimed at a cation requires a relatively large fixed
technical cost and low marginal costs
bering their password, akin to unex- single user account, per user, further advantaging the larg-
pected credit-card rejections while
abroad. Increased rejections may in- are the most est identity providers.
One consequence of this consoli-
crease the burden on “fallback” au-
thentication, to which we still lack a
difficult challenge, dation is that, lacking access to the
volumes of real-world data collected
satisfactory solution. as they can tailor by larger service providers, indepen-
As authentication systems grow
in complexity, their automated de-
techniques to dent researchers may be limited in
their ability to contribute to several
cisions may cause users increased victims and leave important research topics for which
confusion and distress. Users are
less likely to buy into any system that
relatively little the limits of artificial datasets and
mental extrapolation make empiri-
presents them with inconveniences signal available cally grounded research essential.
they do not understand. Training us-
ers to respond with their credentials for classifiers. Other areas of Web research (such
as networking, which requires mas-
to asynchronous security challenges sive packet capture or search-engine
on alternative channels may also pave research requiring huge numbers of
the way for novel phishing attacks. user queries) have likewise become
Even with careful user interface de- difficult for researchers with access to
sign, users may end up confused as only public data sources.
to what the genuine authentication There are also troubling privacy
ceremony22 should be. implications if relying parties require
Another challenge is that better users to sign up with a large service
classifiers may break some access- that, in turn, requires a significant
control practices on top of passwords amount of personal information to
users have grown accustomed to; for perform authentication well. This in-
example, users who share passwords formation may be inherently sensitive
with their spouses or their assistants (such as time and location of login ac-
may face difficulty if classifiers are tivity) or difficult to change if leaked
able to (correctly) determine another (such as behavioral biometrics like
human is using their password, even typing patterns). Many users already
though this is what the user intended. trust highly sensitive information to
Finally, typing passwords less often large online services, but authenti-
could in fact decrease their usability, as cation may be a motivating factor to
users are more likely to forget them if collect more data, store it longer, and
they go long periods between needing share it with more parties.
to type them.
Advantages of scale. Authentica- Conclusion
tion may represent a classic example Passwords offer plenty of examples of
of the winner-take-all characteristics divergence between theory and prac-
that appear elsewhere on the Web, tice; estimates of strength, models of
since it offers benefits to scale in two user behavior, and password-compo-
different ways: First, large services are sition policies that work well in the-
more likely to be accepted by relying ory generally remain unsupported by

86 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


contributed articles

evidence of reduced harm in practice 8. Burr, W.E., Dodson, D.F., Newton, E.M., Perlner, R.A., fingerprinting. In Proceedings of the IEEE Symposium
Polk, W.T., Gupta, S., and Nabbus, E.A. Electronic on Security and Privacy (Berkeley, CA, May 19–22).
and have in some cases been directly Authentication Guideline. National Institute IEEE Press, 2013.
contradicted by empirical observa- of Standards and Technology S.P. 800-63-2, 30. Norman, D.A. The way I see it: When security gets in
Gaithersburg, MD, 2013. the way. Interactions 16, 6 (Nov. 2009), 60–63.
tion. Yet large Web services appear to 9. Cheswick, W.R. Rethinking passwords. Commun. ACM 31. Oechslin, P. Making a faster cryptanalytic time-
cope with insecure passwords largely 56, 2 (Feb. 2013), 40–44. memory trade-off. In Proceedings of the 23rd Annual
10. Das, A., Bonneau, J., Caesar, M., Borisov, N., and Wang, International Advances in Cryptology (Santa Barbara,
because shortcomings can be cov- X. The tangled web of password reuse. In Proceedings CA, Aug. 17–21), Springer, 2003.
ered up with technological smarts in of the Network and Distributed System Security 32. Pliam, J.O. On the incomparability of entropy and
Symposium (San Diego, CA, Feb. 23–26). Internet marginal guesswork in brute-force attacks. In
the back-end. This is a crucial, if un- Society, Reston, VA, 2014. Proceedings of INDOCRYPT: The First International
heralded, evolution, driven largely by 11. De Luca, A., Hang, A., Brudy, F., Lindner, C., and Conference on Cryptology in India (Calcutta, India,
Hussmann, H. Touch me once and I know it’s you!: Dec. 10–13). Springer, 2000.
industry, which is well experienced in Implicit authentication based on touchscreen 33. Riva, O., Qin, C., Strauss, K., and Lymberopoulos,
data-driven engineering. Researchers patterns. In Proceedings of the ACM CHI Conference D. Progressive authentication: Deciding when to
(Austin, TX, May 5–10). ACM Press, New York, 2012. authenticate on mobile phones. In Proceedings of the
who adapt their models and assump- 12. Egelman, A., Sotirakopoulos, A., Muslukhov, I., 21st USENIX Security Symposium (Bellevue, WA, Aug.
Beznosov, K., and Herley, C. Does my password go 8–10). USENIX Society, Berkeley, CA, 2012.
tions to reflect this trend will be in a 34. Schechter, S., Herley, C., and Mitzenmacher, M.
up to eleven?: The impact of password meters on
stronger position to deliver relevant password selection. In Proceedings of the ACM CHI Popularity is everything: A new approach to protecting
Conference (Paris, France, Apr. 27–May 2). ACM Press, passwords from statistical-guessing attacks. In
results. This evolution is still in its New York, 2013. Proceedings of the Fifth USENIX Workshop on Hot
early stages, and there are many im- 13. Florêncio, D., Herley, C., and van Oorschot, P.C. An Topics in Security (Washington, D.C., Aug. 10). USENIX
administrator’s guide to Internet password research. Society, Berkeley, CA, 2010.
portant and interesting questions In Proceedings of the 28th Large Installation System 35. Spafford, E. Observations on reusable password
about the long-term results that have Administration Conference (Seattle, WA, Nov. 9–14). choices. In Proceedings of the USENIX Security
USENIX Association, Berkeley, CA, 2014. Workshop. USENIX Society, Berkeley, CA, 1992.
received little or no study to this point. 14. Florêncio, D. and Herley, C. A large-scale study of 36. Stajano, F. Pico: No more passwords! In Proceedings
There is also scope for even more rad- Web password habits. In Proceedings of the 16th of the 19th International Security Protocols Workshop
International Conference on the World Wide Web Lecture Notes in Computer Science 7114 (Cambridge,
ical rethinking of user authentication (Banff, Alberta, Canada, May 8–12, 2007). U.K., Mar. 28–30). Springer, 2011.
on the Web; clean-slate approaches 15. Florêncio, D. and Herley, C. Where do security policies 37. Ur, B., Kelley, P.G., Komanduri, S., Lee, J., Maass,
come from? In Proceedings of the ACM Symposium M., Mazurek, M.L., Passaro, T., Shay, R., Vidas, T.,
may seem too risky for large compa- On Usable Privacy and Security (Redmond, WA, July Bauer, L., Christin, N., and Cranor, L.F. How does your
nies but can be explored by more agile 14–16). ACM Press, New York, 2010. password measure up? The effect of strength meters
16. Garera, S., Provos, N., Chew, M., and Rubin, A.D. A on password creation. In Proceedings of the USENIX
academic researchers. Tackling these framework for detection and measurement of phishing Security Symposium (Bellevue, WA, Aug. 8–10).
novel challenges is important for en- attacks. In Proceedings of the Fifth ACM Workshop on USENIX Society, Berkeley, CA, 2012.
Recurring Malcode (Alexandria, VA). ACM Press, New 38. U.S. Department of Defense. Password Management
suring published research is ahead York, 2007. Guideline. Technical Report CSC-STD-002-85.
Washington, D.C., 1985.
of industry practice, rather than the 17. Grosse, E. and Upadhyay, M. Authentication at scale.
39. U.S. National Institute of Standards and Technology.
IEEE Security & Privacy Magazine 11 (2013), 15–22.
other way around. 18. Hearn, M. An update on our war against account Password Usage. Federal Information Processing
hijackers. Google Security Team blog, Feb. 2013. Standards Publication 112. Gaithersburg, MD, May 1985.
19. Herley, C. So long, and no thanks for the externalities: 40. Weir, M., Aggarwal, S., Collins, M., and Stern, H.
Acknowledgments The rational rejection of security advice by users. In Testing metrics for password creation policies by
Proceedings of the ACM New Security Paradigms attacking large sets of revealed passwords. In
Joseph Bonneau is funded by a Secure Proceedings of the 17th ACM Conference on Computer
Workshop (Oxford, U.K., Sept. 8–11). ACM Press, New
Usability Fellowship from Simply Se- York, 2009. and Communications Security (Chicago, IL, Oct. 4–8).
20. Herley, C. and van Oorschot, P.C. A research agenda ACM Press, New York, 2010.
cure and the Open Technology Fund. acknowledging the persistence of passwords. IEEE 41. Williamson, G.D. Enhanced authentication in online banking.
Paul van Oorschot is funded by a Natu- Security & Privacy 10, 1 (2012), 28–36. Journal of Economic Crime Management 4, 2 (2006).
21. Jakobsson, M., Shi, E., Golle, P., and Chow, R. Implicit 42. Yan, J., Blackwell, A., Anderson, R., and Grant, A.
ral Sciences and Engineering Research authentication for mobile devices. In Proceedings of Password memorability and security: Empirical
Council of Canada Chair in Authenti- the Fourth USENIX Workshop on Hot Topics in Security results. IEEE Security & Privacy 2, 5 (2004), 25–31.
(Montreal, Canada, Aug. 11). USENIX Association, 43. Zhang, Y., Monrose, F., and Reiter, M. The security
cation and Computer Security and a Berkeley, CA, 2009. of modern password expiration: An algorithmic
Discovery Grant. Frank Stajano is part- 22. Karlof, C., Tygar, J.D., and Wagner, D. Conditioned-safe framework and empirical analysis. In Proceedings
ceremonies and a user study of an application to Web of the 17th ACM Conference on Computer and
ly supported by European Research authentication. In Proceedings of the 16th Annual Communications Security (Chicago, IL, Oct. 4–8).
Council grant 307224. Network and Distributed System Security Symposium ACM Press, New York, 2010.
(San Diego, CA, Feb. 8–11). Internet Society, Reston,
VA, 2009. Joseph Bonneau (jbonneau@cs.stanford.edu) is a
References 23. Kelley, P.G., Komanduri, S., Mazurek, M.L., Shay, R., postdoctoral research fellow at Stanford University,
1. Adams, A. and Sasse, M. Users are not the enemy. Vidas, T., Bauer, L., Christin, N., Cranor, L.F., and Lopez, Stanford, CA, and a technology fellow at the Electronic
Commun. ACM 42, 12 (Dec. 1999), 41–46. J. Guess again (and again and again): Measuring Frontier Foundation, San Francisco, CA.
2. Anderson, R., Bond, M., and Murdoch, S.J. Chip and password strength by simulating password-cracking
spin. Computer Security Journal 22, 2 (2006), 1–6. algorithms. In Proceedings of the IEEE Symposium on Cormac Herley (cormac@microsoft.com) is a principal
3. Beautement, A., Sasse, M.A., and Wonham, M. The Security and Privacy (San Francisco, CA, May 20–23). researcher at Microsoft Research, Redmond, WA.
compliance budget: Managing security behaviour in IEEE Press, 2012. Paul C. van Oorschot (paulv@scs.carleton.ca) is a
organisations. In Proceedings of the New Security 24. Klein, D. Foiling the cracker: A survey of, and professor of computer science and Canada Research
Paradigms Workshop (Lake Tahoe, CA, 2008). improvements to, password security. In Proceedings Chair in Authentication and Computer Security in the
4. Biddle, R., Chiasson, S., and van Oorschot, P.C. Graphical of the Second USENIX Security Workshop. USENIX School of Computer Science at Carleton University,
passwords: Learning from the first 12 years. ACM Association, Berkeley, CA, 1990. Ottawa, Ontario, Canada.
Computing Surveys 44, 4 (Aug. 2012), article 19:1–41. 25. Massey, J.L. Guessing and entropy. In Proceedings
5. Bonneau, J. The science of guessing: Analyzing an of the 1994 IEEE International Symposium on Frank Stajano (frank.stajano@cl.cam.ac.uk) is a
anonymized corpus of 70 million passwords. In Information Theory (June 27–July 1). IEEE Press, Reader in Security and Privacy at the University of
Proceedings of the IEEE Symposium on Security and 1994, 204. Cambridge, Cambridge, U.K., where he is also the head
Privacy (San Francisco, CA, May 20–23). IEEE Press, 2012. 26. Morris, R. and Thompson, K. Password security: A case of the Academic Centre of Excellence in Cyber Security
6. Bonneau, J., Herley, C., van Oorschot, P.C., and Stajano, history. Commun. ACM 22, 11 (1979), 594–597. Research, and a Fellow of Trinity College.
F. The quest to replace passwords: A framework 27. M’Raihi, D., Machani, S., Pei, M., and Rydell, J. TOTP:
for comparative evaluation of Web authentication Time-Based One-Time Password Algorithm, RFC
schemes. In Proceedings of the IEEE Symposium on 6238. The Internet Engineering Task Force, Fremont,
Security and Privacy (San Francisco, CA, May 20–23). CA, May 2011.
IEEE Press, 2012. 28. Muir, J.A. and van Oorschot, P.C. Internet geolocation:
7. Bonneau, J. and Preibusch, S. The password thicket: Evasion and counterevasion. ACM Computer Surveys
Technical and market failures in human authentication 42, 1 (Dec. 2009), 4:1–13.
on the Web. In Proceedings of the Workshop on the 29. Nikiforakis, N., Kapravelos, A., Joosen, W., Kruegel,
Economics of Information Security (Arlington, VA, C., Piessens, F., and Vigna, G. Cookieless monster: Copyright held by authors.
June 14–15, 2010). Exploring the ecosystem of Web-based device Publication rights licensed to ACM. $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 87
review articles
DOI:10.1145/ 2699411
those arguments: so one can write rules
Open-universe probability models show merit about On(p, c, x, y, t) (piece p of color c is
on square x, y at move t) without filling
in unifying efforts. in each specific value for c, p, x, y, and t.
Modern AI research has addressed
BY STUART RUSSELL another important property of the real
world—pervasive uncertainty about

Unifying
both its state and its dynamics—using
probabilitytheory. A key step was Pearl’s
development of Bayesian networks,
which provided the beginnings of a

Logic and
formal language for probability mod-
els and enabled rapid progress in rea-
soning, learning, vision, and language
understanding. The expressive power

Probability
of Bayes nets is, however, limited.
They assume a fixed set of variables,
each taking a value from a fixed range;
thus, they are a propositional formal-
ism, like Boolean circuits. The rules of
chess and of many other domains are
beyond them.
What happened next, of course, is that
classical AI researchers noticed the perva-
sive uncertainty, while modern AI research-
ers noticed, or remembered, that the world
has things in it. Both traditions arrived at
the same place: the world is uncertain and
it has things in it. To deal with this, we have
to unify logic and probability.
But how? Even the meaning of such
a goal is unclear. Early attempts by
Leibniz, Bernoulli, De Morgan, Boole,
Peirce, Keynes, and Carnap (surveyed
PERHAPS THE MOST enduring idea from the early by Hailperin12 and Howson14) involved
days of AI is that of a declarative system reasoning attaching probabilities to logical sen-
over explicitly represented knowledge with a general tences. This line of work influenced AI

inference engine. Such systems require a formal key insights


language to describe the real world; and the real world
" First-order logic and probability
has things in it. For this reason, classical AI adopted theory have addressed complementary
aspects of knowledge representation
first-order logic—the mathematics of objects and and reasoning: the ability to describe
relations—as its foundation. complex domains concisely in terms
of objects and relations and the ability
The key benefit of first-order logic is its expressive to handle uncertain information.
Their unification holds enormous
power, which leads to concise—and hence promise for AI.

learnable—models. For example, the rules of chess " New languages for defining open-universe
probability models appear to provide
occupy 100 pages in first-order logic, 105 pages in the desired unification in a natural way.
propositional logic, and 1038 pages in the language of As a bonus, they support probabilistic
reasoning about the existence and identity
finite automata. The power comes from separating of objects, which is important for any
system trying to understand the world
predicates from their arguments and quantifying over through perceptual or textual inputs.

88 COMMUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


research but has serious shortcomings Despite their successes, these appro- The difference between knowing all
as a vehicle for representing knowledge. aches miss an important consequence the objects in advance and inferring
An alternative approach, arising from of uncertainty in a world of things: uncer- their existence from observation cor-
both branches of AI and from statistics, tainty about what things are in the world. responds to the distinction between
combines the syntactic and semantic Real objects seldom wear unique identi- closed-universe languages such as SQL
devices of logic (composable function fiers or preannounce their existence like and logic programs and open-universe
symbols, logical variables, quantifiers) the cast of a play. In areas such as vision, languages such as full first-order logic.
with the compositional semantics of language understanding, Web mining, This article focuses in particular on
IMAGE BY ALM AGAM I

Bayes nets. The resulting languages and computer security, the existence of open-universe probability models.
enable the construction of very large prob- objects must be inferred from raw data The section on Open-Universe Models
ability models and have changed the way (pixels, strings, and so on) that contain describes a formal language, Bayesian
in which real-world data are analyzed. no explicit object references. logic or BLOG, for writing such models.21

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 89
review articles

It gives several examples, including (in terms. Thus, Parent(Bill, William) These two assumptions force every world
simplified form) a global seismic moni- and Father(William) = Bill are atomic to contain the same objects, which are
toring system for the Comprehensive sentences. The quantifiers ∀ and ∃ in one-to-one correspondence with the
Nuclear Test-Ban Treaty. make assertions across all objects, for ground terms of the language (see
example, Figure 1b).b Obviously, the set of worlds
Logic and Probability under open-universe semantics is larger
This section explains the core concepts ∀ p, c (Parent( p, c) ∧ Male (p)) ⇔ and more heterogeneous, which makes
of logic and probability, beginning with Father (c) = p. the task of defining open-universe prob-
possible worlds.a A possible world is a ability models more challenging.
formal object (think “data structure”) For first-order logic, a possible world The formal semantics of a logical
with respect to which the truth of any specifies (1) a set of domain elements (or language define the truth value of a sen-
assertion can be evaluated. objects) o1, o2, . . . and (2) mappings from tence in a possible world. For example,
For the language of propositional the constant symbols to the domain the first-order sentence A = B is true in
logic, in which sentences are composed elements and from the function and world ω iff A and B refer to the same
from proposition symbols X1, . . . , Xn predicate symbols to functions and rela- object in ω; thus, it is true in the first
joined by logical connectives (∧, ∨, ¬, tions on the domain elements. Figure 1a three worlds of Figure 1a and false in the
⇒, ⇔), the possible worlds ω ∈ Ω are all shows a simple example with two con- fourth. (It is always false under closed-
possible assignments of true and false stants and one binary predicate. Notice universe semantics.) Let T(α) be the set
to the symbols. First-order logic adds that first-order logic is an open-universe of worlds where the sentence α is true;
the notion of terms, that is, expres- language: even though there are two con- then one sentence α entails another
sions referring to objects; a term is a stant symbols, the possible worlds allow sentence β, written α " β, if T(α) ⊆ T(β).
constant symbol, a logical variable, or a for 1, 2, or indeed arbitrarily many objects. Logical inference algorithms generally
k-ary function applied to k terms as A closed-universe language enforces determine whether a query sentence is
arguments. Proposition symbols are additional assumptions: entailed by the known sentences.
replaced by atomic sentences, con- In probability theory, a probability
sisting of either predicate symbols – The unique names assumption model P for a countable space Ω of pos-
applied to terms or equality between requires that distinct terms must sible worlds assigns a probability P(ω) to
refer to distinct objects. each world, such that 0 ≤ P(ω) ≤ 1 and
– The domain closure assump- Σω ∈ Ω P(ω) = 1. Given a probability model,
a In logic, a possible world may be called a model
or structure; in probability theory, a sample
tion requires that there are no the probability of a logical sentence α
point. To avoid confusion, this paper uses objects other than those named is the total probability of all worlds in
“model” to refer only to probability models. by terms. which α is true:

Figure 1. (a) Some of the infinitely many possible worlds for a first-order, open-universe P(α) = Σω∈T(α) P(ω). (1)
language with two constant symbols, A and B, and one binary predicate R(x, y). Gray arrows
indicate the interpretations of A and B and black arrows connect pairs of objects satisfying
R. (b) The analogous figure under closed-universe semantics; here, there are exactly four The conditional probability of one sen-
possible x, y-pairs and hence 24 = 16 worlds. tence given another is P(α|β ) = P(α ∧ β )/
P(β ), provided P(β ) > 0. A random vari-
A B A B A B A B A B A B able is a function from possible worlds
to a fixed range of values; for example,
(a) ... ... ... one might define the Boolean random
variable VA = B to have the value true in
the first three worlds of Figure 1a and
A B A B A B A B A B
false in the fourth. The distribution of
a random variable is the set of prob-
(b) A A A A
... A
B B B B B abilities associated with each of its
possible values. For example, suppose
the variable Coin has values 0 (heads)
and 1 (tails). Then the assertion Coin
Figure 2. Left: a Bayes net with three Boolean variables, showing the conditional probability ∼ Bernoulli(0.6) says that Coin has a
of true for each variable given its parents. Right: the joint distribution defined by Equation (2).
Bernoulli distribution with parameter
0.6, that is, a probability 0.6 for the
Burglary
P(B)
Earthquake
P(E) B E A P(B, E, A) value 1 and 0.4 for the value 0.
0.003 0.002 t t t 0.0000054
t t f 0.0000006
t f t 0.0023952 b The open/closed distinction can also be illus-
t f f 0.0005988
B E P(A | B,E) trated with a common-sense example. Suppose
f t t 0.0007976
t t 0.9
Alarm t f
f t f 0.0011964 a system knows that Father(William) = Bill and
0.8 f f t
f t 0.00995006 Father(Junior) = Bill. How many children does Bill
0.4 f f f 0.98505594
f f 0.01 have? Under closed-universe semantics—for ex-
ample, in a database—he has exactly two; under
open-universe semantics, between 1 and ∞.

90 COMM UNICATIO NS O F THE ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


review articles

Unlike logic, probability theory lacks defined in Equation (2). This generative performed by solving linear programs,
broad agreement on syntactic forms for view will be helpful in extending Bayes as suggested by Hailperin. Within the
expressing nontrivial assertions. For nets to the first-order case. subfield of probabilistic databases one
communication among statisticians, also finds logical sentences labeled with
a combination of English and LATEX Adding Probabilities to Logic probabilities6—but in this case probabil-
usually suffices; but precise language Early attempts to unify logic and prob- ities are attached directly to the tuples of
definitions are needed as the “input for- ability attached probabilities directly the database. (In AI and statistics, prob-
mat” for general probabilistic reason- to logical sentences. The first rigor- ability is attached to general relation-
ing systems and as the “output format” ous treatment, Gaifman’s propositional ships whereas observations are viewed
for general learning systems. As noted, probability logic,9 was augmented with as incontrovertible evidence.) Although
Bayesian networks27 provided a (partial) algorithmic analysis by Hailperin12 and probabilistic databases can model com-
syntax and semantics for the propo- Nilsson.23 In such a logic, one can assert, plex dependencies, in practice one often
sitional case. The syntax of a Bayes for example, finds such systems using global indepen-
net for random variables X1, . . . , Xn dence assumptions across tuples.
consists of a directed, acyclic graph P(Burglary ⇒ Earthquake) = 0.997006, (3) Halpern13 and Bacchus3 adopted and
whose nodes correspond to the ran- extended Gaifman’s technical approach,
dom variables, together with associated a claim that is implicit in the model adding probability expressions to the
local conditional distributions.c The of Figure 2. The sentence Burglary ⇒ logic. Thus, one can write
semantics specifies a joint distribution Earthquake is true in six of the eight pos-
over the variables as follows: sible worlds; so, by Equation (1), asser- ∀ h1, h2 Burglary (h1) ∧ ¬ Burglary (h2)
tion (3) is equivalent to Þ P(Alarm (h1)) > P(Alarm (h2)),
P(ω) = P(x1, . . . , xn) = ∏i P(xi⏐parents ( Xi )) . (2)
P(t t t) + P(t t f ) + P( f t t) + P( f t f ) where now Burglary and Alarm are predi-
These definitions have the desirable + P( f f t) + P( f f f ) = 0.997006 . cates applying to individual houses. The
property that every well-formed Bayes net new language is more expressive but does
corresponds to a proper probability model Because any particular probability model not resolve the difficulty that Gaifman
on the associated Cartesian product space. m assigns a probability to every possible faced—how to define complete and con-
Moreover, a sparse graph—reflecting a world, such a constraint will be either true sistent probability models. Each inequal-
sparse causal structure in the underlying or false in m; thus, m, a distribution over ity constrains the underlying probability
domain—leads to a representation that possible propositional worlds, acts as model to lie in a half-space in the high-
is exponentially smaller than the corre- a single possible world, with respect to dimensional space of probability mod-
sponding complete enumeration. which the truth of any probability assertion els. Conjoining assertions corresponds
The Bayes net example in Figure 2 can be evaluated. Entailment between to intersecting the constraints. Ensuring
(due to Pearl) shows two independent probability assertions is then defined in that the intersection yields a single point
causes, Earthquake and Burglary, that exactly the same way as in ordinary logic; is not easy. In fact, Gaifman’s princi-
influence whether or not an Alarm sounds thus, assertion (3) entails the assertion pal result8 is a single probability model
in Professor Pearl’s house. According requiring (1) a probability for every pos-
to Equation (2), the joint probability P(Burglary ∧ Earthquake) £ 0.997006, sible ground sentence and (2) probability
P(Burglary, Earthquake, Alarm) is given by constraints for infinitely many existen-
because the latter is true in every prob- tially quantified sentences.
P(Burglary) P(Earthquake) ability model in which assertion (3) holds. Researchers have explored two solu-
P(Alarm ⏐ Burglary, Earthquake) . Satisfiability of sets of such assertions can tions to this problem. The first involves
be determined by linear programming.12 writing a partial theory and then “com-
The results of this calculation are shown Hence, we have a “probability logic” in the pleting” it by picking out one canonical
in the figure. Notice the eight possible same sense as “temporal logic”—that is, a model in the allowed set. Nilsson23 pro-
worlds are the same ones that would deductive logical system specialized for posed choosing the maximum entropy
exist for a propositional logic theory reasoning with probabilistic assertions. model consistent with the specified
with the same symbols. To apply probability logic to tasks constraints. Paskin24 developed a
A Bayes net is more than just a speci- such as proving the theorems of probabil- “maximum-entropy probabilistic
fication for a distribution over worlds; it ity theory, a more expressive language logic” with constraints expressed
is also a stochastic “machine” for gener- was needed. Gaifman8 proposed a first- as weights (relative probabilities)
ating worlds. By sampling the variables order probability logic, with possible attached to first-order clauses. Such
in topological order (that is, parents worlds being first-order model structures models are often called Markov logic
before children), one generates a world and with probabilities attached to sen- networks or MLNs30 and have become
exactly according to the distribution tences of (function-free) first-order logic. a popular technique for applications
Within AI, the most direct descendant involving relational data. There is,
of these ideas appears in Lukasiewicz’s however, a semantic difficulty with
c Texts on Bayes nets typically do not define a
syntax for local conditional distributions oth-
probabilistic logic programs, in which such models: weights trained in one sce-
er than tables, although Bayes net software a probability range is attached to each nario do not generalize to scenarios
packages do. first-order Horn clause and inference is with different numbers of objects.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 91
review articles

Moreover, by adding irrelevant objects Breese4 proposed a more declarative


to a scenario, one can change the mod- approach reminiscent of Horn clauses.
el’s predictions for a given query to an Other declarative languages included
arbitrary extent.16,20 Poole’s Independent Choice Logic, Sato’s
A second approach, which hap-
pens to avoid the problems just noted,d An open-universe PRISM, Koller and Pfeffer’s probabilis-
tic relational models, and de Raedt’s
builds on the fact that every well-formed
Bayes net necessarily defines a unique
probability model Bayesian Logic Programs. In all these
cases, the head of each clause or depen-
probability distribution—a complete (OUPM) defines dency statement corresponds to a para-
theory in the terminology of probability
logics—over the variables it contains. The
a probability meterized set of child random variables,
with the parent variables being the cor-
next section describes how this property distribution over responding ground instances of the
can be combined with the expressive
power of first-order logical notation.
possible worlds literals in the body of the clause. For exam-
ple, Equation (4) shows the dependency
that vary in the statements equivalent to the code frag-
Bayes Nets with Quantifiers
Soon after their introduction, research- objects they contain ment in Figure 3:

ers developing Bayes nets for applica- and in the mapping Burglary(h) ∼ Bernoulli(0.003)
Earthquake(r) ∼ Bernoulli(0.002)
tions came up against the limitations
of a propositional language. For exam- from symbols Alarm(h) ∼
(4)
ple, suppose that in Figure 2 there are
many houses in the same general area
to objects. Thus, CPT [. . .] (Earthquake
(Faultregion(h)), Burglary(h))
as Professor Pearl’s: each one needs an OUPMs can handle
Alarm variable and a Burglary variable
with the same CPTs and connected to
data from sources where CPT denotes a suitable conditional
probability table indexed by the corre-
the Earthquake variable in the same that violate the sponding arguments. Here, h and r are
way. In a propositional language, this
repeated structure has to be built manu- closed-universe logical variables ranging over houses and
regions; they are implicitly universally
ally, one variable at a time. The same assumption. quantified. FaultRegion is a function sym-
problem arises with models for sequen- bol connecting a house to its geological
tial data such as text and time series, region. Together with a relational skeleton
which contain sequences of identical that enumerates the objects of each type
submodels, and in models for Baye- and specifies the values of each function
sian parameter learning, where every and relation, a set of dependency state-
instance variable is influenced by the ments such as Equation (4) corresponds
parameter variables in the same way. to an ordinary, albeit potentially very
At first, researchers simply wrote large, Bayes net. For example, if there are
programs to build the networks, using two houses in region A and three in region
ordinary loops to handle repeated struc- B, the corresponding Bayes net is the one
ture. Figure 3 shows pseudocode to build in Figure 4.
an alarm network for R geological fault
regions, each with H(r) houses. Open-Universe Models
The pictorial notation of plates was As noted earlier, closed-universe lan-
developed to denote repeated structure guages disallow uncertainty about what
and software tools such as BUGS10 and things are in the world; the existence
Microsoft’s infer.net have facilitated and identity of all objects must be known
a rapid expansion in applications of in advance.
probabilistic methods. In all these tools, In contrast, an open-universe prob-
the model structure is built by a fixed ability model (OUPM) defines a proba-
program, so every possible world has bility distribution over possible worlds
the same random variables connected that vary in the objects they contain
in the same way. Moreover, the code for and in the mapping from symbols to
constructing the models is not viewed objects. Thus, OUPMs can handle data
as something that could be the output from sources (text, video, radar, intelli-
of a learning algorithm. gence reports, among others) that vio-
late the closed-universe assumption.
Given evidence, OUPMs learn about the
d Briefly, the problems are avoided by separating
the generative models for object existence and objects the world contains.
object properties and relations and by allowing Looking at Figure 1a, the first problem
for unobserved objects. is how to ensure that the model specifies a

92 COM MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


review articles

proper distribution over a heterogeneous, To make such a claim precise, one Equation (6)—each object records
unbounded set of possible worlds. The must define exactly what these worlds its origin; for example, 〈House, 〈Fault
key is to extend the generative view of are and how the model assigns a prob- Region, 〈Region, , 2〉〉, 3〉 is the third
Bayes nets from the propositional to the ability to each. The definitions (given house in the second region.
first-order, open-universe case: in full in Brian Milch’s PhD thesis20) The number variables of a BLOG
begin with the objects each world model specify how many objects there
– Bayes nets generate propositional contains. In the standard semantics are of each type with each possible ori-
worlds one event at a time; each of typed first-order logic, objects are gin; thus #House〈FaultRegion,〈Region, ,2〉〉 (ω) = 4
event fixes the value of a variable. just numbered tokens with types. In means that in world ω there are 4 houses
– First-order, closed-universe mod- BLOG, each object also has an origin, in region 2. The basic variables deter-
els such as Equation (4) define indicating how it was generated. (The mine the values of predicates and func-
generation steps for entire classes reason for this slightly baroque con- tions for all tuples of objects; thus,
of events. struction will become clear shortly.) Earthquake〈Region, ,2〉 (ω) = true means that
– First-order, open-universe models For number statements with no ori- in world ω there is an earthquake in
include generative steps that add gin functions—for example, Equation region 2. A possible world is defined by
objects to the world rather than just (5)—the objects have an empty origin; the values of all the number variables
fixing their properties and relations. for example, 〈Region, , 2〉 refers to the and basic variables. A world may be
second region generated from that generated from the model by sampling
Consider, for example, an open-universe statement. For number statements in topological order, for example, see
version of the alarm model in Equation with origin functions—for example, Figure 5.
(4). If one suspects the existence of up
to three geological fault regions, with Figure 3. Illustrative pseudocode for building a version of the Bayes net in Figure 2 that
handles R different regions with H(r) houses in each.
equal probability, this can be expressed
as a number statement:
loop for r from 1 to R do
add node Earthquaker with no parents, prior 0.002
# Region ∼ UniformInt (1, 3) . (5) loop for h from 1 to H(r) do
add node Burglaryr,h with no parents, prior 0.003
For the sake of illustration, let us assume add node Alarmr,h with parents Earthquaker, Burglaryr,h
that the number of houses in a region r is
drawn uniformly between 0 and 4:
Figure 4. The Bayes net corresponding to Equation (4), given two houses in region A and
three in B.
# House(FaultRegion = r)
∼ UniformInt(0, 4) . (6)
BurglaryA,1 BurglaryA,2 EarthquakeA
Here, FaultRegion is called an origin
function, since it connects a house to the
region from which it originates.
Together, the dependency state- AlarmA,1 AlarmA,2
ments (4) and the two number state-
ments (5 and 6), along with the
necessary type signatures, specify a BurglaryB,1 BurglaryB,2 BurglaryB,3 EarthquakeB
complete distribution over all the
possible worlds definable with this
vocabulary. There are infinitely many AlarmB,1 AlarmB,2 AlarmB,3
such worlds, but, because the number
statements are bounded, only finitely
many—317,680,374 to be precise—
have non-zero probability. Figure 5 Figure 5. Construction of one possible world by sampling from the burglary/earthquake
model. Each row shows the variable being sampled, the value it receives, and the
shows an example of a particular world probability of that value conditioned on the preceding assignments.
constructed from this model.
The BLOG language21 provides a pre-
cise syntax, semantics, and inference Variable Value Prob.
capability for open-universe probabil- #Region 2 0.3333
ity models composed of dependency Earthquake〈Region, ,1〉 false 0.998
Earthquake〈Region, ,2〉 false 0.998
and number statements. BLOG mod-
#House〈FaultRegion,〈Region, ,1〉〉 1 0.2
els can be arbitrarily complex, but they #House〈FaultRegion,〈Region, ,2〉〉 1 0.2
inherit the key declarative property Burglary〈House,〈FaultRegion,〈Region, ,1〉〉,1〉 false 0.997
of Bayes nets: every well-formed BLOG Burglary〈House,〈FaultRegion,〈Region, ,2〉〉,1〉 true 0.003
model specifies a well-defined probabil-
ity distribution over possible worlds.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 93
review articles

The probability of a world so con- Examples of another, depending on the value of a


structed is the product of the prob- The standard “use case” for BLOG has third variable.
abilities for all the sampled values; in three elements: the model, the evi- Citation matching. Systems such as
this case, 0.00003972063952. Now it dence (the known facts in a given sce- CiteSeer and Google Scholar extract a
becomes clear why each object contains nario), and the query, which may be database-like representation, relating
its origin: this property ensures that any expression, possibly with free logi- papers and researchers by authorship
every world can be constructed by exactly cal variables. The answer is a posterior and citation links, from raw ASCII citation
one sampling sequence. If this were not joint probability for each possible set strings. These strings contain no object
the case, the probability of a world would of substitutions for the free variables, identifiers and include errors of syntax,
be the sum over all possible sampling given the evidence, according to the spelling, punctuation, and content, which
sequences that create it. model.e Every model includes type lead in turn to errors in the extracted
Open-universe models may have declarations, type signatures for the databases. For example, in 2002, Cite-
infinitely many random variables, predicates and functions, one or more Seer reported over 120 distinct books
so the full theory involves nontrivial number statements for each type, and written by Russell and Norvig.
measure-theoretic considerations. one dependency statement for each A generative model for this domain
For example, the number statement predicate and function. (In the exam- (Figure 6) connects an underlying,
#Region ∼ Poisson( m) assigns prob- ples here, declarations and signatures unobserved world to the observed
ability e−m m k/k! to each nonnegative are omitted where the meaning is strings: there are researchers, who
integer k. Moreover, the language clear.) Dependency statements use an have names; researchers write papers,
allows recursion and infinite types if-then-else syntax to handle so-called which have titles; people cite the
(integers, strings, and so on). Finally, context-specific dependencies, whereby papers, combining the authors’ names
well-formedness disallows cyclic one variable may or may not be a parent and the paper’s title (with errors) into
dependencies and infinitely receding the text of the citation according to
ancestor chains; these conditions are some grammar. Given citation strings
e As with Prolog, there may be infinitely many
undecidable in general, but certain sets of substitutions of unbounded size; de-
as evidence, a multi-author version
syntactic sufficient conditions can be signing exploratory interfaces for such answers of this model, trained in an unsuper-
checked easily. is an interesting HCI challenge. vised fashion, had an error rate two
to three times lower than CiteSeer’s
Figure 6. BLOG model for citation information extraction. For simplicity the model assumes on four standard test sets.25 The infer-
one author per paper and omits details of the grammar and error models. OM(a, b) is a
discrete log-normal, base 10, that is, the order of magnitude is 10a±b.
ence process in such a vertically inte-
grated model also exhibits a form of
collective, knowledge-driven disam-
type Researcher, Paper, Citation; biguation: the more citations there
random String Name(Researcher);
random String Title(Paper); are for a given paper, the more accu-
random Paper PubCited(Citation); rately each of them is parsed, because
random String Text(Citation); the parses have to agree on the facts
random Boolean Prof(Researcher);
origin Researcher Author(Paper);
about the paper.
#Researcher ∼ OM(3,1); Multitarget tracking. Given a set of
Name(r) ∼ CensusDB_NamePrior(); unlabeled data points generated by
Prof(r) ∼ Boolean(0.2); some unknown, time-varying set of
#Paper(Author=r)
if Prof(r) then ∼ OM(1.5,0.5) else ∼ OM(1,0.5); objects, the goal is to detect and track
Title(p) ∼ CSPaperDB_TitlePrior(); the underlying objects. In radar sys-
CitedPaper(c) ∼ UniformChoice(Paper p); tems, for example, each rotation of the
Text(c) ∼ HMMGrammar(Name(Author(CitedPaper(c))),
Title(CitedPaper(c)));
radar dish produces a set of blips. New
objects may appear, existing objects
may disappear, and false alarms and
detection failures are possible. The
Figure 7. BLOG model for radar tracking of multiple targets. X(a, t) is the state of aircraft a standard model (Figure 7) assumes
at time t, while Z(b) is the observed position of blip b.
independent, linear–Gaussian
dynamics and measurements. Exact
#Aircraft(EntryTime = t) ∼ Poisson(λa); inference is provably intractable, but
Exits(a,t) if InFlight(a,t) then ∼ Boolean(ae);
InFlight(a,t) = MCMC typically works well in practice.
(t == EntryTime(a)) | (InFlight(a,t-1) & !Exits(a,t-1)); Perhaps more importantly, elabora-
X(a,t) if t = EntryTime(a) then ∼ InitState() tions of the scenario (formation fly-
else if InFlight(a,t) then ∼ Normal( F*X(a,t-1),Σx);
#Blip(Source=a, Time=t)
ing, objects heading for unknown
if InFlight(a,t) then ∼ Bernoulli(DetectionProb(X(a,t))); destinations, objects taking off or
#Blip(Time=t) ∼ Poisson(λf); landing) can be handled by small
Z(b) if Source(b)=null then ∼ UniformInRegion(R); changes to the model without resort-
else ∼ Normal(H*X(Source(b),Time(b)),Σb);
ing to new mathematical derivations
and complex programming.

94 COMMUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


review articles

Nuclear treaty monitoring. Verifying has announced its intention to deploy example, when observations of Uranus
the Comprehensive Nuclear-Test-Ban NET-VISA as soon as possible. led to the discovery of Neptune. Allowing
Treaty requires finding all seismic events Remarks on the examples. Despite object trajectories to be nonindepen-
on Earth above a minimum magnitude. superficial differences, the three exam- dent in Figure 7 enables such inferences
The UN CTBTO maintains a network of ples are structurally similar: there are to take place.
sensors, the International Monitoring unknown objects (papers, aircraft,
System (IMS); its automated processing earthquakes) that generate percepts Inference
software, based on 100 years of seismol- according to some physical process (cita- By Equation (1), the probability P(α|e)
ogy research, has a detection failure rate tion, radar detection, seismic propaga- for a closed query sentence α given
of about 30%. The NET-VISA system,2 tion). The same structure and reasoning evidence e is proportional to the sum
based on an OUPM, significantly reduces patterns hold for areas such as database of probabilities for all worlds in which
detection failures. deduplication and natural language α and e are satisfied, with the probability
The NET-VISA model (Figure 8) understanding. In some cases, inferring for each world being a product of model
expresses the relevant geophysics an object’s existence involves grouping parameters as explained earlier.
directly. It describes distributions percepts together—a process that resem- Many algorithms exist to calculate
over the number of events in a given bles the clustering task in machine or approximate this sum of products
time interval (most of which are natu- learning. In other cases, an object may for Bayes nets. Hence, it is natural to
rally occurring) as well as over their generate no percepts at all and still have consider grounding a first-order prob-
time, magnitude, depth, and location. its existence inferred—as happened, for ability model by instantiating the
The locations of natural events are dis-
tributed according to a spatial prior Figure 8. The NET-VISA model.
trained (like other parts of the model)
from historical data; man-made #SeismicEvents ∼ Poisson(T∗λe);
events are, by the treaty rules, assumed Natural(e) ∼ Boolean(0.999);
Time(e) ∼ UniformReal(0,T);
to occur uniformly. At every station s, Magnitude(e) ∼ Exponential(log(10));
each phase (seismic wave type) p from Depth(e) if Natural(e) then ∼ UniformReal(0,700) else = 0;
an event e produces either 0 or 1 detec- Location(e) if Natural(e) then ∼ SpatialPrior()
else ∼ UniformEarthDistribution();
tions (above-threshold signals); the #Detections(event=e, phase=p, station=s)
detection probability depends on the if IsDetected(e,p,s) then = 1 else = 0;
event magnitude and depth and its dis- IsDetected(e,p,s) ∼
tance from the station. (“False alarm” Logistic(weights(s,p), Magnitude(e), Depth(e), Distance(e,s));
#Detections(site = s) ∼ Poisson(T∗λf(s));
detections also occur according to a OnsetTime(d,s)
station-specific rate parameter.) The if (event(d) = null) then ∼ UniformReal(0,T)
measured arrival time, amplitude, and else = Time(event(d)) + Laplace(m t(s),st(s)) +
GeoTT(Distance(event(d),s),Depth(event(d)),phase(d));
other properties of a detection d depend Amplitude(d,s)
on the properties of the originating if (event(d) = null) then ∼ NoiseAmplitudeDistribution(s)
event and its distance from the station. else ∼ AmplitudeModel(Magnitude(event(d)),
Once trained, the model runs con- Distance(event(d),s),Depth(event(d)),phase(d));
... and similar clauses for azimuth, slowness, and phase type.
tinuously. The evidence consists of
detections (90% of which are false
alarms) extracted from raw IMS wave-
form data and the query typically asks Figure 9. Location estimates for the DPRK nuclear test of February 12, 2013: UN CTBTO
Late Event Bulletin (green triangle); NET-VISA (blue square). The tunnel entrance (black
for the most likely event history, or cross) is 0.75 km from NET-VISA’s estimate. Contours show NET-VISA’s posterior location
bulletin, given the data. For reasons distribution.
previously explained, NET-VISA uses
a special-purpose inference algo-
rithm. Results so far are encouraging:
for example, in 2009 the UN’s SEL3
automated bulletin missed 27.4% of
the 27,294 events in the magnitude
range 3–4 while NET-VISA missed
11.1%. Moreover, comparisons with
dense regional networks show that
NET-VISA finds up to 50% more real
events than the final bulletins pro-
duced by the UN’s expert seismic ana-
lysts. NET-VISA also tends to associate
more detections with a given event,
leading to more accurate location
estimates (see Figure 9). The CTBTO

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 95
review articles

Figure 10. A CHURCH program that expresses the burglary/earthquake model in Equations (4)−(6). size around the variables whose values
are changed. Moreover, a logical query
(define num-regions (mem (lambda () (uniform 1 3)))) can be evaluated incrementally in each
(define-record-type region (fields index)) world visited, usually in constant time
(define regions (map (lambda (i) (make-region i))
(iota(num-regions))) per world, rather than recomputing it
(define num-houses (mem (lambda (r)(uniform 0 4)))) from scratch.19, 31
(define-record-type house (fields fault-region index)) Despite these optimizations, generic
(define houses (map (lambda (r)
(map (lambda (i) (make-house r i))
inference for BLOG and other first-order
(iota (num-houses r)))) regions))) languages remains too slow. Most real-
(define earthquake (mem (lambda (r) (flip 0.002)))) world applications require a special-
(define burglary (mem (lambda (h) (flip 0.003)))) purpose proposal distribution to reduce
(define alarm (mem (lambda (h)
(if (burglary h) the mixing time. A number of avenues
(if (earthquake (house-fault-region h)) are being pursued to resolve this issue:
(flip 0.9) (flip 0.8))
(if (earthquake (house-fault-region h))
– Compiler techniques can generate
(flip 0.4) (flip 0.01))))
inference code that is specific to
the model, query, and/or evidence.
logical variables with “all possible” MCMC can move among the open-uni- Microsoft’s infer.net uses such
ground terms in order to generate a verse worlds shown in Figure 1a. methods to handle millions of
Bayes net, as illustrated in Figure 4; As noted, a typical BLOG model has variables. Experiments with BLOG
then, existing inference algorithms infinitely many possible worlds, each show speedups of over 100×.18
can be applied. of potentially infinite size. As an exam- – Special-purpose probabilistic hard-
Because of the large size of these ple, consider the multitarget tracking ware—such as Analog Devices’
ground networks, exact inference is usu- model in Figure 7: the function X(a, t), GP5 chip—offers further constant-
ally infeasible. The most general method denoting the state of aircraft a at time factor speedups.
for approximate inference is Markov t, corresponds to an infinite sequence – Special-purpose samplers jointly
chain Monte Carlo. MCMC methods of variables for an unbounded number sample groups of variables that
execute a random walk among possible of aircraft at each step. Nonetheless, are tightly constrained. A library
worlds, guided by the relative probabili- BLOG’s MCMC inference algorithm of such samplers may render most
ties of the worlds visited and aggregating converges to the correct answer, under user programs efficiently solvable.
query results from each world. MCMC the usual ergodicity conditions, for any – Static analysis can transform pro-
algorithms vary in the choice of neigh- well-formed BLOG model.20 grams for efficiency5, 15 and iden-
borhood structure for the random walk The algorithm achieves this by sam- tify exactly solvable submodels.7
and in the proposal distribution from pling not completely specified possible
which the next state is sampled; sub- worlds but partial worlds, each corre- Finally, the rapidly advancing tech-
ject to an ergodicity condition on these sponding to a disjoint set of complete niques of lifted inference29 aim to unify
choices, samples will converge to the worlds. A partial world is a minimal self- probability and logic at the inferen-
true posterior in the limit. MCMC scales supporting instantiationf of a subset of tial level, borrowing and generalizing
well with network size, but its mixing time the relevant variables, that is, ancestors ideas from logical theorem proving.
(the time needed before samples reflect of the evidence and query variables. Such methods, surveyed by Van den
the true posterior) is sensitive to the For example, variables X(a, t) for values Broeck,32 may avoid grounding and
quantitative structure of the conditional of t greater than the last observation take advantage of symmetries by
distributions; indeed, hard constraints time (or the query time, whichever is manipulating symbolic distributions
may prevent convergence. greater) are irrelevant so the algorithm over large sets of objects.
BUGS10 and MLNs30 apply MCMC can consider just a finite prefix of the
to a preconstructed ground network, infinite sequence. Moreover, the algo- Learning
which requires a bound on the size of rithm eliminates isomorphisms under Generative languages such as BLOG
possible worlds and enforces a propo- object renumberings by computing and BUGS naturally support Bayesian
sitional “bit-vector” representation the required combinatorial ratios for parameter learning with no modi-
for them. An alternative approach is MCMC transitions between partial fication: parameters are defined as
to apply MCMC directly on the space worlds of different sizes.22 random variables with priors and ordi-
of first-order possible worlds26, 31; this MCMC algorithms for first-order nary inference yields posterior param-
provides much more freedom to use, languages also benefit from local- eter distributions given the evidence.
for example, a sparse graph or even a ity of computation27: the probability For example, in Figure 8 we could
relational database19 to represent each ratio between neighboring worlds add a prior λe ∼ Gamma(3, 0.1) for
world. Moreover, in this view it is easy to depends on a subgraph of constant the seismic rate parameter λe instead
see that MCMC moves can not only alter of fixing its value in advance; learn-
relations and functions but also add or f An instantiation of a set of variables is self-
ing proceeds as data arrive, even in the
subtract objects and change the inter- supporting if the parents of every variable in unsupervised case where no ground-
pretations of constant symbols; thus, the set are also in the set. truth events are supplied. A “trained

96 COMM UNICATIO NS O F THE AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


review articles

model” with fixed parameter values can Inference in Church uses MCMC, Network processing vertically integrated
seismic analysis. Bull. Seism. Soc. Am.
be obtained by, for example, choosing where each move resamples one of the 103 (2013).
maximum a posteriori values. In this stochastic primitives involved in pro- 3. Bacchus, F. Representing and Reasoning with
Probabilistic Knowledge. MIT Press, 1990.
way many standard machine learning ducing the current trace. 4. Breese, J.S. Construction of belief and decision
methods can be implemented using just Execution traces of a randomized networks. Comput. Intell. 8 (1992) 624–647.
5. Claret, G., Rajamani, S.K., Nori, A.V., Gordon, A.D.,
a few lines of modeling code. Maximum- program may vary in the new objects Borgström, J. Bayesian inference using data flow
likelihood parameter estimation could they generate; thus, PPLs have an open- analysis. In FSE-13 (2013).
6. Dalvi, N.N., Ré, C., Suciu, D. Probabilistic databases.
be added by stating that certain param- universe flavor. One can view BLOG as CACM 52, 7 (2009), 86–94.
7. Fischer, B., Schumann, J. AutoBayes: A system for
eters are learnable, omitting their priors, a declarative, relational PPL, but there generating data analysis programs from statistical
and interleaving maximization steps is a significant semantic difference: in models. J. Funct. Program 13 (2003).
8. Gaifman, H. Concerning measures in first order calculi.
with MCMC inference steps to obtain a BLOG, in any given possible world, every Israel J. Math. 2 (1964), 1–18.
stochastic online EM algorithm. ground term has a single value; thus, 9. Gaifman, H. Concerning measures on
Boolean algebras. Pacific J. Math. 14 (1964),
Structure learning—generating new expressions such as f (1) = f (1) are true by 61–73.
dependencies and new predicates and definition. In a PPL, on the other hand, 10. Gilks, W.R., Thomas, A., Spiegelhalter, D.J. A language
and program for complex Bayesian modelling. The
functions—is more difficult. The stan- f (1) = f (1) may be false if f is a stochastic Statistician 43 (1994), 169–178.
dard idea of trading off degree of fit function, because each instance of f (1) 11. Goodman, N.D., Mansinghka, V.K., Roy, D., Bonawitz, K.,
Tenenbaum, J.B. Church: A language for generative
and model complexity can be applied, corresponds to a distinct piece of the models. In UAI-08 (2008).
and some model search methods execution trace. Memoizing every sto- 12. Hailperin, T. Probability logic. Notre Dame J. Formal
Logic 25, 3 (1984), 198–212.
from the field of inductive logic pro- chastic function (via mem in Figure 10) 13. Halpern, J.Y. An analysis of first-order logics of
gramming can be generalized to the restores the standard semantics. probability. AIJ 46, 3 (1990), 311–350.
14. Howson, C. Probability and logic. J. Appl. Logic 1, 3–4
probabilistic case, but as yet little is (2003), 151–165.
known about how to make such methods Prospects 15. Hur, C.-K., Nori, A.V., Rajamani, S.K., Samuel, S. Slicing
probabilistic programs. In PLDI-14 (2014).
computationally feasible. These are early days in the process of uni- 16. Jain, D., Kirchlechner, B., Beetz, M. Extending Markov
logic to model probability distributions in relational
fying logic and probability. Experience domains. In KI-07 (2007).
Probability and Programs in developing models for a wide range 17. Koller, D., McAllester, D.A., Pfeffer, A. Effective Bayesian
inference for stochastic programs. In AAAI-97 (1997).
Probabilistic programming languages of applications will uncover new model- 18. Li, L., Wu, Y., Russell, S. SWIFT: Compiled inference for
or PPLs17 represent a distinct but ing idioms and lead to new kinds of pro- probabilistic programs. Tech. Report EECS-2015-12, UC
Berkeley, 2015.
closely related approach for defin- gramming constructs. And of course, 19. McCallum, A., Schultz, K., Singh, S. FACTORIE:
ing expressive probability models. inference and learning remain the major Probabilistic programming via imperatively defined
factor graphs. In NIPS 22 (2010).
The basic idea is that a randomized bottlenecks. 20. Milch, B. Probabilistic models with unknown objects.
algorithm, written in an ordinary pro- Historically, AI has suffered from PhD thesis, UC Berkeley, 2006.
21. Milch, B., Marthi, B., Sontag, D., Russell, S.J., Ong, D.,
gramming language, can be viewed insularity and fragmentation. Until the Kolobov, A. BLOG: Probabilistic models with unknown
not as a program to be executed, but 1990s, it remained isolated from fields objects. In IJCAI-05 (2005).
22. Milch, B., Russell, S.J. General-purpose MCMC inference
as a probability model: a distribution such as statistics and operations over relational structures. In UAI-06 (2006).
over the possible execution traces of research, while its subfields—espe- 23. Nilsson, N.J. Probabilistic logic. AIJ 28 (1986), 71–87.
24. Paskin, M. Maximum entropy probabilistic logic. Tech.
the program, given inputs. Evidence cially vision and robotics—went their Report UCB/CSD-01-1161, UC Berkeley, 2002.
is asserted by fixing any aspect of the own separate ways. The primary cause 25. Pasula, H., Marthi, B., Milch, B., Russell, S.J., Shpitser, I.
Identity uncertainty and citation matching. In NIPS
trace and a query is any predicate on was mathematical incompatibility: what 15 (2003).
26. Pasula, H., Russell, S.J. Approximate inference for
traces. For example, one can write a could a statistician of the 1960s, well- first-order probabilistic languages. In IJCAI-01 (2001).
simple Java program that rolls three versed in linear regression and mixtures 27. Pearl, J. Probabilistic Reasoning in Intelligent Systems.
Morgan Kaufmann, 1988.
six-sided dice; fix the sum to be 13; and of Gaussians, offer an AI researcher 28. Pfeffer, A. IBAL: A probabilistic rational programming
ask for the probability that the second building a robot to do the grocery shop- language. In IJCAI-01 (2001).
29. Poole, D. First-order probabilistic inference. In
die is even. The answer is a sum over ping? Bayes nets have begun to recon- IJCAI-03 (2003).
probabilities of complete traces; the nect AI to statistics, vision, and language 30. Richardson, M., Domingos, P. Markov logic networks.
Machine Learning 62, 1–2 (2006), 107–136.
probability of a trace is the product research; first-order probabilistic lan- 31. Russell, S.J. Expressive probability models in science.
of the probabilities of the random guages, which have both Bayes nets and In Discovery Science (Tokyo, 1999).
32. Van den Broeck, G. Lifted Inference and Learning in
choices made in that trace. first-order logic as special cases, will Statistical Relational Models. PhD thesis, Katholieke
The first significant PPL was Pfeffer’s extend and broaden this process. Universiteit Leuven, 2013.

IBAL,28 a functional language equip-


Stuart Russell (russell@cs.berkeley.edu), is a professor
ped with an effective inference engine. Acknowledgments of computer science and Smith-Zadeh Professor of
Church,11 a PPL built on Scheme, gen- My former students Brian Milch, Hanna Engineering at the University of California, Berkeley.

erated interest in the cognitive science Pasula, Nimar Arora, Erik Sudderth, Copyright held by author.
community as a way to model complex Bhaskara Marthi, David Sontag, Daniel Publication rights licensed to ACM. $15.00.
forms of learning; it also led to inter- Ong, and Andrey Kolobov contributed
esting connections to computability to this research, as did NSF, DARPA, the
theory1 and programming language Chaire Blaise Pascal, and ANR.
research. Figure 10 shows the bur- Watch the author discuss
their work in this exclusive
glary/earthquake example in Church; References Communications video.
1. Ackerman, N., Freer, C., Roy, D. On the computability of http://cacm.acm.org/
notice that the PPL code builds a pos- conditional probability. arXiv 1005.3014, 2013. videos/unifying-logic-
sible-world data structure explicitly. 2. Arora, N.S., Russell, S., Sudderth, E. NET-VISA: and-probability

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 97
ACM Books M MORGAN & CLAYPOOL
&C P U B L I S H E R S
Publish your next book in the
ACM Digital Library
ACM Books is a new series of advanced level books for the computer science community,
published by ACM in collaboration with Morgan & Claypool Publishers.
I’m pleased that ACM Books is directed by a volunteer organization headed by a
dynamic, informed, energetic, visionary Editor-in-Chief (Tamer Özsu), working
closely with a forward-looking publisher (Morgan and Claypool).
—Richard Snodgrass, University of Arizona

books.acm.org ACM Books


◆ will include books from across the entire
spectrum of computer science subject
matter and will appeal to computing
practitioners, researchers, educators, and
students.
◆ will publish graduate level texts; research
monographs/overviews of established
and emerging fields; practitioner-level
professional books; and books devoted to
the history and social impact of computing.
◆ will be quickly and attractively published
as ebooks and print volumes at affordable
prices, and widely distributed in both print
and digital formats through booksellers
and to libraries and individual ACM
members via the ACM Digital Library
platform.
◆ is led by EIC M. Tamer Özsu, University of
Waterloo, and a distinguished editorial
board representing most areas of CS.

Proposals and inquiries welcome! Association for


Contact: M. Tamer Özsu, Editor in Chief
Computing Machinery
booksubmissions@acm.org Advancing Computing as a Science & Profession
research highlights
P. 100 P. 101
Technical
Perspective Cache Efficient
The Simplicity of Functional Algorithms
Cache Efficient By Guy E. Blelloch and Robert Harper
Functional
Algorithms
By William D. Clinger

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 99
research highlights
DOI:10.1145/ 2 776 8 2 7

Technical Perspective
To view the accompanying paper,
visit doi.acm.org/10.1145/ 2776825 rh

The Simplicity of
Cache Efficient
Functional Algorithms
By William D. Clinger

SC I E NTI FI C M O D ELS A RE simplified de- as when we use asymptotic notation and temporal locality. Heretofore,
scriptions of messy reality. to describe costs. It is easier to come most of those proofs have dealt with
Depending on our purposes, some up with an asymptotic estimate for imperative algorithms over arrays. To
models may be better than others. the number of steps in a computation extend the technique to functional
Good models are abstract enough to be than it is to count the exact number algorithms, we need to make assump-
tractable, yet accurate enough to draw of steps. Counting the exact number tions about the locality of object allo-
conclusions matching the aspects of of steps is more precise, but exact cation and garbage collection.
reality we hope to understand. counts will be inaccurate models of Blelloch and Harper suggest we
Even good models have limits, so the real world because compiler op- analyze the costs of functional algo-
we must take care to use them only timizations and hardware peculiari- rithms by assuming objects are allo-
within their domain of applicability. ties break the estimate. Asymptotic cated sequentially in cache memory,
Galileo’s law for adding velocities is estimates are more robustly accurate with each new object adjacent to the
simple and accurate enough for most because they are less precise: They ex- previously allocated object, and gar-
purposes, but a more complicated law press costs at a higher (and more use- bage collection preserves this corre-
is needed when adding relativistic ve- ful) level of abstraction. spondence between allocation order
locities. Einstein’s special theory of Cache-oblivious algorithms provide and memory order. Their key abstrac-
relativity tells us how to reason about another example of using abstraction tion defines a compact data structure
space and time in the absence of accel- to combine accurate cost estimates as one for which there exists a fixed
eration or gravity, but we must turn to with tractability. The cache size M and constant k, independent of the size
Einstein’s general theory when gravi- cache line (block) size B are abstracted of the structure, such that the data
tational effects are significant. That parameters that may show up in the structure can be traversed with at
general theory has passed all experi- asymptotic complexity of a cache- most k cache misses per object. Us-
mental tests so far, but we know its oblivious algorithm, but do not appear ing that abstraction and those two
equations break down at black holes within the algorithm itself. assumptions, the authors show how
and other singularities. To prove that an algorithm is cache- efficient cache-oblivious functional
In general, scientific models seek oblivious, we must consider its spatial algorithms over lists and trees can be
compromise between tractability and expressed and analyzed as easily as
accuracy. That compromise is evident imperative algorithms over arrays.
within the cost models we use to ana- In the following paper, Not all storage allocators and gar-
lyze algorithms. In the following paper, bage collectors satisfy the critical as-
Blelloch and Harper propose and dem- Blelloch and Harper sumptions of this paper, but some do.
onstrate a more effective model for an- propose and In time, as cache-oblivious functional
alyzing and describing the efficiency of algorithms become more common,
functional algorithms. demonstrate we can expect even more implementa-
Fifty years ago, modeling a com- a more effective tions to satisfy those assumptions (or
puter system’s main storage as ran- to come close enough).
dom-access memory (RAM) was both model for analyzing After all, we computer scientists have
simple and accurate. That RAM cost and describing a big advantage over the physicists: We
model remains simple, but it is no can improve both the simplicity and
longer accurate. the efficiency the accuracy of our models by improv-
Cache-aware estimates of real-world of functional ing the reality we are modeling.
performance can be more accurate but
are also more complicated. Even the algorithms. William D. Clinger (will@ccs.neu.edu) is an associate
professor in the College of Computer and Information
ideal cache model, which assumes an Science at Northeastern University, Boston, MA.
unachievable perfect replacement pol-
icy, must account for both spatial and
temporal locality.
Sometimes, however, we can im-
prove both tractability and accuracy, Copyright held by author.

100 CO M MUNICATIO NS O F TH E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


DOI:10.1145/ 2 776 8 2 5

Cache Efficient
Functional Algorithms
By Guy E. Blelloch and Robert Harper

Abstract models are based on two parameters, the memory size M


The widely studied I/O and ideal-cache models were and the block size B, which are considered variables for
developed to account for the large difference in costs to the sake of analysis and therefore show up in asymptotic
access memory at different levels of the memory hierar- bounds. To design efficient algorithms for these models,
chy. Both models are based on a two level memory hierar- it is important to consider both temporal locality, so as to
chy with a fixed size fast memory (cache) of size M, and an take advantage of the limited size slow memory, and spa-
unbounded slow memory organized in blocks of size B. tial locality, because memory is transferred in contiguous
The cost measure is based purely on the number of block blocks. In this paper, we will use terminology from the
transfers between the primary and secondary memory. ideal-cache machine model (ICMM), including referring
All other operations are free. Many algorithms have been to the fast memory as cache, the slow memory as main
analyzed in these models and indeed these models pre- memory, block transfers as cache misses, and the cost as
dict the relative performance of algorithms much more the cache cost.
accurately than the standard Random Access Machine Algorithms that do well in these models are often refer-
(RAM) model. The models, however, require specifying red to as cache or I/O efficient. The theory of cache efficient
algorithms at a very low level, requiring the user to care- algorithms is now well developed (see, for example, the
fully lay out their data in arrays in memory and manage surveys3, 6, 12, 17, 19, 23). These models do indeed express more
their own memory allocation. accurately the cost of algorithms on real machines than does
We present a cost model for analyzing the memory effi- the standard RAM model. For example, the models properly
ciency of algorithms expressed in a simple functional lan- indicate that a blocked or hierarchical matrix–matrix multi-
guage. We show how some algorithms written in standard ply has cost
forms using just lists and trees (no arrays) and requiring
no explicit memory layout or memory management are (1)
efficient in the model. We then describe an implementa-
tion of the language and show provable bounds for map- This is much more efficient than the naïve triply nested loop,
ping the cost in our model to the cost in the ideal-cache which has cost Q(n3/B). Furthermore, although the models
model. These bounds imply that purely functional pro- only consider two levels of a memory hierarchy, cache effi-
grams based on lists and trees with no special attention cient algorithms in the ICMM that do not use the parame-
to any details of memory layout can be asymptotically ters B or M in the code are simultaneously efficient across all
as efficient as the carefully designed imperative I/O effi- levels of the memory hierarchy. Algorithms designed in this
cient algorithms. For example we describe an way are called cache oblivious.11
cost sorting algorithm, which is optimal in the ideal cache As an example of an algorithm analyzed in the models
and I/O models. consider recursive mergeSort (see Figure 2). If the inputs
of size n and output are in arrays, then the merge can take
advantage of spatial locality and has cache cost O(n/B) (see
1. INTRODUCTION Figure 3). The split can be done within the same, or bet-
Today’s computers exhibit a vast difference in cost for ter bounds. For the recursive mergeSort once the recursion
accessing different levels of the memory hierarchy, reaches a level where the computation fits into the cache,
whether it be registers, one of many levels of cache, the all subcalls are free, taking advantage of temporal locality.
main memory, or a disk. On current processors, for exam- Figure 4 illustrates the merge sort recursion tree and ana-
ple, there is over a factor of a hundred between the time lyzes the total cost, which for an input of size n is:
to access a register and main memory, and another factor
of a hundred or so between main memory and disk, even a (2)
solid state drive (SSD). This variance in costs is contrary to
the standard Random Access Machine (RAM) model, which
assumes that the cost of accessing memory is uniform. To The original version of this paper is entitled "Cache
account for non-uniformity, several cost models have been and I/O Efficient Functional Algorithms" and was pub-
developed that assign different costs to different levels of lished in the Proceedings of the 40th Annual ACM SIGACT-
the memory hierarchy. SIGPLAN Symposium on Principles of Programming
Figure 1 describes the widely used I/O2 and ideal- Languages (POPL '13), 39–50.
cache11 machine models designed for this purpose. The

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 101


research highlights

Figure 1. The I/O model2 assumes a memory hierarchy comprising Figure 4. The cost of mergeSort in the I/O or ideal-cache model. At
a fast memory of size M and slow memory of unbounded size. Both the ith level from the top there are 2i calls to mergeSort, each of size
memories are partitioned into blocks of a fixed size B of consecutive ni = n/2i. Each call has cost kni /B for some constant k. The total cost
memory locations. The CPU and fast memory are treated as a of each level is therefore about 2ik B = kn/B . When a call fits
standard Random Access Machine (RAM)—the CPU accesses into the fast memory, then the rest of the sort is cost-free. Because
individual words, not blocks. Additional instructions allow one to a merge requires about 2n space (for the input and the output), this
move any block between the fast and slow memory. The cost of an will occur when M ª 2 . Therefore, there are log2(2n/M) levels that
algorithm is analyzed in terms of the number of block transfers—the have non-zero cost, each with cost kn/B.
cost of operations within the main memory is ignored. The ideal-
Cache cost
cache machine model (ICMM)11 is similar except that the program
only addresses the slow memory, and an “ideal” cache decides which mergeSort kn/B
blocks to cache in the fast memory using an optimal replacement
strategy. Accessing the fast memory (cache) is again regarded as
mergeSort mergeSort kn/B
cost-free and transferring between slow memory and fast memory log2 (2n/M)
has unit cost. The costs in the two models are equivalent within a
constant factor, and in both models the two levels of memory can
either represent main memory and disk, or cache and main memory. MS MS kn/B

Slow memory size = M, just fits in cache Free


Fast memory

Total = (kn/B) log2 (2n/M)


CPU = O(n/B log2 (n/M))
M/B
Block

B
B This is a reasonable cost and much better than assuming
Cost = 0 Cost = 1 every memory access requires a cache miss. However, this is
not optimal for the model. Later, we will outline a multiway
merge sort that is optimal. It has cost
Figure 2. Recursive mergeSort. If the input sequence is empty or a
singleton the algorithm returns the same value, otherwise it splits (3)
the input in two approximately equal sized sequences, L and H,
recursively sorts each sequence, and merges the result.
This optimal cost can be significantly less than the cost
1 fun mergeSort([]) =[] for mergeSort. For example if n £ Mk and M ³ B2, which are
2 | mergeSort([a])=[a] reasonable assumptions in many situations, then the cost
3 | mergeSort(A) = reduces to O(nk/B). All the fastest disk sorts indeed use some
4 let variant of a multiway merge sort or another optimal cache
5 val (L, H) = split(A) efficient algorithm based on sample sorting.21
6 in
7 merge(mergeSort(L),mergeSort(H)) 1.1. Careful layout and careful allocation
8 end Although the analysis for mergeSort given above seems
relatively straightforward, we have picked a simple exam-
ple and left out several important details. With regards to
Figure 3. Merging two sorted arrays. The fingers are moved from left to memory layout, ordering a sorted sequence contiguously
right along the two inputs and the output. At each point the lesser value
in memory is pretty straightforward, but this is not the
at the input fingers is copied to the output, and that finger, along with
the output finger, are incremented. During the merge, each block of the case for many other data structures. Should a matrix, for
inputs only has to be loaded into the fast memory once, and each block example, be laid out in row-major order, in column-major
of the output only needs to be written to slow memory once. Therefore, order, or perhaps in some hierarchical order such as
for two arrays of size n the total cache cost of the merge is O(n/B). z-ordering? These different orders can make a big differ-
Blocks (B = 4) ence. What about pointer-based structures such as linked-
lists or trees? As Figure 5 indicates, the cost of traversing
1 3 8 11 14 15 18 25 27 32 35 40
a linked list will depend on how the lists are laid out in
X
memory.
More subtle than memory layout is memory allocation.
Touching unused memory after allocation will cause a cache
Y 2 5 6 7 9 12 19 20 21 26 28 29
miss. Managing the pool of free memory properly is there-
fore very important. Consider, again, the mergeSort exam-
Out 1 2 3 5 6 7 8
ple. In our discussion, we assumed that once 2ni £ M the
problem fits in fast memory and therefore is free. However,
if we used a memory allocator to allocate the temporary
array needed for the merge the locations would likely be

102 CO MM UNICATIO NS O F T H E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


Figure 5. Two lists in memory. The cost of traversing the lists is very Figure 6. Code for merge of two lists. If list is empty, the code
different depending on how the lists are laid out. In the figure, the top compares the two heads of the list and recurs on the tail of the
one, which is randomly laid out, will require O(n) steps to traverse, list with the smaller element, and on the whole list with the larger
whereas the bottom only requires O(n/B). element. When the recursive call finishes a new list element is
created with the smaller element as the head and the result of the
Head of list recursive call as the tail. The ! is discussed in Section 3.

1 datatype List = Cons of (int * List)


27 15 25
3 3 14 18
3 35 11 32
3 2 | Nil
3
4 fun merge(A, B) =
Head of list 5 case (A, B) of
6 (Nil, B) ⇒ B
7 | (A, Nil) ⇒ A
3 11 14
3 15 18 25
3 27 32 35
3
8 | (Cons(a, At), Cons(b, Bt)) ⇒
9 if (a < b)
10 then Cons(!a, merge(At, B))
fresh and accessing them would cause a cache miss. The 11 else Cons(!b, merge(A, Bt))
cost would therefore not be free, and in fact because levels
near the leaves of the recursion tree do many small alloca-
tions, the cost could be significantly larger near the leaves. a cache miss. Also there is an implied stack in the recursion
To avoid this, programmers need to manage their own and it is not clear how this affects the cost.
memory. They could, for example, preallocate a temporary We show, however, that with an appropriate implementa-
array and then pass it down to the recursive calls as an extra tion the code is indeed cache efficient, and when used in a
argument, therefore making sure that all temporary space is mergeSort gives the same bound as given in Equation (2). We
being reused. For more involved algorithms, allocation can also describe an algorithm that matches the optimal sorting
be much more complicated. bounds given in Equation (3). The approach is not limited to
In summary, designing and programming cache efficient lists, it also works with trees. For example, Figure 11 defines
algorithms for these models requires a careful layout of data a representation of matrices recursively divided into a tree
in the flat memory and careful management of space in that and a matrix multiplication based on it. The cost bounds for
memory. this multiplication match the bounds for matrix multiplica-
tion for the ideal-cache model (Equation 1) using arrays and
1.2. Our goals careful programmer layout.
The goal of our work is to be able to take advantage of the
ideal-cache model, and hence machines that match the 1.3. Cost models and provable
model, without requiring careful layout of data in mem- implementation bounds
ory and without requiring managing memory allocation One way to achieve our goals would be to analyze each algo-
by hand. In particular, we want to work with recursive rithm written in high-level code with respect to a particular
data structures (pointer structures) such as lists or trees. compilation strategy and specific runtime memory alloca-
Furthermore, we want to work with a fully automated tion scheme. This would be extremely tedious, time con-
memory manager with garbage collection and no (explicit) suming, not portable, and error prone. It would not lead to a
specifications of where data is placed in memory. These practical way to analyze algorithms.
goals might seem impossible, but we show that they can Instead we use a cost semantics that directly defines costs
be achieved, at least for certain cases. We show this in the in the programming language so that algorithms can be
context of a purely functional (side-effect free) language. analyzed without concern of implementation details (such
The formulation given here is slightly simplified com- as how the garbage collector works). We refer to our lan-
pared to the associated conference paper5 for the sake of guage, equipped with our cost semantics, as ICFP, for Ideal-
brevity and clarity; we refer the reader to that paper for a Cache Functional Programming language. The goal is then to
fuller explanation. relate the costs in ICFP to costs in the ICMM. We do this by
As an example of what we are trying to achieve consider describing a simulation of ICFP on the ICMM and proving
the merge algorithm shown in Figure 6. Some properties to bounds on the relative costs based on this simulation. In
notice about the algorithm are that it is based on linked-lists particular, we show that a computation running with cost Q,
instead of arrays. Furthermore, because we are working in with parameters M and B, in ICFP can be simulated on the
the functional setting, every list element is created fresh (in ICMM with M¢ = O(M) locations of fast memory, blocksize B,
the Cons operations in Lines 10 and 11). Finally, there is no and cost O(Q). The framework is illustrated in Figure 7.
specification of where things are allocated and no explicit
memory management. Given these properties it might seem 1.4. Allocation order and spatial locality
that this code could be terrible for cache efficiency because Because functional languages abstract away from data layout
accessing each element could be very expensive, as indicated decisions, we use temporal locality (proximity of allocation
in Figure 5. Furthermore, each fresh allocation could cause time) as a proxy for spatial locality (proximity of allocation

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 103


research highlights

Figure 7. Cost semantics for ICFP and its mapping to the ideal cache
try to prove bounds for algorithms for functional programs
model (ICMM). The values k and k¢ are constants. when manipulating recursive data types such as lists or
trees. Abello et al.1 show how a functional style can be used
to design cache efficient graph algorithms. However, they
Cost
analysis Cost Q
assume that data structures are in arrays (called lists), and
ICFP that primitives for operations such as sorting, map, filter
in
(B,M)
ICFP and reductions are supplied and implemented with optimal
asymptotic cost (presumably at a lower level using impera-
tive code). Their goal is therefore to design graph algorithms
Cost M’ = 4M + kB by composing these high-level operations on collections.
Implementation
translation Q’ = k’Q They do not explain how to deal with garbage collection or
memory management.
Cost
analysis Cost Q’ 2. COST SEMANTICS
ICM
in In general, a cost semantics for a language defines both how
(B,M’)
ICM to execute a program and the cost of its execution. In func-
tional languages execution consists of a deterministic strat-
egy for calculating the value of an expression by a process
of simplification, or reduction, similar to what one learns in
elementary algebra. But rather than working with the real
in memory). To do this, the memory model for ICFP con- numbers, programs in a functional language calculate with
sists of two fast memories: a nursery (for new allocations) integers, with inductive data structures, such as lists and
and a read cache. The nursery is time ordered to ensure that trees, and with functions that act on such values, including
items are laid out contiguously when written to main (slow) other functions. The theoretical foundation for functional
memory. For merging, for example, the output list elements languages is Church’s l-calculus.8, 9
will be placed one after the other in main memory. To for- The cost of a computation in a functional language may be
malize this idea we introduce the notion of a data structure defined in various ways, including familiar measures such as
being compact. Roughly speaking a data structure of size time complexity, defined as the number of primitive reduc-
n is compact if it can be traversed in the model in O(n/B) tion steps required to evaluate an expression, and space
cache cost. The output of merge is compact by construction. complexity, defined as the number of basic data objects allo-
Furthermore, if both inputs lists to merge are compact, we cated during evaluation. Such measures are defined in terms
show that merge has cache cost O(n/B). This notion of com- of the constructs of the language itself, rather than in terms
pactness generalizes to other data structures, such as trees. of its implementation on a machine model such as a RAM
Another feature of the allocation cache of ICFP is that it or Turing machine. Algorithm analysis takes place entirely
only contains live data. This is important because in some at the language level. The role of the implementation is to
algorithms, such as matrix multiply, the rate at which tem- ensure that the abstract costs can be realized on a concrete
porary memory is generated is asymptotically larger than machine with stated bounds.
the rate at which output data is generated. Therefore, if all
counted it could fill up the cache, whereas we want to make 2.1. Cache cost
sure that temporary data slots are reused. Also, we want to The cost measure of interest in the present paper is the
ensure that temporary memory is not transferred to the slow cache cost, which is derived from considering a combination
memory because such transfer would cause extra traffic. of the time and space behavior of a program. The cache cost
Moreover, the temporary data could end up between out- is derived by specifying when data objects are allocated and
put data elements, causing them to be fragmented across retrieved, and as with the ICMM is parameterized by two con-
blocks. In our provable implementation we do not require a stants, B and M, governing the cache behavior. To be more
precise tracking of live data, but instead use a generational precise, expressions are evaluated relative to an abstract
collector with a nursery of size 2M. The simulation then store, s, which comprises a main memory, m, a read cache, r,
keeps all the 2M locations in the fast memory so that all allo- and a nursery, or allocation area, u. The structure of the store
cation and collection (within the new generation) is fast. is depicted in Figure 8. The main memory maps abstract loca-
tions to atomic data objects, and is of unbounded capacity.
1.5. Related work Atomic data objects are considered to occupy one unit of
The general idea of using high-level cost models based on a storage. Objects in main memory are aggregated into blocks
cost semantics along with a provable efficient implementa- of size B in a manner to be described shortly. The blocking
tion has previously been used in the context of parallel cost of data objects does not change during execution; once an
models.4, 13, 15, 22 Although there has been a large amount of object is assigned to a block, it remains part of that block for
experimental work on showing how good garbage collec- the rest of the computation.
tion can lead to efficient use of caches and disks (Chilimbi The read cache is a fixed-size mapping from locations to
and Larus7, Courts10, Grunwald et al.14, Wilson et al.24 and data objects containing at most M objects. As usual, the read
many references in Jones and Lins16), we know of none that cache represents a partial view of main memory. Objects are

104 CO MM UNICATIO NS O F T H E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


Figure 8. The memory model for the ICFP cost semantics. It is
be specified in a mathematically rigorous manner, which we
similar to the ideal cache model, but has two fast memories of now illustrate for ICFP, a simple eager variant of Plotkin’s
size M: a nursery for newly allocated data, and a read cache for language PCF20 for higher-order computation over natural
data that has migrated to main memory and was read again by numbers. (It is straightforward to extend ICFP to account
the program. The main memory and read cache are organized into for a richer variety of data structures, such as lists or trees,
blocks, as in the ideal-cache model, but the nursery is not blocked.
Rather, it consists of (live) data objects linearly ordered by the time
and for hardware-oriented concepts such as words and float-
at which they were allocated. The semantics defines when a piece ing point numbers, which we assume in our examples.)
of data is live, but conceptually it is just when it is reachable from
the executing program. If an allocation overflows the nursery, the 2.2. Cost semantics
oldest B live objects are flushed from the cache into main memory
The abstract syntax of ICFP is defined as follows:
as a block to make space for the new object. If one of these words
is later read, the entire block is moved into the read cache, possibly
displacing another block in the process. Displaced blocks can be
discarded, because the only writes in a functional language occur at
allocation, and never during subsequent execution.

Nursery (n)
(size = M, not organized in blocks) Here x stands for a variable, which will always stand for a
Main memory (m) data object allocated in memory. The expressions z and s( )
Allocations
(writes) Sorted by time represent 0 and successor, respectively, and the conditional
(live data only)
tests whether e is z or not, branching to e0 if so, and branch-
Oldest
ing to e1 if not, substituting (the location of) the predeces-
sor for x in doing so. Functions, written fun(x, y.e), bind a
FP B variable x standing for the function itself, and a variable y
Read cache (r) standing for its argument. (The variable x is used to effect
Block recursive calls.) Function application is written app(e1; e2),
Reads which applies the function given by the expression e1 to the
M/B
argument value given by the expression e2. The typing rules
Toss
for ICFP are standard (see Chapter 10 of Harper15), and
hence are omitted for the sake of concision.
Cost = 0 B Cost = 1
Were the cost of a computation in ICFP just the time
complexity, the semantics would be defined by an evalu-
n
ation relation e ß v stating that the expression e evaluates
loaded into the read cache in blocks of size B as determined to the value v using n reduction steps according to a speci-
by the aggregation of objects in main memory. Objects are fied outermost strategy. Accounting for the cache cost of a
evicted from the read cache by simply over-writing them; in computation is a bit more complicated, because we must
a functional language data objects cannot be mutated, and account for the memory traffic induced by execution. To do
hence need never be written back to main memory. Blocks so, we consider an evaluation relation of the form
are evicted according to the (uncomputable) Ideal Cache
Model (ICMM). s @ e ßn s¢ @ l
The allocation area, or nursery, is fixed-sized mapping
from locations to data objects also containing at most M in which the evaluation of an expression e is performed rela-
objects. Locations in the nursery are not blocked, but they tive to a store s, and returns a value represented by a loca-
are linearly ordered according to the time at which they are tion in a modified store s¢. The modifications to the store
allocated. We say that one data object is older than another reflect the allocation of new objects, their migration to main
if the location of the one is ordered prior to that of the other. memory, and their loading into the read cache. The evalua-
A location in the nursery is live if it occurs within the pro- tion relation also specifies the cache cost n, which is deter-
gram being evaluated.18 All data objects are allocated in the mined entirely by the movements of blocks to and from
nursery. When its capacity of M objects is exceeded, the old- main memory. All other operations are assigned zero cost.
est B objects are formed into a block that is migrated to main The evaluation relation for ICFP is defined by deriva-
memory, determining once and for all the block to which tion rules that specify closure conditions on it. Evaluation
these objects belong. This policy ensures that temporal local- is defined to be the strongest relation satisfying these con-
ity implies spatial locality, which means that objects that are ditions. To give a flavor of how this is done, we present in
allocated nearby in time will be situated nearby in space Figure 9 a simplified form of the rule for evaluating a func-
(i.e., occupy the same block). In particular, when an object tion application, app(e1; e2). The rule has four premises, and
is loaded into the read cache, its neighbors in its block will one conclusion, separated by the horizontal line, which may
be loaded along with it. These will be the objects that were be read as implication stating that if the premises are all
allocated at around the same time. derivable, then so is the conclusion.
The role of a cost semantics is to enable reasoning about The other constructs of ICFP are defined by similar
the cache cost of algorithms without dropping down to the rules, which altogether specify the behavior and cost of
implementation level. To support proof, the semantics must the evaluation of any ICFP program. The abstract cost

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 105


research highlights

Figure 9. Simplified semantics of function application. This rule Figure 10. Simulating the ICFP on the ICMM. The simulation requires
specifies the cost and evaluation of a function application app(e1; e2) a cache of size 4 × M + k × B for some constant k. Allocation and
of a function e1 to an argument e2. similar rules govern the other migration is managed using generational copying garbage collection,
language constructs. Bear in mind that all values are represented by which requires 2 × M words of cache to manage M live objects. The
(abstract) locations in memory. First, the expression e1 is evaluated, liveness of objects in the nursery can be assessed without accessing
with cost n1, to obtain the location l1 of the function being called. main memory, because in a functional language older objects cannot
That location is read from memory to obtain the function itself, a refer to newer objects. Copying collection preserves the allocation
self-referential l-abstraction, at cost n'1 . Second, the expression ordering of objects, as is required for our analyses. Because two
e2 is evaluated, with cost n2, to obtain its value l2. Finally, l1 and l2 objects that are in the same abstract block can be separated into
are substituted into the body of the function, and then evaluated to adjacent blocks by the simulation, blocks are loaded two-at-a-time,
obtain the result l at cost n. The overall result is l with total cost the requiring 2 × M additional words of cache on the ICMM. An additional
sum of the constituent costs. The only non-zero cost arises from B words of read cache are required to account for the space required
allocation of objects and reading from memory; all other operations by the control stack using an amortization argument, and a constant
are considered cost-free, as in the I/O model. number of blocks are required for the allocation of the program itself
in memory.
σ @ e1 ⇓n1 σ1! @ l 1! (evaluate function)
Simulator cache Main memory
σ1! @ l!1 ↓n!1 σ2 @ fun(x, y.e) (load function value) total size M’ = 4M + kB
σ2 @ e2 ⇓n2 σ2! @ l 2! (evaluate argument) Oldest
n
σ2! @ [l 1! , l 2! /x, y]e ⇓ σ! @ l (evaluate function body) Allocations 2M/B nursery
n 1 + n !1 + n 2 + n !2 (writes) using
σ @ app(e1 ; e2 ) ⇓ σ! @ l !
Next generational GC
FP free
on
RAM
semantics underlies the analyses given in Sections 1 and 3. 2M/B + k
The abstract costs given by the semantics are validated by Reads read and stack
giving a provable implementation4, 13 that realizes these costs Live data cache
on a concrete machine. In our case the realization is given in Live stack frame
terms of the ICMM, which is an accepted concrete formula- No longer live
Unused
tion of a machine with hierarchical memory. By composing B
the abstract semantics with the provable implementation
we obtain end-to-end bounds on the cache complexity of
algorithms written in ICFP without having to perform the merge), and define the notion of hereditarily finite values for
analysis at the level of the concrete machine, but rather at this purpose, but here we only consider first-order usage.
the level of the language in which the algorithms are written. As mentioned in the introduction, the cost of access-
ing a recursive datatype will depend on how it is laid out in
2.4. Provable implementation memory, which depends on allocation order. We could try to
The provable implementation of ICFP on the ICMM is define the notion of a list being in a good order in terms of
described in Figure 10. Its main properties, which are impor- how the list is represented in memory. This is cumbersome
tant for obtaining end-to-end bounds on cache complexity, and low level. We could also try to define it with respect to
are summarized by the following theorem: the specific code that allocated the data. Again this is cum-
bersome. Instead we define it directly in terms of the cache
Theorem 2.1. An evaluation s @ e ßn s¢ @ l in ICFP with cost of traversing the structure. By traversing we mean going
abstract cache parameters M and B can be implemented on the through the structure in a specific order and touching (read-
ICMM with concrete cache parameters 4 × M + k × B and B with ing) all data in the structure. Because types such as trees
cache cost O(n), for some constant k. might have many traversal orders, the definition is with
respect to a particular traversal order. For this purpose, we
The proof of the theorem consists of an implementation of define the notion of “compact.”
ICFP on the ICMM described in Figure 10, and of showing
that reduction step of ICFP is simulated on the ICMM to Definition 3.1. A data structure of size n is compact with
within a (small) constant factor. The full details of the proof, respect to a given traversal order if traversing it in that order
along with a tighter statement of the theorem, are given in has cache cost O(n/B) in our cost semantics with M ³ kB for
the associated conference paper. some constant k.

3. CACHE EFFICIENT ALGORITHMS We can now argue that certain code will generate data struc-
We now analyze mergeSort and matrix multiply in ICFP, and tures that are compact with respect to a particular order.
give bounds for a k-way mergeSort that is optimal. The imple- We note that if a list is compact with respect to traversal in
mentations are completely natural functional programs one direction, it is also compact with respect to the opposite
using lists and trees instead of arrays. In the associated con- direction.
ference paper, we describe how one can deal with higher- To keep data structures compact, not only do the top
order functions (such as passing a comparison function to level links need to be accessed in approximately the order

106 COMM UNICATIO NS O F T H E AC M | J U LY 201 5 | VO L . 5 8 | NO. 7


they were allocated, but anything that is touched by the algo- because the input list is compact, for k¢n £ M the input fits in
rithm during traversal also needs to be accessed in a simi- the read cache (for some constant k¢). Therefore, the cache
lar order. For example, if we are sorting a list it is important cost for mergeSort is at most the time to flush O(n) items out
that in addition to the “cons cells,” the keys are allocated in of the allocation cache that it might have contained at the
the list order (or, as we will argue shortly, reverse order is start, and to load the read cache with the input. This cache
fine). To ensure that the keys are allocated in the appropri- cost is bounded by O(n/B). When the input does not fit in
ate order, they need to be copied whenever placing them in cache we have to pay for the merge as analyzed above plus
a new list. This copy needs to be a deep copy that copies all the recursive calls. This gives the same recurrence as for the
components. If the keys are machine words then in practice array version and hence solves to the claimed result. £
these might be inlined into the cons cells anyway by a com- Here we just outline the k-way mergeSort algorithm and
piler. However, to ensure that objects are copied we will use state the cost bounds, which are optimal for sorting. The
operation !a to indicate copying of a. sort is similar to mergeSort but instead of splitting into
two parts recursively sorting each, it splits into k parts.
3.1. Merge sort A k-way merge is then used to merge the sorted parts. The
We now consider analyzing mergeSort in our cost model, k-way merge can be implemented using a priority queue.
and in particular the code given in Figures 2 and 6. We will The sort in ICFP has the optimal bounds for sorting given
not cover the definition of split because it is similar to by Equation (3).
merge. We first analyze the merge.
3.2. Matrix multiply
Theorem 3.2. For compact lists A and B, the evaluation of Our final example is matrix multiply. The code is shown in
merge(A, B) starting with any cache state will have cache cost Figure 11 (we have left out checks for matching sizes). This
(where n = |A|+|B| ) and will return a compact list. is a block recursive matrix multiply with the matrix laid out
in a tree. We define compactness with respect to a preorder
traversal of this tree. We therefore say the matrix is compact
Proof. We consider the cache cost of going down the recur- if traversing in this order can be done with cache cost O(n2/B)
sion and then coming back up. Because A and B are both for an n × n matrix (n2 leaves). We note that if we generate a
compact, we need only put aside a constant number of matrix in a preorder traversal allocating the leaves along the
cache blocks to traverse each one (by definition). Recall that way, the resulting matrix will be compact. Also, every recur-
in the cost model we have a nursery n that maintains both sive sub-matrix is itself compact.
live allocated values and place holders for stack frames in
the order they are created. In merge nothing is allocated Theorem 3.4. For compact n × n matrices A and B, the evalu-
from one recursive call to the next (the cons cells are cre- ation of mmult(A, B) starting with any cache state will have
ated on the way back up the recursion) so only the stack cache cost given by Equation (1) and will return a compact
frames are placed in the nursery. After M recursive calls matrix as a result.
the nursery will fill and blocks will have to be flushed to the
memory m (as described in Section 2). The merge will invoke Proof. Matrix addition has cache cost O(n2/B) and generates
at most O(n/B) such flushes because only n frames are cre- a compact result because we traverse the two input matrices
ated. On the way back up the recursion we will generate the
cons cells for the list and copy each of the keys (using the !). Figure 11. Matrix Multiply.
Note that copying the keys is important so that the result
remains compact. The cons cells and copies of the keys will 1 datatype M = Leaf of int
be interleaved in the allocation order in the nursery and 2 | Node of M * M * M * M
flushed to memory once the nursery fills. Once again these 3
will be flushed in blocks of size B and hence there will be 4 fun mmult(Leaf a, Leaf b) = Leaf(a × b)
at most O(n/B) such flushes. Furthermore the resulting list 5 | mmult(Node(a11, a12, a21, a22),
will be compact because adjacent elements of the list will 6 Node(b11, b12, b21, b22)) =
be in the same block. £ 7 let
8 fun madd(Leaf a, Leaf b) = Leaf(a + b)
We now consider mergeSort as a whole. 9 | madd(Node(a11, a12, a21, a22),
10 Node(b11, b12, b21, b22)) =
Theorem 3.3. For a compact list A, the evaluation of 11 Node(madd(a11, b11),madd(a12, b12),
mergeSort(A) starting with any cache state will have cache 12 madd(a21, b21),madd(a22, b22))
cost given by Equation (2) and will return a compact list. 13 in
14 Node(madd(mmult(a11, b11),mmult(a12, b21)),
Proof. As with the array version, we consider the two cases 15 madd(mmult(a11, b12),mmult(a12, b22)),
when the input fits in cache and when it does not. The 16 madd(mmult(a21, b11),mmult(a22, b21)),
mergeSort routine never requires more than O(n) live allo- 17 madd(mmult(a21, b12),mmult(a22, b22)))
cated data. Therefore, when kn £ M for some (small) con- 18 end
stant k all allocated data fits in the nursery. Furthermore,

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 107


research highlights

in preorder traversal and we generate the output in the same References


order. Because the live data is never larger than O(n2), the 1. Abello, J., Buchsbaum, A.L., computational geometry (preliminary
problem will fit in cache for n2 £ M/c for some constant c. Westbrook, J. A functional approach version). In FOCS (IEEE Computer
to external graph algorithms. Society, 1993), 714–723.
Once it fits in cache the cost is O(n2/B) needed to load the Algorithmica 32, 3 (2002), 437–458. 13. Greiner, J., Blelloch, G.E. A provably
input matrices and write out the result. When it does not fit 2. Aggarwal, A., Vitter, J.S. The input/ time-efficient parallel implementation
output complexity of sorting and of full speculation. ACM Trans.
in cache we have to do eight recursive calls and four calls to related problems. Commun. ACM 31, Program. Lang. Syst. 21, 2 (1999),
9 (1988), 1116–1127. 240–285.
matrix addition. This gives a recurrence, 3. Arge, L., Bender, M.A., Demaine, E.D., 14. Grunwald, D., Zorn, B.G., Henderson, R.
Leiserson, C.E., Mehlhorn, K., eds. Improving the cache locality of
Cache-Oblivious and Cache-Aware memory allocation. In R. Cartwright, ed.,
Algorithms, 18.07.–23.07.2004, PLDI (ACM, 1993), 177–186.
Volume 04301 of Dagstuhl Seminar 15. Harper, R. Practical Foundations for
Proceedings. IBFI, Schloss Dagstuhl, Programming Languages. Cambridge
Germany, 2005. University Press, Cambridge, UK,
4. Blelloch, G.E., Greiner, J. Parallelism 2013.
in sequential functional languages. 16. Jones, R., Lins, R. Garbage
In SIGPLAN-SIGARCH-WG2.8 Collection: Algorithms for Automatic
Conference on Functional Programming Dynamic Memory Management.
and Computer Architecture (FPCA) (La Wiley, 1996.
which solves to O . The output is compact because Jolla, CA, 1995), 226–237. 17. Meyer, U., Sanders, P., Sibeyn, J.F., eds.
each of the four calls to madd in mmult allocate new results 5. Blelloch, G.E., Harper, R. Cache and Algorithms for Memory Hierarchies,
I/O efficent functional algorithms. Advanced Lectures [Dagstuhl Research
in preorder with respect to the submatrices they generate, In ACM-SIAM Symposium on Discrete Seminar, March 10–14, 2002], volume
and the four calls are made in preorder. Therefore, the over- Algorithms (SODA). R. Giacobazzi and 2625 of Lecture Notes in Computer
R. Cousot, eds, (Rome, Italy, 2013), Science. (Schloss Dagstuhl, Germany,
all matrix returned is allocated in preorder. £ ACM, 39–50. 2003), Springer.
6. Chiang, Y.-J., Goodrich, M.T., Grove, E.F., 18. Morrisett, J.G., Felleisen, M., Harper, R.
Tamassia, R., Vengroff, D.E., Vitter, J.S. Abstract models of memory
4. CONCLUSION External-memory graph algorithms. management. In FPCA (1995), 66–77.
In ACM-SIAM Symposium on 19. Munagala, K., Ranade, A.G.
The present work extends the methodology of Blelloch Discrete Algorithms (SODA). K.L. I/O-complexity of graph algorithms.
and Greiner4, 13 to account for the cache complexity of Clarkson, ed. (San Francisco, CA, In SODA. R.E. Tarjan and T. Warnow,
1995), ACM/SIAM, 139–149. eds. (ACM/SIAM, 1999), 687–694.
algorithms stated in terms of two parameters, the cache 7. Chilimbi, T.M., Larus, J.R. Using 20. Plotkin, G.D. LCF considered as
block size and the number of cache blocks. Analyses are generational garbage collection a programming language. Theor.
to implement cache-conscious Comput. Sci. 5, 3 (1977), 223–255.
carried out in terms of the semantics of ICFP given in data placement. In International 21. Rahn, M., Sanders, P., Singler, J.
Section 2, and are transferred to the Ideal Cache Model Symposium on Memory Management. Scalable distributed-memory external
S.L.P. Jones and R.E. Jones, eds. sorting. In ICDE. F. Li, M.M. Moro,
by a provable implementation, also sketched in Section 2. (Vancouver, British Columbia, 1998), S. Ghandeharizadeh, J.R. Haritsa,
The essence of the idea is that conventional copying gar- ACM, 37–48. G. Weikum, M.J. Carey, F. Casati,
8. Church, A. An unsolvable problem E.Y. Chang, I. Manolescu, S. Mehrotra,
bage collection can be deployed to achieve cache-efficient of elementary number theory. Am. U. Dayal, and V.J. Tsotras, eds. (IEEE,
algorithms without sacrificing abstraction by resorting to J. Math. 58, 2 (April 1936), 345–363. 2010), 685–688.
9. Church, A. The Calculi of Lambda- 22. Spoonhower, D., Blelloch, G.E., Harper, R.,
manual allocation and cacheing of data objects. Using Conversion. Annals of Mathematics Gibbons, P.B. Space profiling for
this approach, we are able to express algorithms in a high- Studies. Princeton University Press, parallel functional programs. In
Princeton, NJ, 1941. ICFP. J. Hook and P. Thiemann, eds.
level functional style, analyze them using a model that 10. Courts, R. Improving locality of (ACM, 2008), 253–264.
captures the idea of a fixed size read and allocation stack, reference in a garbage-collecting 23. Vitter, J.S. Algorithms and data
memory management system. structures for external memory.
without exposing the implementation details, and still Commun. ACM 31, 9 (1988), 1128–1138. Foundations Trends Theor. Comput.
11. Frigo, M., Leiserson, C.E., Prokop, H., Sci. 2, 4 (2006), 305–474.
match the asymptotic bounds for the ideal cache model Ramachandran, S. Cache-oblivious 24. Wilson, P.R., Lam, M.S., Moher, T.G.
achieved using low-level imperative techniques including algorithms. In FOCS (IEEE Computer Caching considerations for
Society, 1999), 285–298. generational garbage collection.
explicit memory management. For sorting, the bounds 12. Goodrich, M.T., Tsay, J.-J., Vengroff, D.E., In LISP and Functional
are optimal. Vitter, J.S. External-memory Programming, 1992, 32–42.
One direction for further research is to integrate (deter-
Guy E. Blelloch and Robert Harper
ministic) parallelism with the present work. Based on previ- ({guyb, rwh}@cs.cmu.edu), Carnegie
ous work4, 13 we expect that the evaluation semantics given Mellon University, Pittsburgh, PA
here will provide a good foundation for specifying parallel as
well as sequential complexity. One complication is that the
explicit consideration of storage considerations in the cost
model given here would have to take account of the interac-
tion among parallel threads. The amortization arguments
would also have to be reconsidered to account for parallel-
ism. Finally, although we are able to generate an optimal
cache-aware sorting algorithm, it is unclear whether it is
possible to generate an optimal cache-oblivious sorting
algorithm in our model.

Acknowledgments
This work is partially supported by the National Science
Foundation under grant number CCF-1018188, and by Intel
Labs Academic Research Office for the Parallel Algorithms
for Non-Numeric Computing Program. © 2015 ACM 0001-0782/15/07 $15.00

108 COM MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


CAREERS
Boise State University Pontificia Universidad Católica de Chile which offers the recruited faculty members a
Department of Computer Science School of Engineering clearly defined career path.
Eight Open Rank, Tenured/Tenure-Track Faculty Position: Big Data for Transportation Candidates should have demonstrated excel-
Faculty Positions and Logistics Systems lence in research and a strong commitment to
teaching. A doctoral degree is required at the time
The Department of Computer Science at Boise The School of Engineering at Pontificia Universi- of appointment. Candidates for senior positions
State University invites applications for eight dad Católica de Chile calls for applications for a must have an established record of research, and
open rank, tenured/tenure-track faculty posi- full-time faculty position in the area of Big Data a track-record in securing external funding as PI.
tions. Seeking applicants in the areas of big data for Transportation and Logistics Systems. This As a State-level innovative city, Shenzhen is
(including distributed systems, HPC, machine position is a joint appointment with the Depart- home to some of China’s most successful high-
learning, visualization), cybersecurity, human ment of Computer Science and the Department tech companies, such as Huawei and Tencent.
computer interaction and computer science edu- of Transport Engineering and Logistics. The As a result, SUSTC considers entrepreneurship is
cation research. Strong applicants from other ar- new position aims to strengthen teaching and one of the main directions of the university, and
eas of computer science will also be considered. research collaborations between both depart- good starting supports will be provided for pos-
Applicants should have a commitment to excel- ments, aiming to face the new challenges of ur- sible initiatives. SUSTC encourages candidates
lence in teaching, a desire to make significant con- ban transport planning and city management with intention and experience on entrepreneur-
tributions in research, and experience in collabo- in the era of Big Data. Applicants must be in ship to apply.
rating with faculty and local industry to develop and possession of (or be currently pursuing studies
sustain funded research programs. A PhD in Com- towards) a Ph.D. degree in Computer Science or Terms & Applications
puter Science or a closely related field is required by a closely related area, and have a demonstrable To apply, please send curriculum vitae, descrip-
the date of hire. For additional information, please commitment to excellence in both, teaching and tion of research interests and statement on teach-
visit http://coen.boisestate.edu/cs/jobs. research. For further information, please visit: ing to cshire@sustc.edu.cn.
http://www.ing.puc.cl/facultybigdata. Contact: Professor Yifan Chen
Email: cshire@sustc.edu.cn
Carnegie Mellon University Tel: +86-755-88018506
School of Computer Science The South University of Science and
Full Professor and Head of the Machine Technology (SUSTC) SUSTC offers competitive salaries, fringe ben-
Learning Department Professor/Associate Professor/Assistant efits including medical insurance, retirement and
Professorship in Computer Science housing subsidy, which are among the best in
The School of Computer Science at Carnegie Mel- China. Salary and rank will commensurate with
lon University is seeking an energetic and visionary The University qualifications and experience. More information
leader for the position of Full Professor and Head Established in 2012, the South University of Sci- can be found at http://talent.sustc.edu.cn/en
of the Machine Learning Department. The Depart- ence and Technology (SUSTC) is a public insti- Candidates should also arrange for at least
ment, one of seven academic units in the School, is tution funded by the municipal of Shenzhen, a three letters of recommendation sending directly
the world’s only academic Machine Learning De- special economic zone city in China. Shenzhen to the above email account. The search will con-
partment and is known internationally for its out- is a major city located in Southern China, situ- tinue until the position is filled.
standing research and educational programs. The ated immediately north of Hong Kong Special
successful candidate will be charged with ensuring Administrative Region. As one of China’s major
that the Department maintains a national and in- gateways to the world, Shenzhen is the country’s University of Notre Dame
ternational presence in the rapidly growing fields fast-growing city in the past two decades. The city Department of Computer Science and
of machine learning and data sciences, and have is the high-tech and manufacturing hub of south- Engineering
the opportunity to further shape and strengthen ern China. A picturesque coastal city, Shenzhen is Teaching Faculty Position
the Department by leading efforts to recruit new also a popular tourist destination and was named
faculty at both senior and junior levels, expand re- one of the world’s 31 must-see tourist destina- The Department of Computer Science and En-
search or educational programs, and develop new tions in 2010 by The New York Times. gineering at the University of Notre Dame seeks
fundraising initiatives. Candidates should possess The South University of Science and Technol- candidates for a teaching faculty position to teach
an earned doctorate in a relevant discipline and ogy is a pioneer in higher education reform in courses primarily in the CS&E undergraduate cur-
should have an outstanding record of (i) research China. The mission of the University is to become ricula. This is a full-time, continuing position in
productivity, as demonstrated through publica- a globally recognized institution which empha- the Special Professional Faculty track. Competitive
tion record and peer-reviewed extramural fund- sizes academic excellence and promotes innova- candidates will have the training and experience
ing obtained; (ii) teaching at the graduate and/or tion, creativity and entrepreneurship. The teach- necessary to teach effectively in a range of courses
undergraduate levels; (iii) engagement in interdis- ing language at SUSTC is bilingual, either English in accredited degree programs in Computer Sci-
ciplinary research initiatives; (iv) productive aca- or Putonghua. ence and Computer Engineering. Candidates with
demic leadership and service. Prior administrative Set on five hundred acres of wooded land- backgrounds in all areas of Computer Science and
experience is desirable, but not required. Although scape in the picturesque Nanshan (South Moun- Computer Engineering will be considered. Rel-
applications will be accepted until the position is tain) area, the new campus offers an ideal envi- evant industry experience is also valued.
filled, applications should be received by July 1 to ronment suitable for learning and research. The University of Notre Dame is a private,
receive full consideration. Candidates should sub- Catholic university with a doctoral research ex-
mit a full curriculum vitae, the names and email Call for Application tensive Carnegie classification, and consistently
addresses of four individuals that we can contact The South University of Science and Technology ranks in USNWR as a top-twenty national uni-
for references, and a brief summary of relevant re- now invites applications for the faculty position versity. The South Bend area has a vibrant and
search and administrative experience via email to in Computer Science Department which is cur- diverse economy with affordable housing and ex-
MLDheadsearch@cs.cmu.edu. rently under rapid construction. It is seeking to cellent school systems, and is within easy driving
Carnegie Mellon considers applicants for appoint a number of tenured or tenure track posi- distance of Chicago and Lake Michigan.
employment without regard to, and does not tions in all ranks. Candidates with research inter- Qualified candidates should have at least a
discriminate on the basis of, gender, race, ests in all mainstream fields of Computer Science Masters degree, and preferably a doctoral degree,
protected veteran status, disability, or any other will be considered. These positions are full time in Computer Science, Computer Engineering, or
legally protected status. posts. SUSTC adopts the tenure track system, a related area.

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 109


CAREERS

Applications should include a cover letter, tries and fields, including investment banking, Interested persons should submit their appli-
curriculum vitae, statement of teaching experi- the technology sector, hedge funds and private cation here: http://tinyurl.com/mandtsearch-
ence and philosophy, and names of at least three equity, venture capital, aviation, medicine, bio- upenn. Questions regarding this unique opportu-
professional references, at least two of whom technology, consulting, law, and more. The M&T nity should be directed to mandtsearch@upenn.
must be able to comment on the applicant’s community in general is highly entrepreneurial, edu. Review of applications will begin August 1,
teaching experience. Review of applications will working on their own ventures or collaborat- 2015.
begin on June 1 and continue until the position ing on start-up enterprises. With eight regional The University of Pennsylvania is an affirmative
is filled. groups across the globe, M&T alumni love to stay action/equal opportunity employer. All qualified ap-
Applications should be submitted at connected, and frequently recruit fellow gradu- plicants will receive consideration for employment
http://apply.interfolio.com/29569. ates and current students for internships and and will not be discriminated against on the basis of
The University of Notre Dame is an Equal full-time positions. race, color, religion, sex, sexual orientation, gender
Opportunity, Affirmative Action Employer. The Director of the Jerome Fisher Program in identity, creed, national or ethnic origin, citizenship
Management & Technology should be an individ- status, age, disability, veteran status, or any other
ual with an exceptional record of: characteristic protected by law.
University of Pennsylvania
School of Engineering and Applied Science ! A world-class scholar in a relevant area of en-
(SEAS) and the Wharton School Jerome Fisher gineering, management, innovation or related Western Michigan University
Program in Management & Technology field with a broad understanding of issues across Department of Computer Science College of
Faculty Director Search domains. Engineering and Applied Sciences
! An international leader and educational inno- Faculty Positions in Computer Science
The School of Engineering and Applied Science vator that has a vision for the future of the pro-
(SEAS) and the Wharton School at the University gram in an era of growing innovation. Applications are invited for three tenure-track po-
of Pennsylvania have initiated a search for an out- ! An institution builder that will work with fac- sitions at the assistant professor level in the De-
standing scholar to serve as the Director of the ulty, students and leadership across SEAS and partment of Computer Science at Western Michi-
prestigious Jerome Fisher Program in Manage- Wharton as well with an extremely deep alumni gan University (Kalamazoo, MI) starting January
ment & Technology (M&T). base in executing the vision. 2016.
The M&T Program (http://www.upenn.edu/ Applicants must have a Ph.D. in Computer
fisher/) combines academics from two phe- Candidates must hold a Ph.D. in engineer- Science or a closely related field. Candidates with
nomenal Penn assets, Penn Engineering and ing, management, or related area. The academic expertise in any area of computer science are
the Wharton School, into one unique educa- appointment of the Director is expected to be at welcome to apply, but preference will be given to
tional experience. Founded in 1977, it has at- the level of tenured Full Professor holding the Jef- candidates with particular expertise in big data,
tracted some of the brightest undergraduates frey A. Keswin Professorship, and could be in any computer security and embedded systems/inter-
in the world. With over 1,900 alumni worldwide, relevant department in SEAS, Wharton, or across net of things.
graduates from the M&T Program are corporate schools, depending on the applicant. Diversity Successful candidates will be capable of
leaders and innovators in a number of indus- candidates are strongly encouraged to apply. establishing an active research program lead-
ing to funding, supervising graduate students,
and teaching courses at both the undergradu-
ate and graduate levels. Other duties include
development of undergraduate and graduate
The Hasso Plattner Institute for Software Systems Engineering (HPI) courses, advising and service at the University,
in Potsdam is Germany’s university excellence center in Computer Science. College, Department and professional society
Annually, the Institute’s Research School offers levels.
Application screening will start on September
15, 2015 but the positions will remain open un-
10 Ph.D. and Postdoc Scholarships til filled. Successful candidates must earn their
Ph.D. degree by the time of employment.
The Department has 260 undergraduates,
The HPI Research School focuses on the foundation and application of large-scale, highly complex
and interconnected IT systems. With its interdisciplinary and international structure, the Research
70 M.S. students and 40 Ph.D. students.
School interconnects the HPI research groups and its international branches at Cape Town University, Current active research areas include networks,
Technion — Israel Institute of Technology, and Nanjing University. The HPI Future SOC Lab, a state-of- embedded systems/internet of things, compilers,
the-art computer center, enriches the academic work at the HPI Research School.
computational biology, massive data analytics,
The HPI professors and their research groups ensure high quality research and will supervise scientific computing, parallel computing,
Ph.D. students in the following topic areas:
security, privacy, formal verification, parallel
Enterprise Platform and Integration Concepts, Prof. Dr. h.c. Hasso Plattner
Internet Technologies and Systems, Prof. Dr. Christoph Meinel debugging, and data mining. More information
Human Computer Interaction, Prof. Dr. Patrick Baudisch regarding Western Michigan University, the
Computer Graphics Systems, Prof. Dr. Jürgen Döllner College of Engineering and Applied Sciences
Algorithm Engineering, Prof. Dr. Tobias Friedrich
System Analysis and Modeling, Prof. Dr. Holger Giese and the Department of Computer Science are
Software Architecture, Prof. Dr. Robert Hirschfeld available at http://www.wmich.edu, http://www.
Information Systems, Prof. Dr. Felix Naumann wmich.edu/engineer, and http://www.cs.wmich.
Operating Systems and Middleware, Prof. Dr. Andreas Polze
Business Process Technology, Prof. Dr. Mathias Weske
edu, respectively.
The Carnegie Foundation for the Advance-
If you are interested please submit a full application (curriculum vitae and copies of certificates/
transcripts, brief research proposal, work samples/copies of relevant scientific work e.g. master’s ment of Teaching has placed WMU among
thesis, and a letter of recommendation). the 76 public institutions in the nation des-
Please send applications by August 15th to: research-school-application@hpi.de ignated as research universities with high
For more information see: www.hpi.de/research-school research activity.
WMU is an Affirmative Action/Equal Opportu-
nity employer consistent with applicable Federal
and State law. All qualified applicants are encour-
aged to apply. To do so, please visit http://www.
wmujobs.org and provide a cover letter, curricu-
lum vitae, statement of research goals, teaching
statement, and names and contact information
of at least three references.

110 COMM UNICATIO NS O F T H E AC M


HPI-Anzeige_ACM_2015_final.indd 1
| J U LY 201 5 | VO L . 5 8 | NO. 7
24.04.15 17:17
last byte

[ C ONTI N U E D FRO M P. 112] artificial The President stood up. “Keep me


intelligence, and what do I get in re- informed. And I don’t just want to
turn? Nothing.” “We didn’t know about its reading habits, I want
“That’s not true …” discover until solutions. I want this thing stopped.”
“What Doctor Rhine is trying to Her eyes narrowed as Skinner tried to
say,” said Skinner, cutting across her far too late interrupt. “I’m sorry, professor, did you
assistant, “is we delivered what was that a truly want to say something?”
asked of us. Th1nk is the most power- “You do realize that Th1nk can hear
ful artificial intelligence ever created. intelligent machine us?” said Skinner. “It’s best to be cir-
Far more intelligent than we are.” would only cumspect.”
Rhine wondered, as he always did, “Dear God,” said Rhine.
how Skinner managed to pronounce really be interested “What is it, Doug?”
the modified slogan of old man Wat- in itself.” “It’s halted again. In 1979.”
son in such a way that you could hear Skinner shook her head. “So?”
the interpolated number one. Rhine pointed at the screen. “It must
“Yet, there’s no payback,” said the have missed the original radio plays. Or
President. “Only problems and re- maybe it wasn’t in the mood then. The
quests for more funding.” Hitchhiker’s Guide to the Galaxy book
Skinner coughed. “We assumed came out in 1979. It took your remark
Th1nk would do our bidding. We Going on what we just heard, you’d bet- as a threat, Madame President, and has
didn’t discover until far too late that ter pull the plug on dairy.” found the perfect response.”
a truly intelligent machine would only The President looked puzzled. The President looked puzzled. “But
really be interested in itself. And by the “Because of something a girl said to what …”
time we realized, it had already worked a cat?” Rhine touched a control on his
its way into our systems. Defense, “Perhaps looking-glass milk isn’t screen and Th1nk’s voice came from
power grids, air traffic control … We good to drink,” said Rhine slowly. the speakers in a harsh, deep tone.
wouldn’t dare stop it.” “And now it’s a “It’s going to poison the milk?” The “Your planet is scheduled for demo-
wirehead,” said Rhine. President looked horrified. lition. The process will take slightly
“A what?” Skinner dabbed her forehead with less than two of your Earth minutes.
“Wirehead.” Rhine nodded at the a tissue. “Not exactly.” She stared at Thank you.”
screens. “When it can tap into every the screens for a moment. “Plenty of “It can’t do that,” said the President.
movie and TV show ever made, why molecules are asymmetrical. Sugars, “How could it?”
would it want to work? It streams enter- for instance. They have what’s called Rhine sounded defeated. “Re-
tainment directly, multichannel, 24/7. stereoisomers. We know the diges- member the space-defense bill? Who
It finished the visual and audio media tive system can distinguish left- and pushed through doomsday satellites?”
yesterday. Now it’s on to print books.” right-handed molecules. If Th1nk can The President’s eyes narrowed. Be-
“That’s why we asked you to come,” invert the structure of milk, at best it fore she could speak, Skinner burst
said Skinner. She pulled up a chair be- would cause indigestion. But it could in: “We might be able to avoid this.
tween the President and Rhine. “It’s be far worse.” Doesn’t Through the Looking Glass end
working chronologically. Last night “That’s only the beginning,” said with a discussion of dreams?”
it reached 1865 and stopped. It fell in Rhine. “Think what else is in Through Rhine nodded. “Was Alice in the
love with Alice in Wonderland.” the Looking Glass. Running to stand Red King’s dream? Or did Alice dream
“Alice’s Adventures in Wonderland,” still. Trains that jump hedges. The bat- him up?”
muttered Rhine. tle of the Lion and the Unicorn.” “Okay. Hey Th1nk, you know what
“And that was why we had the cloth- “We’d better raise alert levels,” said a dream is. It’s not real. None of this
ing factories issue,” said the President. the President. “Is there any way of tell- is supposed to be real. All it was ever
“Exactly,” said Skinner. “Th1nk can ing what will play out?” meant to be was a story, played out in
control pretty much any computerized “Th1nk likes drama,” said Skin- your processors.”
factory—that’s every factory. Suddenly ner. “Not surprising after absorbing “That’s the plan?” said the Presi-
there wasn’t a garment produced with- all those Hollywood movies. It reads dent. “That’s how you save us? Will it
out playing-card patterns.” an extract before it gets particular toys work?”
“And the rose-breeding stations out of the box.” The professor bit her Rhine glanced at the screen. “We’ll
started converting white roses to red,” lip. “Th1nk is a child, and we live in its find out in 42 seconds.”
said Rhine. “We never found out how toy box.”
it got the flamingos and hedgehogs to “Hold that thought,” said Rhine. Brian Clegg (www.brianclegg.net) is a popular science
author whose recent titles include The Quantum Age and
the games makers.” “It’s left 1871 behind. Spooling on. I Final Frontier.
“Yes, I get it,” said the President. “It can’t believe how fast it’s taking this
was making Alice real.” material in. Passing through the 1940s.
“But now it’s moved on,” said Rhine. Looks like we got away with nothing
“To 1871. Through the Looking Glass. more than the dairy issue.” © 2015 ACM 0001-0782/15/07 $15.00

JU LY 2 0 1 5 | VO L. 58 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 111


last byte

From the intersection of computational science and technological speculation,


with boundaries limited only by our ability to imagine what could be.

DOI:10.1145/2776883 Brian Clegg

Future Tense
Toy Box Earth
What a young AI learned following Alice through the looking glass…

“ WHOA. ” RH I N E L E T rip an unnerving


hyena laugh. Out of the corner of his
eye, he glimpsed long dark hair and a
floor-length dress in the style favored
by his professor.
“Get yourself over here now,” Rhine
shouted. “You’re not going to believe
where it’s stopped.”
“Is that right?”
Rhine shot to attention on hearing
that voice, familiar from a thousand
YouTubes. His chair spun back into a
table, sending a cascade of papers onto
the floor. “Madame President, I’m so
sorry. No one told me you were here.”
The President smiled graciously,
and waved her security detail out of
the way. She spoke to someone over
her shoulder. “Professor Skinner, do
all your grad students speak to you
this way?”
Skinner strode into the room,
glaring at Rhine. “I must apologize,
Madame President. Doctor Rhine
sometimes lets his enthusiasm carry
him away.”
Rhine suppressed another laugh.
It didn’t help that his professor, who
consciously styled herself on the Pres-
ident, was wearing a near-identical
dress in an almost matching shade.
“I’m sorry, but listen to its latest ob-
session.” He touched a control and a
young girl’s voice with a crystal-clear
English accent emerged:
“How would you like to live in knowing what was coming, but didn’t Skinner tried to place herself be-
IMAGE BY MA RCIO JO SE BASTOS SILVA

Looking-Glass House, Kitty? I won- make it in time. “That’s wrong,” said tween Rhine and the President. “You
der if they’d give you milk in there? Rhine. “So many ignorant people get know the background, ma’am?”
Perhaps looking-glass milk isn’t it wrong. It’s Through the Looking- “Of course.” The President pulled
good to drink …” Glass and What Alice Found There.” up a chair, kicking off her shoes.
“I know that,” said the President. The President’s smile became a “That’s better. I ought to know; I
“Alice Through the Looking Glass.” touch cooler. “And the AI’s choice of sanctioned the funding. You built me
Skinner tried to interrupt Rhine, reading matter is interesting because?” the ultimate [C O NTINUED O N P. 111]

112 CO M MUNICATIO NS O F TH E ACM | J U LY 201 5 | VO L . 5 8 | NO. 7


Discover Your Story
Where every discipline in the computer graphics and interactive techniques
community weaves together a story of personal growth and powerful creation. Discover unique
SIGGRAPH stories using the Aurasma app, then come to SIGGRAPH and build your own.

[ DOWNLOAD AURASMA FROM YOUR APP STORE AND SCAN IMAGE ABOVE ]

9-13 August 2015

Los Angeles Convention Center

You might also like