Data Science Analytics and Applications Proceedings of The 1st International Data Science Conference iDSC2017 1st Edition Peter Haber

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Data Science Analytics and

Applications Proceedings of the 1st


International Data Science Conference
iDSC2017 1st Edition Peter Haber
Visit to download the full and correct content document:
https://ebookstep.com/product/data-science-analytics-and-applications-proceedings-o
f-the-1st-international-data-science-conference-idsc2017-1st-edition-peter-haber/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Intro to Python for Computer Science and Data Science:


Learning to Program with AI, Big Data and The Cloud 1st
Edition Deitel

https://ebookstep.com/product/intro-to-python-for-computer-
science-and-data-science-learning-to-program-with-ai-big-data-
and-the-cloud-1st-edition-deitel/

Data Science ■ ■■■■■■■■ 1st Edition ■■■■■■■ ■■■■■■■■

https://ebookstep.com/download/ebook-51584990/

Data Science ■■■ ■■■■■■■■■■ 1st Edition ■■■■■■ ■■■■■

https://ebookstep.com/download/ebook-33607582/

■■■■■■■■■■■■■ ■■■■■■■ ■■ Computer Science. ■■■■,


■■■■■■■■■■■■ ■ data science 1st Edition ■■■■■■■■
■■■■■■■■ ■■■■

https://ebookstep.com/download/ebook-43189802/
Inclusive Development of Society-Proceedings of the 6th
International Conference on Management and Technology
in Knowledge, Service, Tourism & Hospitality (SERVE
2018) 1st Edition Ford Lumban Gaol
https://ebookstep.com/product/inclusive-development-of-society-
proceedings-of-the-6th-international-conference-on-management-
and-technology-in-knowledge-service-tourism-hospitality-
serve-2018-1st-edition-ford-lumban-gaol/

AI in Marketing Sales and Service How Marketers without


a Data Science Degree can use AI Big Data and Bots
Gentsch

https://ebookstep.com/product/ai-in-marketing-sales-and-service-
how-marketers-without-a-data-science-degree-can-use-ai-big-data-
and-bots-gentsch/

Statistique et Data Science Avec R François Husson

https://ebookstep.com/product/statistique-et-data-science-avec-r-
francois-husson/

Qualität und Data Science in der Marktforschung


Bernhard Keller

https://ebookstep.com/product/qualitat-und-data-science-in-der-
marktforschung-bernhard-keller/

Data Science on the Google Cloud Platform, 2nd Edition


(Third Early Release) Valliappa Lakshmanan

https://ebookstep.com/product/data-science-on-the-google-cloud-
platform-2nd-edition-third-early-release-valliappa-lakshmanan/
Peter Haber
Thomas Lampoltshammer
Manfred Mayr Eds.

Data Science –
Analytics
and Applications
Proceedings of the 1st International Data
Science Conference – iDSC2017
Data Science – Analytics and Applications
Peter Haber · Thomas Lampoltshammer · Manfred Mayr
(Eds.)

Data Science – Analytics


and Applications
Proceedings of the 1st International Data Science Conference – iDSC2017
Editors
Peter Haber Manfred Mayr
Informationstechnik & System-Management Informationstechnik & System-Management
Fachhochschule Salzburg Puch/Salzburg, Österreich Fachhochschule Salzburg Puch/Salzburg, Österreich

Thomas Lampoltshammer
Department für E-Governance in Wirtschaft und Verwaltung
Donau-Universität Krems, Krems an der Donau / Österreich

ISBN 978-3-658-19286-0 ISBN 978-3-658-19287-7 (eBook)


https://doi.org/10.1007/978-3-658-19287-7

Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; ­detaillierte bibliogra­
fische Daten sind im Internet über http://dnb.d-nb.de abrufbar.

Springer Vieweg
© Springer Fachmedien Wiesbaden GmbH 2017
Das Werk einschließlich aller seiner Teile ist urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich
vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung des Verlags. Das gilt insbesondere für
­Vervielfältigungen, Bearbeitungen, Übersetzungen, Mikroverfilmungen und die Einspeicherung und Verarbeitung in
elektronischen Systemen.
Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in diesem Werk berechtigt auch ohne
besondere Kennzeichnung nicht zu der Annahme, dass solche Namen im Sinne der Warenzeichen- und Markenschutz-
Gesetzgebung als frei zu betrachten wären und daher von jedermann benutzt werden dürften.
Der Verlag, die Autoren und die Herausgeber gehen davon aus, dass die Angaben und Informationen in diesem Werk zum
Zeitpunkt der Veröffentlichung vollständig und korrekt sind. Weder der Verlag noch die Autoren oder die Herausgeber über-
nehmen, ausdrücklich oder implizit, Gewähr für den Inhalt des Werkes, etwaige Fehler oder Äußerungen. Der Verlag bleibt
im Hinblick auf geografische Zuordnungen und Gebietsbezeichnungen in veröffentlichten Karten und ­Institutionsadressen
neutral.

Gedruckt auf säurefreiem und chlorfrei gebleichtem Papier

Springer Vieweg ist Teil von Springer Nature


Die eingetragene Gesellschaft ist Springer Fachmedien Wiesbaden GmbH
Die Anschrift der Gesellschaft ist: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Preface

It is with deep satisfaction that we write this foreword for the Proceedings of the 1st International Data
Science Conference (iDSC) held in Salzburg, Austria, June 12th - 13th 2017. The conference program
and the resulting proceedings represent the efforts of many people. We want to express our gratitude
towards the members of our program committee as well as towards our external reviewers for their hard
work during the reviewing process.
iDSC proofed itself as an innovative conference, which gave its participants the opportunity to delve
into state-of-the-art research and best practice in the fields of Data Science and data-driven business
concepts. Our research track offered a series of presentations by Data Science researchers regarding
their current work in the fields of Data Mining, Machine Learning, Data Management, and the entire
spectrum of Data Science.
In our industry track, practitioners demonstrated showcases of data-driven business concepts and how
they use Data Science to achieve organisational goals, with a focus on manufacturing, retail, and
financial services. Within each of these areas, experts described their experience, demonstrated their
practical solutions, and provided an outlook into the future of Data Science in the business domain.
Besides these two parallel tracks, a European symposium on Text and Data Mining has been integrated
into the conference. This symposium highlighted the EU project FutureTDM, granting insights into the
future of Text and Data Mining, and introducing overarching policy recommendations and sector-
specific guidelines to help stakeholders overcome the legal and technical barriers, as well the lack of
skills that have been identified.
Our sponsors had their own, special platform via workshops to provide hands-on interaction with tools
or to learn approaches towards concrete solutions. In addition, an exhibition of products and services
offered by our sponsors took place throughout the conference, with the opportunity for our participants
to seek contact and advice.
Completing the picture of our program, we proudly presented keynote presentations from leaders in
Data Science and data-driven business, both researchers and practitioners. These keynotes provided all
participants the opportunity to come together and shared views on challenges and trends in Data Science.
In addition to the contributed papers, five invited keynote presentations were given by: Euro Beinat (CS
Research, Salzburg University), Mario Meir-Huber (Microsoft Austria), Mike Olson (Cloudera), Ralf
Klinkenberg (RapidMiner) and Janek Strycharz (Digital Center Poland). We thank the invited speakers
for sharing their insights with our community.
The conference chair John Thompson has also helped us in many ways setting up the industry track, for
which we are grateful. We would especially like to thank our two colleagues, Astrid Karnutsch and
Maximilian Tschuchnig, for their enormous and constructive commitment to organizing and conducting
the conference. The paper submission and reviewing process was managed using the EasyChair system.
These proceedings will provide scientists and practitioners with an excellent reference to current
activities in the Data Science domain. We trust also that this will be an impetus to stimulate further
studies, research activities and applications in all discussed areas ensured by the support of our publisher
Springer / Vieweg Wiesbaden Germany.
Finally, again, the conference would not be possible without the excellent papers contributed by our
authors. We thank them for their contributions and their participation at iDSC·17.

Peter Haber, Thomas Lampoltshammer and Manfred Mayr


Conference Chairs
Future TDM Symposium Recap

FutureTDM is a european project focusing on reducing barriers and increasing uptake of Text and
Data Mining (TDM) for research environments in Europe. The outcomes of the project were
presented in the Symposium which has also served to connect key actors and interest groups and
promote open dialogue via discussion panels and informal workshops. The FTDM Symposium
was scheduled alongside iDSC 2017, given that both events address similar target groups and
share a common perspective: they both aimed at creating a communication network among the
members of the TDM community, where experts can exchange ideas and share the most up-to-date
research results, as well as legal and industrial advances relevant to TDM. The audience targeted
by the iDSC conference was the broad community of researchers and industry practitioners as well
as other practitioners and stakeholders, making it ideal for disseminating the project’s results.
The project’s objective has been to detect the barriers to TDM, reveal best practices and put
together sets of recommendations for TDM practitioners through a collaborative knowledge and
open information approach. The barriers recorded were grouped around four pillars: a) legal, b)
economic, c) skills, d) technical. These categories emerged after discussions with respective
stakeholders such as researchers, developers, publishers and SMEs during Knowledge CafØs run
across Europe (the Netherlands, the United Kingdom, Italy, Slovenia, Germany, Poland etc) and
two workshops held in Brussels1 (on September, 27th 2016 and March, 29th 2017).
The Symposium2 was a chance to invite experts from all over Europe to share their experience and
expertise in different domains. It was also a great opportunity to announce the guidelines and
recommendations formulated in order to increase TDM uptake. It started with a brief introduction
by Bernhard Jger (SYNYO)3 underlying the need to bring together different groups of
stakeholders, such as policy makers and legislators, developers and users who would benefit from
the project’s findings and the respective recommendations formed by the FTDM working groups.
It continued with a keynote speech by Janek Strycharz (Projekt Polska Foundation) dedicated to
the Economic Potential of Data Analytics. Janek Strycharz elaborated on different types of Big
Data and the variety of possibilities they offer and explained how that at a global and european
scale there could be a benefit from Big Data and TDM (the European GDP alone would be
increased by USD 200 billion).

1
FutureTDM Workshop I and II outcomes can be found at http://www.futuretdm.eu/knowledge-
cafes/futuretdm-workshop/
http://www.futuretdm.eu/knowledge-cafes/futuretdm-workshop-2/
2
All presentation slides are available online at www.slideshare.net/FutureTDM/presentations
3
Presentation on Introduction to the FutureTDM project is available at
https://www.slideshare.net/FutureTDM/introduction-to-the-future-tdm-project
VIII Future TDM

The first session entitled “Data Analytics and the Legal Landscape: Intellectual Property and
Data Protection” included Freyja van den Boom, researcher from Open Knowledge
International/Content Mine who presented the legal barriers identified and the respective
recommendations created under the subject "Dealing with the legal bumps on the road to further
TDM uptake". The focus of the presentation was on the principles identified to counterbalance
barriers: Awareness and Clarity, TDM Without Boundaries, and Equitable Access. The session
was chaired by Ben White (Head of Intellectual Property at the British Library) and included the
following panelists: i) Duncan Campbell (John Wiley & Sons, Inc.), representing the publisher’s
perspective, ii) Prodromos Tsiavos (Onassis Cultural Centre/IP Advisor), providing an
organization’s point of view, iii) Marie Timmermann (Science Europe), offering her point of view
as the EU Legislation and Regulatory Affairs Officer and iv) Romy Sigl (AustrianStartups) sharing
her experience from startUps. The discussion revolved around regulations which must address the
implementation of the law and its exceptions, copyright issues, the distinction between commercial
and noncommercial activities, the need for better communication between different groups of
stakeholders and the importance and value of TDM for publishers.
During the following session the projects ContentMine (Stefan Kasberger), PLAZI (Donat Agosti),
CORE (Petr Knoth), RapidMiner (Ralf Klinkenberg), clarin:el (Maria Gavrilidou) and ALCIDE
(Alessio Palmero Aprosio) were introduced and the presenters were accessible for a more detailed
presentation of their work to the attendees who would be interested in learning more. The
researchers shared their experience on technical and legal problems they have encountered
demonstrating the TDM applications and infrastructures they had created.
The next session offered an overview of FTDM case studies from Startups to Multinationals.
The presentation entitled "Stakeholder consultations - The Highlights" was given by Freyja van
den Boom (Open Knowledge International/Content Mine) who talked about the findings from
continuous stakeholder consultations throughout the project. The session was chaired by Maria
Eskevich (Radboud University) and included as panelists Donat Agosti (PLAZI), Petr Knoth
(CORE), Kim Nilsson (PIVIGO), and Peter Murray-Rust (ContentMine). The issues raised during
discussion pinpointed the need for realistic solutions to infrastructures, community engagement,
and open source and data.
Kiera McNeice (British Library) was the presenter in the fourth session and her presentation was
entitled "Supporting TDM in the Education Sector". The session focusing on “Universities, TDM
and the need for strategic thinking on educating researchers” was chaired by Ben White (Head
of Intellectual Property at the British Library) and panelists Claire Sewell (Cambridge University
Library), Jonas Holm (Stockholm University Library), and Kim Nilsson (PIVIGO). The discussion
which followed touched upon issues such as the future of Data Science and the nature of Data
Scientists. Some of the key concepts which were discussed were that of inclusion and diversity,
gender imbalance and nationality characteristics, which all affect access to Data Science and the
ability to become a Data Scientist. Concerns were expressed as to whether anyone could become a
Data Scientist, and whether the focus should be on becoming a Data Scientist or a more efficient
TDM user.
The challenges and solutions regarding technologies and infrastructures supporting Text and
Data Analytics was the topic of the fifth session, the main presenter of which was Maria Eskevich
(Radboud University). She focused on "The TDM Landscape: Infrastructure and Technical
Implementation" and touched upon the business and scientific perspectives on TDM by showing
the investment made by the EU in the five economic sectors. She also talked about the
barriers/challenges encountered in terms of accessibility and interoperability of infrastructures,
sustainability of data and digital readiness of language resources. The following discussion,
chaired by Stelios Piperidis (ARC) with Mihai Lupu (Data Market Austria ), Maria Gavrilidou
(clarin: el) and Nelson Silva (know-centre) revolved around real TDM problems and the solutions
the researchers came up with and close with the requirements of an effective TDM infrastructure.
Future TDM IX

The final session of the Symposium was dedicated to the Next Steps: A Roadmap to promoting
greater uptake of Data Analytics in Europe. A presentation was made by Kiera McNeice
(British Library) who briefly summarised what the project has achieved so far and focussed on the
key principles from the FutureTDM Policy Framework4 which must underlie all the efforts to be
made in the future in Legal Policies, Skills and Education, Economy and Incentives and Technical
and Infrastructure.
The Symposium close with a presentation of Bernhard Jger and Burcu Akinci (SYNYO) of the
FutureTDM platform (http://www.futuretdm.eu/), which is populated with the project outcomes
and findings.The platform will continue to exist after the end of the project and will be
continuously revised and updated in order to maintain a coherent and up‐to‐date view on the TDM
landscape open to the public.

Kornella Pouli
Athena RIC/ILSP, Athens

Burcu Akinci
SYNYO GmbH, Vienna

4
http://www.futuretdm.eu/policy-framework/
Organisation

Organising Institutions
Salzburg University of Applied Sciences
Information Professionals GmbH

Conference Chairs
Peter Haber Salzburg University of Applied Sciences
Thomas J. Lampoltshammer Danube University Krems
Manfred Mayr Salzburg University of Applied Sciences
John A. Thompson Information Professionals GmbH

Organising Committee
Peter Haber Salzburg University of Applied Sciences
Astrid Karnutsch Salzburg University of Applied Sciences
Thomas J. Lampoltshammer Danube University Krems
Manfred Mayr Salzburg University of Applied Sciences
John A. Thompson Information Professionals GmbH
Susanne Schnitzer Information Professionals GmbH
Maximilian E. Tschuchnig Salzburg University of Applied Sciences

Program Committee
David C. Anastasiu San Jose State University
Vera Andrejcenko University of Antwerp
Christian Bauckhage University of Bonn
Markus Breunig Rosenheim University of Applied Sciences
Stefanie Cox IT Innovation Centre
Werner Dubitzky University of Ulster, Coleraine
Gnther Eibl Salzburg University of Applied Sciences
Sleyman Eken University Kocaeli
Karl Entacher Salzburg University of Applied Sciences
Edison Pignaton de Freitas Federal University of Rio Grande do Sul
Bernhard Geissler Danube University Krems
Charlotte Gerritsen Netherlands Institute for the Study of Crime and Law
Enforcement (NSCR)
Mohammad Ghoniem Luxembourg Institute of Science and Technology
Peter Haber Salzburg University of Applied Sciences
Johann Hchtl Danube University Krems
Martin Kaltenbck Semantic Web Company
Astrid Karnutsch Salzburg University of Applied Sciences
Elmar Kiesling Vienna University of Technology
Robert Krimmer University of Tallinn
Peer Krger Ludwig-Maximilians-Universitt Mnchen
Thomas J. Lampoltshammer Danube University Krems
Michael Leitner Louisiana State University
Giuseppe Manco University of Calabria
Manfred Mayr Salzburg University of Applied Sciences
Mark-David McLaughlin Bentley University
Robert Merz Salzburg University of Applied Sciences
Elena Lloret Pastor University of Alicante
Cody Ryan Peeples Cisco
Gabriela Viale Pereira Fundaªo Getœlio Vargas – EAESP
Peter Ranacher University of Zurich
Siegfried Reich Salzburg Research Forschungsgesellschaft mbH
Eric Rozier Iowa State University
Johannes Scholz Graz University of Technology
Maximilian E. Tschuchnig Salzburg University of Applied Sciences
Jrgen Umbrich Vienna University of Economics and Business
Andreas Unterweger Salzburg University of Applied Sciences
Eveline Wandl-Vogt Austrian Academy of Sciences
Stefan Wegenkittl Salzburg University of Applied Sciences
Stefanie Wiegand IT Innovation Centre / University of Southampton
Peter Wild Austrian Institute of Technology
Radboud Winkels University of Amsterdam
Anneke Zuiderwijk - van Eijk Delft University of Technology

Reviewer
David C. Anastasiu San Jose State University
Christian Bauckhage University of Bonn
Markus Breunig Rosenheim University of Applied Sciences
Cornelia Ferner Salzburg University of Applied Sciences
Werner Dubitzky University of Ulster, Coleraine
Gnther Eibl Salzburg University of Applied Sciences
Karl Entacher Salzburg University of Applied Sciences
Bernhard Geissler Danube University Krems Hchtl
Martin Kaltenbck Semantic Web Company
Peer Krger Ludwig-Maximilians-Universitt Mnchen
Thomas J. Lampoltshammer Danube University Krems
Michael Leitner Louisiana State University
Elena Lloret Pastor University of Alicante
Manfred Mayr Salzburg University of Applied Sciences
Robert Merz Salzburg University of Applied Sciences
Edison Pignaton de Freitas Federal University of Rio Grande do Sul
Siegfried Reich Salzburg Research Forschungsgesellschaft mbH
Eric Rozier Iowa State University
Johannes Scholz Graz University of Technology
Maximilian E. Tschuchnig Salzburg University of Applied Sciences
Jrgen Umbrich Vienna University of Economics and Business
Andreas Unterweger Salzburg University of Applied Sciences
Stefan Wegenkittl Salzburg University of Applied Sciences
6SRQVRUVRIWKHFRQIHUHQFH

Platinum Sponsors

Cloudera GmbH
Apache Hadoop-based software,
f ware, support
soft
and services, and training
www.cloudera.com

Silver Sponsors

The unbelievable Machine Company


GmbH
Full-service provider for Big Data, cloud
services & hosting
www.unbelievable-machine.com

F&F GmbH
IT consulting, solutions and Big Data Analytics
www.fff-muenchen.de
www.ff-muenchen.de

The MathWorks GmbH


Mathematical computing soft
f ware
software
www.mathworks.com

RapidMiner GmbH
Data science software
f ware platform for data
soft
preparation, machine learning, deep learning,
text mining, and predictive analytics
www.rapidminer.com

ITG: innovative consulting and location


development
ITG is Salzburg’s innovation centre
www.itg-salzburg.at
XV

Table of Content

German Abstracts ................................................................................................................................. 1


Full Papers – Double Blind Reviewed ................................................................................................. 9
Reasoning and Predictive Analytics................................................................................................... 11
Circadian Cycles and Work Under Pressure:
A Stochastic Process Model for E-learning Population Dynamics .......................................... 13
CØsar Ojeda, Rafet Sifa and Christian Bauckhage
Investigating and Forecasting User Activities in Newsblogs:
A Study of Seasonality, Volatility and Attention Burst ........................................................... 19
CØsar Ojeda, Rafet Sifa and Christian Bauckhage
Knowledge-based Short-Term Load Forecasting for Maritime Container Terminals.............. 25
Norman Ihle and Axel Hahn
Data Analytics in Community Networks ........................................................................................... 31
Beyond Spectral Clustering:
A Comparative Study of Community Detection for Document Clustering.............................. 33
CØsar Ojeda, Rafet Sifa, Kostadin Cvejoski and Christian Bauckhage
Third Party Effect: Community Based Spreading in Complex Networks ................................ 39
CØsar Ojeda, Shubham Agarwal, Rafet Sifa and Christian Bauckhage
Cosine Approximate Nearest Neighbors .................................................................................. 45
David C. Anastasiu
Data Analytics through Sentiment Analysis ..................................................................................... 51
Information Extraction Engine for Sentiment-Topic Matching
in Product Intelligence Applications ........................................................................................ 53
Cornelia Ferner, Werner Pomwenger, Stefan Wegenkittl, Martin Schnll, Veronika
Haaf and Arnold Keller
Towards German Word Embeddings: A Use Case with Predictive Sentiment Analysis ......... 59
Eduardo Brito, Rafet Sifa, Kostadin Cvejoski, CØsar Ojeda and Christian Bauckhage
User/Customer-centric Data Analytics .............................................................................................. 63
Feature Extraction and Large Activity-Set Recognition Using Mobile Phone Sensors ........... 65
Wassim El Hajj, Ghassen Ben Brahim, Cynthia El-Hayek and Hazem Hajj
The Choice of Metric for Clustering of Electrical Power Distribution Consumers ................. 71
Nikola Obrenović, Goran Vidaković and Ivan Luković
Evolution of the Bitcoin Address Graph .................................................................................. 77
Erwin Filtz, Axel Polleres, Roman Karl and Bernhard Haslhofer
XVI

Data Analytics in Industrial Application Scenarios ......................................................................... 83


A Reference Architecture for Quality Improvement in Steel Production ................................ 85
David Arnu, Edwin Yaqub, Claudio Mocci, Valentina Colla, Marcus Neuer, Gabriel
Fricout, Xavier Renard, Christophe Mozzati and Patrick Gallinari
Anomaly Detection and Structural Analysis in Industrial Production Environments .............. 91
Martin Atzmueller, David Arnu and Andreas Schmidt
Semantically Annotated Manufacturing Data to support Decision Making in Industry 4.0:
A Use-Case Driven Approach .................................................................................................. 97
Stefan Schabus and Johannes Scholz
Short Papers and Student Contributions ........................................................................................ 103
Improving Maintenance Processes with Data Science
How Machine Learning Opens Up New Possibilities ............................................................ 105
Dorian Prill, Simon Kranzer, Robert Merz
ouRframe - A Graphical Workflow Tool for R ...................................................................... 109
Marco Gruber, Elisabeth Birnbacher and Tobias Fellner
Sentiment Analysis - A Students Point of View..................................................................... 111
Hofer Dominik
German Abstracts

Reasoning and Predictive Analytics

Circadian Cycles and Work Under Pressure: A Stochastic Process Model for E-learning
Population Dynamics
Internetanalysetechniken, konzipiert zur Quantifizierung von Internetnutzungsmustern, erlauben ein
tieferes Verstndnis menschlichen Verhaltens. Neueste Modelle menschlicher Verhaltensdynamiken
haben gezeigt, dass im Gegensatz zu zufllig verteilten Ereignissen, Menschen Ttigkeiten ausben, die
schubweises Verhalten aufweisen. Besonders die Teilnahme an Internetkursen zeigt hufig Zeitrume
von Inaktivitt und Prokrastination gefolgt von hufigen Besuchen kurz vor den Prfungen. Hier
empfehlen wir ein stochastisches Prozessmodell, welches solche Muster kennzeichnet und
Tagesrhythmen menschlicher Aktivitten einbezieht. Wir bewerten unser Modell anhand von realen
Daten, die whrend einer Zeitspanne von zwei Jahren auf einer Plattform fr Universittskurse
gesammelt wurden. Anschlieend schlagen wir ein dynamisches Modell vor, welches sowohl
Prokrastinationszeitrume als auch Zeitrume des Arbeitens unter Zeitdruck bercksichtigt. Da
Tagesrhythmen und Prokrastination-Druck-Kreislufe wesentlich fr menschliches Verhalten sind,
kann unsere Methode auf andere Ttigkeiten ausgeweitet werden, wie zum Beispiel die Auswertung von
Surfgewohnheiten und Kaufverhalten von Kunden.

Investigating and Forecasting User Activities in Newsblogs: A Study of Seasonality,


Volatility and Attention Burst
Das Studium allgemeiner Aufmerksamkeit ist ein Hauptthemengebiet im Bereich der
Internetwissenschaft, da wir wissen wollen, wie die Beliebtheit eines bestimmten Nachrichtenthemas
oder Memes im Laufe der Zeit zu- oder abnimmt. Neueste Forschungen konzentrierten sich auf die
Entwicklung von Methoden zur Quantifizierung von Erfolg und Beliebtheit von Themen und
untersuchten ihre Dynamiken im Laufe der Zeit. Allerdings wurde das gesamtheitliche Nutzerverhalten
ber Inhaltserstellungsplattformen grtenteils ignoriert, obwohl die Beliebtheit von
Nachrichtenartikeln auch mit der Art verbunden ist, wie Nutzer Internetplattformen nutzen. In dieser
Abhandlung zeigen wir ein neuartiges Framework, dass die Verlagerung der Aufmerksamkeit von
Bevlkerungsgruppen in Hinblick auf Nachrichtenblogs untersucht. Wir konzentrieren uns auf das
Kommentarverhalten von Nutzern bei Nachrichtenbeitrgen, was als Stellvertreter fr die
Aufmerksamkeit gegenber Internetinhalten fungiert. Wir nutzen Methoden der Signalverarbeitung und
konometrie, um Verhaltensmuster von Nutzern aufzudecken, die es uns dann erlauben, das Verhalten
einer Bevlkerungsgruppe zu simulieren und schlussendlich vorherzusagen, sobald eine
Aufmerksamkeitsverlagerung auftritt. Nach der Untersuchung von Datenreihen von ber 200 Blogs mit
14 Millionen Nachrichtenbeitrgen, haben wir zyklische Gesetzmigkeiten im Kommentarverhalten
identifiziert: Aktivittszyklen von 7 Tagen und 24 Tagen, die mglicherweise im Zusammenhang zu
bekannten Dimensionen von Meme-Lebenszeiten stehen.
2 German Abstracts

Knowledge-based Short-Term Load Forecasting for Maritime Container Terminals


Durch den Anstieg von Last- und Nachfragemanagement in modernen Energiesystemen erhlt die
Kurzzeitlastprognose fr Industrieeinzelkunden immer mehr Aufmerksamkeit. Es scheint fr
Industriestandorte lohnenswert, dass Wissen zu geplanten Manahmen in den Lastprognoseprozess des
nchsten Tages einzubeziehen. Im Fall eines Seecontainer-Terminals, basieren diese Betriebsplne auf
der Liste ankommender und abfahrender Schiffe. In dieser Abhandlung werden zwei Anstze
vorgestellt, welche dieses Wissen auf verschiedene Weisen einbeziehen: Whrend fallbasiertes
Schlussfolgern trge whrend des Prognoseprozesses lernt, mssen knstliche neurale Netzwerke erst
trainiert werden, bevor ein Prognoseprozess durchgefhrt werden kann. Es kann gezeigt werden, dass
das Einbeziehung von mehr Wissen in den Prognoseprozess bessere Ergebnisse im Hinblick auf die
Prognosegenauigkeit ermglicht.
German Abstracts 3

Data Analytics in Community Networks

Beyond Spectral Clustering: A Comparative Study of Community Detection for


Document Clustering
Dokumenten-Clustering ist ein allgegenwrtiges Problem bei der Datengewinnung, da Textdaten eine
der gebruchlichsten Kommunikationsformen sind. Die Reichhaltigkeit der Daten erfordert Methoden,
die – je nach den Eigenschaften der Informationen, die gewonnen werden sollen – auf verschiedene
Aufgaben zugeschnitten sind. In letzter Zeit wurden graphenbasierte Methoden entwickelt, die es
hierarchischen, unscharfen und nicht-gaufrmigen Dichtemerkmalen erlauben, Strukturen in
komplizierten Datenreihen zu identifizieren. In dieser Abhandlung zeigen wir eine neue Methodologie
fr das Dokumenten-Clustering, das auf einem Graphen basiert, der durch ein Vektorraummodell
definiert ist. Wir nutzen einen berlappenden hierarchischen Algorithmus und zeigen die
Gleichwertigkeit unserer Qualittsfunktion mit der von Ncut. Wir vergleichen unsere Methode mit
spektralem Clustering und anderen graphenbasierten Modellen und stellen fest, dass unsere Methode
eine gute und flexible Alternative fr das Nachrichten-Clustering darstellt, wenn eingehende Details
zwischen den Themen bentigt werden.

Third Party Effect: Community Based Spreading in Complex Networks


Ein wesentlicher Teil der Netzwerkforschung wurde dem Studium von Streuprozessen und
Gemeinschaftserkennung gewidmet, ohne dabei die Rolle der Gemeinschaften bei den Merkmalen der
Streuprozesse zu bercksichtigen. Hier verallgemeinern wir das SIR-Modell von Epidemien durch die
Einfhrung einer Matrix von Gemeinschaftsansteckungsraten, um die heterogene Natur des Streuens zu
erfassen, die durch die natrlichen Merkmale von Gemeinschaften definiert sind. Wir stellen fest, dass
die Streufhigkeiten einer Gemeinschaft gegenber einer anderen durch das interne Verhalten von
Drittgemeinschaften beeinflusst wird. Unsere Ergebnisse bieten Einblicke in Systeme mit reichhaltigen
Informationsstrukturen und in Populationen mit vielfltigen Immunreaktionen.

Cosine Approximate Nearest Neighbors


Kosinus-˜hnlichkeitsgraphenerstellung, oder All-Pairs-˜hnlichkeitssuche, ist ein wichtiger
Systemkern vieler Methoden der Datengewinnung und des maschinellen Lernens. Die
Graphenerstellung ist eine schwierige Aufgabe. Bis zu n2 Objektpaare sollten intuitiv verglichen
werden, um das Problem fr eine Reihe von n Objekten zu lsen. Fr groe Objektreihen wurden
Nherungslsungen fr dieses Problem vorgeschlagen, welche die Komplexitt der Aufgabe
thematisieren, indem die meisten, aber nicht unbedingt alle, nchsten Nachbarn abgefragt werden. Wir
schlagen eine neue Nherungsgraphen-Erstellungsmethode vor, welche Eigenschaften der
Objektvektoren kombiniert, um effektiv weniger Vergleichskandidaten auszuwhlen, welche
wahrscheinlich Nachbarn sind. Auerdem kombiniert unsere Methode Filterstrategien, welche vor
kurzem entwickelt wurden, um Vergleichskandidaten, die nicht vielversprechend sind, schnell
auszuschlieen, was zu weniger allgemeinen ˜hnlichkeitsberechnungen und erhhter Effizienz fhrt.
Wir vergleichen unsere Methode mit mehreren gngigen Annherungs- und exakten Grundwerten von
sechs Datenstzen aus der Praxis. Unsere Ergebnisse zeigen, dass unser Ansatz einen guten Kompromiss
zwischen Effizienz und Effektivitt darstellt, mit einer 35,81-fachen Effizienzsteigerung gegenber der
besten Alternative bei 0,9 Recall.
4 German Abstracts

Data Analytics through Sentiment Analysis

Information Extraction Engine for Sentiment-Topic Matching in Product Intelligence


Applications
Produktbewertungen sind eine wertvolle Informationsquelle sowohl fr Unternehmen als auch fr
Kunden. Whrend Unternehmen diese Informationen dazu nutzen, ihre Produkte zu verbessern,
bentigen Kunden sie als Untersttzung fr die Entscheidungsfindung. Mit Bewertungen, Kommentaren
und zustzlichen Informationen versuchen viele Onlineshops potenzielle Kunden dazu zu animieren,
auf ihrer Seite einzukaufen. Allerdings mangelt es aktuellen Online-Bewertungen an einer
Kurzzusammenfassung, inwieweit bestimmte Produktbestandteile den Kundenwnschen entsprechen,
wodurch der Produktvergleich erschwert wird. Daher haben wir ein Produktinformationswerkzeug
entwickelt, dass gngige Technologien in einer Engine maschineller Sprachverarbeitung vereint. Die
Engine ist in der Lage produktbezogene Online-Daten zu sammeln und zu sichern, Metadaten
auszulesen und Meinungen. Die Engine wird auf technische Online-Produktbewertungen zur
Stimmungsanalyse auf Bestandteilsebene angewendet. Der vollautomatisierte Prozess durchsucht das
Internet nach Expertenbewertungen, die sich auf Produktbestandteile beziehen, und aggregiert die
Stimmungswerte der Bewertungen.

Towards German Word Embeddings: A Use Case with Predictive Sentiment Analysis
Trotz des Forschungsbooms im Bereich Worteinbettungen und ihrer Textmininganwendungen der
letzten Jahre, konzentriert sich der Groteil der Publikationen ausschlielich auf die englische Sprache.
Auerdem ist die Hyperparameterabstimmung ein Prozess, der selten gut dokumentiert (speziell fr
nicht-englische Texte), jedoch sehr wichtig ist, um hochqualitative Wortwiedergaben zu erhalten. In
dieser Arbeit zeigen wir, wie verschiedene Hyperparameterkombinationen Einfluss auf die
resultierenden deutschen Wortvektoren haben und wie diese Wortwiedergaben Teil eines komplexeren
Modells sein knnen. Im Einzelnen fhren wir als erstes eine intrinsische Bewertung unserer deutschen
Worteinbettungen durch, die spter in einem vorausschauenden Stimmungsanalysemodell verwendet
werden. Letzteres dient nicht nur einer intrinsischen Bewertung der deutschen Worteinbettungen,
sondern zeigt auerdem, ob Kundenwnsche nur durch das Einbetten von Dokumenten vorhergesagt
werden knnen.
German Abstracts 5

User/Customer-centric Data Analytics

Feature Extraction and Large Activity-Set Recognition Using Mobile Phone Sensors
Diese Arbeit beschftigt sich mit dem Problem der Aktivittserkennung unter Verwendung von Daten,
die vom Mobiltelefon des Benutzers erhoben wurden. Wir beginnen mit der Betrachtung und Bewertung
der Beschrnkungen der gngigen Aktivittserkennungsanstze fr Mobiltelefone. Danach stellen wir
unseren Ansatz zur Erkennung einer groen Anzahl von Aktivitten vor, welche die meisten
Nutzeraktivitten abdeckt. Auerdem werden verschiedene Umgebungen untersttzt, wie zum Beispiel
zu Hause, auf Arbeit und unterwegs. Unser Ansatz empfiehlt ein einstufiges Klassifikationsmodell, dass
die Aktivitten genau klassifiziert, eine groe Anzahl von Aktivitten umfangreich abdeckt und in realen
Umgebungen umsetzbar anzuwenden ist. In der Literatur gibt es keinen einzigen Ansatz, der alle drei
Eigenschaften in sich vereint. In der Regel optimieren vorhandene Anstze ihre Modelle entweder fr
einen oder maximal zwei der folgenden Eigenschaften: Genauigkeit, Umfang und Anwendbarkeit.
Unsere Ergebnisse zeigen, dass unser Ansatz ausreichende Leistung im Hinblick auf Genauigkeit bei
einem realistischen Datensatz erbringt, trotz deutlich erhhter Aktivittszahl im Vergleich zu gngigen
Modellen, die auf Aktivittserkennen basieren.

The Choice of Metric for Clustering of Electrical Power Distribution Consumers


Ein bedeutender Teil jedes Systemdatenmodells zur Energieverteilungsverwaltung ist ein Modell der
Belastungsart. Eine Belastungsart stellt ein typisches Belastungsverhalten einer Gruppe gleicher Kunden
dar, z. B. einer Gruppe von Haushalts-, Industrie- oder gewerblichen Kunden. Eine verbreitete Methode
der Erstellung von Belastungsarten ist die Bndelung individueller Energieverbraucher auf der Basis
ihres jhrlichen Stromverbrauchs. Um ein zufriedenstellendes Ma an Belastungsartqualitt zu
erreichen, ist die Wahl des geeigneten ˜hnlichkeitsmaes zur Bndelung entscheidend. In dieser
Abhandlung zeigen wir einen Vergleich verschiedener Metriken auf, die als ˜hnlichkeitsma in
unserem Prozess der Belastungsarterstellung eingesetzt werden. Zustzlich zeigen wir eine neue Metrik,
die auch im Vergleich enthalten ist. Die Metriken und die Qualitt der damit erstellten Belastungsarten
werden unter Verwendung von Realdatenstzen untersucht, die ber intelligente Stromzhler des
Verteilungsnetzes erhoben wurden.

Evolution of the Bitcoin Address Graph


Bitcoin ist eine dezentrale virtuelle Whrung, die dafr genutzt werden kann, weltweit pseudo-
anonymisierte Zahlungen innerhalt kurzer Zeit und mit vergleichsweise geringen Transaktionskosten
auszufhren. In dieser Abhandlung zeigen wir die ersten Ergebnisse eine Langzeitstudie zur
Bitcoinadressenkurve, die alle Adressen und Transaktionen seit dem Start von Bitcoin im Januar 2009
bis zum 31. August 2016 enthlt. Unsere Untersuchung enthllt eine stark verschobene Gradverteilung
mit einer geringen Anzahl von Ausnahmen und zeigt, dass sich die gesamte Kurve stark ausdehnt.
Auerdem zeigt sie die Macht der Adressbndelungsheuristik zur Identifikation von realen Akteuren,
die es bevorzugen, Bitcoin fr den Wertetransfer statt fr die Wertespeicherung zu verwenden. Wir
gehen davon aus, dass diese Abhandlung neue Einblicke in virtuelle Whrungskosysteme bietet und
als Grundlage fr das Design zuknftiger Untersuchungsmethoden und -infrastrukturen dienen kann.
6 German Abstracts

Data Analytics in Industrial Application Scenarios

A Reference Architecture for Quality Improvement in Steel Production


Es gibt weltweit einen erhhten Bedarf an Stahl, aber die Stahlherstellung ist ein enorm anspruchsvoller
und kostenintensiver Prozess, bei dem gute Qualitt schwer zu erreichen ist. Die Verbesserung der
Qualitt ist noch immer die grte Herausforderung, der sich die Stahlbranche gegenber sieht. Das EU-
Projekt PRESED (Predictive Sensor Data Mining for Product Quality Improvement) [Vorrausschauende
Sensordatengewinnung zur Verbesserung der Produktqualitt] stellt sich dieser Herausforderung durch
die Fokussierung auf weitverbreitete, wiederkehrende Probleme. Die Vielfalt und Richtigkeit der Daten
sowie die Vernderung der Eigenschaften des untersuchten Materials erschwert die Interpretation der
Daten. In dieser Abhandlung stellen wir die Referenzarchitektur von PRESED vor, die speziell
angefertigt wurde, um die zentralen Anliegen der Verwaltung und Operationalisierung von Daten zu
thematisieren. Die Architektur kombiniert groe und intelligente Datenkonzepte mit
Datengewinnungsalgorithmen. Datenvorverarbeitung und vorausschauende Analyseaufgaben werden
durch ein plastisches Datenmodell untersttzt. Der Ansatz erlaubt es den Nutzern, Prozesse zu gestalten
und mehrere Algorithmen zu bewerten, die sich gezielt mit dem vorliegenden Problem befassen. Das
Konzept umfasst die Sicherung und Nutzung vollstndiger Produktionsdaten, anstatt sich auf
aggregierte Werte zu verlassen. Erste Ergebnisse der Datenmodellierung zeigen, dass die detailgenaue
Vorverarbeitung von Zeitreihendaten durch Merkmalserkennung und Prognosen im Vergleich zu
traditionell verwendeter Aggregationsstatistik berlegene Erkenntnisse bietet.

Anomaly Detection and Structural Analysis in Industrial Production Environments


Das Erkennen von anormalem Verhalten kann im Kontext industrieller Anwendung von entscheidender
Bedeutung sein. Whrend moderne Produktionsanlagen mit hochentwickelten Alarmsteuerungssytemen
ausgestattet sind, reagieren diese hauptschlich auf Einzelereignisse. Aufgrund der groen Anzahl und
der verschiedenen Arten von Datenquellen ist ein einheitlicher Ansatz zur Anomalieerkennung nicht
immer mglich. Eine weitverbreitete Datenart sind Logeintrge von Alarmmeldungen. Sie erlauben im
Vergleich zu Sensorrohdaten einen hheren Abstraktionsgrad. In einem industriellen
Produktionsszenario verwenden wir sequentielle Alarmdaten zur Anomalieerkennung und -auswertung,
basierend auf erstrangigen Markov-Kettenmodellen. Wir umreien hypothesegetriebene und
beschreibungsorientierte Modellierungsoptionen. Auerdem stellen wir ein interaktives Dashboard zur
Verfgung, um die Ergebnisse zu untersuchen und darzustellen.
German Abstracts 7

Semantically Annotated Manufacturing Data to support Decision Making in Industry 4.0:


A Use-Case Driven Approach
Intelligente Fertigung oder Industrie 4.0 ist ein Schlsselkonzept, um die Produktivitt und Qualitt in
industriellen Fertigungsunternehmen durch Automatisierung und datengetriebene Methoden zu
erhhen. Intelligente Fertigung nutzt Theorien cyber-physischer Systeme, dem Internet der Dinge sowie
des Cloud-Computing. In dieser Abhandlung konzentrieren sich die Autoren auf Ontologie und
(rumliche) Semantik, die als Technologie dienen, um semantische Kompatibilitt der Fertigungsdaten
sicherzustellen. Zustzlich empfiehlt die Abhandlung, fertigungsrelevante Daten ber die Einfhrung
von Geografie und Semantik als Sortierformate zu strukturieren. Der in dieser Abhandlung verfolgte
Ansatz sichert Fertigungsdaten verschiedener IT-Systeme in einer Graphdatenbank. Whrend des
Datenintegrationsprozesses kommentiert das System systematisch die Daten – basierend auf einer
Ontologie, die fr diesen Zweck entwickelt wurde – und hngt rumliche Informationen an. Der in dieser
Abhandlung vorgestellte Ansatz nutzt eine Analyse von Fertigungsdaten in Bezug auf Semantik und
rumliche Abmessung. Die Methodologie wird auf zwei Anwendungsflle fr ein
Halbleiterfertigungsunternehmen angewendet. Der erste Anwendungsfall behandelt die Datenanalyse
zur Ereignisanalyse unter Verwendung von semantischen ˜hnlichkeiten. Der zweite Anwendungsfall
untersttzt die Entscheidungsfindung in der Fertigungsumgebung durch die Identifizierung potentieller
Engpsse bei der Halbleiterfertigungslinie.
Full Papers – Double Blind Reviewed
Reasoning and Predictive Analytics
Circadian Cycles and Work Under Pressure:
A Stochastic Process Model for E-learning
Population Dynamics
Christian Backhage, CésarCésar
OjedaOjeda
and ∗Rafet Sifa Christian
Sifa∗† and Christian Backhage
∗†
and Rafet Sifa
Fraunhofer IAIS Fraunhofer IAIS, St. Augustin, Germany University of Bonn
, Rafet Bauckhage

St. Augustin, Germany Bonn, Germany


† University of Bonn, Germany

Abstract—Web analytics techniques designed to quantify Web A. Empirical Basis


usage patterns allow for a deeper understanding of human
behavior. Recent models of human behavior dynamics have
The empirical basis for our work in this paper consists of
shown that, in contrast to randomly distributed events, people population behavior data collected from a course management
engage in activities which show bursty behavior. In particular, system of an anonymous German university. In total, the
participation in online courses often shows periods of inactivity system provides access to 1,147 different online lectures from
and procrastination followed by frequent visits shortly before 115 different courses. Our data set captures a total of 186,658
examination deadlines. Here, we propose a stochastic process
model which characterizes such patterns and incorporates circa-
anonymized, time-stamped visits from 30,497 different IP
dian cycles of human activities. We validate our model against addresses covering the four semesters in the time from April
real data spanning two years of activity on a university course 2012 to March 2014. For each course covered in our data,
platform. We then propose a dynamical model which accounts for students attend weekly lectures and exercises. Whenever a
both periods of procrastination and work under pressure. Since deadline for course work is scheduled, we typically observe
circadian and procrastination-pressure cycles are fundamental
to human activities, our method can be extended to other tasks
students to react in terms of increasingly frequent visits to
such as analyzing browsing behaviors or customer purchasing the course site. Immediately after each deadline, however,
patterns. access counts drop significantly and this pattern tends to persist
throughout the duration of the course. In addition to behaviors
I. I NTRODUCTION induced by course specific deadlines, we observe a fluctuating
visiting rate caused by the personal schedules of students.
Over the past few years, several platforms for open online Finally, prior to the final examinations, we typically observe
courses have been launched that cater to the demand for highly frequent visits to course sites where, for some courses,
continuous learning in the knowledge society. Benefits of the time in which students react to the final deadline spans
these systems are that they allow a single professor to reach several weeks while, for other courses, it is of the order of
thousands of students, facilitate personalized learning, and, last days. From an abstract point of view, increased visits prior to
but not least, allow for gathering information as to the behavior deadlines for course work and examination provide an example
of large populations of students. The latter enables to track the of a well known decision based queuing process [1] since
learning progress and to automatically recommend content so deadlines cause priorities of students to shift. Looking at our
as to optimize the learning experience. data on a finer, say, daily level, we can observe how students
Among the many data collected, visitation patterns play allocate time to common activities such as eating, resting, or
an especially crucial role and were found to reflect priority sleeping. These natural activities cause idle periods in our data
based decision making behaviors of human agents [1]–[4]. where site visitation rates drop and we note that such periods
Such decision making phenomena are especially characteristic cannot be explained in terms of simple Poisson processes.
for online courses on university eLearning platforms where The examples in Fig. 1 illustrate these general behaviors.
students are provided with the course material and are expected In particular, the figure shows three proxies: the number of
to cover it over the period of a semester. At the same time, re- visits to a course site, time spent on the course (working time
search on online communication patterns has shown that inter- in seconds), and number of different media (videos, course
event times can be characterized in terms of inhomogeneous notes, . . . ) viewed per day. For better comparison, we rescaled
Poisson processes where the Poisson rate λ changes over time each proxy to the same maximum value. Apparently, these
so as to account for the circadian cycles and weekly cycles proxies only vary minimally in the diurnal cycles from lecture
in human activity [5]. In this paper, we therefore consider to deadline periods.
the use of such models in analyzing the behavior dynamics
of a population engaged in online courses. We extend the B. Contributions
inhomogeneous Poisson process and incorporate a dynamic Addressing the problem of modeling the population behav-
equation that accounts for the sudden change in attention as ior on eLearning sites, our main contributions in this paper
the population reacts to a given deadline. is to introduce a model of human behavior dynamics. In

© Springer Fachmedien Wiesbaden GmbH 2017


P. Haber et al. (Hrsg.), Data Science – Analytics and
Applications, http://doi.org/10.1007/978-3-658-19287-7_1
14 C. Ojeda, R. Sifa and C. Bauckhage

0.14 Distribution of Hourly Visits


calm
deadline
0.12
1.0
different medias
number of visits
work time 0.10
0.8
0.08
Rescaled Proxy

0.6
0.06

0.4
0.04

0.2
0.02

0.0
Oct Nov Dec Jan Feb Mar 0.00
0:00 3:00 6:00 9:00 12:00 15:00 18:00 21:00
2013
Semester Dates
(a) distribution of proxies over a semester (b
b) average distribution of visits per day
Fig. 1. Example of the temporal distribution of several proxies for the activity on a course related site in an eLearning system.

the following, we will refer to this model as a Pairwise


Pairwise inhomogeneou us Poisson process allow for a variation in the
ocrrastination Reaction Cascade (PPRC) which that allows
Prrocr arrivals rate, defined
d trough an intensity function
for answering the following questions: a) Given a particular
university course, can we describe the inter-event time dis- λ(t) : S → R+ . (1)
tribution of visits to the system? b) Given a course, can we
predict from the behavior observed in the initial stages of the This intensiity function contains information about the vis-
reaction period, whether a population will covert most of the iting behavior,, because the probability of the number of visits
material required? between time t and time t + δ, P r{Ñ (t, t + δ)} must, by
definition [6], satisfy
II. M ODEL D EFINITION n o
Our goal is to model the visiting behavior of a population Pr N ˜ (t,
( , t + δ) = 0 = 1 − δλ(t) − o(δ) (2)
of students to a video lectures platform over the period of a n o
P r Ñ (t,, t + δ) = 1 = δλ(t) + o(δ) (3)
semester in which students access course content which con-
sists of video lectures and reading material. In particularr,, we
aim at representing the behavior in the time span between the for all t ≥ 0 and some vanishingly small δ . The probability
beginning of the semester and the final examination. Should a of a visit to occur at time t then depends on the value of
course require several examinations, we model the behavior λ at t and, b by imposing conditions on λ, one can devise
over the period prior to the “most important examination” inhomogeneou us Poisson process models of a wide variety
(defined by the amount of course content to be covered). We We of behaviors. In the following, we incorporate the known
assume that there are Nu diffferent
ferent users and that their number empirical behaavior by characterizing the intensity function.
remains fixed during a semester. In order to be of practical
use, our model should account for: B. Cirrcadian
cadian Cycles
1) circadian cycles of human activity As seen in n Fig. 1(b), circadian cycles influence human
2) the stochastic nature of aggregated behaviors of many behavior over the course of a day and, not surprisingly, lead
diffferent
ferent users to reduced vissits to eLearning sites during night times.
3) the tendency of students to learn most of the material We incorpo
We orate this kind of prior knowledge using empir-
close to the deadline ically determiined histograms of hourly visits as shown in
4) the short term behavior of a session of study the figure andd define a function Pd (t) which indicates the
probability of a visit to occur at time t (we need to evaluate
A. Inhomogeneous Poisson
Poisson Prrocess
ocess
the hour of thhe day for the value of t). We
We define Pd to be
The visiting pattern of the population is defined by a set of periodic, Pd (t + τd ) = Pd (t) where the period τd corresponds
point data in a one dimensional domain S ∈ R. The elements to the numberr of seconds in a day. Given these preliminaries,
of S will be called ti and indicate the number of seconds ncorporate Pd in the behavior of the Poisson rate
we can then incorporate
which elapsed from the beginning of the semester to the point using
in time at which visit i ∈ {1, ..., N } occurs where N is the
total number of visits in the semester for the course we are λ(t) = Vd (t)P
Pd (t) (4)
trying to model.
Poisson processes are widely used for modeling point where Vd (t) indicates the rate of visits per day. Next, we
data. In particular, if the rate of arrivals change over time, introduce a dynamical model for this rate.
Circadian Cycles and Work Under Pressure: A Stochastic Process Model for Elearning Population Dynamics 15

Observed Simulated Observed Simulated


40 70
35 60
30 50
25 40
20
15 30
10 20
5 10
0 0
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep
2013 2013
(a) Procrastination and Reaction ( Piecewise Procrastination and Reaction
(b)
Fig. 2. (a) Fit of the continuous model for courses with more than 500 visits. (b) Course semesterr with diffferent
ferent course deadlines, and simulated cox process
intensity. Notice that we intent to reproduce the statistical behavior as opposed to a smooth fit.

ocrrastination and Reaction PR


C. Prrocr and they are limited by the amount of material which can be
The exemplary course data in Fig. 1 shows a population’s covered daily. The role of the infection rate is played by the
reaction to an examination deadline. The reaction manifests reaction rate
reaction rate
itself in terms of first increasing and then peaking activity  α
in the time between January 2013 and February 2013. In t
g(t) = (6)
this example, reactions are observable on a time scale of the β
order of weeks. Other courses howeverr,, were observed to elicit
shorter reaction times in that students tried to cover most of the and the term α/β
α is included for scaling. Finally, Te − V (t)
material in just a few days.
days This renders the problem of data accounts for natural
n limitations of the number of visits, since
driven prediction of user behavior into a rather considerable Te is the maxximum number of visits expected per day and
challenge, since only a few data points are available in the is limited by tthe number of users and the average number of
region of exited activity. Also, many statistical models do not visits the courrse material needs to be covered.
capture the intrinsic non-stationary nature of the phenomenon Note that (5
5) has a close form solution
where the awareness of an upcoming deadline causes changes ( t α
in the activity. Te − V0 )e−( β )
Te − (T if t < td
In order to model the temporal evolution of the daily Vd (t) = (7)
0 otherwise
rate Vd (t), we propose an ordinary diffferential
ferential equation that
accounts for the following minimal principles [7], [8]: Also note thaat if t > β the fraction in the exponent is
1) users try to learn the whole material by the time the bigger than onne and the expression (t/β)α grows rapidly. The
deadline approaches negative exponent causes the exponent term to die out and
2) there is a procrastination parameter β which establishes Vd (t) → Te . V0 is the initial daily rate of visits Vd (0) = V0 .
when the visits to the system become a priority The steepness of this jump depends on the value of α/β. If
3) users react over time according to some function g(t). we derive (7) and evaluate at β we obtain α(Neβ −N0 )
, i.e. the
We will model the function g(t) as a power of time. This
We slope at the reaction date. The time td indicates the day of the
accords with theories of human behavior dynamics which examination aafter which activities typically drop suddenly as
make similar assumptions, for instance, to model how attention students do no ot need to study urgently anymore.
to news items, memes and games declines over time [9]– It is importaant to note that, as a continuous model, equation
[12]. In our model however, the inverse behavior is expected, (7) holds for courses which have a large enough population
attention rises as time progresses. Hence, we expect a positive of attendees aand a larrge
ge enough workload. A large workload
power as g(t) ∝ tα . WeWe thus consider the following diffferential
ferential will guaranteee that the users will remain engaged over several
equation days, as each user requires many visits in order to cover the
   α h i
d α t material.
Vd (t) = Te − Vd (t) (5)
dt β β For the purpose of parameter estimation from data, we w
which is akin to that of an epidemic in that students which denote the complete set of parameters of our dynamical model del
engage with the platform are thought of as infected. as
In contrast to real epidemics, howeverr,, infections are ex-
clusively driven by time and not by other infected students θpr = {T
Te , V0 , β, α} (
(8)
16 C. Ojeda, R. Sifa and C. Bauckhage

and determine θpr as those parameters that minimizes the Algorithm 1 Generating a Cox Process
following error Require: Semester τs , Distribution Parameters θpw , E
Td
1: {tˆi }E
i=1 ∼ Uniform(τs )
X
2: Ta ∼ P (T |θ)
D(t, θpr ) = (Vdd (τ ) − V̂dd (τ )[θpr ])2 (9)
3: U ∼ ∅
τ =1
4: for i ← 1, ..., E do
To minimize this expression, we resort to the Levenberg-
5: ui ∼ Uniform(0, 1)
Marquadt algorithm and let τ = 1, ..., τd vary over the whole
6: ri ∼ V (ti |Ta )
number of days of the semester. The fluctuating nature of the
7: if ui < ri then
distribution of visits is accounted for via a Gaussian noise
8: U ← U ∪ tˆi
assumption with variance σn2 .
9: end if
D. Piecewise PR for Course Workload 10: end for
11: return U
Although some courses will have the characteristics required
by the continuous model in (7), most of the behavior related
to a course will be characterized by rapid shocks occurring
F. PPRC
on days which define a deadline of some sort (examination
or course work). This behavior, however, can be easily in- Summarizing all of the above, we refer to our full model
corporated into (7) by requiring a high value of α. This will as the piecewise, procrastination reaction model (PPRC). The
produce a high visit rate on only a few days around β, up to complete set of parameters of our model is given by
the deadline. n o
To fully specify a semester we need then to define one such θpw = δ, V0 , σ0 , ν, ρ, Pd (t), λc , pc (12)
shock for each deadline and therefore assume
and characterizes the main aspects of human behavior for
t α
Vd (t) = Ta − (Ta − V0 )e−( ta −δ ) (10) the inter-event time distribution. Circadian characteristic are
captured by Pd (t), the reaction of the population towards a
if ta−1 < t < ta which provides a piecewise approximation of given task is parametrized by ν, ρ, and δ, baseline behaviors
our model, i.e. one solution of (5) for each course deadline, are expressed via V0 and σ0 and short term behaviors via λc
where t0 = 0 and ta defines the point in time of deadline and pc .
a ∈ {0, 1, ..., M } where M is the number of deadlines in
a course. For simplicity, we assume that all deadlines are III. M ETHODOLOGY
separated by δ days from day at which students begin to react Next, we outline the procedure we use for model fitting.
(reaction width). Finally, we let Ta be the number of visits First we show how to obtain the procrastination reaction
of the population at that deadline. Since we do not know in parameters and then introduce an algorithm for simulating a
advance how many visits will happen for a given deadline, Cox process. Finally, we discuss a training procedure based on
we model shocks in terms of random variables. From our simulated annealing which allows us to obtain the parameters
empirical data we found via Kolmogorov-Smirnoff tests that for the PPRC θpw as defined by (12).
the distribution that best fits our model follows the gamma
distribution A. Thinning Algorithm
ρν ν−1 −ρT In order to simulate and generate data from our model, we
Gamma(T |ρ, ν) = T e (11)
Γ(ν) proceed via a modification of the rejection sample algorithm
This is again very much in line with known models of human for point data, known as thinning.
attention [9]. Given our overall collection of observed data, we intent
Finally, we observe that this kind of assumption in which to generate a set of seconds {ti } ranging from 0 to τs
we define the intensity function of a inhomogeneous Poisson the total number of seconds in one semester. Traditionally,
process trough another stochastic process is known as a doubly inhomogeneous Poisson process generation [14], [15] requires
stochastic Poisson process or as a Cox process [13]. us to sample from a uniform Poisson distribution via a max-
imum intensity λ∗ , since an inhomogeneous Poisson process
E. Cascade Rates with intensity λ(s) requires that its number of events to be
R
In earlier related work [5], it has been pointed out that distributed via N (S) = λ(s)ds.
short term behaviors of user populations can be modeled via In our case however, we do not know the contribution to
a Poisson process in which another rate is imposed after the the distribution of the number of events in a cascade so we
initial visit. Since we do not know which of the visits will directly generate E events from a uniform distribution over the
generate a cascade of activities, we define a variable pc as interval (0, τs ). We then generate a gamma distributed sample
the proportion of initial Poisson events which give rise to Ta (see again (11)) and random noise from N (0, σn ). This
a cascade. Furthermore, λc defines the rate of the uniform then allows us to create the stochastic intensity function of
Poisson process which characterizes the Poisson distribution. the PPRC model.
Circadian Cycles and Work Under Pressure: A Stochastic Process Model for Elearning Population Dynamics 17

0.01 0.5 0.9 simulated real expo


0.1 pareto
1.6 1.0

Cum. Interevent Dist.


1.4
0.8
1.2
Area Statistics

1.0 0.6
0.8
0.6 0.4
0.4 0.2
0.2
0.0 0.0
0 100 200 300 400 500 600 700 800 0 2 4 6 8 10 12 14
# Iterations Log Inter Event Time (s)
(a) Area test statistic obtained from iterating a simulated (b) Logaarithm of the cumulative distribution of the inter-
annealing procedure for diffferent
ferent temperature values. event tim
mes
Fig. 3. Exemplary results of the behavior of model in training and empirical data fitting.

We generate the desired intensity shape by using a set of


We
uniform random variates {ui }Ei=1 in [0, 1] and evaluate each ti
in the obtained PPRC function. This way (i.e. using algorithm
1 for Cox process generation), we obtain a number Ẽ of events
{tj }Ẽ
j = 1. Finally we incorporate short term behaviors by
choosing fc = pc Ẽ diffferent
ferent events. From these events, we
generate a sample from a Poisson process with rate λc .

B. Tr
Training
Given a sample of empirical point visits {te }, we next
wish to estimate the parameters of our model θpw which
best reproduce the data. The daily behavior of the model Fig. 4. Real distribution of visits over a period three weeks before an
as established by Pd (t) can be obtained directly from the examination deaddline related to one of the courses in our data set (low wer
panel), and simulaated point process of visits using the Cox process discusssed
histogram of the hours of {te }. To
To obtain the PPRC parameters in the text. Note that
t idle periods reflect reduced activities over night.
we first need to obtain the daily visits by a histogram of the
point data for each day.
We then obtain the peak values corresponding to the relevant
We we use simulaated
t d annealing
li [16] by b random
d di l
displacement t on
course work by normalizing the daily visits histogram and fined by the variables (V
the space defi V0 , σ0 , δ, pc , λc , E). AAs
consecutively choosing the biggest peaks of the distribution an example off the results of this model fitting procedure, w we
(located at {ta } with values {T
Ta }) until the cumulative distri- show the outccome w.r.t. a computer science course in Fig.. 3
bution posses a standard deviation bigger than 0.25, up to a where we com mpare the performance our model to a Paretto-
maximum of 24 diffferent
ferent peaks (which would correspond to and exponential distribution chosen as baseline models.
the 24 weeks per semester and a maximum of one homework
per week). IV. R ESULTS
We obtain the parameters of the gamma distribution us- In a series oof practical experiments, we trained our models
els
ing maximum likelihood estimation from the values of the for time markks of 20 diffferent
ferent courses. In each case, wwe
{TTa } found by this procedure. In order to train the re- initialized the model parameters to E = 4000, λr = 1500
maining parameters of our model, we require that the inter- pc = 0.7 V0 = 5 σ0 = 1. δ = 0.1 and used a total of 40 000
event distribution PM (u|θpw ) as sampled via the numerical iterations and an annealing temperature of 0.2. After training,
ng,
Cox algorithm reproduces the distribution of the empirical we obtained a average Kolmogoro
Kolmogorov-Smirnoffff (KS) divergen
nce
cumulative distribution PD (u). The objective function, we statistic of 0.003 ± 0.01 for the whole data set as well ass a
consider
R for for maximization s given by the area test statistic cascade rate oof cascade 1930 ± 691[s] for an average cascaade
A = |P PD (u) − PM (u|θpw )|du where we choose u = log(t) rate of 30 minnutes. Finally, the reaction width δ found in our
for numerical convenience. We We thus obtain the cumulative data was 2.32 ± 1.6 days before the deadline.
model distribution as a sample statistic form our model. As Table IV ppresents goodness-of-fit statistics for a random
Table
such, for a given values of the parameters diffferent
ferent samples selection of hhighly attended courses. Overall, the Pairwise
will generate diffferent
ferent values of the area statistic. In order Procrastinationn Reaction Cascade model proposed in this paper
to minimize the stochastic surface defined by the parameters, was found to fit (almost surprisingly) well to our empirical
18 C. Ojeda, R. Sifa and C. Bauckhage

TABLE I [5] R. D. Malmgren, D. B. Stouffer, A. E. Motter, and L. A. N. Amaral,


A REA - AND KOLMOGOROV-S MIRNOFF STATISTICS FOR RANDOMLY “A Poissonian explanation for heavy tails in e-mail communication,”
SELECTED COURSES IN OUR DATA SET. PNAS, vol. 105, no. 47, pp. 18 153–18 158, 2008. [Online]. Available:
http://www.pnas.org/content/105/47/18153.abstract
Course Name Area Statistic KS Divergence [6] R. G. Gallager and R. G. Gallager, Discrete stochastic processes.
Kluwer Academic Publishers Boston, 1996, vol. 101.
Computer Science 0.038 0.016 [7] M. A. Alvarez, D. Luengo, and N. D. Lawrence, “Latent force models,”
Databases 0.053 0.014 in Proc. AISTATS, 2009.
U.S.-American Literature 0.054 0.019 [8] T. Gunter, C. Lloyd, M. A. Osborne, and S. J. Roberts, “Efficient
School Studies 0.13 0.051 bayesian nonparametric modelling of structured point processes,” arXiv
Multivariable Calculus 0.075 0.026 preprint arXiv:1407.6949, 2014.
[9] C. Bauckhage, “Insights into internet memes.” in Proc. ICWSM, 2011.
[10] Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos, “Rise
and fall patterns of information diffusion: model and implications,” in
data. This suggests that our proposed extension of previous Proc. KDD, 2012.
inhomogeneous Poisson processes models of human activity [11] C. Bauckhage, K. Kersting, and F. Hadiji, “Mathematical models of fads
explain the temporal dynamics of internet memes,” in Proc. ICWSM,
dynamics on the Web [5] does indeed capture the peculiar 2013.
dynamics observed on eLearnig sites. [12] R. Sifa, C. Bauckhage, and A. Drachen, “The Playtime Principle: Large-
scale Cross-games Interest Modeling,” in Proc. of IEEE CIG, 2014.
[13] D. R. Cox, “Some statistical methods connected with series of events,”
V. C ONCLUSIONS J. Royal Statistical Society B, vol. 17, no. 2, pp. 129–164, 1955.
[14] P. A. Lewis and G. S. Shedler, “Simulation of nonhomogeneous poisson
Our goal in this paper was to devise a model for the access processes by thinning,” Naval Research Logistics Quarterly, vol. 26,
behavior of a population of students on a university eLearning no. 3, pp. 403–413, 1979.
[15] R. P. Adams, I. Murray, and D. J. MacKay, “Tractable nonparametric
site. The particular challenge was to account for characteristic bayesian inference in poisson processes with gaussian process intensi-
procrastination and reaction patterns observable prior to final ties,” in Proc. ICML, 2009.
examination deadlines. [16] S. Kirkpatrick, M. Vecchi et al., “Optimization by simmulated anneal-
ing,” science, vol. 220, no. 4598, pp. 671–680, 1983.
Though rather intricate machine learning techniques are
necessary to fit our model to the given data, we emphasize
that the model itself is not a black box but was derived
from first principles. This is to say that each of the con-
stituent parts of our model are interpretable. In particular, they
represent characteristics of human behavior, circadian cycles,
the procrastination and reaction in the present of a deadline,
as well as short term use of the system. These components
were integrated via a dynamical Cox process model which
is intrinsically non stationary and allows for analyzing data
data of poor quality (few observations only). In practical
experiments, the resulting Pairwise Procrastination Reaction
Cascade model was found to be well capable of reproducing
the statistics of the behavior of a student population.
To the best of our knowledge, the work reported here
is the first such study undertaken on a large data set of
access patterns to an eLearning platform. Accordingly, there
are numerous direction for future work. In particular, we
are currently working on extending our approach towards a
practical application that allows teachers to schedule their
deadlines for course work and final examinations such that
the expected workload for students is more equally distributed
over course of a semester.

R EFERENCES
[1] A. L. Barabasi, “The origin of bursts and heavy tails in human dynam-
ics,” Nature, vol. 435, no. 7039, pp. 207–211, 2005.
[2] F. Wu and B. Huberman, “Novelty and collective attention,” PNAS, vol.
104, no. 45, pp. 17 599–17 601, 2007.
[3] R. Sifa, F. Hadiji, J. Runge, A. Drachen, K. Kersting, and C. Bauckhage,
“Predicting Purchase Decisions in Mobile Free-to-Play Games,” in Proc.
of AAAI AIIDE, 2015.
[4] C. Ojeda, K. Cvejoski, R. Sifa, and C. Bauckhage, “Variable Attention
and Variable Noise: Forecasting User Activity,” in Proc. of LWDA
KDML, 2016.
Investigating and Forecasting User Activities in
Newsblogs: A Study of Seasonality, Volatility and
Attention Burst
Christian Bauckhage, César Ojeda and∗Rafet
, RafetSifa Christian Backhage and Rafet Sifa
Fraunhofer IAIS ∗ Fraunhofer IAIS, St. Augustin, Germany University of Bonn
César Ojeda Sifa∗† and Christian Bauckhage ∗†

St. Augustin, Germany † University of Bonn, Germany Bonn, Germany

Abstract—The study of collective attention is a major topic website as a reference, we might use such information as a
in the area of Web science as we are interested to know how general guideline for forecasting. Information regarding the
a particular news topic or meme is gaining or losing popularity population behavior over a website will provide the basis for
over time. Recent research focused on developing methods which
quantify the success and popularity of topics and studyied their understanding both the evolution of that particular website as
dynamics over time. Yet, the aggregate behavior of users across well as the possible success of the future content.
content creation platforms has been largely ignored even though The population behavior on a given website is unavoidably
the popularity of news items is also linked to the way users stochastic, as we cannot know a priori whether a particular user
interact with the Web platforms. In this paper, we present a or a set of users will visit or not (see Fig. 1 for an examplary
novel framework of research which studies the shift of attentions
of population over newsblogs. We concentrate on the commenting time series of activities of blog commenters). Statistical time
behavior of users for news articles which serves as a proxy for series analysis has a rich history of success in fields as diverse
attention to Web content. We make use of methods from signal as electronics, computer science, and economics and we know
processing and econometrics to uncover patterns in the behavior that, given that relevant information regarding the behavior
of users which then allow us to simulate and hence to forecast the of the system is properly modeled in the phenomena that is
behavior of a population once an attention shift occurs. Studying
a data set of over 200 blogs with 14 million news posts, we found measured, and randomness is realized under proper bounds,
periodic regularities in the commenting behavior. Namely, cycles estimates can be achieved and predictive analytics becomes
of 7 days as well as 24 days of activity which may be related to possible.
known scales of meme lifetimes. In this work we present a bottom up case study to analyze
attention of blog users (particulary commenters) to understand
I. I NTRODUCTION
and predict their activity patterns which exploits the fact that
Much recent research on Web analytics has concentrated many websites as well as blogs have years of user history
on developing theories of collective attention where the main which can be mined in search for relevant patterns.
object of study is the evolution of the popularity of topics,
ideas, or sets of news [1], [2]. In this context, the concept of II. R ELATED W ORK
a meme has arisen as the main atom of modern quantitative The study of information diffusion on the Web intents to
social science [1], [3] and researchers seek to understand model pathways and dynamics trough which ideas propagates.
whether a particular meme will remain popular and, if so, for One of the goals is to infer the structure of networks in
how long. Under this line of research, virality is the main which information propagates [4]. Researchers often try to
phenomenon to model. Usually, a contagious (i.e. network devise equations which govern the aggregate behavior [2];
based) approach is followed, and virality is literally treated these equations typically have parameters which depend on the
as an infection in a given population: as an item becomes population size, the rate of the spreading process, and universal
popular, i.e. as a piece of news is retweeted or discussed in features of how attention fades. They model time series which
the media, the population unit, whether blog or twitter account, show how much activity a particular topic or meme attract.
is considered infected. Forecasting can be then performed once the parameters of a
It is important to note that, this paradigm of research, particular population have been determined. In [5] Matsubara
ignores the impact of the source of the meme. The website et al. learned the parameters of the attention time series related
or content generating media plays a key role in generating (or to the first Harry Potter movie and then predicted the behavior
hindering) the evolution of a particular idea or topic. Naturally, for the next movies by measuring only the initial population
if a given website has a large number of users, it is likely, that reaction. More importantly, the natural limitations of human
certain topics become popular and capture the attention of the attention and human behavior have shown to define the overall
population. We might thus ask whether the baseline behavior behavior of the population as they access the information [6],
of the users of a website is enough to generate popularity. [7], [8], [9].
Is it the content which is popular or is it the website which Blog dynamics are traditionally studied in the context of
is popular? If we can use the baseline behavior of a given conversation trees established between different blogs [10],

© Springer Fachmedien Wiesbaden GmbH 2017


P. Haber et al. (Hrsg.), Data Science – Analytics and
Applications, http://doi.org/10.1007/978-3-658-19287-7_2
20 C. Ojeda, R. Sifa and C. Bauckhage

number of blogs 203 Blog URL News Genre Viral Post Count
number of posts 713,122 le-grove.co.uk Sport 620
number of comments 14,883,752 politicalticker.blogs.cnn.com Politics 542
time span of activity 2006-2014 order-order.com Politics 294
sloone.wordpress.com Personal 248
technologizer.com Technology 136
TABLE I: Characteristics of the Wordpress Data Set snsdkorean.wordpress.com Entertainment 116
collegecandy.com Magazine 108
seokyualways.wordpress.com Personal 96
religion.blogs.cnn.com Religion 93
kickdefella.wordpress.com Personal 89

[11]. In this context, model of the structural and dynamical


TABLE II: Top 10 Newsblogs from our dataset.
properties of networks of hyperlinks are sought for. McGlohon
et al. [11] try to find representative temporal and topological
features for grouping blogs together and to understand how
blog contents concentrate on sites with a large community of
information propagates among different blogs. The model in
frequent users.
[10] defines rules which indicate how the different links are
Our dataset contains the entire post and comment history of
created. Typical properties include the degree distribution of
newsblogs (from year 2006 to 2014) that ever made it to the
the edges to a blog, which presents a power law behavior, the
daily updated list of Blogs of the Day4 . Having crawled the
size of conversation trees (how many posts are created in a
content of over 6580 different blogs, we choose to work with
given theme) and the popularity in time, as measured with the
the time series of activities through comments and posting
amount of edges in time.
of the most active (throughout the available time span) 203
Here, we intend to make use of methods of statistical time blogs from variety of genres in order to study long term
series modeling in order to study and forecast the behavior dynamics. The blogs we analyze overall contain 713,122 posts
of a population on a given webpage. In particular, we focus and 14,883,752 comments. Table I summarizes the overall
on time series of users commenting patterns. We base our characteristics of our data and Table II shows the top 10 blogs
work on well establish methods in economics [12], [13] and sorted with respect to the number of viral posts. So as to
signal processing [14]. In a nutshell, we attempt to describe the protect the identity of commenters, we randomly hashed the
statistical process which generates the time series, modeling author identities and key features we extracted from each post
both temporal correlations as well as signal noise. Due to are
the world financial crises of the years 2008 a great deal of
• CommentsTime: Time of each comment
attention in the field is devoted at understanding volatility or
• MainURL: URL of the post
strong fluctuations in the market [15], [16]. At their core, the
• CommentsInfo: Content of the comment
problems of volatility and collective attention shifts are similar
• Title: Title of the post
and we exploit these similarities in our approach.
• PubDate: Date of last update of the post

III. DATA S ET D ESCRIPTION • Blog: URL of the blog


• Info: Summary of the Post
The empirical basis for our work consists in a collection of
blogs hosted on Wordpress1 . Wordpress blogs are personal or IV. E MPIRICAL O BSERVATIONS
company owned web sites that allow users to create content in Our main objective in this paper is understanding the
the form of posts. Users (or, in blogger terminology, writers dynamics of the aggregate behavior of a population of users of
or authors) decide whether or not their readers can comment a webpage or blog. For each blog we determine the number
on particular posts. Comments on posts provide a significant of comments Ct on all posts at time t. Figure 1 presents a
proxy when it comes to measuring the attention a topic typical example of such a time series of daily commenting
receives. behavior spanning a period of 3 years together with a monthly
Wordpress is a content management system and a frequently moving average. Averaged over months, the time series shows
used blogging platform accounting for more than 23.3% of a slowly varying behavior which reflects the fact that, over
the top 10 million websites as of January 20152 . As of time, the attention the blog receives does not vary much. On
this writing, there are 43 million posts and more than 60 the other hand, on the time scale of days rapid variations
millions users. Notably, the user-base of Wordpress ranges can be observed, as well as strong peaks which indicate the
from individual hobby bloggers to publishers such as Time presence of particularly engaging content. We aim at applying
magazine3 . And while traditional mass media limit the number forecasting methods to these very fluctuations. The main goal
of topics accessible to the public [17], blogs have arisen as a of our work is to propose a forecasting procedure that can
social medium allowing for personal opinion and information characterize these shifts of attention.
variety. Nevertheless, inequality is a pervasive characteristic We assume that a prototypical webpage has a steady user
of social phenomena [18], [19] and many of the consumers of base. This user base can be understood as a set of followers and
1 https://www.wordpress.com/
as different followers visit the website, comments will appear
2 Information retrieved on 5.3.2013 from http://w3techs.com/technologies/ as natural stochastic fluctuations. However, if a particular news
overview/content management/all/ item is highly relevant or hot for a segment of the population,
3 https://vip.wordpress.com/2014/03/06/time-com-launches-on-wordpress-
com-vip/ 4 https://botd.wordpress.com/
Investigating and Forecasting User Activities in Newsblogs: A Study of Seasonality, Volatility and Attention Burst 21

3000 http://politicalticker.blogs.cnn.com/
Number Of Comments
2500 Average Trend
Number Of Comments Ct

2000

1500

1000

500

60 http:////politicalticker.blogs.cnn.com/ Burst Time Series yt


Burst
50
40
30
yt

20
10
0
10
201
1
201
2
201
2
201
2
201
3
201
3
201
3
201
4
201
4
201
4 Fig. 2: Seasoonality study through a semilog plot of the t
Oct Feb Jun Oct Feb Jun Oct Feb Jun Oct
Dates periodogram for
f estimation of the power spectral density from
om
Fig. 1: Time series of the number of comments Ct and bursts an example newsblog. Notice the presence of 3 frequencies ies
yt for a newsblog. The upper figure shows the daily fluctu- among the noisy behavior that account for commenting activiity
ations of activities (blue) and the monthly moving average in period of th
hree-and-a-half and seven days.
trend (red). The lower figure shows a burst representation that
allows for locating noticeable changes in the overall behavior.
behavior
If seasonal behavior is present, peaks
p will appear in tthe
periodigram. For example, Fig.2 show ws the periodogram forr a
then users will show more activity, increase their commenting time series of comment counts from m our dataset. Notice tthe
rate, and also share the content with others users which arree not peaks at frequencies of 0.13, 0.29 an
nd 0.31 which account ffor
part of the baseline set of followerrss of the blog. Consequently, periods in the comment patterns off weekly and half weekkly
some of the comments will appear as a result of a spreading seasonal behavior indicating that thee users of that particullar
process due to the web site followers. The popularity of web newsblog leave comments in three-and-a-half
and-a-half and seven daays
page trranslate
anslate into popularity of the content. frequencies.
Assuming a somewhat stable set of users has the added B. Burrst
st and Volatility
Volatility
advantage of exploiting seasonality efffects,fects, as people are
In order to characterize the popularity
arity of a particular newws
known to follow seasonal behavior [20]. Such periodicities can
source, we require a quantitative measurement
m of attention
help to better understand the repeating behavior (for instance
shifts. These can described via an iter-burst measure [21],
weekly visits of average users) as well as to forecast activities.
namely
V. T IME S ERIES A NALYSIS Ct − Ct−1

yt = . ((3)
Ct
Wee model the user commenting behavior as a discrete
W
stochastic process, i.e. as a collection of random variables Eq. (3) is also know as the logarith
hmic derivative of the prro-
X1 , X2 , X3 , ..., Xk indexed by time. Since we aim at attention cess Ct . Due to its rescaled nature, yt allows for comparison on
modeling, our first variable of study is Ct , the number of between diffferent
ferent newsblogs and timees, but, for our data set,, it
comments an entire blog has at time t. Considering daily provides limited value for forecastinng. We
We thus also considder
samples accords with the pace of publication in most blogs. the detrended comment count C˜t whiich is obtained after using
the Baxter-King filter (see the Appenndix).
A. Seasonality
C. Modeling V Volatility
olatility
One approach to time series prediction using stochastic
processes requires detecting periodicities. If are recovered Having introduced C˜t , we next neeed an approach that ccan
by detecting their frequencies and associated amplitudes, we account for fluctuations. Since afterr the filter is applied w
we
can predict the behavior at each cycle. A common method obtain a zero mean behavior, we can assume
a that the commennts
considers power spectra of the stochastic process at hand via behavior will vary as
the discrete Fourier transform Ct = σt t
C̃ ((4)
r T
1X where σt represents
p the standard deviiation of the fluctuation at
fx (ω) = xt exp −iωt
iωt. (1)
T t=1 time t and t ∼ N (0, 1) a noise value at time t, sampled fromma
normal distribution of mean 0 and varriance 1. In finance, wh hen
We then square the transform
We
modeling the fluctuations in returns (as opposed to C˜t ) σt is
1 known as the volatility. The simplest model for volatility occcur
fx (ω)fˆx (ω) I(ω) = (2)
2π in econometrics under the names off ARCH (autoregressiive
to obtain the so called periodogram of a time series. conditional heteroscedasticity) and generalized ARCH or
22 C. Ojeda, R. Sifa and C. Bauckhage

GARCH [22]. Here, heteroscedasticity refers to variations


in the fluctuation of the variable of interest, in our case Distribution of Comments not viral
0.0040
Ct . Formally, this implies that the covariance Cov(Ct , Ct−k ) 0.0035
pareto
Empirircal Distribution
depends on time t. The dependence is modelled trough past 0.0030
values of the noise et as well as past values of σt 0.0025

Probability
q
X q
X 0.0020

σt2 =ω+ αk 2t−k + 2


βl σt−l . (5) 0.0015

k=1 l=1 0.0010

0.0005
The values of p and q determine how correlated σt is with
0.0000
0 200 400 600 800 1000
past fluctuations so that the model is called GARCH(p, q). Number Of Comments
To learn these parameters from data, we have to take into
To 0.007 Distribution of Comments From WP Selection
lognorm
account that for GARCH(1, 0) the square of the fluctuations 0.006
Empirical Distribution

are C˜t2 = α0 + α1 C˜t−1


t
2
+ error, i.e. that they behave as an 0.005

AR(1) (an autoregressive model of order 1) [12]. Hence, the 0.004

Probability
values of the partial autocorrelations of the square variable 0.003

give us a hint of the dependence of the model. 0.002

0.001
VI. M AIN R ESULTS
0.000
A. Attention Prroxy
oxy 0 200 400 600 800
Number Of Comments
1000 1200 1400

Traditional attention proxies for collective attention com-


prise direct measures such as the number of hyperlinks to a
web site, retweets, or the number of likes via the facebook Fig. 3: The disstribution of comments. The upper figure shows
platform. Sometimes an internal measure of the website is the distribution
n of comments for posts pertaining to the non-
preferred, for instance, the web site digg.com uses a user based relevant data set.
s Pareto distribution is fitted with 43 average
ranking mechanism which ultimately decides which content is value and p vaalue = 0.43. The lower figure on the other hand
displayed on the main page. In order to validate our use of shows the disttribution of comments for relevant posts as de-
comments as an attention proxy, we make use of the fact that termined from the Wordpress
Wordpress data seet average 254 comments
Wordpress regularly features the most popular posts5 . In the
Wordpress per post and p value = 0.73.
following, we refer to normal posts as posts not presented in
this list, whereas viral or important posts as the posts presented
in the list and we note that, indeed, posts features on the most newsblogs. It is important to note thaat such an analysis should
popular list attract more comments than others. be realized for each blog independentlydently, as it will reveal
Since we work with a dataset that covers several years of the cycles of the followers or indduced cycles due to the
activity, it is important to note that the amount of comments posting patterns of the particular website
w we are analyzing.
expected in a given year is diffferent
ferent from the amount of Nevertheless, it is commonly known thatt both human behavior
comments in another if the blog shows growth or decay in and virality patterns often show univeersal trends. For instance,
its user base. Yet,
Yet, this issue is accounted for by our use of attention to memes grows and fades over
o time in highly regular
ended behavior C˜t .
de-trrended patterns [1]. Also, people show repettitive behavior because of
For our whole data set of relevant blogs, we observe average work schedules and life habits [20].. If there is in fact such
fluctuation value for the important posts of 71.12 whereas for universal behavior for our case, a perriodogram analysis should
the normal blogs an average fluctuation value of -96.14. This reveal common characteristics acrosss diffferent
ferent blogs.
shows that most of the comment behavior arises frrom om popular Indeed, based on periodogram annalysis, we obtained the
or virral
al posts. most relevant frequencies for each periodogram as the fre-
Figure 3 shows the distribution of comments for normal and quencies above 4 standard deviatioons over the statistics of
important posts where we observe positive skew distributions each individually calculated periodogram.
dogram. Figure 4 shows
that can be modeled using pareto and lognorm distributions the histogram of the relevant frequencies
encies where we observe
(see the Appendix). The pareto behavior indicates that there relevant peaks at frequencies of 17 and
a 7 days.
are strong fluctuations in the normal data set, meaning that
The peak with the largest frequenccy value of 7 days can be
there are in fact important posts not considered by the Word-
Word-
attributed to a weekly behavior of the users. Business days
press listing procedure.
constrain the publishing effforts
forts of websites as well as the
B. Seasonality visiting patterns of users. The 17-day period on the other hand,
In this section we perform the periodogram analysis to seems to correspond to the natural attention scale of publishing
identify the cycles of commentors leaving comments in the behavior. A news content providerr for example, publishes
several posts on a given topic and this 17 day frequency might
5 https://botd.wordpress.com/top-posts/ correspond to the saturation of attenttion resourses people can
Investigating and Forecasting User Activities in Newsblogs: A Study of Seasonality, Volatility and Attention Burst 23

17 days 7 days
300
300 RealizedVolatility
250 GARCH p: 6 q: 6
250
200

Volatiilliitty σ
200 150
# Counts

100
150

50
100

0
1 1
2008 ov 2008 ai 2009 ov 2009 ai 2010 ov 2010 ai 201 ov 201 ai 2012
50
Mai N M N M N M N M
Dates
0

0.00 0.05 0.10 0.15 0.20


Fig. 5: Reaalized volatility vs Conditional Volatility
Volatility
Frequency (1/Days)
as obtained from the best GARCH model for the
Fig. 4: Seasonality study through analyzing the histogram for http://marisacaat.wordpress.com/ website.
the relevant frequencies over the whole population of blogs.
For each blog we selected the frequencies which exceed the Blogs Best Model MSE MSE p=1 q=1 # Comments
le-grove.co.uk// (7, 7) 2.00 1.87 1625633
4 standard deviation. We
We observe peaks for 7 and 17 days politicalticker.bblogs.cnn.com/ (3, 2) 4.48 4.16 1461266
order-order.com/m/ (5, 5) 4.87 3.80 1289484
indicating users’ periodic commenting frequencies. sloone.wordpreess.com/ (8, 7) 1.39 1.16 93493
technologizer.ccom/ (9, 1) 594.34 649.79 165856
snsdkorean.woordpress.com/ (2, 5) 0.40 0.28 83081
collegecandy.ccom/ (9, 1) 0.71 0.61 51495
seokyualways.wordpress.com/ (1, 6) 3.76 2.41 130460
religion.blogs.cnn.com/
cnn.com/ (9, 9) 21.66 18.74 2426437
allocate to diffferent
ferent subjects [9]. Unless we perform an actual kickdefella.woordpress.com/ (6, 8) 0.22 0.17 39312

content analysis, however, we will not be able to pinpoint the


TABLE IIII: Fitting results for the top 10 newsblogs.
true latent cause of this periodicity. Neverthelss, our results
show a consistent appearance of this frequency throughout our
Wordpress data set.
Wordpress
available [[22].
]. We
We performed a statistical test of heterocedas-
C. For
Forrecasting ticity for all o
of the blogs in our data set [16] and selected
Having analyzed the nature of user behavior and havinf the top 100 w web sites. WeWe present the results for the top 10
uncovered periodic patterns in the behavior of commentors, blogs in Table
Table III. WeWe fited GARCH models with a grid
we next approach the ultimate goal of analytics which, in our search on the paramters p ∈ {1, ..., 10} and q ∈ {1, ..., 10}
case, is forecasting user fluctuations. and selected th he best model according to Akaike Information
1) A For otocol: To
Forrecasting Prrotocol: To this end, We
We propose the fol- Criterion (AICC) [23]. For comparison, we obtain the realized
lowing protocol in order to study the fluctuations in attention volatility [22] i.e. the standard deviation for the past 14 days
to blogs: window (σ fo or {C˜i−14 , ..., C˜i−1 , C˜i }) and compared to the
conditional vo olatility σt as predicted by GARCH given the
• Create the time series of comments per day Ct (difffer fer-
past values of the models one day in advance (according to the
ent time scales such as hours, days, or weeks may be
selected q andd p). Figure 5 shows an examplary GARCH fit
considered, provided that enough samples are available).
of the fluctuatiions of the commenting behavior of a newsblog.
• Apply a detrending procedure such as the Baxter King
For the top 10 00 blogs, we obtained average MSE values of
filter in order to obtain a stationary version of the given
0.739100 for tthe models with best AIC values and 0.663600
time series.
for the modelss with p=1,q=1.
• Obtain the peridogram of the time series so as to identify
relevant frequencies. VII. C ONCLUSIONS AND F UTURE W ORK
• Compute the auto correlation (Eq. (9)) and the partial
autocorrelation of C˜t . Wee were cconcerned with the problem of understanding
W
• If relevant autocorrelations can be observed, the blog is
the dynamics of the collective attention of populations of
amenable for forecasting of fluctuations. readers of newwsblogs and presented a novel time series anal-
• Fit the ARCH or GARCH model to the time series.
ysis approachh. Our approach relies on two main concepts:
• Once the values of βp and αq are known, we can obtain
seasonality annd volatility. The seasonality is uncovered by
the conditional volatility, i.e. the values of σt given past the analysis oof the pperiodogram
g function of the time series
values of q and p. of comments. Volatility
Volatility, on the other hand, is uncovered by
analyzing the deviation of the variable fluctuations in time. We
We
In order to fit the parameters of the ARCH and GARCH
found universal seasonal behavior of newsblogs users in our
processes, many commercial and free software packages 6 are
dataset which hint ate universal patterns of human behavior.
6 An example open source python package for fitting ARCH models can be Additionally, we showed how GARCH modeling of the time
found under https://pypi.python.org/pypi/arch. series presents a way to forecast possible future scenarios for
Another random document with
no related content on Scribd:
found numerously within 100 miles of it. They adhere to stones in
rapid water, and differ from the Melaniidae of the Old World and of S.
America in the absence of a fringe to the mantle and in being
oviparous. They do not occur north of the St. Lawrence River, or
north of U.S. territory in the west, or in New England. Three-quarters
of all the known species inhabit the rough square formed by the
Tennessee River, the Mississippi, the Chattahoochee River, and the
Gulf of Mexico. The Mississippi is a formidable barrier to their
extension, and a whole section (Trypanostoma, with the four genera
Io, Pleurocera, Angitrema, and Lithasia) does not occur west of that
river. The Viviparidae are also very largely developed, the genera
Melantho, Lioplax, and Tulotoma being peculiar. The Pulmonata are
also abundant, while the richness of the Unionidae may be gathered
from the fact that Wetherby states[377] that in 1874 no less than 832
species in all had been described.
The entire Mississippi basin is inhabited by a common
assemblage of Unionidae, and a considerable number of the species
are distributed over the whole of this area, Texas, and parts of E.
Mexico. Some species have spread out of this area into Michigan,
Canada, the Red River, and Hudson’s Bay district, and even into
streams in New York which drain into the Atlantic. An entirely
different set of forms occupy the great majority of the rivers falling
into the Atlantic, the Appalachian Mountains acting as an effective
barrier between the two groups of species, which appear to mingle
below the southern end of the range. In many cases Unionidae seem
to have no difficulty in migrating from river to river, if the distance is
not extreme; they probably are carried across overflowed districts in
time of flood.[378]
Fig. 227.—Helix (Arionta)
fidelis Gray, Oregon.
(2) The Californian Sub-region is markedly distinct from the rest
of N. America. The characteristic sombre Helices of the Eastern
States are almost entirely wanting, and are replaced by Arionta (20
sp.), a larger and more varied group, which may have some affinity
to Chinese forms. Glyptostoma (1 sp.) is also peculiar. Selenites
here has its metropolis, and Pristiolma is a remarkable group of
small Hyalinia (Zonites), but the larger forms of the Eastern States
are wanting. Several remarkable and quite peculiar forms of slug
occur, namely, Ariolimax (whose nearest relation is Arion),
Prophysaon, Hemphillia, and Binneya. There are no land
operculates.
Not more than 15 to 20 species of the Pleuroceridae (sect.
Goniobasis) occur west of the Rocky Mountains, and only a single
Unio, 5 Anodonta, and 1 Margaritana, which is common to New
England. Pompholyx is a very remarkable ultra-dextral form of
Limnaea, apparently akin to the Choanomphalus of L. Baikal.
Bithynia, absent from the Eastern States, is represented by two
species. The general indications are in favour of the Californian
fauna having migrated from an Old World source after the upheaval
of the Sierras; the American fauna, on the other hand, is purely
indigenous, with no recent Old World influence at all.
Land Mollusca of the Nearctic Region
Glandina 4
Selenites 6
Limax 4
Vitrina 4
Vitrinozonites 1
Mesomphix 15
Hyalinia 22
Conulus 1
Gastrodonta 9
Pristiloma 2
Tebennophorus 4
Ariolimax 6
Prophysaon 2
Hemphillia 1
Binneya 1
Patula 18
Punctum 2
Arionta 20
Praticola 2
Glyptostoma 1
Mesodon 27
Stenotrema 11
Triodopsis 21
Polygyra 23
Polygyrella 2
Gonostoma 1
Vallonia 1
Strobila 2
Pupa 18
Vertigo 8
Holospira 2
Cionella 1
Bulimulus 6
Macroceramus 1
Succinea 21
Vaginulus 1
Helicina 2

F. The Neotropical Region


The land Mollusca of the Neotropical Region stand in complete
contrast to those of the Nearctic. Instead of being scanty, they are
exceedingly abundant; instead of being small and obscure, they are
among the largest in size, most brilliant in colour, and most singular
in shape that are known to exist. At the same time they are, as a
whole, isolated in type, and exhibit but little relation with the Mollusca
of any other region.
The most marked feature is the predominance of the peculiar
genera Bulimus and Bulimulus, the centre of whose development
appears to lie in Peru, Ecuador, and Bolivia, but which diminish, both
in numbers and variety of form, in the eastern portion of the region.
In the forests of Central America, Venezuela, and Ecuador, and, to a
lesser degree, in those of Peru and Brazil, occurs the genus
Orthalicus, whose tree-climbing habits recall the Cochlostyla of the
Philippines. These three groups of bulimoid forms constitute, as far
as the mainland is concerned, the preponderating mass of the land
Mollusca. Helix proper is most strongly developed in the Greater
Antilles, which possess several peculiar groups of great beauty. In
Central America Helix is comparatively scarce, but in the northern
portions of the continent several fine genera (Labyrinthus, Isomeria,
Solaropsis) occur, which disappear altogether towards the south.
Carnivorous land Mollusca are, so far as Central America is
concerned, more highly developed than in any other quarter of the
world, particularly in the genera Glandina and Streptostyla. These
genera also penetrate the northern portions of the continent,
Glandina reaching as far as Ecuador, and Streptostyla as far as
Peru. The Greater Antilles have also characteristic forms of these
genera. Streptaxis is tolerably abundant all over tropical South
America, and is the one pulmonate genus which shows any affinity
with the African fauna.
The slugs are exceedingly scarce. Vaginula occurs throughout,
and is the only genus in any sense characteristic.
Clausilia, in the sub-genus Nenia, occurs along the Andean chain
from the extreme north (but not in Central America) as far south as
Bolivia. It has in all probability made its way into S. America in
exceedingly remote ages from its headquarters in Eastern Asia. No
species survives in N. America, and a single straggler is found in
Porto Rico. The genera Macroceramus, Cylindrella, and Strophia,
are characteristic West Indian forms, which are only slightly
represented on the mainland. Homalonyx, a curious form akin to
Succinea, is peculiar to the region.

Fig. 228.—Homalonyx unguis Fér.,


Demerara. sh, Shell (shown also
separate); p.o, pulmonary orifice.
Land operculates attain a most extraordinary development in the
Greater Antilles, and constitute, in some cases, nearly one-half of the
whole Molluscan fauna. Several groups of the Cyclostomatidae find
their headquarters here, and some spread no farther. On the
mainland this prominence does not continue. West Indian influence
is felt in Central America and on the northern coast district, and
some Antillean genera make their way as far as Ecuador. The whole
group entirely disappears in Chili and Argentina, becoming scarce
even in Brazil.
Among the fresh-water operculates, Ampullaria is abundant, and
widely distributed. Vivipara, so characteristic of N. America, is
entirely absent. Chilina, a remarkable fresh-water pulmonate, akin to
Limnaea, is peculiar to Chili, Patagonia, and Southern Brazil, but is
not found in the tropical portion of the continent. Of the fresh-water
Pelecypoda Mycetopus, Hyria, Castalia, Leila, and Mülleria are
peculiar forms, akin to the Unionidae.
(1) The Antillean Sub-region surpasses all other districts in the
world in respect of (1) extraordinary abundance of species, (2) sharp
definition of limits as a whole, (3) extreme localisation of the fauna of
the separate islands. The sub-region includes the whole of the half-
circle of islands from the Bahamas to Grenada, together with the
extreme southern end of the peninsula of Florida, which was once,
no doubt, a number of small islands like the Bahamas. Trinidad, and
probably Tobago, although containing an Antillean element, belong
to the mainland of S. America, from which they are only separated
by very shallow water.
The sub-region appears to fall into four provinces:—
(a) Cuba, the Bahamas, and S. Florida; (b) Jamaica; (c) San
Domingo (Haiti), Porto Rico, and the Virgin Is., with the Anguilla and
St. Bartholomew group; (d) the islands from Guadeloupe to
Grenada. The first three provinces contain the mass of the
characteristic Antillean fauna, the primary feature being the
extraordinary development of the land operculates, which here
reaches a point unsurpassed in any other quarter of the globe. The
relative numbers are as follows:—
Cuba Jamaica San Domingo Porto Rico
Inoperculate 362 221 152 75
Operculate 252 242 100 23
It appears, then, that the proportion of operculate to inoperculate
species, while very high in Cuba (about 41 per cent of the whole),
reaches its maximum in Jamaica (where the operculates are actually
in a majority), begins to decline in San Domingo (about 40 per cent),
and continues to do so in Porto Rico, where they are not more than
24 per cent of the whole. These operculates almost all belong to the
families Cyclostomatidae and Helicinidae, only two genera
(Aperostoma and Megalomastoma) belonging to the Cyclophorus
group. Comparatively few genera are absolutely peculiar to the
islands, one or two species of most of them occurring in Central or S.
America, but of the several hundreds of operculate species which
occur on the islands, not two score are common to the mainland.
Map to illustrate the
GEOGRAPHICAL DISTRIBUTION
of the Land Mollusca of the
WEST INDIES.
The red line marks the 100 fathom line.
London: Macmillan and Cọ. London: Stanford’s Geogḷ Estabṭ.
The next special feature of the sub-region is a remarkable
development of peculiar sub-genera of Helix. In this respect the
Antilles present a striking contrast to both Central and S. America,
where the prime feature of the land Pulmonata is the profusion of
Bulimus and Bulimulus, and Helix is relatively obscured. No less
than 14 sub-genera of Helix, some of which contain species of
almost unique beauty and size, are quite peculiar to the Greater
Antilles, and some are peculiar to individual islands.
Here, too, is the metropolis of Cylindrella (of which there are 130
species in Cuba alone), a genus which just reaches S. America, and
has a few species along the eastern sea-board of the Gulf of Mexico.
Macroceramus and Strophia are quite peculiar; the former, a genus
allied to Cylindrella, which attains its maximum in Cuba and San
Domingo, is scarcely represented in Jamaica, and disappears south
of Anguilla; the latter, a singular form, resembling a large Pupa in
shape, which also attains its maximum in Cuba, is entirely wanting in
Jamaica, and has its last representative in S. Croix. One species
irregularly occurs at Curaçao.
The carnivorous group of land Mollusca are represented by
several peculiar forms of Glandina, which attain their maximum in
Jamaica and Cuba, but entirely disappear in the Lesser Antilles.
A certain number of the characteristic N. American genera are
found in the Antillean Sub-region, indicating a former connexion,
more or less intimate, between the W. Indies and the mainland. The
genera are all of small size. The characteristic N. American Hyalinia
are represented in Cuba, San Domingo, and Porto Rico; among the
Helicidae, Polygyra reaches Cuba, but no farther, and Strobila
Jamaica. The fresh-water Pulmonata are of a N. American type, as
far as the Greater Antilles are concerned, but the occurrence of
Gundlachia (Tasmania and Trinidad only) in Cuba is an unexplained
problem at present. Unionidae significantly occur only at the two
ends of the chain of islands, not reaching farther than Cuba (Unio 3
sp.) at one end, and Trinidad (which is S. American) at the other.
A small amount of S. American influence is perceptible throughout
the Antilles, chiefly in the occurrence of a few species of Bulimulus
and Simpulopsis. The S. American element may have strayed into
the sub-region by three distinct routes: (1) by way of Trinidad,
Tobago, and the islands northward; (2) by a north-easterly extension
of Honduras towards Jamaica, forming a series of islands of which
the Rosalind and Pedro banks are perhaps the remains; (3) by a
similar approximation of the peninsula of Yucatan and the western
extremity of Cuba. Central America is essentially S. American in its
fauna, and the characteristic genera of Antillean operculates which
occur on its eastern coasts are sufficient evidence of the previous
existence of a land connexion more or less intimate (see map).
(a) Cuba is by far the richest of the Antilles in land Mollusca, but it
must be remembered that it is also much better explored than San
Domingo, the only island likely to rival it in point of numbers. It
contains in all 658 species, of which 620 are land and 38 fresh-
water, the land operculates alone amounting to 252.
Carnivorous genera form but a small proportion of the whole.
There are 18 Glandina (which belong to the sections Varicella and
Boltenia) and 4 Streptostyla, the occurrence of this latter genus
being peculiar to Cuba and Haiti (1 sp.) among the Antilles, and
associating them closely with the mainland of Central America,
where Streptostyla is abundant. These two genera alone represent
the Agnatha throughout the sub-region.
There are no less than 84 species of Helix, belonging to 12 sub-
genera. Only one of these (Polymita) is quite peculiar to Cuba, but of
7 known species of Jeanerettia and 8 of Coryda, 6 and 7
respectively are Cuban. Thelidomus has 15 species (Jamaica 3,
Porto Rico 3); Polydontes has 3, the only other being from Porto
Rico; Hemitrochus has 12 (Jamaica 1, Bahamas 6); Cysticopsis 9
(Jamaica 6); Eurycampta 4 (Bahamas 1).
The Cylindrellidae find their maximum development in Cuba. As
many as 34 Macroceramus occur (two-thirds of the known species),
and 130 Cylindrella, some of the latter being most remarkable in
form (see Fig. 151, B, p. 247).
The land operculates belong principally to the families
Cyclostomatidae and Helicinidae. Of the former, Cuba is the
metropolis of Ctenopoma and Chondropoma, the former of which
includes 30 Cuban species, as compared with 1 from San Domingo
and 2 from Jamaica. Megalomastoma (Cyclophoridae) is also
Haitian and Porto Rican, but not Jamaican. Blaesospira, Xenopoma,
and Diplopoma are peculiar. The Helicinidae consist mainly of
Helicina proper (58 sp.), which here attains by far its finest
development in point of size and beauty, and of Eutrochatella (21
sp.), which is peculiar to the three great islands (Jamaica 6 sp., San
Domingo 6 sp.).
The Bahamas, consisting in all of more than 700 islands, are very
imperfectly known, but appear to be related partly to Cuba, partly to
San Domingo, from each of which they are separated by a narrow
channel of very deep water. They are certainly not rich in the
characteristic groups of the Greater Antilles. The principal forms of
Helix are Plagioptycha (6 sp.), common with San Domingo, and
Hemitrochus (6 sp.), common with Cuba. Strophia is exceedingly
abundant, but Cylindrella, Macroceramus, and Glandina have but
few species. There are a few species of Ctenopoma, Chondropoma,
and Cistula, while a single Schasicheila (absent from the rest of the
sub-region) forms a link with Mexico.

Fig. 229.—Characteristic Cuban


Helices. A, Polydontes imperator
Montf. B, Caracolus rostrata Pfr.
C, Polymita muscarum Lea.
Southern Florida, with one or two species each of Hemitrochus,
Cylindrella, Macroceramus, Strophia, Ctenopoma, and
Chondropoma, belongs to this province.
(b) Jamaica.—The land Mollusca of Jamaica are, in point of
numbers and variety, quite unequalled in the world. There are in all
as many as 56 genera and more than 440 species, the latter being
nearly all peculiar. The principal features are the Glandinae, the
Helicidae, and the land operculates. The Glandinae belong
principally to the sub-genera Varicella, Melia, and Volutaxis,
Streptostyla being absent, although occurring in Cuba and San
Domingo. There are 10 genera of Helix, of which Pleurodonta is
quite peculiar, while Sagda (13 sp.) is common only with S.W. San
Domingo (2 sp.), and Leptoloma (8 sp.) only with Cuba (1 sp.). The
single Strobila seems to be a straggler from a N. American source.
Macroceramus has only 2 species as against 34 in Cuba, and of
Cylindrella, in which Cuba (130 sp.) is so rich, only 36 species occur.
The genus Leia, however (14 sp.), is all but peculiar, occurring
elsewhere only in the neighbouring angle of San Domingo, which is
so closely allied with Jamaica. The complete absence of Strophia is
remarkable.

Fig. 230.—Characteristic
Jamaican and Haitian
Mollusca: A, Sagdae
pistylium Müll., Jamaica; B,
Chondropoma salleanum Pfr.,
San Domingo; C,
Eutrochatella Tankervillei
Gray, Jamaica; D, Cylindrella
agnesiana C. B. Ad.,
Jamaica.
The land operculates form the bulk of the land fauna, there being
actually 242 species, as against 221 of land Pulmonata, a proportion
never again approached in any part of the world. As many as 80 of
these belong to the curious little genus Stoastoma, which is all but
peculiar to the island, one species having been found in San
Domingo, and one in Porto Rico. Geomelania and Chittya, two
singular inland forms akin to Truncatella, are quite peculiar. Alcadia
reaches its maximum of 14 species, as against 4 species in San
Domingo and 9 species in Cuba, and Lucidella is common to San
Domingo only; but, if Stoastoma be omitted, the Helicinidae
generally are not represented by so many or by so striking forms as
in Cuba, which has 90 species, as against Jamaica 44, and San
Domingo 35.
(c) San Domingo, although not characterised by the extraordinary
richness of Cuba and Jamaica, possesses many specially
remarkable forms of land Mollusca, to which a thorough exploration,
when circumstances permit, will no doubt make important additions.
From its geographical position, impinging as it does on all the islands
of the Greater Antilles, it would be expected that the fauna of San
Domingo would not exhibit equal signs of isolation, but would appear
to be influenced by them severally. This is exactly what occurs, and
San Domingo is consequently, although very rich in peculiar species,
not equally so in peculiar genera. The south-west district shows
distinct relations with Jamaica, the Jamaican genera Leia,
Stoastoma, Lucidella, and the Thaumasia section of Cylindrella
occurring here only. The north and north-west districts are related to
Cuba, while the central district, consisting of the long band of
mountainous country which traverses the island, contains the more
characteristic Haitian forms.
The Helicidae are the most noteworthy of the San Domingo land
Mollusca. The group Eurycratera, which contains some of the finest
existing land snails, is quite peculiar, while Parthena, Cepolis,
Plagioptycha, and Caracolus here reach their maximum. The
Cylindrellidae are very abundant, but no section is peculiar. Land
operculates do not bear quite the same proportion to the Pulmonata
as in Cuba and Jamaica, but they are well represented (100 to 152);
Rolleia is the only peculiar genus.
The relations of San Domingo to the neighbouring islands are
considerably obscured by the fact that they are well known, while
San Domingo is comparatively little explored. To this may perhaps
be due the curious fact that there are actually more species common
to Cuba and Porto Rico (26) than to Porto Rico and San Domingo.
Cuba shares with San Domingo its small-sized Caracolus and also
Liguus, but the great Eurycratera, Parthena, and Plagioptycha are
wholly wanting in Cuba. The land operculates are partly related to
Cuba, partly to Jamaica, thus Choanopoma, Ctenopoma, Cistula,
Tudora, and many others, are represented on all these islands, while
the Jamaican Stoastoma occurs on San Domingo and Porto Rico,
but not on Cuba, and Lucidella is common to San Domingo and
Jamaica alone. An especial link between Jamaica and San Domingo
is the occurrence in the south-west district of the latter island of
Sagda (2 sp.). The relative numbers of the genera Strophia,
Macroceramus, and Helicina, as given below (p. 351), are of interest
in this connexion.
Porto Rico, with Vièque, is practically a fragment of San Domingo.
The points of close relationship are the occurrence of Caracolus,
Cepolis, and Parthena among the Helicidae, and of Simpulopsis,
Pseudobalea, and Stoastoma. Cylindrella and Macroceramus are
but poorly represented, but Strophia still occurs. The land
operculates (see the Table) show equal signs of removal from the
headquarters of development. Megalomastoma, however, has some
striking forms. The appearance of a single Clausilia, whose nearest
relations are in the northern Andes, is very remarkable. Gaeotis,
which is allied to Peltella (Ecuador only), is peculiar.
Fig. 231.—Examples of West Indian
Helices: A, Helix (Parthena)
angulata Fér., Porto Rico; B,
Helix (Thelidomus) lima Fér.,
Vièque; C, Helix (Dentellaria) nux
denticulata Chem., Martinique.
Land Mollusca of the Greater Antilles
Cuba. Jamaica. S. Domingo. Porto Rico.
Glandina 18 24 15 8
Streptostyla 4 ... 2 ...
Volutaxis ... 11 (?) 1 ...
Selenites 1 ... ... ...
Hyalinia 4 11 5 6
Patula 5 1 ... ...
Sagda ... 13 2 ...
Microphysa 7 18 8 3
Cysticopsis 9 6 ... ...
Hygromia (?) ... ... 3 ...
Leptaxis (?) ... ... 1 ...
Polygyra 2 ... ... ...
Jeanerettia 6 ... ... 1
Euclasta ... ... ... 4
Plagioptycha ... ... 14 2
Strobila ... 1 ... ...
Dialeuca ... 1 ... ...
Leptoloma 1 8 ... ...
Eurycampta 4 ... ... ...
Coryda 7 ... ... ...
Thelidomus 15 3 ... 3
Eurycratera ... ... 7 ...
Parthena ... ... 2 2
Cepolis ... ... 3 1
Caracolus 8 ... 6 2
Polydontes 3 ... ... 1
Hemitrochus 12 1 ... ...
Polymita 5 ... ... ...
Pleurodonta ... 34 ... ...
Inc. sed. 5 ... ... ...
Simpulopsis ... ... 1 1
Bulimulus 3 3 6 7
Orthalicus 1 1 ... ...
Liguus 3 ... 1 ...
Gaeotis ... ... ... 3
Pineria 2 ... ... 1
Macroceramus 34 2 14 3
Leia ... 14 2 ...
Cylindrella 130 36 35 3
Pseudobalea 2 ... 1 1
Stenogyra 6 7 (?) ...
Opeas 8 (?) 4 6
Subulima 6 14 2 2
Glandinella 1 ... ... ...
Spiraxis 2 (?) 2 1
Melaniella 7 ... ... ...
Geostilbia 1 ... 1 ...
Cionella 2 ... ... ...
Leptinaria ... 1 ... 3
Obeliscus ... ... 1 2
Pupa 2 7 3 2
Vertigo 4 ... ... ...
Strophia 19 ... 3 2
Clausilia ... ... ... 1
Succinea 11 2 5 3
Vaginula 2 2 2 1
Megalomastoma 13 ... 1 3
Neocyclotus 1 33(?) ... ...
Licina 1 ... 3 ...
Jamaicia ... 2 ... ...
Crocidopoma ... 1 3 ...
Rolleia ... ... 1 ...
Choanopoma 25 12 19 3
Ctenopoma 30 2 1 ...
Cistula 15 3 3 3
Chondropoma 57 (?) 19 4
Tudora 7 17 5 ...
Adamsiella 1 12 ... ...
Blaesospira 1 ... ... ...
Xenopoma 1 ... ... ...
Cistula 15 3 3 ...
Colobostylus 4 13 5 ...
Diplopoma 1 ... ... ...
Geomelania ... 21 ... ...
Chittya ... 1 ... ...
Blandiella ... ... 1 ...
Stoastoma ... 80 1 1
Eutrochatella 21 6 6 ...
Lucidella ... 4 1 ...
Alcadia 9 14 4 ...
Helicina 58 16 24 9
Proserpina 2 4 ... ...
The Virgin Is., with St. Croix, Anguilla, and the St. Bartholomew
group (all of which are non-volcanic islands), are related to Porto
Rico, while Guadeloupe and all the islands to the south, up to
Grenada (all of which are volcanic), show marked traces of S.
American influence. St. Kitt’s, Antigua, and Montserrat may be
regarded as intermediate between the two groups. St. Thomas, St.
John, and Tortola have each one Plagioptycha and one Thelidomus,
while St. Croix has two sub-fossil Caracolus which are now living in
Porto Rico, together with one Plagioptycha and one Thelidomus
(sub-fossil). The gradual disappearance of some of the characteristic
greater Antillean forms, and the appearance of S. American forms in
the Lesser Antilles, is shown by the following table:—
S
P S S G M t
o t S t u a S .
r . t A . a D r t B T
t S . T n A d o t . a V G r
o T t o g K n e m i r i r i
h . C r u i t l i n L b n e n
R o r t i t i o n i u a c n i
i m J o o l t g u i q c d e a d
c a a i l l ’ u p c u i o n d a
o s n x a a s a e a e a s t a d
. . . . . . . . . . . . . . . .
Bulimulus 7 4 2 4 1 2 2 3 8 9 5 3 3 6 2 4
Cylindrella 3 2 1 1 1 . . . . 1 1 1 1 . . 1
Macroceramus 3 1 1 . 2 1 . . . . . . . . . .
Cyclostomatidae, etc.23 4 1 5 1 1 1 . 4 . . . . . . 1
Dentellaria . . . . . . 1 1 8 5 11 2 2 . 1 1
Cyclophorus . . . . . . . . 1 2 2 . . . . .
Amphibulimus . . . . . . . . 2 3 1 . . . . .
Homalonyx . . . . . . . . 1 1 . . . . . .

(d) In Guadeloupe we find Cyclophorus, Amphibulimus,


Homalonyx, and Pellicula, which are characteristic of S. America,
and nearly all recur in Dominica and Martinique. These islands are
the metropolis of Dentellaria, a group of Helix, evidently related to
some of the forms developed in the Greater Antilles. Stragglers
occur as far north as St. Kitt’s and Antigua, and there are several on
the mainland as far south as Cayenne. Traces of the great Bulimus,
so characteristic of South America, occur as far north as S. Lucia,
where also is found a Parthena (San Domingo and Porto Rico).
Trinidad is markedly S. American; 55 species in all are known, of
which 22 are peculiar, 28 are common to S. America (8 of these
reach no farther north along the islands), and only 5 are common to
the Antilles, but not to S. America. The occurrence of Gundlachia in
Trinidad has already been mentioned.
The Bermudas show no very marked relationship either to the N.
American or to the West Indian fauna. In common with the former
they possess a Polygyra, with the latter (introduced species being
excluded) one species each of Hyalosagda, Subulina, Vaginula, and
Helicina, so that, on the whole, they may be called West Indian. The
only peculiar group is Poecilozonites, a rather large and depressed
shell of the Hyalinia type.
(2) The Central American Sub-region may be regarded as
extending from the political boundary of Mexico in the north to the
isthmus of Panama in the south. It thus impinges on three important
districts—the N. American, West Indian, and S. American; and it
appears, as we should perhaps expect, that the two latter of these
regions have considerably more influence upon its fauna than the
former. Of the N. American Helicidae, Polygyra is abundant in
Mexico only, and two species of Strobila reach N. Guatemala, while
the Californian Arionta occurs in Mexico. S. American Helicidae, in
the sub-genera Solaropsis and Labyrinthus, occur no farther north
than Costa Rica. Not a single representative of any of the
characteristic West Indian Helicidae occurs. Bulimulus and
Otostomus, which form so large a proportion of the Mollusca of
Venezuela, Colombia, Ecuador, and Peru, together with Orthalicus,
are abundant all over the region. Again, Cylindrella, Macroceramus,
and some of the characteristic Antillean operculates, are
represented, their occurrence being in most cases limited to the
eastern coast-line and eastern slope of the central range.
Besides these external elements, the region is rich in indigenous
genera. Central America is remarkable for an immense number of
large carnivorous Mollusca possessing shells. There are 49 species
of Glandina, the bulk of which occur in eastern and southern Mexico;
36 of Streptostyla (S.E. Mexico and Guatemala, only 1 species
reaching Venezuela and another Peru); 5 of Salasiella, 2 of Petenia,
and 1 of Strebelia; the last three genera being peculiar. Streptaxis,
fairly common in S. America, does not occur. Velifera and
Cryptostracon, two remarkable slug-like forms, each with a single
species, are peculiar to Costa Rica. Among the especial peculiarities
of the region are the giant forms belonging to the Cylindrellidae,
which are known as Holospira, Eucalodium, and Coelocentrum (Fig.
232). They are almost entirely peculiar to Mexico, only 7 out of a
total of 33 reaching south of that district, and only 1 not occurring in it
at all.
Fig. 232.—Examples of
characteristic Mexican
Mollusca: A, Coelocentrum
turris Pfr.; B, Streptostyla
Delattrei Pfr.
The land operculates are but scanty. Tomocyclus and
Amphicyclotus are peculiar, and Schasicheila, a form of Helicina,
occurs elsewhere only in the Bahamas. Ceres (see Fig. 18, C, p. 21)
and Proserpinella, two remarkable forms of non-operculate
Helicinidae (compare the Chinese Heudeia), are quite peculiar.
Pachychilus, one of the characteristic fresh-water genera, belongs to
the S. American (Melaniidae) type, not to the N. American
(Pleuroceridae). Among the fresh-water Pulmonata, the Aplecta are
remarkable for their great size and beauty. In the accompanying
table “Mexico” is to be taken as including the region from the United
States border up to and including the isthmus of Tehuantepec, and
“Central America” as the whole region south of that point.
Land Mollusca of Central America
Mexico Central Common to
only. America both.
only.
Strebelia 1 ... ...
Glandina 33 13 3
Salasiella 4 ... 1
Streptostyla 18 12 6
Petenia ... 1 1
Limax ... 1 ...
Velifera ... 1 ...
Omphalina 10 1 1
Hyalinia 2 5 3
Guppya ... 8 3
Pseudohyalina 2 ... 2
Tebennophorus 1 ... ...
Cryptostracon ... 1 ...
Xanthonyx 4 ... ...
Patula 3 ... 4
Acanthinula 1 2 2
Vallonia ... 1 ...
Trichodiscus 2 2 3
Praticolella 1 ... 1
Arionta 3 ... ...
Lysinoe 1 1 1
Oxychona 2 5 ...
Solaropsis ... 2 ...
Polygyra 14 1 2
Strobila 1 1 ...
Labyrinthus ... 5 ...
Otostomus 23 20 7
Bulimulus 6 5 2
Berendtia 1 ... ...
Orthalicus 6 3 3
Pupa 1 1 1
Vertigo 1 ... ...
Holospira 12 ... ...
Coelocentrum 6 1 1
Eucalodium 15 ... 5
Cylindrella 6 4 ...
Macroceramus 2 1 ...
Simpulopsis 2 1 ...
Caecilianella 1 ... ...
Opeas 1 2 3
Spiraxis 8 2 1
Leptinaria ... 2 ...
Subulina 2 3 4
Succinea 11 3 1
Vaginula 1 ... ...
Aperostoma ... 4 1
Amphicyclotus 2 1 2
Cystopoma 2 ... ...
Tomocyclus ... 1 2

You might also like