Professional Documents
Culture Documents
Debs 2022
Debs 2022
Debs 2022
ABSTRACT
Recent developments in realtime analytics frameworks and real-
time database management systems coping with high velocity, as
well as affordable cluster hardware allow to re-think the legacy
financial systems. The DEBS’2022 contest deals with computing in
real-time specific trend indicators and detecting in real-time pat-
terns in order to advise in real-time customers on buying or selling
on the financial markets. In response to the DEBS’2022 contest, we
design and implement a scalable solution for real-time analytics
and recommendations on financial instruments. Our solution is
designed to be scalable. It’s based on Apache Spark -an open-source
unified analytics engine for large-scale data processing.
CCS CONCEPTS
• Information systems → Data stream mining; Data analytics;
New Batch of Events the EMAs and Crossovers per Symbol snapshot. Subsequent batches
include already received symbols as well as new symbols. This
type of processing requires upserts operations. Since RDDs are
immutable, and upserts aren’t supported over RDDs, we re-build
for each symbol, find-out the last received
the EMAs and Crossovers per Symbol snapshot through performing
event (MapToPair, ReduceByKey)
a full-outer-join between EMAs and Crossovers per Symbol snapshot
RDD and Latest Event per Symbol RDD related to the current batch.
EMAs and Crossovers of symbols existing in both Pair RDDs are
Latest Event per Symbol updated, while for new symbols EMAs and Crossovers are initial-
ized. Each batch includes a list of symbols to look for their EMAs
(Q1) and their last three Crossover Events (Q2). If the list size is
greater than a threshold value, we create an RDD using this list and
we perform a broadcast hash join with each partition of the EMAs
and Crossovers per Symbol snapshot. Else, we broadcast the list to
EMA38p, all workers, and we use Pair RDD filtering. The result of this step is
case 1st batch: Initialize EMA100p, case ith batch: Upsert a Pair RDD which is collected as a Map at the Spark driver in order
EMAs for each received EMA38c, EMAs (FullOuterJoin, to send results of Q1 and Q2.
symbol (MapToPair) EMA100c MapToPair)
coefficents 4 PERFORMANCE ANALYSIS
We set up scripts for reserving and configuring nodes for running
rebuild Apache Spark in distributed mode on the french cluster utility
Grid5000 [? ]. We conduct the experiments using a batch size equal
to 10,000 events. Two types of Apache Spark deployments were
Snapshot EMAs and Crossovers per used, namely,
Symbol
• Apache Spark Local Mode: This test fails to complete the
processing of all batches. It completes the 1st 1,000 batches
Filter by list of symbols OR perform a hash join within 6 minutes, 2,000 batches within 20 minutes, and 3,700
broadcast [Run Q1 and Q2 on summary] batches within 1 hour. It crashes with java.lang.OutOfMemoryError:
Java heap space after processing 3,750 batches and consum-
ing 42GB of memory.
Figure 2: Workflow of a batch data processing: tasks are • Apache Spark Standalone Mode: this deployment fails to pro-
shaded in blue, data flows are shaded in purple, static data cess received batches. The Kryo Serializer doesn’t cope auto-
are shaded in yellow, and big data snapshot over batches is matically with google protobuffer and java collections types.
shaded in green. We’re working on this issue to make the code running on this
deployment using existing implementations 2 . The alterna-
tive is to create Java classes wrapping generated classes using
3 DESCRIPTION OF THE SOLUTION serilizable data types (Event.class, CrossoverEvent.class, et
cetera)
Our solution is based on Apache Spark[3]. The latter supports
in-memory processing across a cluster of machines. In-memory 5 CONCLUSION AND FUTURE WORK
computing accelerates data processing. Spark is also highly fault-
tolerant; if one node fails, the failed tasks are distributed across Summarizing in this paper, we propose a workflow for processing
the other nodes. Moreover the cluster scales horizontally with no batches, as well as the design and implementation of a software
downtime. With respect to the requirements described in [2], we application based on Apache Spark. Current work is mainly oriented
propose the workflow illustrated in Figure 2. We implement the to best performing the code on a distributed cluster utility.
workflow using Apache Spark and Java programming language us-
ing resilient distributed dataset (RDD). Each new received batch of
REFERENCES
[1] Sebastian Frischbier, Mario Paic, Alexander Echler, and Christian Roth. 2019.
events is stored in an RDD. The latter is partitioned for enabling Managing the Complexity of Processing Financial Data at Scale - An Experience
parallel processing. Indeed, by default an RDD typical size is 128MB. Report. In Complex Systems Design and Management. Springer International
Received batch as well as EMAs and Crossovers per Symbol snap- Publishing, 14–26.
[2] Ruben Mayer and Jawad Tahir. 2022. The DEBS 2022 Grand Challenge. In Proceed-
shot have both a very small memory footprint. Consequently they ings of the 16th ACM Intl Conference on Distributed and Event-Based Systems.
should be partitioned. Although, symbols are strings, we checked [3] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
that events are evenly distributed over partitions, and no data skew silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
is observed using the default partitioner. In each batch, we look for Computing. In Proc. of the 9th USENIX Symposium on Networked Systems Design
each symbol the last received Event. This step outputs the Latest and Implementation. 15–28.
Event per Symbol RDD. The first batch processing initializes EMAs
measures for each received symbol in the first batch, and computes 2 https://github.com/magro/kryo-serializers