Debs 2022

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Real-time Analytics and Recommendations on Financial

Instruments Flows Using Apache Spark


Rim Moussa
rim.moussa@enicarthage.rnu.tn
University of Carthage
Carthage, Tunisia

ABSTRACT
Recent developments in realtime analytics frameworks and real-
time database management systems coping with high velocity, as
well as affordable cluster hardware allow to re-think the legacy
financial systems. The DEBS’2022 contest deals with computing in
real-time specific trend indicators and detecting in real-time pat-
terns in order to advise in real-time customers on buying or selling
on the financial markets. In response to the DEBS’2022 contest, we
design and implement a scalable solution for real-time analytics
and recommendations on financial instruments. Our solution is
designed to be scalable. It’s based on Apache Spark -an open-source
unified analytics engine for large-scale data processing.

CCS CONCEPTS
• Information systems → Data stream mining; Data analytics;

KEYWORDS Figure 1: Big picture.


Financial Symbols, Time-series Analytic, Big data
ACM Reference Format:
The DEBS’2022 experiment concerns the real-time processing of
Rim Moussa. 2021. Real-time Analytics and Recommendations on Financial
Instruments Flows Using Apache Spark. In The 15th ACM International more than 5504 financial instruments of three major European
Conference on Distributed and Event-based Systems (DEBS ’21), June 28- traders (eng. exchanges) respectively Paris (FR), Amsterdam (NL),
July 2, 2021, Virtual Event, Italy. ACM, New York, NY, USA, 2 pages. https: and Frankfurt/Xetra (ETR). The data flows transmitted in real-time
//doi.org/10.1145/3465480.3466931 contain 289 million events received by Infront Financial Technology
GmbH 1 from November 8 to 14, 2021. These events relate to a
1 INTRODUCTION particular type of instruments. In reality, velocity is much more
Recent developments in in-memory analytics frameworks (Apache important. In 2019, Infront Financial Technology GmbH captured
Spark, Apache Flink) and real-time database management systems an average of 18 trillion notifications per day, and the velocity in-
(key-value stores), both coping with high velocity, as well as afford- creased in 2021 to reach 24 trillion notifications [1].
able cluster hardware allow to re-think financial legacy systems. The paper outline is the following, first we briefly describe the
Traders work on tracking and evaluating quantitative indicators to dataset and the two queries. Then, we describe our solution, and
identify trends in the development of instruments’ prices. Trends report performance measurements.
can be both upwards (i.e., prices increase) or downwards (i.e., prices
start to drop). Identifying the start of an uptrend is crucial to be able 2 DATASET AND WORKLOAD
to buy while the price is still low and to sell as soon as a downtrend The input is a batch with 𝑛 events. Each event encloses pricing
begins in order to advise customers correctly. The speed in getting information at a timestamp. Two queries are stated in DEBS’2022
insights from real-time financial instruments’ data will allow fast contest [2]. Given a list of financial symbols to track,
and swise decisions. • Query 1: returns the latest EMA-38 and EMA-100 measures
Permission to make digital or hard copies of all or part of this work for personal or (abrev. Exponential Moving Average) for each symbol in the
classroom use is granted without fee provided that copies are not made or distributed list.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM • Query 2: returns the last three Crossover Events (buy/sell)
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, advises (with timestamps) for each symbol in the list. A
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
Crossover allows to detect a breakout for either recommend-
DEBS ’21, June 28-July 2, 2021, Virtual Event, Italy ing to buy or to sell financial symbols.
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8555-8/21/06. . . $15.00
1 https://www.infrontfinance.com/
https://doi.org/10.1145/3465480.3466931
DEBS ’21, June 28-July 2, 2021, Virtual Event, Italy R. Moussa

New Batch of Events the EMAs and Crossovers per Symbol snapshot. Subsequent batches
include already received symbols as well as new symbols. This
type of processing requires upserts operations. Since RDDs are
immutable, and upserts aren’t supported over RDDs, we re-build
for each symbol, find-out the last received
the EMAs and Crossovers per Symbol snapshot through performing
event (MapToPair, ReduceByKey)
a full-outer-join between EMAs and Crossovers per Symbol snapshot
RDD and Latest Event per Symbol RDD related to the current batch.
EMAs and Crossovers of symbols existing in both Pair RDDs are
Latest Event per Symbol updated, while for new symbols EMAs and Crossovers are initial-
ized. Each batch includes a list of symbols to look for their EMAs
(Q1) and their last three Crossover Events (Q2). If the list size is
greater than a threshold value, we create an RDD using this list and
we perform a broadcast hash join with each partition of the EMAs
and Crossovers per Symbol snapshot. Else, we broadcast the list to
EMA38p, all workers, and we use Pair RDD filtering. The result of this step is
case 1st batch: Initialize EMA100p, case ith batch: Upsert a Pair RDD which is collected as a Map at the Spark driver in order
EMAs for each received EMA38c, EMAs (FullOuterJoin, to send results of Q1 and Q2.
symbol (MapToPair) EMA100c MapToPair)
coefficents 4 PERFORMANCE ANALYSIS
We set up scripts for reserving and configuring nodes for running
rebuild Apache Spark in distributed mode on the french cluster utility
Grid5000 [? ]. We conduct the experiments using a batch size equal
to 10,000 events. Two types of Apache Spark deployments were
Snapshot EMAs and Crossovers per used, namely,
Symbol
• Apache Spark Local Mode: This test fails to complete the
processing of all batches. It completes the 1st 1,000 batches
Filter by list of symbols OR perform a hash join within 6 minutes, 2,000 batches within 20 minutes, and 3,700
broadcast [Run Q1 and Q2 on summary] batches within 1 hour. It crashes with java.lang.OutOfMemoryError:
Java heap space after processing 3,750 batches and consum-
ing 42GB of memory.
Figure 2: Workflow of a batch data processing: tasks are • Apache Spark Standalone Mode: this deployment fails to pro-
shaded in blue, data flows are shaded in purple, static data cess received batches. The Kryo Serializer doesn’t cope auto-
are shaded in yellow, and big data snapshot over batches is matically with google protobuffer and java collections types.
shaded in green. We’re working on this issue to make the code running on this
deployment using existing implementations 2 . The alterna-
tive is to create Java classes wrapping generated classes using
3 DESCRIPTION OF THE SOLUTION serilizable data types (Event.class, CrossoverEvent.class, et
cetera)
Our solution is based on Apache Spark[3]. The latter supports
in-memory processing across a cluster of machines. In-memory 5 CONCLUSION AND FUTURE WORK
computing accelerates data processing. Spark is also highly fault-
tolerant; if one node fails, the failed tasks are distributed across Summarizing in this paper, we propose a workflow for processing
the other nodes. Moreover the cluster scales horizontally with no batches, as well as the design and implementation of a software
downtime. With respect to the requirements described in [2], we application based on Apache Spark. Current work is mainly oriented
propose the workflow illustrated in Figure 2. We implement the to best performing the code on a distributed cluster utility.
workflow using Apache Spark and Java programming language us-
ing resilient distributed dataset (RDD). Each new received batch of
REFERENCES
[1] Sebastian Frischbier, Mario Paic, Alexander Echler, and Christian Roth. 2019.
events is stored in an RDD. The latter is partitioned for enabling Managing the Complexity of Processing Financial Data at Scale - An Experience
parallel processing. Indeed, by default an RDD typical size is 128MB. Report. In Complex Systems Design and Management. Springer International
Received batch as well as EMAs and Crossovers per Symbol snap- Publishing, 14–26.
[2] Ruben Mayer and Jawad Tahir. 2022. The DEBS 2022 Grand Challenge. In Proceed-
shot have both a very small memory footprint. Consequently they ings of the 16th ACM Intl Conference on Distributed and Event-Based Systems.
should be partitioned. Although, symbols are strings, we checked [3] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
that events are evenly distributed over partitions, and no data skew silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
is observed using the default partitioner. In each batch, we look for Computing. In Proc. of the 9th USENIX Symposium on Networked Systems Design
each symbol the last received Event. This step outputs the Latest and Implementation. 15–28.
Event per Symbol RDD. The first batch processing initializes EMAs
measures for each received symbol in the first batch, and computes 2 https://github.com/magro/kryo-serializers

You might also like