Poh 等 - 2022 - Enhancing Cross-Sectional Currency Strategies by C

Enhancing Cross-Sectional Currency Strategies by
Context-Aware Learning to Rank with Self-Attention

Daniel Poh Bryan Lim
Department of Engineering Science, Oxford-Man Institute Department of Engineering Science, Oxford-Man Institute
of Quantitative Finance, University of Oxford of Quantitative Finance, University of Oxford
Oxford, United Kingdom Oxford, United Kingdom
dp@robots.ox.ac.uk blim@eng.oxon.org
Stefan Zohren Stephen Roberts

arXiv:2105.10019v2 [q-fin.PM] 27 Jan 2022
Department of Engineering Science, Oxford-Man Institute Department of Engineering Science, Oxford-Man Institute
of Quantitative Finance, University of Oxford of Quantitative Finance, University of Oxford
Oxford, United Kingdom Oxford, United Kingdom
stefan.zohren@eng.ox.ac.uk sjrob@robots.ox.ac.uk
ABSTRACT methods which adopt a narrow, longitudinal focus and trade indi-
The performance of a cross-sectional currency strategy depends vidual assets independently [34], cross-sectional strategies cover
crucially on accurately ranking instruments prior to portfolio con- a broader slate of instruments and typically involve buying assets
struction. While this ranking step is traditionally performed using with the highest expected returns (Winners) while simultaneously
heuristics, or by sorting the outputs produced by pointwise re- selling those with the lowest (Losers). The classical cross-sectional
gression or classification techniques, strategies using Learning to momentum (CSM) of [25] for instance, was originally applied to
Rank algorithms have recently presented themselves as competitive equities and assumes the persistence of returns – trading some
and viable alternatives. Although the rankers at the core of these top segment of stocks against an equally-sized bottom segment
strategies are learned globally and improve ranking accuracy on after ranking them by returns over the past 12-months. Here we
average, they ignore the differences between the distributions of focus on the foreign exchange (FX) market which is more liquid and
asset features over the times when the portfolio is rebalanced. This enjoys larger trading volumes as well as lower transaction costs
flaw renders them susceptible to producing sub-optimal rankings, [32]. The confluence of these factors, and the fact that participants
possibly at important periods when accuracy is actually needed are typically sophisticated investors, raises the bar for generating
the most. For example, this might happen during critical risk-off consistent excess returns over time [32].
episodes, which consequently exposes the portfolio to substantial, Along with the proliferation of machine learning, numerous
unwanted drawdowns. We tackle this shortcoming with an analo- cross-sectional strategies incorporating advanced prediction tech-
gous idea from information retrieval: that a query’s top retrieved niques such as [21, 22, 26] have been developed. One notable subset
documents or the local ranking context provide vital information of these models employ Learning to Rank (LTR) algorithms. This
about the query’s own characteristics, which can then be used to class of methods explicitly account for the crucial expected order
refine the initial ranked list. In this work, we use a context-aware of returns, which is in turn an important driver of cross-sectional
Learning-to-rank model that is based on the Transformer archi- trading performance. The usefulness of this approach is empirically
tecture to encode top/bottom ranked assets, learn the context and validated by [40] who demonstrate the superiority of CSM tech-
exploit this information to re-rank the initial results. Backtesting niques employing LTR over conventional baselines on US equities.
on a slate of 31 currencies, our proposed methodology increases LTR is an active area of research within Information Retrieval
the Sharpe ratio by around 30% and significantly enhances various (IR), and focuses on training models using supervised techniques to
performance metrics. Additionally, this approach also improves the perform ranking tasks [29]. Among the numerous studies seeking
Sharpe ratio when separately conditioning on normal and risk-off to develop better LTR algorithms, one particular broad branch with
market states. works such as [3, 23, 39, 47] concentrate on incorporating inter-
item dependencies not just at the loss level, but also within the
scoring function. Within this area, some studies [2, 8, 36, 54] use
KEYWORDS
re-ranking, where a sorted list from a previous ranking algorithm
momentum strategies, learning to rank, neural networks, machine is used as input to model the complex interactions among items.
learning, information retrieval, foreign exchange markets, portfolio From this, a scoring function is learned and is used to refine the
construction original sorted list. Adopting this approach, [2] tackle a weakness
of globally learned ranking functions – that is, they can sort objects
accurately on average, but fare poorly over certain queries as they
1 INTRODUCTION ignore query-specific feature distributions. Motivated by the idea
Cross-sectional strategies are a popular trading style, with numer- that a query’s top retrieved documents or the local ranking con-
ous works in academic finance documenting technical variations text contains information that can boost the performance of LTR
and their use across different asset classes [7]. Unlike time-series
Poh D., Lim B., Zohren S. and Roberts S.
2 RELATED WORKS
2.1 Cross-sectional Currency Momentum
Momentum strategies can be classified as being either time-series or
cross-sectional. In time-series momentum, which was first proposed
by [34], an instrument’s trading rule relies only on previous returns.
With cross-sectional momentum (CSM), the relative performance of
assets is the essence of the strategy – after ranking securities based
on some scoring model, the long/short portfolios are constructed
by trading the extremes (e.g. top and bottom deciles) of the ranked
list. Since [25], which demonstrate the profitability of this strat-
egy on US equities, the literature has been galvanised by a broad
spectrum of works – ranging from technical refinements spanning
varying levels of sophistication [7, 22, 26, 38] to reports document-
ing the ubiquity of momentum in different asset classes and markets
[16, 18–20, 28, 45]. Unlike the extensive works focused on equity
momentum, currency momentum is centred on the time-series of in-
dividual currency pairs and is often cast as "technical trading rules"
Figure 1: Risk-off episodes (e.g. the prolonged period of for which [33] provides a broad overview. While studies such as
risk-aversion around the GFC in 2008) appear to coincide [12, 35] offer evidence of CSM being profitable in FX markets, their
with large moves in the cumulative returns of an equally- results involve trading a narrow cross-section involving only major
weighted portfolio of G10 currencies. Risk-off states have pairs and lack a unifying analysis that explains their returns. [32]
also increased in frequency since then. addresses this research gap and also demonstrate that portfolios
constructed by trading the winner/loser segments (analogous to
[25] in the equity literature) can generate high unconditional excess
returns. In [7], the authors adopt a more sophisticated approach
systems, the authors develop the DLCM (Deep Listwise Context
by using volatility-scaled moving-average convergence divergence
Model). Central to the DLCM (which was a state-of-the-art method
(MACD) indicators as inputs. Beyond this and unlike the compara-
for encoding feature vectors [36]) is a recurrent neural network
tively abundant work on equity momentum, we find little published
(RNN) that is used to encode the local ranking context obtained
in terms of constructing currency CSM portfolios.
by an initial sort for re-ranking. Recognising the limitations of the
RNN architecture used by the DLCM, [39] replace it with a Trans-
former encoder [49]. By incorporating positional embeddings as a
2.2 Learning to Rank
further modification, they show that this approach is suitable for Learning to Rank (LTR) is an active research area within informa-
re-ranking and report significant improvements over DLCM. tion retrieval focusing on developing models that learn how to sort
By the same token, current CSM strategies built on top of glob- lists of objects in order to maximise utility. Its prominence grew
ally trained LTR algorithms are able to sort assets accurately on in parallel with the accessibility of modern computing hard- and
average. However, relying on these global rankers subjects the software, and its algorithms utilise sophisticated architectures that
broader strategy to inaccurate rankings at certain points in time – allow it to learn in a data-driven matter – unlike their predecessors
potentially at instances when accuracy is actually needed the most, such as BM25 [44] and LMIR [41] which require no training but
such as during critical (and increasingly frequent) risk-off episodes which instead need handcrafting and explicit design [29]. We point
as shown in Figure 1∗ . Following [39], we use a context-aware LTR the interested reader to [30] for a comprehensive overview.
model that is based on the Transformer which has established state- Given the widespread adoption of LTR across numerous applica-
of-the-art performance on the MSLR-WEB30K† data set. Starting tions such as search engines, e-commerce [46] and entertainment
with a sorted list obtained from a previous ranking algorithm, we [37], there is motivation both in academia and industry to develop
cast our problem in a similar supervised setting utilised by LTR better models. Commonly used methods typically learn a rank-
and re-rank this list using our proposed model. These results are ing function that assigns scores to items individually (i.e. without
then concretely evaluated against a mix of LTR and conventional the context of other objects in the list). Although these models are
benchmarks. On a set of 31 currency pairs, we demonstrate how trained by optimising a loss function which may capture the interac-
the performance of the original strategy can be enhanced with our tions among items, they score them individually at inference-time
approach. without accounting for any mutual influences among the items
[39]. To this end, a number of studies [3, 23, 39, 47] overcome this
disadvantage by modelling inter-item dependencies at both the loss
∗ We define a single risk off episode or state as an instance when the VIX is 5% level and within the scoring function. Some works [2, 36, 52, 54]
points higher than its rolling 60-day moving average; Normal states are then simply within this area achieve this with re-ranking. The main idea is to
states which are not risk-off states. All currencies are expressed versus the USD, and explicitly model the interactions between sorted items provided
the vertical lines indicating risk-off states have been broadened to improve visibility.
† This is large scale LTR data set released by Microsoft and is often used to bench- by some previous ranking algorithm, and subsequently use this
mark the performance of ranking algorithms. information to refine the same sorted list. In [2], re-ranking with
Enhancing Cross-Sectional Currency Strategies by Context-Aware Learning to Rank with Self-Attention
the top retrieved items or local ranking context is used to remedy traded. Given a specific time 𝑡 at which the portfolio is rebalanced,
the deficiency of globally-learned ranking functions ignoring dif- the feature vector of an instrument 𝑐 that is available for trading
ferences between feature distributions across queries. Specifically, can be represented as x (𝑡,𝑐) . Conventional LTR algorithms assume
the authors note that depending on query characteristics, pertinent the existence of a global scoring function 𝑓 that, once learned, takes
items for different queries are often distributed differently in the x (𝑡,𝑐) as input and then generates a ranking score for this object.
feature space. Hence, while a global ranker is able to sort accurately For the same function parameterised by 𝜃 , learning its optimal form
on average, it may struggle to produce optimal results for certain is achieved by minimising the empirical loss:
queries. Inspired by [27, 43, 53] documenting the effectiveness of ∑︁
the local ranking context in boosting the performance of text-based L (𝜃 ) = ℓ 𝑦 (𝑡,𝑐) , 𝑓 (x (𝑡,𝑐) ; 𝜃 ) 𝑐 ∈ C (2)
retrieval systems, [2] propose the DLCM model. The DLCM uses 𝑡 ∈𝜏
an RNN to sequentially encode the features of top results in order where 𝜏 represents all the times during which the portfolio is rebal-
to learn a local context model, which is later applied to re-rank the anced, ℓ is the loss function computed with the ground truth 𝑦 (𝑡,𝑐)
same top results. A major drawback with RNN-based approaches (the sorted decile labels for the next period) and the predicted score
however, is that the encoded feature information degrades with 𝑓 (x (𝑡,𝑐) ; 𝜃 ). Lastly, C denotes the universe of tradeable currencies.
encoding distance. Motivated by Transformer [49] architectures as Suppose we now have 𝐼 (𝑅𝑡𝑚 , 𝑋𝑡𝑚 , 𝑓 (·; 𝜃 )) § which is the local
a workaround, [36] and [39] utilise an adapted variant – exploiting ranking context provided by some previous ranking model 𝑓 . Here,
both the self-attention mechanism that learns the inter-item depen- 𝑅𝑡𝑚 = {top-m 𝑐 sorted by 𝑓 (x (𝑡,𝑐) ; 𝜃 )} and 𝑋𝑡𝑚 = {x (𝑡,𝑐) |𝑐 ∈ 𝑅𝑡𝑚 }.
dencies without any decay over distance, as well as the encoding The loss from ranking with the local context can be formulated as:
procedure that allows for parallelization [36]. By incorporating po- ∑︁
sitional encodings, [39] demonstrates that this modification enables L (𝜃ˆ) = ℓ 𝑦 (𝑡,𝑐) , 𝑓˜ x (𝑡,𝑐) ; 𝜃˜ 𝑐 ∈ 𝐼 (𝑅𝑡𝑚 , 𝑋𝑡𝑚 , 𝑓 (·; 𝜃 ))

(3)
the model to perform re-ranking. 𝑡 ∈𝜏
Surveying the LTR-related literature on finance, we note that
existing works such as [40, 48, 50] are essentially based on globally where 𝑓˜(·; 𝜃˜) stands for the context-aware scoring function that is
learned rankers and are therefore affected by the concerns discussed parameterised by 𝜃˜. For the bottom retrieved items, 𝑅𝑡𝑚 is defined
earlier. Given that context-aware models based on Transformers as the lowest-𝑚 items sorted by the same function 𝑓 (·; 𝜃 ).
have been successfully used for re-ranking [36, 39], we adopt a In this work, the Transformer-based architecture proposed by
similar approach of applying a context-aware ranker to refine the [39] plays the role of the context-aware ranking model 𝑓˜. This
initial results of popular LTR algorithms such as ListNet [13] and model is separately trained using the pairwise and listwise losses
LambdaMART [11] and show that this improves performance. described in Section 5.3. We apply our model to re-rank some top
and bottom 𝑚 subset (which is similar in spirit to [2]), with 𝑚 < 𝑛
3 PROBLEM DEFINITION and where 𝑛 = |C|. For the remainder of this paper, we drop the
For a given portfolio of currencies that is rebalanced daily, the subscripts in x (𝑡,𝑐) and 𝑦 (𝑡,𝑐) for brevity.
returns for a cross-sectional momentum (CSM) strategy at day 𝑡
can be expressed as follows: 4 CONTEXT-AWARE MODEL FOR
𝑛
! RE-RANKING
𝐶𝑆𝑀 1 ∑︁ (𝑖) 𝜎𝑡𝑔𝑡 (𝑖)
𝑟𝑡,𝑡 +1 = 𝑆 𝑟𝑡,𝑡 +1, (1) In this section, we explain the context-aware model as applied to
𝑛 𝑖=1 𝑡 (𝑖)
𝜎 𝑡 long positions and note that the procedure is similar for shorts.
where 𝑟𝑡,𝑡𝐶𝑆𝑀 denotes the realised portfolio returns going from day 𝑡 Presented with an initial list of 𝑚 top retrieved assets produced by
+1
to 𝑡 + 1, 𝑛 refers to the number of currency pairs in the portfolio and a previous ranking algorithm, we want to learn a context-aware
(𝑖) scoring function that refines this list by incorporating information
𝑆𝑡 ∈ {−1, 0, 1} signifies the cross-sectional momentum signal or
across items.
trading rule for pair 𝑖. The annualised target volatility 𝜎𝑡𝑔𝑡 is set at
(𝑖)
15% and asset returns are scaled with 𝜎𝑡 which is an estimator for 4.1 Model Inputs
ex-ante daily volatility. For simplicity we use a rolling exponentially
Inputs comprise of a list of 𝑚 top retrieved currencies x after an
weighted standard deviation with a 63-day span on daily returns
(𝑖) initial round of ranking by some previous LTR algorithm. Each item
for 𝜎𝑡 , and note that other sophisticated methods such as GARCH is first passed through a shared fully connected input layer of size
[10] can be employed. 𝑑𝑓 𝑐 .
3.1 Problem Formulation 4.1.1 Positional Encodings. To allow the model to leverage item
We now formalise the problem setup of adapting the current LTR positions provided by the previous ranker, we add positional en-
framework with information provided by top retrieved items. The codings to the input embeddings. While [49] propose both fixed
CSM strategy can be framed similar to a LTR problem‡ , where and learnable positional encodings to encode item order in a list,
assets are sorted based on their momentum signals to maximise
trading profits when the highest/lowest ranked instruments are § [2] represents the local ranking context with 𝐼 (𝑅𝑚 , 𝑋 𝑚 ) , but we include the
𝑡 𝑡
dependence on 𝑓 ( ·; 𝜃 ) as different ranking models produces different top/bottom
‡ For a background on the parallels between the LTR frameworks as applied in retrieved results, and to also better distinguish 𝑓 from the proposed context-aware
the IR and CSM settings, we refer the reader to [40]. ranking model 𝑓˜ ( ·; 𝜃˜) .
Figure 2: Flow diagrams of both traditional LTR (upper half) and the context-aware model (lower half) applied to the CSM
strategy. The latter exploits the ranking context provided by some previous ranking algorithm to re-rank the initial set of
scores provided by the same algorithm.
we use the former which is given as: refers to as multi-head attention:

! MHA(Q, K, V) = concat(head1, ..., headℎ )W𝑂 (7)
𝑝
𝑃𝐸 (𝑝,2𝑖) = sin (4) Q
10000 (2𝑖)/𝑑𝑚𝑜𝑑𝑒𝑙 head𝑖 = Att(QW𝑖 , KW𝑖K, VW𝑖V ) (8)
!
𝑝 here, each head𝑖 (out of ℎ heads) refers to the 𝑖th attention mecha-
𝑃𝐸 (𝑝,2𝑖+1) = cos (5) nism of Equation 6, and learned parameter matrices:
10000 (2𝑖)/𝑑𝑚𝑜𝑑𝑒𝑙
Q
W𝑖 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑞 , W𝑖K ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑘 ,
where 𝑝 denotes the item’s position, 𝑖 is an element belonging to
an embedding dimension index. W𝑖V ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑 𝑣 , WO ∈ Rℎ𝑑 𝑣 ×𝑑𝑚𝑜𝑑𝑒𝑙
where typically 𝑑𝑞 = 𝑑𝑘 = 𝑑 𝑣 = 𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ is used.
4.2 Encoder Layer
4.2.2 Feed-Forward Network. This component within the model in-
The Encoder layer facilitates learning the higher-order representa-
troduces non-linearity via its activation and facilitates interactions
tions of objects in the list, and it achieves this with the attention
across different parts of the inputs.
mechanism and a feed-forward network alongside various tech-
niques such as dropouts, residual connections and layer normalisa- 4.2.3 Stacking Encoder Layers. A single encoder block 𝜉 (·) can be
tion. A schematic of a single encoder layer is displayed on the left expressed as:
of Figure 3. 𝜉 (x) = Γ(z + 𝛿 (𝜙 (z))) (9)
4.2.1 Self-Attention Mechanism. This was introduced in [49] and is z = Γ(x + 𝛿 (MHA(x))) (10)
the key component within the broader model architecture allowing where Γ(·) and 𝛿 (·) are respectively the layer normalisation and
item representations to be constructed. The attention mechanism dropout functions. Additionally, 𝜙 (·) represents a projection onto
uses the query, key and value matrices as inputs. Self-attention is a a fully-connected layer, and MHA(·) is the multi-head attention
variant of attention where these matrices are identical. In our work, module. Stacking multiple encoder layers iteratively feeds an en-
they are representations of the top retrieved items. We compute coder’s output into the next, and this grows the model’s ability to
attention using the form referred to as Scaled Dot-Product¶ : learn more complex representations:
!
QK⊤ 𝝃 (x) = 𝜉 1 ...(𝜉 𝑁 (x)) (11)
Att(Q, K, V) = softmax √︁ V (6)
𝑑𝑚𝑜𝑑𝑒𝑙 where 𝝃 (x) in Equation 11 denotes a stacked encoder involving a
series of 𝑁 encoder layers.
where Q is a query matrix of dimension 𝑑𝑚𝑜𝑑𝑒𝑙 that contains all
items in the list, K and V are respectively the key and value matrices, 4.3 Model Architecture
and lastly √ 1 is a scaling factor to mitigate issues arising from
𝑑𝑚𝑜𝑑𝑒𝑙 Following the approach of [39], our context-aware ranking model
small gradients in the softmax operator. closely resembles the Transformer’s encoder with an additional
The model’s capacity to learn representations can be further linear projection on the input. Presented with top retrieved curren-
enhanced by ensembling multiple attention modules in what [49] cies provided by an initial ranking algorithm, we pass the input
through a fully connected layer of size 𝑑 𝑓 𝑐 . Next, we feed this
¶ There are different ways of computing attention, such as [31] and [5] through 𝑁 stacked encoders before sending the results to another
vector obtained from the ranking function 𝑓 and lastly, y denotes

the vector of ground-truth labels.
5.3.1 Pairwise Logistic Loss. Proposed by [4], this loss is defined
as:
𝑛 ∑︁
∑︁ 𝑛

ℓ (y, 𝑓 (x)) = I(𝑦𝑖 > 𝑦 𝑗 ) log 1 + exp(𝑓 (x) 𝑗 − 𝑓 (x)𝑖 ) (13)
𝑖=1 𝑗=1
where I(·) is the indicator function.

5.3.2 ListNet Loss. The ListNet loss [13] is a listwise loss and is
given as:
∑︁
ℓ (y, 𝑓 (x)) = − softmax(y)𝑖 × log softmax(𝑓 (x))𝑖 (14)
𝑖
We next list and discuss both baselines and proposed models used
in our study, along with the metrics used to evaluate performance.
Figure 3: Schematic of a single Encoder Layer (left), and the
broader proposed architecture for the context-aware rank- 5.4 Models and Comparison Metrics
ing model (right).
We first list the benchmark models (with their corresponding short-
hand in parentheses) used in this paper. The following 3 models
are collectively referred to as conventional benchmarks:
fully-connected layer that is used to compute scores. This entire
pipeline can be compactly expressed as: (1) Random (Rand) – Buys/sells at random. Included to give
some absolute baseline sense of performance.
𝑓˜(x) = 𝜙 output 𝝃 (𝜙 input (x))

(12) (2) Volatility Normalised MACD (Baz) – Heuristics-based ranker
with a sophisticated trend estimator proposed by [7].
and is graphically represented on the right of Figure 3. (3) Multi-Layer Perceptron (MLP) – Characterises the typical
Regress-then-rank model used by contemporary strategies.
5 PERFORMANCE EVALUATION The next 4 models are grouped as LTR benchmarks:
5.1 Dataset Overview (4) Pairwise Logistic Regression (PW) – Neural network trained
We use daily data obtained from the Bank for International Settle- using the Pairwise Logistic Loss of [4]
ments (BIS) [6]. From this, daily portfolios are constructed using the (5) ListMLE (ML) – Listwise neural LTR model by [51].
same set of 31 currencies in [7]. Our study spans 2000 to 2020 with (6) ListNet (LN) – Listwise neural LTR model by [13].
the set of currencies all expressed versus the USD. For conditioning (7) LambdaMART (LM) – Pairwise tree-based ranker by [11].
our daily data on normal and risk-off market states, we make use With all benchmarks, we follow [7] – rebalancing daily and forming
of the daily close of VIX historical data from Cboe Global Markets equally weighted long/short decile portfolios with six currencies
[14]. (i.e. using the 3 most extreme pairs on each side).
For the context-aware rankers, we have the following:
5.2 Predictor Description • Context-aware PW models: PW+P, PW+L – each is based
We use a simple combination of returns-based features for our on the baseline PW model that is re-ranked with the Trans-
predictors: former and calibrated with the pairwise and listwise loss
(1) 1-month raw returns – Based on the best model in [32], which respectively.
involves scoring instruments based on returns calculated • Context-aware ML models: ML+P, ML+L.
over the previous one month. • Context-aware LN models: LN+P, LN+L.
(2) Normalised returns – Returns over the past 1, 3, 5, 10 and 21- • Context-aware LM models: LM+P, LM+L.
day periods standardised by daily volatility and then scaled Finally, we re-rank the top 10 retrieved instruments (i.e., 𝑚 = 10)
to the appropriate time scale. with a context-aware model, and use the resulting top 3 items to
(3) MACD-based indicators – Final momentum trading indicator construct our long portfolio. For shorts, we re-rank the bottom 10
along with its constituent raw signals as defined in [7] and and select the bottom 3.
[40]. The financial and ranking performance of the various models
are evaluated over the following metrics:
5.3 Loss Functions • Profitability: Expected returns (E[Returns]) and hit rate (per-
For the loss function ℓ used for training the model, we use a pairwise centage of positive returns at the portfolio-level obtained
(Pairwise Logistic) loss and a listwise (ListNet) loss. Both losses over the out-of-sample period).
are commonly used and are described below. In the equations that • Risks: Volatility, Maximum Drawdown (MDD) and Downside
follow, x refers to a list of inputs, and 𝑓 (x) represents the score Deviation.

E[Returns] E[Returns]
• Financial Performance: Sharpe Volatility , Sortino MDD
these models also have higher cumulative returns. Taken together, it

E[Returns]
is clear that our proposed set of context-aware rankers are superior
and Calmar Downside Deviation ratios are used as a gauge to the benchmarks.
to measure risk-adjusted performance. Wealso include the
Avg. Profits 5.6.1 Ranking and Strategy Performances over Different Market
average profit divided by the average loss Avg. Loss .
States. To gauge ranking accuracy across models, we use the Nor-
• Ranking Performance: NDCG@3. This is based on the Nor-
malised Discounted Cumulative Gain (NDCG) [24] which is a mea-
malised Discounted Cumulative Gain (NDCG) [24] which is
sure of graded relevance and is commonly used in the IR literature.
a graded relevance measure. However, rather than assessing
In particular, we focus on the aggregated NDCG@3 computed over
NDCG across the entire cross-section of instruments, we
the top/bottom 3 currency pairs which assesses the models’ accu-
focus on the top/bottom 3 assets which is directly linked to
racy in forming the traded portfolios. From Table 3 which averages
strategy performance.
this metric across for each benchmark over all rebalancing periods,
we see that the (globally learned) LTR baselines typically possess
5.5 Backtest and Training Details
values that are comparable to if not higher than the conventional
All LTR models (baseline and context-aware) and the MLP were models – empirically supporting the point that CSM strategies
tuned in blocks of 5-year intervals, with the calibrated weights developed using LTR can sort more accurately on average.
and hyperparameters fixed and then used for portfolio rebalancing To investigate the extent that context-awareness is able to im-
in the following out-of-sample 5-year window. For a given train- prove ranking accuracy over different trading conditions, we con-
ing set, 90% of the data was used for training and the remainder dition strategy performance on (i) normal and (ii) risk-off market
is reserved for validation. Backpropagation was conducted for a states. We define a risk-off state similar to [17] which is an instance
maximum of 100 epochs. For baseline LTR models and the MLP, when the VIX is 5%∗∗ points higher than its rolling 60-day moving
we used 2 hidden layers with the layer width treated as a tunable average; Normal states are then simply states which are not risk-off
hyperparameter. Early stopping was applied to mitigate overfitting, states. With this definition, approximately 4% of our data is cate-
and was triggered when a model’s loss on the validation set did not gorised as risk-off states. Table 4 indicates that the context-aware
improve for 25 consecutive epochs. Across all models, hyperparam- ranking models are able to more precisely select instruments over
eters were tuned using a 50-iterations search with HyperOpt [9]. both normal and risk-off environments, with larger differences ob-
Further details on calibrating the hyperparameters can be found in served for the latter (e.g. between LN and LN+L). This superiority
Section B of the Appendix. is mirrored in Table 5: the context-aware strategies trade at higher
Sharpe ratios, with the largest improvements coinciding with the
5.6 Results and Discussion widest disparities exhibited in Table 4. Importantly, these statistics
Table 1 consolidates performance figures for both conventional establish that incorporating context-awareness into LTR models
and LTR benchmarks, while Table 2 compares the performance results in has a pronounced effect on ranking accuracy – not just
of the same LTR models against their context-aware counterparts on average, but over normal and risk-off market states. This in turn
that are produced by re-ranking under the pairwise and listwise leads to strategies with superior Sharpe ratios.
losses described in Section 5.3∥ . For the results contained in both
tables, we perform an additional layer of volatility scaling at the 6 CONCLUSION
portfolio level – facilitating comparison and aligning returns with Cross-sectional momentum (CSM) strategies developed using Learn-
our 15% volatility target. All returns are computed in the absence ing to Rank (LTR) algorithms are able to outperform contemporary
of transaction costs to focus on the models’ raw predictive ability. methods for currency momentum by producing asset rankings
With the set of baselines in Table 1, we see that the group of mod- that are more accurate on average. However, the underlying rank-
els employing LTR perform just as well if not better than the other ing algorithms at the heart of these models are typically globally
benchmarks over various measures of performance. Among the set learned and ignore the differences between feature distributions
of LTR benchmarks, LambdaMART dominates all other rankers over different periods when the portfolio is rebalanced. This flaw
across most measures of performance – echoing [42] who document renders the broader strategy susceptible to producing inaccurate
its superiority over neural rankers on three publicly used LTR data rankings, potentially at critical points in time when accuracy is
sets. We also note the MLP marginally outperforming Baz despite actually needed the most. For example, this can happen during
its relatively sophisticated construction involving neural networks risk-off episodes – leading to large, unwanted portfolio drawdowns.
– likely a consequence of over-fitting and the sub-optimality of the We rectify this shortcoming with a context-aware LTR model that is
regress-then-rank approach [40]. based on the Transformer architecture. The model encodes features
From the results in Table 2, we observe that the set of context- of the top/bottom retrieved assets provided by a previous ranking
aware models possess Sharpe ratios that are larger by approximately algorithm, learns the local ranking context, and uses this to refine
30% on average; On other risk and performance-based measures, the original list.
they are generally comparable if not better. In Figure 4, it is clear that Backtesting on a slate of 31 currencies, our proposed method-
∥ For all tables in this section, bold figures represents the best statistic for models
ology increases the Sharpe ratio by around 30% and significantly
contained within the same group for its respective performance measure. Doubly
asterisked figures are the best statistics across all models for its respective performance
measure. ∗∗ [17] use 10% as their threshold.
Table 1: Performance metrics for all benchmarks models, rescaled to the 15% annual target volatility. The set of LTR baselines
perform just as well if not better than the conventional baselines.
Conventional LTR
Rand Baz MLP PW ML LN LM
E[returns] -0.010 0.086 0.091 0.098 0.096 0.098 **0.129
Volatility **0.155 0.160 0.158 0.157 0.164 0.162 0.157
Sharpe Ratio -0.064 0.539 0.572 0.624 0.586 0.602 **0.822
Downside Dev. 0.108 0.106 0.108 0.110 **0.106 0.108 0.110
Max Drawdown 0.365 0.365 0.543 **0.240 0.538 0.326 0.264
Sortino Ratio -0.092 0.814 0.843 0.890 0.907 0.905 **1.179
Calmar Ratio -0.027 0.236 0.167 0.407 0.179 0.299 **0.490
Hit-rate 0.502 0.520 0.530 0.523 0.510 0.515 **0.531
AP/AL 0.980 1.014 0.979 1.016 **1.068 1.047 1.018
Table 2: Performance metrics for all LTR models (original and context-aware), rescaled to the 15% annual target volatility.
Context-aware LTR models outperform across all measures.
PW-based models ML-based models LN-based models LM-based models

PW PW+P PW+L ML ML+P ML+L LN LN+P LN+L LM LM+P LM+L
E[returns] 0.098 0.141 0.117 0.096 0.116 0.120 0.098 0.157 0.133 0.129 0.148 **0.157
Volatility 0.157 0.156 0.157 0.164 0.158 0.157 0.162 0.165 0.165 0.157 0.157 **0.156
Sharpe Ratio 0.624 0.905 0.746 0.586 0.729 0.764 0.602 0.948 0.806 0.822 0.941 **1.008
Downside Dev. 0.110 0.107 0.108 0.106 0.110 0.107 0.108 0.109 0.110 0.110 0.109 **0.106
Max Drawdown 0.240 **0.234 0.287 0.538 0.533 0.540 0.326 0.285 0.239 0.264 0.236 0.279
Sortino Ratio 0.890 1.315 1.086 0.907 1.050 1.119 0.905 1.431 1.210 1.179 1.357 **1.479
Calmar Ratio 0.407 0.604 0.410 0.179 0.217 0.223 0.299 0.549 0.559 0.490 **0.626 0.564
Hit-rate 0.523 0.532 0.518 0.510 0.519 0.523 0.515 0.531 0.514 0.531 **0.535 0.535
AP/AL 1.016 1.025 1.056 1.068 1.050 1.038 1.047 1.047 **1.093 1.018 1.022 1.032
Figure 4: Cumulative returns for both baseline LTR and context-aware LTR models, rescaled to target volatility. Models are
grouped by their respective baseline ranking models. For example, PW and both context-aware PW models re-ranked with
the pairwise and listwise losses (PW+P and PW+L respectively) are grouped together. The plots collectively indicate that in-
corporating context-awareness improves the cumulative wealth of all original LTR strategies.
enhances various performance metrics. When we condition on trad- innovating on the underlying Transformer-based architecture for
ing environment, our approach similarly improves the Sharpe ratio (re-)ranking.
in both normal and risk-off market states. New directions for future
work includes conducting a comprehensive performance study of
context-aware LTR models across different data sets (e.g. differ-
ent asset classes, higher frequency LOB data), as well as further
Table 3: Long/Short ranking performance for conventional and LTR baselines. Focusing on the average NDCG@3 across all
rebalances, the LTR models are generally as accurate if not more accurate than the conventional benchmarks.
Conventional LTR
Rand Baz MLP PW ML LN LM
NDCG@3 0.553 0.556 0.556 0.559 0.556 0.555 **0.560
Table 4: Long/Short ranking performance for baseline LTR and context-aware LTR models conditioning on market environ-
ments. Models are grouped by their respective baseline ranking models. Across both normal and risk-off market states, the
context-aware ranking models are able to more accurately select the winners and losers, as measured by the NDCG@3 aver-
aged over each state. This difference in ability is greater in risk-off states.

Normal 0.558 0.560 0.558 0.556 0.557 0.558 0.554 0.557 0.555 0.560 **0.560 0.560
Risk-Off 0.564 0.572 0.574 0.549 0.566 0.555 0.560 0.569 **0.580 0.550 0.556 0.551
Table 5: Sharpe ratios for LTR and context-aware LTR models over normal and risk-off market states. Context-aware ranking
models typically possess higher Sharpe ratios over both market environments.

Normal 0.705 0.872 0.724 0.691 0.775 0.870 0.573 0.936 0.735 0.913 1.041 **1.065
Risk-off -0.652 1.495 1.137 -1.174 0.109 -1.049 1.151 1.200 **2.126 -0.706 -0.720 0.003
ACKNOWLEDGMENTS htm
[7] Jamil Baz, Nicolas M Granger, Campbell R. Harvey, Nicolas Le Roux, and Sandy
We would like to thank the Oxford-Man Institute of Quantitative Rattray. 2015. Dissecting Investment Strategies in the Cross Section and Time
Finance for financial and computing support and SR thanks the UK Series. SSRN Electronic Journal (2015). https://doi.org/10.2139/ssrn.2695101
[8] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Huai-hsin Chi, Elad
Royal Academy of Engineering. Eban, Xiyang Luo, Alan Mackey, and Ofer Meshi. 2018. Seq2Slate: Re-Ranking
and Slate Optimization with RNNs. CoRR abs/1810.02019 (2018). arXiv:1810.02019
[9] James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins, and David D Cox.
REFERENCES 2015. Hyperopt: A Python Library for Model Selection and Hyperparameter
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Optimization. Computational Science & Discovery 8, 1 (July 2015), 014008. https:
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- //doi.org/10.1088/1749-4699/8/1/014008
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, [10] Tim Bollerslev. 1986. Generalized autoregressive conditional heteroskedasticity.
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Journal of Econometrics 31, 3 (April 1986), 307–327. https://ideas.repec.org/a/
Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike eee/econom/v31y1986i3p307-327.html
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul [11] Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART:
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, An Overview. Technical Report MSR-TR-2010-82. Microsoft Research.
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. [12] Craig Burnside, Martin Eichenbaum, and Sergio Rebelo. 2011. Carry Trade and
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Momentum in Currency Markets. Technical Report w16942. National Bureau of
https://www.tensorflow.org/ Software available from tensorflow.org. Economic Research, Cambridge, MA. w16942 pages. https://doi.org/10.3386/
[2] Qingyao Ai, Keping Bi, Jiafeng Guo, and W. Bruce Croft. 2018. Learning a Deep w16942
Listwise Context Model for Ranking Refinement. The 41st International ACM [13] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to
SIGIR Conference on Research & Development in Information Retrieval (June 2018), Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th
135–144. https://doi.org/10.1145/3209978.3209985 arXiv:1804.05936 International Conference on Machine Learning - ICML ’07. ACM Press, Corvalis,
[3] Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Ben- Oregon, 129–136. https://doi.org/10.1145/1273496.1273513
dersky, and Marc Najork. 2019. Learning Groupwise Multivariate Scoring [14] Cboe Global Markets Cboe. 2021. VIX Historical Price Data. Calculated (or
Functions Using Deep Neural Networks. In Proceedings of the 2019 ACM SI- Derived) based on Cboe data. (2021). https://www.cboe.com/tradable_products/
GIR International Conference on Theory of Information Retrieval (ICTIR ’19). vix/vix_historical_data/
Association for Computing Machinery, New York, NY, USA, 85–92. https: [15] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
//doi.org/10.1145/3341981.3344218 System. In Proceedings of the 22nd ACM SIGKDD International Conference on
[4] J. P. Arias-Nicolás, C. J. Pérez, and J. Martín. 2008. A Logistic Regression-Based Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16).
Pairwise Comparison Method to Aggregate Preferences. Group Decision and ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
Negotiation 17, 3 (May 2008), 237–247. https://doi.org/10.1007/s10726-007-9071-0 [16] Andy C. W. Chui, Sheridan Titman, and K. C. John Wei. 2010. Individualism
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine and Momentum around the World. The Journal of Finance 65, 1 (2010), 361–392.
Translation by Jointly Learning to Align and Translate. http://arxiv.org/abs/1409. https://doi.org/10.1111/j.1540-6261.2009.01532.x
0473 cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation. [17] Reinout De Bock and Irineu de Carvalho Filho. 2015. The Behavior of Currencies
[6] (BIS) Bank for International Settlements. 2021. US dollar exchange rates. Calcu- during Risk-off Episodes. Journal of International Money and Finance 53 (May
lated (or Derived) based on BIS data. (2021). https://www.bis.org/statistics/xrusd. 2015), 218–234. https://doi.org/10.1016/j.jimonfin.2014.12.009
[18] Wilma de Groot, Juan Pang, and Laurens Swinkels. 2012. The Cross-Section of Conference on Research and Development in Information Retrieval - SIGIR ’98. ACM
Stock Returns in Frontier Emerging Markets. Journal of Empirical Finance 19, 5 Press, Melbourne, Australia, 275–281. https://doi.org/10.1145/290941.291008
(2012), 796–818. https://doi.org/10.1016/j.jempfin.2012.08.007 [42] Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui
[19] Claude B. Erb and Campbell R. Harvey. 2006. The Strategic and Tactical Value Wang, Mike Bendersky, and Marc Najork. 2021. Are Neural Rankers Still Out-
of Commodity Futures. Financial Analysts Journal 62, 2 (March 2006), 69–97. performed by Gradient Boosted Decision Trees?. In International Conference on
https://doi.org/10.2469/faj.v62.n2.4084 Learning Representations (ICLR).
[20] John M. Griffin, Xiuqing Ji, and J. Spencer Martin. 2003. Momentum Investing [43] S. E. Robertson and K. Sparck Jones. 1976. Relevance Weighting of Search Terms.
and Business Cycle Risk: Evidence from Pole to Pole. The Journal of Finance 58, Journal of the American Society for Information Science 27, 3 (1976), 129–146.
6 (2003), 2515–2547. https://doi.org/10.1046/j.1540-6261.2003.00614.x https://doi.org/10.1002/asi.4630270302
[21] Shihao Gu, Bryan Kelly, and Dacheng Xiu. 2018. Empirical Asset Pricing via [44] S. E. Robertson and S. Walker. 1994. Some Simple Effective Approximations to
Machine Learning. Technical Report w25398. National Bureau of Economic the 2-Poisson Model for Probabilistic Weighted Retrieval. In SIGIR ’94, Bruce W.
Research, Cambridge, MA. https://doi.org/10.3386/w25398 Croft and C. J. van Rijsbergen (Eds.). Springer London, London, 232–241. https:
[22] Shihao Gu, Bryan T. Kelly, and Dacheng Xiu. 2019. Autoencoder Asset Pricing //doi.org/10.1007/978-1-4471-2099-5_24
Models. SSRN Electronic Journal (2019). https://doi.org/10.2139/ssrn.3335536 [45] K. Geert Rouwenhorst. 1998. International Momentum Strategies. The Journal of
[23] Saratchandra Indrakanti, Svetlana Strunjas, Shubhangi Tandon, and Manojku- Finance 53, 1 (1998), 267–284. https://doi.org/10.1111/0022-1082.95722
mar Rangasamy Kannadasan. 2019. Influence of Neighborhood on the Pref- [46] Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017.
erence of an Item in eCommerce Search. arXiv:1908.03825 [cs] (Oct. 2019). On Application of Learning to Rank for E-Commerce Search. Proceedings of
arXiv:1908.03825 [cs] the 40th International ACM SIGIR Conference on Research and Development in
[24] Kalervo Järvelin and Jaana Kekäläinen. 2000. IR Evaluation Methods for Retriev- Information Retrieval (Aug. 2017), 475–484. https://doi.org/10.1145/3077136.
ing Highly Relevant Documents. In Proceedings of the 23rd Annual International 3080838 arXiv:1903.04263
ACM SIGIR Conference on Research and Development in Information Retrieval - [47] Kihyuk Sohn, Xinchen Yan, and Honglak Lee. 2015. Learning Structured Output
SIGIR ’00. ACM Press, Athens, Greece, 41–48. https://doi.org/10.1145/345508. Representation Using Deep Conditional Generative Models. In Proceedings of the
345545 28th International Conference on Neural Information Processing Systems - Volume
[25] Narasimhan Jegadeesh and Sheridan Titman. 1993. Returns to Buying Winners 2 (NIPS’15). MIT Press, Cambridge, MA, USA, 3483–3491.
and Selling Losers: Implications for Stock Market Efficiency. The Journal of [48] Qiang Song, Anqi Liu, and Steve Y. Yang. 2017. Stock Portfolio Selection Using
Finance 48, 1 (March 1993), 65–91. https://doi.org/10.1111/j.1540-6261.1993. Learning-to-Rank Algorithms with News Sentiment. Neurocomputing 264 (Nov.
tb04702.x 2017), 20–28. https://doi.org/10.1016/j.neucom.2017.02.097
[26] Saejoon Kim. 2019. Enhancing the Momentum Strategy through Deep Regres- [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
sion. Quantitative Finance 19, 7 (July 2019), 1121–1133. https://doi.org/10.1080/ Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Atten-
14697688.2018.1563707 tion Is All You Need. In Proceedings of the 31st International Conference on Neural
[27] Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models. Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY,
In Proceedings of the 24th Annual International ACM SIGIR Conference on Research USA, 6000–6010.
and Development in Information Retrieval (New Orleans, Louisiana, USA) (SIGIR [50] Liang Wang and Khaled Rasheed. 2018. Stock Ranking with Market Microstruc-
’01). Association for Computing Machinery, New York, NY, USA, 120–127. https: ture, Technical Indicator and News. In Proceedings on the International Conference
//doi.org/10.1145/383952.383972 on Artificial Intelligence (ICAI). The Steering Committee of The World Congress
[28] Blake LeBaron. 1996. Technical Trading Rule Profitability and Foreign Exchange in Computer Science, Computer . . . , 322–328.
Intervention. Technical Report w5505. National Bureau of Economic Research, [51] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise
Cambridge, MA. w5505 pages. https://doi.org/10.3386/w5505 Approach to Learning to Rank: Theory and Algorithm. In Proceedings of the
[29] Hang Li. 2014. Learning to Rank for Information Retrieval and Natural Language 25th International Conference on Machine Learning (ICML ’08). Association for
Processing, Second Edition. Synthesis Lectures on Human Language Technologies Computing Machinery, New York, NY, USA, 1192–1199. https://doi.org/10.1145/
7, 3 (Oct. 2014), 1–121. https://doi.org/10.2200/S00607ED2V01Y201410HLT026 1390156.1390306
[30] Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer Berlin [52] Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang,
Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14267-3 Jianhui Chen, Changsung Kang, Hongbo Deng, Chikashi Nobata, Jean-Marc
[31] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Ap- Langlois, and Yi Chang. 2016. Ranking Relevance in Yahoo Search. In Proceedings
proaches to Attention-based Neural Machine Translation. In Proceedings of the of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
2015 Conference on Empirical Methods in Natural Language Processing. Asso- Data Mining. ACM, San Francisco California USA, 323–332. https://doi.org/10.
ciation for Computational Linguistics, Lisbon, Portugal, 1412–1421. https: 1145/2939672.2939677
//doi.org/10.18653/v1/D15-1166 [53] Chengxiang Zhai and John Lafferty. 2001. Model-Based Feedback in the Lan-
[32] Lukas Menkhoff, Lucio Sarno, Maik Schmeling, and Andreas Schrimpf. 2012. guage Modeling Approach to Information Retrieval. In Proceedings of the Tenth
Currency Momentum Strategies. Journal of Financial Economics 106, 3 (Dec. International Conference on Information and Knowledge Management (Atlanta,
2012), 660–684. https://doi.org/10.1016/j.jfineco.2012.06.009 Georgia, USA) (CIKM ’01). Association for Computing Machinery, New York, NY,
[33] Lukas Menkhoff and Mark P. Taylor. 2007. The Obstinate Passion of Foreign USA, 403–410. https://doi.org/10.1145/502585.502654
Exchange Professionals: Technical Analysis. Journal of Economic Literature 45, 4 [54] Tao Zhuang, Wenwu Ou, and Zhirong Wang. 2018. Globally Optimized Mutual
(Dec. 2007), 936–972. https://doi.org/10.1257/jel.45.4.936 Influence Aware Ranking in E-Commerce Search. In Proceedings of the 27th Inter-
[34] Tobias J. Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time Series national Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18).
Momentum. Journal of Financial Economics 104, 2 (May 2012), 228–250. https: AAAI Press, 3725–3731.
//doi.org/10.1016/j.jfineco.2011.11.003
[35] John Okunev and Derek White. 2003. Do Momentum-Based Strategies Still Work
in Foreign Currency Markets? The Journal of Financial and Quantitative Analysis
38, 2 (June 2003), 425. https://doi.org/10.2307/4126758 A CURRENCY DATASET DETAILS
[36] Changhua Pei, Yi Zhang, Yongfeng Zhang, Fei Sun, Xiao Lin, Hanxiao Sun, Jian Following [7], we work with 31 currencies all expressed versus the
Wu, Peng Jiang, and Wenwu Ou. 2019. Personalized Re-Ranking for Recommen-
dation. arXiv:1904.06813 [cs] (July 2019). arXiv:1904.06813 [cs] USD over the period 2-May-2000 to 31-Dec-2020. The full currency
[37] Bruno L. Pereira, Alberto Ueda, Gustavo Penha, Rodrygo L. T. Santos, and Nivio list is as follows:
Ziviani. 2019. Online Learning to Rank for Sequential Music Recommenda-
tion. In Proceedings of the 13th ACM Conference on Recommender Systems. ACM, • G10: AUD, CAD, CHF, EUR, GBP, JPY, NOK, NZD, SEK, USD
Copenhagen Denmark, 237–245. https://doi.org/10.1145/3298689.3347019
[38] Craig Pirrong. 2005. Momentum in Futures Markets. SSRN Electronic Journal • EM Asia: HKD, INR, IDR, KRW, MYR, PHP, SGD, TWD, THB
(2005). https://doi.org/10.2139/ssrn.671841 • EM Latam: BRL, CLP, COP, PEN, MXN
[39] Przemysław Pobrotyn, Tomasz Bartczak, Mikołaj Synowiec, Radosław Biało- • CEEMEA: CZK, HUF, ILS, PLN, RUB, TRY
brzeski, and Jarosław Bojar. 2020. Context-Aware Learning to Rank with
Self-Attention. In arXiv:2005.10084 [Cs] (SIGIR eCom ’20). Virtual Event China. • Africa: ZAR
arXiv:2005.10084 [cs]
[40] Daniel Poh, Bryan Lim, Stefan Zohren, and Stephen Roberts. 2021. Building Cross- In order to reduce the impact of outliers, we winsorise the data
Sectional Systematic Strategies by Learning to Rank. The Journal of Financial by capping and flooring it to be within 5 times its exponentially
Data Science (March 2021), 70–86. https://doi.org/10.3905/jfds.2021.1.060
[41] Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to weighted moving (EWM) standard deviation from its EWM average
Information Retrieval. In Proceedings of the 21st Annual International ACM SIGIR that is calculated using a 252-day span.
B ADDITIONAL TRAINING DETAILS • Hidden Width – [8, 16, 32, 64, 128]
Python Libraries: LambdaMART uses XGBoost [15]. The MLP, Pair- • Learning Rate – [10−7, 10−6, 10−5, 10−4, 10−3 ]
wise Regression Model, ListMLE, ListNet as well as the Transformer
underlying all context-aware ranking models (e.g. PW+P, PW+L) LambdaMART:
are developed using Tensorflow [1]. • ‘objective’ – ‘rank:pairwise’
Hyperparameter Optimisation: Hyperparameters are tuned using • ‘eval_metric’ – ‘ndcg’
HyperOpt [9]. For LambdaMART, we refer to the hyperparameters • ‘eta’ – [10−5, 10−4, 10−3, 10−2, 10−1 ]
as they are named in the XGBoost library. • ‘num_boost_round’ – [5, 10, 20, 40, 80, 160, 320]
• ‘max_depth’ – [2, 4, 6, 8, 10]
Multi-layer Perceptron (MLP): • ‘reg_alpha’ – [10−5, 10−4, 10−3, 10−2, 10−1 ]
• Dropout Rate – [0.0, 0.2, 0.4, 0.6, 0.8] • ‘reg_lambda’ – [10−5, 10−4, 10−3, 10−2, 10−1 ]
• Hidden Width – [16, 32, 64, 128, 256] • ‘tree_method’ – ‘gpu_hist’
• Learning Rate – [10−7, 10−6, 10−5, 10−4, 10−3 ]
Transformer for Context-aware LTR
ListNet: • 𝑑 𝑓 𝑐 – [16, 32, 64, 128, 256]
• Dropout Rate – [0.0, 0.2, 0.4, 0.6, 0.8] • 𝑑 𝑓 𝑓 – [16, 32, 64, 128, 256]
• Hidden Width – [8, 16, 32, 64, 128] • Dropout Rate – [0.0, 0.2, 0.4, 0.6, 0.8]
• Learning Rate – [10−5, 10−4, 10−3, 10−2, 10−1 ] • Learning Rate – [10−5, 10−4, 10−3, 10−2, 10−1 ]
• No. of Encoder Layers – [1, 2, 3, 4]
ListMLE, Pairwise Logistic Regression • No. of Attention Heads – 1
• Dropout Rate – [0.0, 0.2, 0.4, 0.6, 0.8]

Poh 等 - 2022 - Enhancing Cross-Sectional Currency Strategies by C

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Poh 等 - 2022 - Enhancing Cross-Sectional Currency Strategies by C

Uploaded by

Copyright:

Available Formats

Enhancing Cross-Sectional Currency Strategies by

Context-Aware Learning to Rank with Self-Attention

Stefan Zohren Stephen Roberts

we use the former which is given as: refers to as multi-head attention:

vector obtained from the ranking function 𝑓 and lastly, y denotes

where I(·) is the indicator function.

PW-based models ML-based models LN-based models LM-based models

PW-based models ML-based models LN-based models LM-based models

PW-based models ML-based models LN-based models LM-based models

You might also like