Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A Note on the M6 Forecasting Competition:

Designing Parametric Models with Hypernetworks*


Filip Staněk„
CERGE-EI
February 12, 2023

Abstract
This short note outlines the general approach used for the forecast-
ing part of the M6 forecasting competition. It describes a meta-learning
approach that is based on an encoder-decoder hypernetwork, capable of
identifying the most appropriate parametric model for a given family of
related prediction tasks. In addition to its application in the M6 forecast-
ing competition, we also evaluate it on the sinusoidal regression problem.
There, the proposed method outperforms established methods by an order
of magnitude, achieving near-oracle level performance.

Keywords: M6 Forecasting Competition, Meta-learning, Hypernetworks


JEL classification codes: C14, C51, C53

* This research project is ongoing, and this text is only a rough preliminary draft intended
to spark discussion and elicit feedback. With that in mind, thank you for taking the time to
read it.
„ E-mail: filip.stanek@cerge-ei.cz.
CERGE-EI, a joint workplace of Charles University and the Economics Institute of the
Czech Academy of Sciences, Politickych veznu 7, 111 21 Prague, Czech Republic.

Electronic copy available at: https://ssrn.com/abstract=4355794


1 Introduction
The method proposed in this note is not limited to time series forecasting,
and even less so to financial forecasting. Because of that, we begin by briefly
outlining the general motivation for the method and its underlying principles.
Only then do we delve into specific applications, demonstrating its potential on
the sinusoidal regression problem and the M6 forecasting competition.
We study the problem of finding the most suitable parametric model for
a given family of related but not necessarily identical prediction tasks. Typi-
cally, this is a complex endeavor that requires a unique combination of domain
knowledge and statistical expertise. The model should have enough parameters
to accurately capture the variation between tasks so that it can adapt to the
specifics of a particular task. However, it is desirable to have as few parameters
as possible to prevent estimation noise from reducing the quality of predictions
for a typical number of observations per task. Using a too flexible or even task-
agnostic non/semi-parametric model is generally suboptimal because there are
limited observations per task and many similarities in the DGPs across tasks.
In this note, we show that the search for the most suitable parametric model
can be to some extent automated. By connecting an encoder-decoder network
that accepts a task identifier to the parameters of another network that pro-
cesses inputs to generate predictions for that task, we can perform a simultane-
ous search over the space of parametric functions and their parameter values.
Importantly, the resulting hyper-network allows for full backpropagation and
does not rely on the computation of higher-order derivatives for its training,
unlike many alternative meta-learning approaches. This way, even with very
limited computational resources, one can design a parametric model specifically
for a given family of tasks/DGPs so that the degrees of freedom allotted per task
are used to best capture the variability between tasks. The method outlined in
this note falls under the field of meta-learning. For an excellent review of the
field, please refer to Hospedales et al. (2021) or Huisman et al. (2021).
The rest of the text is structured as follows. Section 2 introduces the no-
tation, related literature, and the method itself. Section 3 showcases three
applications: the commonly used benchmark of sinusoidal regression, M4 and
the M6 forecasting competition. Section 4 contains some concluding remarks
and discusses ongoing research.

2 Method
2.1 Notation
Following the notation of Hospedales et al. (2021), a task T = {D, L} consists
of data D = {(xi , yi )}N
i=1 and loss function L. The loss function L(D; θ, ω)
measures the prediction performance on dataset D, given a vector of task-specific
parameters θ and a vector of meta parameters ω, which are shared across tasks.
Throughout the text, we focus on the canonical supervised learning problem,

Electronic copy available at: https://ssrn.com/abstract=4355794


where the loss function L has the same form for all tasks
N
1 X
L(D; θ̂, ω) = γ(yi , fω (xi ; θ̂)) (1)
N i=1

θ̂ = κω (D, L) ≈ arg min L(D; θ, ω), (2)


θ

where fω is the forecasting function, γ is a contrast function measuring the


quality of the prediction ŷi = fω (xi ; θ̂) relative to yi , and κω parametrized ω
approximates the optimization routine for the most suitable task specific param-
eters θ. Oftentimes, the information contained in ω regarding which forecasting
function fω to use and the most appropriate optimization routine is determined
through expert judgment, based on informal prior knowledge regarding the task
and/or ad-hoc hyperparameter tuning.
By considering a family of tasks distributed according to p(T ), we can for-
malize the problem of finding the most suitable model for the given family. For
this purpose, it is common to partition D into two separate datasets, designated
for training end evaluation; Dtrain = {(xi , yi )}K i=1 and D
val
= {(xi , yi )}N
i=K+1 .
The objective is to find a model such that, when observing Dtrain and adapting
accordingly through θ, the expected performance on yet unobserved Dval will
be minimized. Formally:

ω ∗ = arg min E [L(Dval ; θ̂, ω)]


ω T ∼p(T )
(3)
s.t.: θ̂ = κω (Dtrain , L) ≈ arg min L(Dtrain ; θ, ω).
θ

M
Given a collection of M observed tasks {Tm }m=1 , the finite sample equivalent
of the problem is stated as follows:
M
1 X val
ω̂ = arg min L(Dm ; θ̂m , ω)
ω M m=1 (4)
train train
s.t.: θ̂m = κω (Dm , Lm ) ≈ arg min L(Dm ; θm , ω).
θm

2.2 Related Meta-Learning Models


AREA UNDER CONSTRUCTION

2.3 MtMs Model


To proceed with introducing the model, we make the following three assump-
tions, which greatly simplify the problem:
A1: The optimization routine κω () is not directly dependent on ω:
train train
θ̂m = κ(Dm , Lm ) = arg min L(Dm ; θm , ω). (5)
θm

Electronic copy available at: https://ssrn.com/abstract=4355794


A2: The prediction function fω () depends on ω through its parametriza-
tion βm = g(θm ; ω):

fω (xm,i ; θm ) = f (xm,i ; g(θm ; ω)). (6)


| {z }
βm

A3: The training is done using a train-train split, as is common in multi-


task learning, rather than with the train-val split that is typical for meta-
learning:
M
1 X train
ω̂ = arg min L(Dm ; θ̂m , ω)
ω M m=1
train
s.t.: θ̂m = κ(Dm train
, Lm ) = arg min L(Dm ; θm , ω) (7)
θm
M
with Lval

m m=1
being used for early stopping.

Assumption A1 is pragmatically motivated by the fact that our aim is to


study optimal parametric models, not how to best initialize optimization rou-
tines. This is in sharp contrast to the most popular family of meta-learning
approaches, which focus on optimization routines and in which ω represents the
initialization θ0 and some additional information on how to best adapt from θ0
(see, for example, Finn et al. (2017), Li et al. (2017), and Park and Oliva (2019)).
A2 is a technical assumption that makes the dependence relation explicit.
Assumption A3, on the other hand, is probably the most controversial as
the training process in this case does not strictly correspond to the way the
model will be deployed in practice. That is, to the situation when observing a
train
completely new task TM +1 and being asked to adapt θM +1 based on DM +1 to
val
predict y in DM +1 while keeping ω fixed. This assumption, however, appears
justifiable in light of recent studies that demonstrate that for meta-learning, the
commonly adopted train-val split might not always be preferable to a simpler
train-train split (Bai et al., 2021) and that meta-learning and multi-task learning
problems are very closely connected (Wang et al., 2021).
Under assumptions A1-3, the bilevel optimization problem simplifies to a
straightforward optimization problem
 n o  M K
!
M 1 X 1 X
ω̂, θ̂m = arg min γ(ym,i , f (xm,i ; g(θm ; ω))) , (8)
m=1 ω,{θm }M M m=1 K i=1
m=1

M
where both ω̂ and {θm }m=1 are optimized simultaneously. In effect, simul-
taneous search over both parametric functions fω (·; θm ) = f (·; g(θm ; ω)) and
M
corresponding parameters {θm }m=1 is performed.
To allow for maximal flexibility, we express both the function f (·; β) and
β = g(·; ω) as feedforward neural networks. The total size of network f (·; β),
represented by dβ = card(β), controls the level of complexity with which the

Electronic copy available at: https://ssrn.com/abstract=4355794


predicted values ŷi depend on the input xi . The size of the mesa parameters dθ =
card(θ) and the size of the network g(·; ω), represented by dω = card(ω), regulate
the number of degrees of freedom allotted to each task and the nonlinearity of
the model’s response to these parameters, respectively. Network g does not
necessarily have to be fully connected. To reduce computational complexity, it
is possible to leave some output nodes as orphaned constants, which would allow
the mesa parameters θm to affect only a part of the base model f , such as only
its last layers. Figure 1 shows a diagram of the whole model. For brevity, we
refer to it simply as MtMs henceforth to emphasize the simultaneous training
of both global meta parameters ω and task specific mesa parameters θm .

-0.3 -0.6 1.2 -0.3 -0.3 -0.9 1 0 0 0 0 0 0 0 0


-0.4 -1.1 0.8 -1.1 -0.4 -0.3 1 0 0 0 0 0 0 0 0
0.4 -1.7 1.0 -0.9 0.6 0.5 1 0 0 0 0 0 0 0 0
-0.9 0.1 -1.1 0.8 0.3 0.9 1 0 0 0 0 0 0 0 0
...

...

...

...

...

...

...

...

...

...

...

...

...

...

...
-0.1 0.5 0.4 -0.5 0.3 -0.2 x 0 0 0 0 0 0 0 0 1 q

mesa module:
θ = Θ ∗ q⊤

-0.4 2.3 θ⊤

meta module:
β = g(θ; ω)

...

0.8 0.3 -0.4 -0.0 -0.7 -1.5 ... -0.1 -0.3 -0.4 -0.8 -0.9 0.9

β
base model:
ŷ = f (x; β)

0.3 ŷ

Figure 1: A diagram illustrating MtMs


We denote q ∈ {0, 1}M as the indicator of the task under consideration and
Θ = (θ1 , θ2 , ... , θM ).

Despite being trained under the multi-task learning paradigm, it can be


deployed for both multi-task and meta-learning problems. For multi-task prob-
lems, it can be used without further optimization, simply by providing more
data from an already observed task. For meta-learning problems, it is sufficient
to optimize over the mesa parameters θ to estimate the parameters of the new

Electronic copy available at: https://ssrn.com/abstract=4355794


task
K
1 X
θ̂M +1 = arg min γ(yM +1,i , f (xM +1,i ; g(θM +1 ; ω))) (9)
θM +1 K i=1
using either backpropagation or a conventional numerical method. Note that the
entries in the low-dimensional vector θM +1 are proper parameters. As extremum
estimates, one can even perform inference (provided that regularity conditions
are met). Furthermore, as demonstrated in subsequent sections, they often
influence the prediction function f in a well-interpretable way, similar to the
parameters of a model crafted manually by a human expert.

3 Applications
3.1 Sinusoidal Task
To evaluate the potential of the MtMs to find the most appropriate parametric
model for a given family of prediction problems, we first consider a simula-
tion exercise originally proposed by Finn et al. (2017) to test the performance
of MAML. Since then, this environment has frequently been used to compare
competing meta-learning methods.
In particular, the tasks Tm = {Dm , Lm } are generated according to the
following DGP:
Am ∼ U (0.1, 5)
bm ∼ U (0, π)
(10)
xm,i |Am , bm ∼ U (−5, 5)
ym,i |xm,i , Am , bm = Am ∗ sin(xm,i + bm )
The goal is to find the best model that can predict yi,m based on xi,m for i > K
train
after observing only Dm , as measured by the mean squared error:
N
val 1 X
Lm (Dm ; θ̂m , ω) = (ym,i − fω (xm,i ; θ̂m ))2
N −K (11)
i=K+1
train
s.t.: θ̂m = κω (D , Lm )

For fair comparison, we follow Finn et al. (2017) and set the base model to
be a feedforward neural network with two hidden layers of size 40 and ReLU
non-linearities. The number of mesa parameters, dθ , is set to 2 and the meta
module g(·; ω) is a simple fully connected feedforward network with no hidden
layers or non-linearities. For training of the MtMs, it is entirely sufficient to use
only 1000 distinct tasks. This is far fewer then the 70000 task originally used
in Finn et al. (2017) and in the followup studies. Likewise, the training is done
with a fraction of computational resources. It takes approximately 0.5 hour on
a consumer grade mid-range CPU1 , which is in sharp contrast to the powerful
1 AMD Ryzen 7 4700U

Electronic copy available at: https://ssrn.com/abstract=4355794


GPU units used for training in other studies. Other simulation details follow
Zhao et al. (2020) and are available in the replication repository2 .
Table 1 shows the mean squared error achieved by the MtMs for 5-shot
learning and 10-shot learning. For comparison, we include the losses of com-
monly used meta-learning methods on this task (the performance of competing
methods is taken from Park and Oliva (2019) and Zhao et al. (2020)). The
proposed MtMs framework outperforms all benchmark methods by an order of
magnitude for both 5-shot learning and 10-shot learning of the sinusoidal task.
In fact, the losses are in both cases very close to the theoretical minimum of 0,
indicating that the MtMs is capable of recovering the data-generating process
to such a degree that, when faced with only as few as 5 observations (xm,i , ym,i )
from task m, it is able to almost perfectly infer ym,· as a function of xm,· for
the whole range [−5, 5].
Figure 2 shows predictions of the model fω (x; θ) as a function of x for dif-
ferent values of mesa-parameters θ. As is apparent from the fact that all pre-
dictions closely resemble different sine waves, the MtMs is capable of correctly
determining that each generated task follows a sine function with varying phase
and amplitude. However, the mesa parameters θ = [θ[1], θ[2]] explaining the
variability between tasks do not directly correspond to the amplitude A and
phase b. Instead, θ[1] regulates the amplitude (negatively), but to a lesser de-
gree, it also regulates the phase (positively), while θ[2] primarily regulates the
phase (positively) and, to a lesser degree, it also regulates the amplitude. This
is not surprising, as there are infinitely many parametric models that are obser-
vationally equivalent to the DGP described in Eq. 10. In particular, any two
vectors in R2 that are linearly independent are capable of spanning the whole
space of [b, A] just as well as the basis vectors used in Eq. 10. The MtMs hence
generally converges to one of these equivalent parametrizations, not necessarily
to the exact same parametrization used to simulate the data.
Admittedly, the sinusoidal regression problem is relatively favorable to the
MtMs because the data are generated using a clearly defined low-dimensional
model, and MtMs is, at its core, a method for recovering unknown parametric
models. To demonstrate that the good performance is not limited to artificial
tasks like this one, in the next subsections, we apply it to real-life forecasting
problems from the M4 and M6 competition.

3.2 M4 Forecasting Competition


AREA UNDER CONSTRUCTION

3.3 M6 Forecasting Competition


For the M6 forecasting competition (Makridakis et al., 2022), we adopted the
following strategy. First, we construct 9 analogous datasets, each consisting of
50 stocks and 50 ETFs. These stocks and ETFs are chosen so that their volatility
2 https://github.com/stanek-fi/MtMs_sinusoidal_task

Electronic copy available at: https://ssrn.com/abstract=4355794


Method K=5 K = 10
MAML (Finn et al., 2017) 0.686±0.070 0.435±0.039
LayerLR (Park and Oliva, 2019) 0.528±0.068 0.269±0.027
Meta-SGD (Li et al., 2017) 0.482±0.061 0.258±0.026
MC1 (Park and Oliva, 2019) 0.426±0.054 0.239±0.025
MC2 (Park and Oliva, 2019) 0.405±0.048 0.201±0.020
MH (Zhao et al., 2020) 0.501±0.082 0.281±0.072
MtMs (ours) 0.022±0.003 0.014±0.001

Table 1: Losses for sinusoidal task


Mean squared errors and corresponding 95% confidence intervals for different
meta-learning methods.

θ1
2.5 −0.085
−0.054
y^ = fω(x;[θ1, θ2])

−0.029
−0.008
0.0
0.017
0.043
0.06
−2.5 0.081
0.105

−5.0 −2.5 0.0 2.5 5.0


x
4

θ2
−0.102
2
−0.069
y^ = fω(x;[θ1, θ2])

−0.049
−0.033
0
−0.013
0.008
0.03
−2
0.056
0.094

−4
−5.0 −2.5 0.0 2.5 5.0
x

Figure 2: MtMs predictions for sinusoidal task (K = 5)


Plots of fω (x; θ) as a function of x for different values of the mesa parameter
vector θ. In the upper panel, the first mesa parameter θ[1] varies while θ[2] is
fixed to its median value. In the lower panel, the second mesa parameter θ[2]
varies while θ[1] is fixed to its median value.

Electronic copy available at: https://ssrn.com/abstract=4355794


and trading volume best correspond to the distribution of volatility and volume
observed in the 100 assets selected by the organizers. Since predictions are to
be made at 4-week frequency, we further augment the datasets by considering
alternative 4-week intervals shifted by 1, 2, and 3 weeks. We use data from up
to year 2000, with an approximately one year at the end of the interval being
used for testing.
For each stock and each of the four possible interval shifts, we calculate up
to seven lags of 4-week returns and volatility, as well as a battery of trading indi-
cators using the R package TTR (Ulrich, 2021). The TTR package is particularly
convenient for this purpose because it provides a unified interface for a wide
range of indicators, minimizing the need for manual corrections and allowing
us to simply loop over the available functions. Besides the indicator of whether
the asset is a stock or ETF, all features are functions of past prices.
For the base model f , we opt for a simple feedforward network consisting of
hidden layers of sizes {32, 8} with a leaky ReLU nonlinearity, softmax on the
last 5 output nodes, and a dropout probability of 0.2. The meta module g is
a trivial feedforward network with no hidden layers, which connects to the last
layer of f . The other output nodes are orphaned constants. Lastly, we allow for
one mesa parameter per asset, that is, dθ = 1. The fact that each asset has one
degree of freedom with which it can alter the behavior of f allows us to utilize
a much broader universe of assets than the 100 selected by the organizers, as
the heterogeneity in their DGPs is absorbed by θ.
To train the model, we first train the base model f on pooled data, disre-
garding any information regarding which asset the given observation belongs
to, with a minibatch size of 200. This trained model then serves as a starting
initialization for MtMs to identify any between-task variation. This two-step
optimization process has proven to be more effective than starting the MtMs
from a random initialization. The training of the MtMs took less than 1 hour
on a midrange CPU. Given that predictions are to be made for tasks already
included in the training dataset, we can make predictions directly under the
multi-task learning paradigm.
As of now, MtMs ranks at the 4th place with an RPS of 0.15689, outper-
forming both the naive benchmark (M6 Dummy, RPS: 0.16) and the market
benchmark (microprediction, RPS: 0.15832 (see Microprediction, 2023)). These
results are, however, still tentative. It appears that the good forecasting perfor-
mance is achieved almost entirely by predicting volatility, as predictions for the
first (resp. second) quintile are generally very close to predictions for the fifth
(resp. fourth) quintile. Indeed, in our tests, interchanging the quintiles leads
to only a marginal deterioration of predictive performance, much of which is
likely attributed to the asymmetry of return distributions rather than a genuine
ability to predict variations in mean returns across assets.
It is also important to disclose that MtMs was used only for the prediction
part of the competition. Our relatively good performance on the investment
part might be partly related to the strategic decisions leveraging the specifics
of the IR evaluation metric and the adversarial nature of the competition itself.
Much of it, however, is without question attributable to luck, as it is inevitable

Electronic copy available at: https://ssrn.com/abstract=4355794


for such a fat-tailed metric computed over the span of a single year. A more
detailed discussion of the investment part of the competition will be provided
in a separate document, as it is unrelated to MtMs, along with the original
repository used to make predictions for the competition (including its history
as the model slightly evolved during the competition).

4 Concluding remarks
The satisfactory performance in the M6 forecasting competition and the unpar-
alleled performance in the sinusoidal regression task indicate that the proposed
MtMs approach might be useful in practical applications. Currently, we are
testing MtMs on data from the M4 forecasting competition (Makridakis et al.,
2020), and preliminary results seem promising. The mesa parameters here gen-
erally tend to capture some combination of time series persistence and season-
ality. Generally, MtMs appears well-suited for time series forecasting, as the
DGPs typically have some common patterns (e.g., seasonality), yet at the same
time, they are not identical. Another promising area might be few-shot image
recognition, a typical domain of meta-learning approaches. However, we are not
currently pursuing this avenue due to the lack of computational resources.

10

Electronic copy available at: https://ssrn.com/abstract=4355794


References
Bai, Y., M. Chen, P. Zhou, T. Zhao, J. Lee, S. Kakade, H. Wang, and C. Xiong
(2021). How Important is the Train-Validation Split in Meta-Learning? In
International Conference on Machine Learning, pp. 543–553. PMLR.
Finn, C., P. Abbeel, and S. Levine (2017, July). Model-Agnostic Meta-Learning
for Fast Adaptation of Deep Networks. In Proceedings of the 34th Interna-
tional Conference on Machine Learning, pp. 1126–1135. PMLR.
Hospedales, T., A. Antoniou, P. Micaelli, and A. Storkey (2021). Meta-learning
in neural networks: A survey. IEEE transactions on pattern analysis and
machine intelligence 44 (9), 5149–5169.
Huisman, M., J. N. van Rijn, and A. Plaat (2021, August). A survey of deep
meta-learning. Artificial Intelligence Review 54 (6), 4483–4541.
Li, Z., F. Zhou, F. Chen, and H. Li (2017, September). Meta-SGD: Learning to
Learn Quickly for Few-Shot Learning.
Makridakis, S., A. Gaba, R. Hollyman, F. Petropoulos, E. Spiliotis, and
N. Swanson (2022). The M6 Financial Duathlon Competition Guidelines.
Makridakis, S., E. Spiliotis, and V. Assimakopoulos (2020, January). The M4
Competition: 100,000 time series and 61 forecasting methods. International
Journal of Forecasting 36 (1), 54–74.
Microprediction (2023, February). The Options Market Beat
94% of Participants in the M6 Financial Forecasting Contest.
https://medium.com/geekculture/the-options-market-beat-94-of-
participants-in-the-m6-financial-forecasting-contest-fa4f47f57d33.
Park, E. and J. B. Oliva (2019). Meta-curvature. Advances in Neural Informa-
tion Processing Systems 32.

Ulrich, J. (2021, December). TTR: Technical Trading Rules.


Wang, H., H. Zhao, and B. Li (2021). Bridging multi-task learning and meta-
learning: Towards efficient training and effective adaptation. In International
Conference on Machine Learning, pp. 10991–11002. PMLR.

Zhao, D., S. Kobayashi, J. Sacramento, and J. von Oswald (2020, December).


Meta-Learning via Hypernetworks. In Zhao, Dominic; Kobayashi, Seijin;
Sacramento, João; von Oswald, Johannes (2020). Meta-Learning via Hyper-
networks. In: 4th Workshop on Meta-Learning at NeurIPS 2020 (MetaLearn
2020), Virtual Conference, 11 December 2020, IEEE., Virtual Conference.
IEEE.

11

Electronic copy available at: https://ssrn.com/abstract=4355794

You might also like