WQP GRN Onboarding Document

1
WQP GRN Onboarding Document

How to use our platform
Disclaimer: This document is giving an overview of model

creation in our platform. Please read the entire document
before starting to try it out on the platform, as you need a
basic understanding of how it works. At the end of the
document you will find a “tutorial” task, you can play with
and try out the mechanics and capabilities of the platform.
Research Problems
Typical Workflow
Research problems are typically onboarded and addressed in the following fashion:
1. Use case specified internally or by client.

2. Exploratory data analysis done, to understand data requirements and the
sorts of questions we could be able to answer for the use case.
3. For each problem that is to be answered for the use case, a platform problem
is framed by research leads in the form of playbooks and accompanying
command streams that cover the following:
a. ETL for creating the base data set.
b. What is the problem to be answered? More specifically, what are the
inputs provided by the model end-user and what kind of output
should the model end-user expect?
c. What subset of data is made available to researchers for building
models, and what additional data is reserved for testing (i.e. only
available to the model test environment, or MTE; this would be a
superset of the researcher data)?
d. What are the test inputs and what are the expected/ideal outputs? I.e.,
create a command stream for testing. Command streams should be
provided for two scenarios: using only the researcher data for fitting,
and using the MTE data (a superset of researcher data) for fitting.
4. Problem and base data for the problem is documented in the shared Google
Drive in the appropriate project folder.
© WorldQuant Predictive Confidential

2
5. Researchers develop models using only the researcher data, and testing
against the researcher test inputs. When they feel the models are ready, they
can submit the models for production where their models are fitted using
MTE data and tested against the MTE input.
6. Research leads would evaluate submitted models and promote a model into
production if it has sufficient predictive power and meet the other required
criteria on latency, etc.
7. An ensemble model would be created using some or all of the production
models. This ensemble model is refreshed periodically with new production
models.
8. Internal or client deliverables would be constructed from the ensemble
model output.
Steps before using the platform

Check your inbox for the invite to AWS Workspace. Click on the link to download
the Workspace Client app and proceed with the registration. Then access your AWS
Workspace.
Note: to use the wqpt commands on Windows, you need to use cygwin to have an
emulated linux command prompt. On older images (<v1.3) the default Cygwin path
is not on D: drive, so make sure you are working on drive D: (run pwd and look for
/cygdrive/d in the output). (If not, use cd D:). Suggested to create a work folder
(mkdir work) and go to that directory (cd work).
Run the cygwin terminal and check whether the anaconda is properly configured.
For this run the following command:
which python
The expected output should show the executable from the Anaconda folder. If it
does not use anaconda’s python executable, contact the support engineer for help.
The wqpt utility should be installed already. To configure it run the following
command:
wqpt authtoken <token>
where the <token> should be replaced by your authentication token (the link to
your auth token is distributed through slack personally).

3
If needed, you can update the wqpt package by the running wqpt-update
command. This document assumes wqpt version >= wqpt (version 2.0.0+2887).
You can check your version by running wqpt --version.
Updating WQPT Utility

In the near future, we will automatically update WQPT Utility for you, but in the
meantime you should run wqpt-update on your machine daily or weekly to ensure
you are always on the latest version of WQPT.

4
Platform Primer
Creating Models
You can set up a model skeleton for a given problem locally using:
$ wqpt create <NAME_OF_MODEL> --playbook=<NAME_OF_PROBLEM>

--lang=<IMPLEMENTATION_LANGUAGE>
NOTE: If you try out the platform, please use demand-multivariate as playbook,
e.g.:
wqpt create demand-pred-test01 --playbook=demand-multivariate
--lang=py
If you see a similar error message:

Error occurred: failed to fetch playbook: 'demand-multivariate'.
Status: Forbidden<403>
Then the VPN connection is probably inactive.
This will create an arbitrarily chosen directory <NAME_OF_MODEL> already

containing the required directory structure and stub files that comprise a model for
the problem <NAME_OF_PROBLEM>. Implementation languages currently supported
by the platform are py (Python) or R (R), and the default behavior is to use py
(Python) if the --lang option is not specified.
The minimal required directory structure for a model is as follows:
● .wqpt/Playbook.yml: This file contains the problem specification, i.e.

model inputs and expected outputs, as well as how models are to be tested.
● Alphafile.yml: This file contains metadata for your model.
● alpha.py or alpha.R: This is the file that forms the entry point into your
model.
● environment.yml: This file contains the dependencies for your model.
● .wqpt/commands.jsonl: This file contains the sequence of testing inputs to
your model and the ground truth values of any predictions.
● files/*: Other files in the files subdirectory contain the base data and
other data or code necessary for running a minimal model.

5
Example:
.wqpt/Playbook.yml
The .wqpt/Playbook.yml file specifies the problem the model addresses.
It has six main sections:
● name: This is the name of the problem.

● description: This section describes the problem to be answered.
● notes: These are additional notes for the given problem, e.g. details about
testing setup, etc.
● functions: This describes the function signatures for calls to the model and
the return type from those calls, if any. Each entry in functions is the name
of a function. Each function’s entry has a parameters entry that lists
parameters for that function and their types, and may have an output entry
that lists the return type from that function. Typically, signatures should be
provided for the predict() function, which asks the model to make a
prediction, and for the set_state() function, which indicates to the model
what data it has access to at that point in time.
● statistics: This is the list of statistical modules to run in order to generate
metrics for the goodness of the model. These correspond to the predefined

6
module names that come as part of the platform, or to names of custom

modules specified in an optional custom.py file.
● testflow: This lists in order the different tests to be run. As of 2020-01-02,
only command_stream tests are supported. A command stream is a file in
JSON Lines format where each line is a JSON dictionary representing a
command to run on the model (along with the ground truth value for
predict() commands), and commands are run in sequence.
Example:

7
Alphafile.yml
The Alphafile.yml file contains metadata for the model. Currently, it has three main
sections:
- runtimeVersion which indicates the platform runtime version this model was built
with
- description should contain the introduction of your alpha: the used methods and
main steps of the algorithm. Fill it before submitting your alpha.
- alpha which contains information about the model.
The following entries in alpha are required:
● name: This is the of the model which is used to identify the model in the catalog.
● playbook: This is the name of the playbook (i.e. corresponding to a research
problem) in the catalog that this model addresses.
● version: This is the version of the model. If model code is refreshed, this should be
incremented.
● runtime: This is the runtime used by the model, and is dependent on the
implementation language specified during model skeleton creation.
● entrypoint: This is the entry point into the model, i.e. the root file containing the
model, and is typically alpha.py for Python models or alpha.R for R models.
Example:

8
alpha.py or alpha.R
The alpha.py (or alpha.R) file is the entry point into your model if you are using
Python (or R) as the implementation language. Your model should implement an
interface with three public functions:
● set_state(param1, param2, …): This function is called in order to “set

the state of the world” for the model, i.e. it indicates to the model update the
data it is allowed to access. E.g. after set_state(“2019-06-01”) is called,
the model should now constrain the data used for any training to
”2019-06-01” or earlier. The exact semantics of this function (i.e. what the
state parameters represent in terms of data access) is problem-dependent.
The function signature must match what is specified in the
.wqpt/Playbook.yml file.
● fit(): When this function is called, the model should fit, re-train, or update
its parameters using available data in the current state. E.g., suppose the
state is ”2019-06-01” and fit() is called, then the model should re-fit its
model parameters using data up ”2019-06-01”.
● predict(param1, param2, …): When this function is called, the model
should make a prediction for the scenario described by the given parameters
param1, param2, … . The function signature must match what is specified in
the .wqpt/Playbook.yml file. Additionally, the function should return a
value of type specified in .wqpt/Playbook.yml. It may also return the None
(Python) or NULL (R) value if it is unable to make a prediction for the given
combination of inputs. Either this function or predict_batch must be
implemented. If this is implemented, comment out predict_batch parts.
● predict_batch(items): When this function is called, the model should
make predictions for the list of scenarios in the given items list. Each
member of items is a dict (Python) or list (R) of key-value pairs where
the keys correspond to the signature of predict. The function should return
a list of values of type specified in .wqpt/Playbook.yml. A value can also be
None (Python) or NULL (R) if the model is unable to make a prediction (or
decides not to) for the corresponding member of items. Either this function
or predict must be implemented. If this is implemented, comment out
predict parts.
Note that the model should persist its state between calls. Additionally, it is
recommended to implement predict_batch rather than predict to reduce
network overhead and leverage the ability of most standard ML packages to make
vectorized predictions, which is much faster than one at a time. The platform will
preferentially use predict_batch over predict if both functions are
implemented.

9
Examples:
● Python model showing the capturing of state in set_state, using captured

state to limit available data in fit as expected, and making batch
predictions using predict_batch instead of predict

10
● R model base template
environment.yml
The environment.yml file lists the package dependencies of your model. There
should be one entry per line, and each entry should be the name of a package from
Conda, PyPI (Python model) or CRAN (R model).
Example:
● Python model

11
● R model is the same
Here is an example from a real model in R:

12
custom.py and custom statistical metrics (optional)
The custom.py file contains code to define and register custom statistical modules
that compute metrics aside from the pre-defined ones. To use a registered stats
module, it has to be added to the statistics section of the .wqpt/Playbook.yml
file. Note that any additional statistical metrics you add to the
.wqpt/Playbook.yml file will only be generated in local tests and not on any
remote ones.
Example:
To add a new custom statistical module, you will need to add or edit the custom.py
file to register the module and edit the .wqpt/Playbook.yml file to add the
registered module to the list of statistical metrics to compute when running
$ wqpt stats
The following file shows how you could define a new custom statistic metric. In
this case, a new statistical module is defined to compute the Root Mean Squared
Logarithmic Error, which we register as RMSLE.

13
The following file shows the changes to be made to .wqpt/Playbook.yml to use

the RMSLE module to compute that Root Mean Squared Logarithmic Error metric.
This only works for local model runs, i.e. only when you run

14
$ wqpt test

15
Model test runs
Local test runs using the researcher data set
Once your model code is ready, you can run a local test using the CLI commands in
the model’s base directory:
$ wqpt test
Remote test runs using the researcher data set
When your alpha is ready, in the directory containing your alpha, execute the
following:
$ wqpt upload -r
Remember, that for submitting any remote runs you’ll need to be connected to VPN.
Remote test runs using the researcher data set without local
checks
Append -f if some checks (eg. requirements) fail, but should run on the server:
$ wqpt upload -r -f
Remote test runs using the reserved MTE data set
When your alpha is ready, in the directory containing your alpha, execute the
following:
$ wqpt upload
Remote test runs using the reserved MTE data set without
local checks
Append -f if some checks (eg. requirements) fail, but should run on the server:
$ wqpt upload -f

16
Adding notes to a submission
Append -m and a short reminder about changes of your alpha since the last
submission. Think of it as a changelog message, which helps you to identify
versions.
$ wqpt upload -m "fixed exception on missing input values in any

column"
Test feedback
Feedback for local test runs
Model performance is displayed in the console when running local tests, but you can query
the results again any time by executing the following command in the directory containing
your alpha:
$ wqpt stats
Feedback for remote test runs

To see how your model performed on the remote tests, please ask your mentor from WQP
via Slack. During the onboarding process, until you are not assigned to an internal
researcher, you can ask feedback about model performance via #onboarding_help Slack
channel.
Saving and loading a trained model

Using the optional capture and restore functions a trained model can be saved and loaded.
The main goal is to skip the possibly long fitting when not necessary. Model data is not
written to a file directly by the alpha, but get and set as a dictionary-type variable. Capture
and restore functions don’t need to be explicitly called by the alpha, because the platform
calls them when needed.
Capture returns a JSON serializable object. If the capture function is not defined within
alpha then runtime will substitute it with a function that returns None.
Restore function accepts a hashable object and returns True if state is restored and false
otherwise. If restore function is not defined within alpha then runtime will substitute it
with a function that returns false.

17
Introductory Tasks
Try building a model for one of the established problems
on the platform
The goal of this task is to help establish your own research workflow and
familiarize yourself with the platform, especially the wqpt command line tool.
There are a number of problems that can be tackled, but it is recommended to try
the following one as this problem is easy to understand and well-documented:
● Demand prediction
Demand prediction for Beverages
Overview
We are aiming to create demand predictions on various levels of aggregations (from
UPC (Universal Product Codes) level to subcategory level) in the beverage category
from the given dataset.
Performance
MAPE (Mean absolute percentage error)
metric
Success
Criteria MAPE <30%
Research question
What is the expected demand of known UPCs in function of distribution, price & promo
variables?

18
Research problem
Given a date, predict the SPPD based on given input variables:
Platform Output
Problem Scope files Input variables (levers) variable
Name (predicted)
Time
demand- Available in files SPPD
● TimePeriodEndDate
multivariate folder after
Product Features (UPC level)
creation of the
● Upc
model in wqpt
Financials
platform
● base_price
● discount_perc
● AvgPctAcv
Promo variables
● AvgPctAcvAnyDisplay
● AvgPctAcvAnyFeature
● AvgPctAcvFeatureAndDisplay
● AvgPctAcvTPR
Calculated fields
# Field Description Formula
Sales per Point of Distribution

1 SPPD https://www.cpgdatainsights.com/measure-s [Units]/[Avg % ACV]
ales/velocity-how-prod-really-sells/
2 base_price unprompted price (no TPR or discount) [Base Dollars]/[Base Units]
year ago unprompted price (no TPR or [Base Dollars, Yago]/[Base

3 base_price_yago
discount) Units, Yago]
[Dollars, Promo]/[Units,
4 promo_price prompted price (TPR or discount)
Promo]
[Dollars, Promo,
5 promo_price_yago year ago prompted price (TPR or discount) Yago]/[Units, Promo,
Yago]
the depth of the discount as a percentage vs 1-[promo_price]/[base_pr

6 discount_pct
base price ice]
the depth of the discount as a percentage vs 1-[promo_price_yago]/[b

7 discount_pct_yago
base price year ago ase_price_yago]

19
Terminology
Term Description Formula
Equalized units, Physical volume of product sold at retail

expressed in a common unit relevant to the category. Use Units*Size(of
EQ units
when comparing products of different sizes. Some common product)
EQ units are pounds (LBS), gallons, ounces, cases.
Denotes a measure, attribute or dimension which has been

transformed (e.g. binned, grouped, renamed) vs its original
suffix "wqp" value in the raw data. Purpose is to reduce the tail / outlier
values or to improve data consistency. Often involves a
mapping table created manually.
Collectively, in the context of SPINS data they encompass the

following:
● BasePrice
price & ● DiscountPct
promo ● AvgPctACV
variables ● AvgPctACVAnyDisplay
● AvgPctACVAnyFeature
● AvgPctACVFeatureAndDisplay
● AvgPctACVTPR
Also referred to as "features", attributes is a set of qualitative

Product characteristics which define a product. The features can be
Attributes either taken as is from SPINS data or engineered (see suffix
"wqp")
All commodity volume, total retail dollar sales for an entire

ACV
store across all products and categories.
TPR Temporary price reduction

WQP GRN Onboarding Document - How To Use Our Platform

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WQP GRN Onboarding Document - How To Use Our Platform

Uploaded by

Copyright:

Available Formats

1

Disclaimer: This document is giving an overview of model

1. Use case specified internally or by client.

© WorldQuant Predictive Confidential

Steps before using the platform

wqpt authtoken <token>

© WorldQuant Predictive Confidential

Updating WQPT Utility

© WorldQuant Predictive Confidential

$ wqpt create <NAME_OF_MODEL> --playbook=<NAME_OF_PROBLEM>

If you see a similar error message:

This will create an arbitrarily chosen directory <NAME_OF_MODEL> already

The minimal required directory structure for a model is as follows:

● .wqpt/Playbook.yml: This file contains the problem specification, i.e.

© WorldQuant Predictive Confidential

The .wqpt/Playbook.yml file specifies the problem the model addresses.

It has six main sections:

● name: This is the name of the problem.

© WorldQuant Predictive Confidential

module names that come as part of the platform, or to names of custom

© WorldQuant Predictive Confidential

The following entries in alpha are required:

© WorldQuant Predictive Confidential

● set_state(param1, param2, …): This function is called in order to “set

© WorldQuant Predictive Confidential

● Python model showing the capturing of state in set_state, using captured

© WorldQuant Predictive Confidential

● R model base template

© WorldQuant Predictive Confidential

● R model is the same

Here is an example from a real model in R:

© WorldQuant Predictive Confidential

custom.py and custom statistical metrics (optional)

© WorldQuant Predictive Confidential

The following file shows the changes to be made to .wqpt/Playbook.yml to use

© WorldQuant Predictive Confidential

© WorldQuant Predictive Confidential

Model test runs

Local test runs using the researcher data set

Remote test runs using the researcher data set

Remote test runs using the reserved MTE data set

© WorldQuant Predictive Confidential

Adding notes to a submission

$ wqpt upload -m "fixed exception on missing input values in any

Feedback for local test runs

Feedback for remote test runs

Saving and loading a trained model

© WorldQuant Predictive Confidential

Demand prediction for Beverages

© WorldQuant Predictive Confidential

Sales per Point of Distribution

2 base_price unprompted price (no TPR or discount) [Base Dollars]/[Base Units]

year ago unprompted price (no TPR or [Base Dollars, Yago]/[Base

the depth of the discount as a percentage vs 1-[promo_price]/[base_pr

the depth of the discount as a percentage vs 1-[promo_price_yago]/[b

© WorldQuant Predictive Confidential

Equalized units, Physical volume of product sold at retail

Denotes a measure, attribute or dimension which has been

Collectively, in the context of SPINS data they encompass the

Also referred to as "features", attributes is a set of qualitative

All commodity volume, total retail dollar sales for an entire

TPR Temporary price reduction