Scalable Recommendation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Scalable Realistic Recommendation Datasets

through Fractal Expansions


Francois Belletti Karthik Lakshmanan Walid Krichene Yi-Fan Chen John Anderson
Google AI Google AI Google AI Google AI Google AI
Google Google Google Google Google
belletti@google.com lakshmanan@google.com walidk@google.com yifanchen@google.com janders@google.com

Abstract—Recommender System research suffers currently TABLE I: Size of MovieLens 20M [19] vs industrial dataset
from a disconnect between the size of academic data sets and the in [55].
arXiv:1901.08910v3 [cs.IR] 20 Feb 2019

scale of industrial production systems. In order to bridge that


gap we propose to generate more massive user/item interaction MovieLens 20M Industrial
data sets by expanding pre-existing public data sets. #users 138K Hundreds of Millions
User/item incidence matrices record interactions between users
and items on a given platform as a large sparse matrix whose #items 27K 2M
rows correspond to users and whose columns correspond to items. #topics 19 600K
Our technique expands such matrices to larger numbers of rows
#observations 20M Hundreds of Billions
(users), columns (items) and non zero values (interactions) while
preserving key higher order statistical properties.
We adapt the Kronecker Graph Theory to user/item incidence
matrices and show that the corresponding fractal expansions useful characteristics of the dataset. For instance, [37] shows a
preserve the fat-tailed distributions of user engagements, item
popularity and singular value spectra of user/item interaction
privacy breach of the Netflix prize dataset. More importantly,
matrices. Preserving such properties is key to building large publishing anonymized industrial data sets runs counter to user
realistic synthetic data sets which in turn can be employed expectations that their data may only be used in a restricted
reliably to benchmark Recommender Systems and the systems manner to improve the quality of their experience on the
employed to train them. platform.
We provide algorithms to produce such expansions and apply
them to the MovieLens 20 million data set comprising 20 million Therefore, we decide not to make user data more broadly
ratings of 27K movies by 138K users. The resulting expanded available to preserve the privacy of users. We instead choose
data set has 10 billion ratings, 864K items and 2 million users to produce synthetic yet realistic data sets whose scale is
in its smaller version and can be scaled up or down. A larger commensurable with that of our production problems while
version features 655 billion ratings, 7 million items and 17 million only consuming already publicly available data.
users.
Index Terms—Machine Learning, Deep Learning, Recom- Producing a realistic MovieLens 10 billion+ dataset:
mender Systems, Graph Theory, Simulation In this work, we focus on the MovieLens dataset which
only entails movie ratings posted publicly by users of the
I. I NTRODUCTION MovieLens platform. The MovieLens data set has now become
a standard benchmark for academic research in Recommender
Machine Learning (ML) benchmarks compare the capa-
Systems, [1], [8], [20], [22], [26], [32], [34], [38], [44], [48],
bilities of models, distributed training systems and linear
[49], [54], [56], [58] are only few of the many recent research
algebra accelerators on realistic problems at scale. For these
articles relying on MovieLens whose latest version [19] has
benchmarks to be effective, results need to be reproducible by
accrued more than 800 citations according to Google Scholar.
many different groups which implies that publicly shared data
Unfortunately, the data set comprises only few observed in-
sets need to be available.
teractions and more importantly a very small catalogue of
Unfortunately, while Recommendation Systems constitute a
users and items — when compared to industrial proprietary
key industrial application of ML at scale, large public data
recommendation data.
sets recording user/item interactions on online platforms are
not yet available. For instance, although the Netflix data set [6] In order to provide a new data set — more aligned with
and the MovieLens data set [19] are publicly available, they the needs of production scale Recommender Systems — we
are orders of magnitude smaller than proprietary data [3], [12], aim at expanding publicly available data by creating a realistic
[55]. surrogate. The following constraints help create a production-
Proprietary data sets and privacy: While releasing large size synthetic recommendation problem similar and at least as
anonymized proprietary recommendation data sets may seem hard an ML problem as the original one for matrix factoriza-
an acceptable solution from a technical standpoint, it is a non- tion approaches to recommendations [20], [25]:
trivial problem to preserve user privacy while still maintaining • orders of magnitude more users and items are present in
Item-wise Item-wise can preserve high level statistics of the original data set while
15000 rating sums 10 5 rating sums (log/log) scaling its size up by multiple orders of magnitudes.
10 4
10000 10 3 Many different transforms can be applied to the matrix R
total rating

total rating
10 2
5000 10 1 which can be considered a standard sparse 2 dimensional im-
10 0 age. A recent approach to creating synthetic recommendation
0 10 -1
10 -2 data sets consists in making parametric assumptions on user
5000 10 -3
10 0 10 1 10 2 10 3 10 4 10 5 behavior by instantiating a user model interacting with an
0
00

0
0
0
0
0
item rank
00
00
00
00
00
online platform [11], [46]. Unfortunately, such methods (even
50
10
15
20
25
30
item rank
calibrated to reproduce empirical facts in actual data sets) do
User-wise User-wise not provide strong guarantees that the resulting interaction
500 rating sums 10 3 rating sums (log/log)
10 2
data is similar to the original. Therefore, instead of simulating
0
500 10 1 recommendations in a parametric user-centric way as in [11],
total rating

total rating
1000 10 0 [46], we choose a non-parametric approach operating directly
10 -1
1500 10 -2 in the space of user/item affinity. In order to synthesize a
2000 10 -3 large realistic dataset in a principled manner, we adapt the
2500 10 -4
10 0 10 1 10 2 10 3 10 4 10 5 10 6 Kronecker expansions which have previously been employed
0

40 0
60 0
80 0
10 0
12 00
14 00
00

user rank
00
00
00
00
00
00
00

to produce large realistic graphs in [28]. We employ a non-


20

user rank parametric analytically tractable simulation of the evolution of


User/item User/item the user/item bi-partite graph to create a large synthetic data
250 rating spectrum 10 3 rating spectrum (log/log)
set. Our choice is to trade-off realism for analytic tractability.
200
singular value

singular value

We emphasize the latter.


magnitude

magnitude

150
10 2 While Kronecker Graphs Theory is developed in [28], [29]
100
50 on square adjacency matrices, the Kronecker product operator
0 10 1 is well defined on rectangular matrices and therefore we can
0 500 1000 1500 2000 2500 10 0 10 1 10 2 10 3 10 4 apply a similar technique to user/item interaction data sets —
singular value singular value
rank rank which was already noted in [29] but not developed extensively.
The Kronecker Graph generation paradigm has to be changed
Fig. 1: Key first and second order properties of the original with the present data set in other aspects however: we need
MovieLens 20m user/item rating matrix (after centering and to decrease the expansion rate to generate data sets with the
re-scaling into [−1, 1]) we aim to preserve while synthetically scale we desire, not orders of magnitude too large. We need
expanding the data set. Top: item popularity distribution (total to do so while maintaining key conservation properties of the
ratings of each item). Middle: user engagement distribution original algorithm [29].
(total ratings of each user). Bottom: dominant singular values In order to reliably employ Kronecker based fractal expan-
of the rating matrix (core to the difficulty of matrix factor- sions on recommender system data we devise the following
ization tasks). In all log/log plots the small fraction of non- contributions:
positive row-wise and column-wise sums are removed.
• we develop a new technique based on linear algebra to
adapt fractal Kronecker expansions to recommendation
the synthetic dataset; problems;
• the synthetic dataset is realistic in that its first and • we demonstrate that key recommendation system specific
second order statistics match those of the original dataset properties of the original dataset are preserved by our
presented in Figure 1. technique;
• we also show that the resulting algorithm we develop is
Key first and second order statistics of interest we aim to
scalable and easily parallelizable as we employ it on the
preserve are summarized in Figure 1 — the details of their
actual MovieLens 20 million dataset;
computation are given in Section IV.
• we produce a synthetic yet realistic MovieLens 655
Adapting Kronecker Graph expansions to user/item billion dataset to help recommender system research scale
feedback: We employ the Kronecker Graph Theory intro- up in computational benchmark for model training.
duced in [28] to achieve a suitable fractal expansion of
recommendation data to benchmark linear and non-linear The present article is organized as follows: we first reca-
user/item factorization approaches for recommendations [20], pitulate prior research on ML for recommendations and large
[25]. Consider a recommendation problem comprising m users synthetic dataset generation; we then develop an adaptation of
and n items. Let (Ri,j )i=1...m,j=1...n be the sparse matrix of Kronecker Graphs to user/item interaction matrices and prove
recorded interactions (e.g. the rating left by the user i for item key theoretical properties; finally we employ the resulting
j if any and 0 otherwise). The key insight we develop in the algorithm experimentally to MovieLens 20m data and validate
present paper is that a carefully crafted fractal expansion of R its statistical properties.
II. R ELATED WORK learned by training the neural model to predict user behavior
on a large data set of previously observed user/item interac-
Recommender Systems constitute the workhorse of many e- tions.
commerce, social networking and entertainment platforms. In DNNs consume large quantities of data and are com-
the present paper we focus on the classical setting where the putationally expensive to train, therefore they give rise to
key role of a recommender system is to suggest relevant items commonly shared benchmarks aimed at speeding up train-
to a given user. Although other approaches are very popular ing. For training, a Stochastic Gradient Descent method is
such as content based recommendations [43] or social rec- employed [27] which requires forward model computation
ommendations [10], collaborative filtering remains a prevalent and back-propagation to be run on many mini-batches of
approach to the recommendation problem [33], [42], [45]. (user, item, score) examples. The matrix completion task still
Collaborative filtering: The key insight behind collabora- consists in predicting a rating for the interaction of user i and
tive filtering is to learn affinities between users and items based item j although (i, j) has not been observed in the original
on previously collected user/item interaction data. Collabora- data-set. The model is typically run on billions of examples
tive filtering exists in different flavors. Neighborhood methods as the training procedure iterates over the training data set.
group users by inter-user similarity and will recommend items Model freshness is generally critical to industrial recom-
to a given user that have been consumed by neighbors [43]. mendations [12] which implies that only limited time is
Latent factor methods such as matrix factorization [25] try to available to re-train the model on newly available data. The
decompose user/item affinity as the result of the interaction of throughput of the trainer is therefore crucial to providing more
a few underlying representative factors characterizing the user engaging recommendation experiences and presenting more
and the item. Although other latent models have been devel- novel items. Unfortunately, public recommendation data sets
oped and could be used to construct synthetic recommendation are too small to provide training-time-to-accuracy benchmarks
data sets (e.g. Principal Component Analysis [23] or Latent that can be realistically employed for industrial applications.
Dirichlet Allocation [9]), we focus on insights derived from Too few different examples are available in MovieLens 20m
matrix factorization. for instance and the number of different available items is
The matrix factorization approach represents the affinity orders of magnitude too small. In many industrial settings,
ai,j between a user i and an item j with an inner product millions of items (e.g. products, videos, songs) have to be
xTi yj where xi and yj are two vectors in Rd representing taken into account by recommendation models. The recom-
the user and the item respectively. Given a sparse matrix of mendation model learns an embedding matrices of size (N, d)
user/item interactions R = (ri,j )i=1...m,j=1...n , user and item where d ∼ 10 − 103 and N ∼ 106 − 109 are typical values.
factors can therefore be learned by approximating R with a As a consequence, the memory footprint of this matrix may
low rank matrix XY T where X ∈ Rm,k entails the user dominate that of the rest of the model by several orders
factors and Y ∈ Rn,k contains the item factors. The data of magnitude. During training, the latency and bandwidth
set R represents ratings as in the MovieLens dataset [19] of the access to such embedding matrices have a prominent
or item consumption (ri,j = 1 if and only if the user i influence on the final throughput in examples/second. Such
has consumed item j [6]). The matrix factorization approach computational difficulties associated with learning large em-
is an example of a solution to the rating matrix completion bedding matrices are worthwhile solving in benchmarks. A
problem which aims at predicting the rating of an item j by higher throughput enables training models with more examples
a user i which has not been observed yet and corresponds which enables better statistical regularization and architectural
to a value of 0 in the sparse original rating matrix. Such a expressiveness. The multi-billion interaction size of the data
factorization method learns an approximation of the data that set used for training is also a major factor that affects modeling
preserves a few higher order properties of the rating matrix choices and infrastructure development in the industry.
R. In particular, the low rank approximation tries to mimic Our aim is therefore to enable a comparison of modeling
the singular value spectrum of the original data set. We draw approaches, software engineering frameworks and hardware
inspiration from matrix factorization to tackle synthetic data accelerators for ML in the context of industry scale recom-
generation. The present paper will adopt a similar approach mendations. In the present paper we focus on enabling a better
to extend collaborative filtering data-sets. Besides trying to evaluation of examples/sec throughput and training-time-to-
preserve the spectral properties of the original data, we operate accuracy for Neural Collaborative Filtering Approaches [20]
under the constraint of conserving its first and second order and Matrix Factorization Approaches [25]. A major issue
statistical properties. with this approach is, as we mentioned, the size of publicly
Deep Learning for Recommender Systems: Collaborative available collaborative filtering data sets which is orders of
filtering has known many recent developments which motivate magnitude smaller than production grade data (see Table I)
our objective of expanding public data sets in a realistic and has mis-representative orders of magnitudes in terms of
manner. Deep Neural Networks (DNNs) are now becoming numbers of distinct users and items. The present paper offers
common in both non-linear matrix factorization tasks [12], a first solution to this problem by providing a simple and
[20], [51] and sequential recommendations [18], [47], [57]. tractable non-parametric fractal approach to scaling up public
The mapping between user/item pairs and ratings is generally recommendation data sets by several orders of magnitude.
of the coarse interaction matrix. We choose to expand the
user/item interaction matrix by extrapolating this self-similar
Item groups structure and simulating its growth to yet another level of
granularity: each original item is treated as a synthetic topic
in the expanded data set and each actual user is considered a
fictional user group.
User groups A key advantage of this fractal procedure is that it may
(1,1) (1,2) (1,3) be entirely non-parametric and designed to preserve high
level properties of the original dataset. In particular, a fractal
User/item expansion re-introduces the patterns originally observed in the
interaction
(2,1) (2,2) (2,3)
matrix entire real dataset within each block of local interactions of
(3, 2) (3, 3) (3, 4)
the synthetic user/item matrix. By carefully designing the way
such blocks are produced and laid out, we can therefore hope
(4, 4)
to produce a realistic yet much larger rating matrix. In the
User/item interaction patterns
following, we show how the Kronecker operator enables such
a construction.
2) Fractal expansion through Kronecker products: The
Fig. 2: Typical user/item interaction patterns in recommenda- Kronecker product — denoted ⊗ — is a non-standard matrix
tion data sets. Self-similarity appears as a natural key feature operator with an intrinsic self-similar structure:
of the hierarchical organization of users and items into groups  
of various granularity. a11 B . . . a1n B
A ⊗ B =  ... .. ..  (1)

. . 
am1 B ... amn B
Such data sets will help build a first set of benchmarks for
model training accelerators based on publicly available data. where A ∈ Rm,n , B ∈ Rp,q and A ⊗ B ∈ Rmp,nq .
We plan to publish the expanded data. However, MovieLens In the original presentation of Kronecker Graph Theory [28]
20m is publicly available and our method can already be as well as the stochastic extension [35] and the extended the-
applied to recreate such an expanded data set. Meta-data ory [29], the Kronecker product is the core operator enabling
is also common in industrial applications [3], [7], [12] but the synthesis of graphs with exponentially growing adjacency
we consider its expansion outside the scope of this first matrices. As in the present work, the insight underlying
development. the use of Kronecker Graph Theory in [29] is to produce
large synthetic yet realistic graphs. The fractal nature of the
III. F RACTAL EXPANSIONS OF USER / ITEM INTERACTION Kronecker operator as it is applied multiple times (see Figure
DATA SETS
2 in [29] for an illustration) fits the self-similar statistical
The present section delineates the insights orienting our properties of real world graphs such as the internet, the web
design decisions when expanding public recommendation data or social networks [28].
sets. If A is the adjacency matrix of the original graph, fractal
1) Self-similarity in user/item interactions: Interactions be- expansions are created in [28] by chaining Kronecker products
tween users and items follow a natural hierarchy in data sets as follows:
where items can be organized in topics, genres, categories A ⊗ A . . . A.
etc [55]. There is for instance an item-level fractal structure
in MovieLens 20m with a tree-like structure of genres, sub- As adjacency matrices are square, Kronecker Graphs are
genres, and directors. If users were clustered according to not employed on rectangular matrices in pre-existing work
their demographics and tastes, another hierarchy would be although the operation is well defined. Another slight diver-
formed [43]. The corresponding structured user/item interac- gence between present and pre-existing work is that — in
tion matrix is illustrated in Figure 2. The hierarchical nature their stochastic version — Kronecker Graphs carry Bernouilli
of user/item interactions (topical and demographic) makes the probability distribution parameters in [0, 1] while MovieLens
recommendation data set structurally self-similar (i.e. patterns ratings are in {0.5, 1.0, . . . , 5.0} originally and [−1, 1] after we
that occur at more granular scales resemble those affecting center and rescale them. We show that these differences do not
coarser scales [36]). prevent Kronecker products from preserving core properties
One can therefore build a user-group/item-category inci- of rating matrices. A more important challenge is the size of
3 3
dence matrix with user-groups as rows and item-categories the original matrix we deal with: R ∈ R(138×10 ,27×10 ) . A
as columns — a coarse interaction matrix. As each user group naive Kronecker expansion would therefore synthesize a rating
consists of many individuals and each item category com- matrix with 19 billion users which is too large.
prises multiple movies, the original individual level user/item Thus, although Kronecker products seem like an ideal candi-
interaction matrix may be considered as an expanded version date for the mechanism at the core of the self-similar synthesis
of a larger recommendation dataset, some modifications are Algorithm 1 Kronecker fractal expansion
needed to the algorithms developed in [29]. for i = 1 to m do
3) Reduced Kronecker expansions: We choose to synthe- kBlocks ← empty list
size a user/item rating matrix for j = 1 to n do
R b⊗R
e=R ω ← next pseudo random number
kBlock ← F (R(i,
b j), R, w)
where R b is a matrix derived from R but much smaller (for kBlocks append kBlock
instance R b ∈ R128,256 ). For reasons that will become apparent end for
as we explore some theoretical properties of Kronecker fractal outputToFile(kBlocks)
expansions, we want to construct a smaller derived matrix R b end for
that shares similarities with R. In particular, we seek Rb with a
similar row-wise sum distribution (user engagement distribu-
0 0
tion), column-wise distribution (item engagement distribution) where G : Rm,n × N → Rm ,n is a randomized sketching
and singular value spectrum (signal to noise ratio distribution operation on the matrix A which reduces its size by several
in the matrix factorization). orders of magnitude. A trivial scheme consists in sampling
4) Implementation at scale and algorithmic extensions: a small number of rows and columns from A at random.
Computing a Kronecker product between two matrices A and Other random projections [2], [15], [31] may of course be
B is an inherently parallel operation. It is sufficient to broad- used. The randomized procedures above produce a user/item
cast B to each element (i, j) of A and then multiply B by ai,j . interaction matrix where there is no longer a block-wise
Such a property implies that scaling can be achieved. Another repetitive structure. Less obvious statistical patterns can give
advantage of the operator is that even a single machine can rise to more challenging synthetic large-scale collaborative
produce a large output data-set by sequentially iterating on the filtering problems.
values of A. Only storage space commensurable with the size
of the original matrix is needed to compute each block of the IV. S TATISTICAL PROPERTIES OF K RONECKER FRACTAL
Kronecker product. It is noteworthy that generalized fractal EXPANSIONS
expansions can be defined by altering the standard Kronecker
product. We consider such extensions here as candidates to After having introduced Kronecker products to self-
engineer more challenging synthetic data sets. One drawback similarly expand a recommendation dataset into a much larger
though is that these extensions may not preserve analytic one, we now demonstrate how the resulting synthetic user/item
tractability. interaction matrix shares crucial common properties with the
A first generalization defines a binary operator ⊗F with original.
F : R × Rm,n × N → Rp,q as follows: 1) Salient empirical facts in MovieLens data: First, we

F (a11 , B, ω11 ) . . . F (a1n , B, ω1n )
 introduce the critical properties we want to preserve. Note that
.. .. .. throughout the paper we present results on a centered version
A ⊗F B =   (2)
 
. . . of MovieLens 20m. The average rating of 3.53 is subtracted
F (am1 , B, ωm1 ) . . . F (amn , B, ωmn ) from all ratings (so that the 0 elements of the sparse rating
where ω11 , . . . , ωmn is a sequence of pseudo-random num- matrix match un-observed scores and not bad movie ratings).
bers. Including randomization and non-linearity in F appears Furthermore, we re-scale the centered ratings so that they are
as a simple way to synthesize data sets entailing more varied all in the interval [−1, 1]. As a user/item interaction dataset on
patterns. The algorithm we employ to compute Kronecker an online platform, one expects MovieLens to feature common
products is presented in Algorithm 1. The implementation we properties of recommendation data sets such as “power-law”
employ is trivially parallelizable. We only create a list of Kro- or fat-tailed distributions [55].
necker blocks to dump entire rows (users) of the output matrix First important statistical properties for recommendations
to file. This is not necessary and can be removed to enable concern the distribution of interactions across users and across
as many processes to run simultaneously and independently items. It is generally observed that such distributions exhibit
as there are elements in R b (provided pseudo random numbers a “power-law” behavior [1], [14], [16], [30], [39], [40], [52],
are generated in parallel in an appropriate manner). [55]. To characterize such a behavior in the MovieLens data
The only reason why we need a reduced version R b of R is set, we take a look at the distribution of the total ratings
to control the size of the expansion. Also, A ⊗ B and B ⊗ A along the item axis and the user axis. In other words, we
are equal after a row-wise and a column-wise permutation. compute row-wise and column-wise sums for the rating matrix
Therefore, another family of appropriate extensions may be R and observe their distributions. The corresponding ranked
obtained by considering distributions are exposed in Figure 1 and do exhibit a clear
  “power-law” behavior for rather popular items. However we
b11 G(A, ω11 ) . . . b1n G(A, ω1n ) observe that tail items have a higher popularity decay rate.
B ⊗G A =  .. .. ..
 (3) Similarly, the engagement decay rate increases for the group
 
. . .
bm1 G(A, ωm1 ) . . . bmn G(A, ωmn ) of less engaged users.
The other approximate “power-law” we find in Figure 1 two sets, i.e. A × B = {a × b | a ∈ A, b ∈ B}.
lies in the singular value spectrum of the MovieLens dataset. Proposition 1: Consider A ∈ Rm,n and B ∈ Rp,q and their
We compute the top k singular values [21] of the MovieLens Kronecker product K = A ⊗ B. Then
rating matrix R by approximate iterative methods (e.g. power
R(K) = R(A) × R(B) and C(K) = C(A) × C(B).
iteration) which can scale to its large (138K, 27K) dimension.
The method yields the dominant singular values of R and Proof 1: Consider the ith row of K, by definition of
the corresponding singular vectors so that one can classically K
P the corresponding sum can be rewritten as follows:
approximate R by R ' U ΣV where Σ is diagonal of
P
k
j=1...np i,j = j=1...np abi−1cp +1,bj−1cq +1 b{i−1}p ,{j−1}q
dimension (k, k), U is column-orthogonal of dimension (m, k) which in turn equals
and V is row-orthogonal of dimension (k, n) — which yields X X
the rank k matrix closest to R in Frobenius norm. abi−1cp +1,j b{i−1}p ,j 0 .
j=1...n j 0 =1...q
Examining the distribution of the 2048 top singular values
of R in the MovieLens dataset (which has at most 27K non- Refactoring the two sums concludes the proof for the row-
zero singular values) in Figure 1 highlights a clear “power- wise sum properties. The proof for column-wise properties is
law” behavior in the highest magnitude part of the spectrum identical. 
of R. We observe in the spectral distribution an inflection for Theorem 1: Consider A ∈ Rm,n and B ∈ Rp,q and their
smaller singular values whose magnitude decays at a higher Kronecker product K = A ⊗ B. Then
rate than larger singular values. Such a spectral distribution is S(K) = S(A) × S(B).
as a key feature of the original dataset, in particular in that it
conditions the difficulty of low-rank approximation approaches Proof 2: One can easily check that (XY ) ⊗ (V W ) =
to the matrix completion problem. Therefore, we also want (X ⊗ V )(Y ⊗ W ) for any quadruple of matrices X, Y, V, W
the expanded dataset to exhibit a similar behavior in terms of for which the notation makes sense and that (X ⊗ Y )T =
spectral properties. X T ⊗ Y T . Let A = UA ΣA VA be the SVD of B and
In all the high level statistics we present, we want to B = VB ΣB VB the SVD of B. Then (A ⊗ B) = (UA ⊗
preserve the approximate “power-law” decay as well as its UB )(ΣA ⊗ ΣB )(VA ⊗ VB ). Now, (UA ⊗ UB )T (UA ⊗ UB ) =
inflection for smaller values. Our requirements for the expand- (UAT ⊗ UBT )(UA ⊗ UB ) = (UAT UA ) ⊗ (UBT UBT ). Writing the
ing transform which we apply to R are therefore threefold: same decomposition for (VA ⊗VB )(VAT ⊗VBT ) and considering
we want to preserve the distributions of row-wise sums of that UA , UB are column-orthogonal while VA , VB are row-
R, column-wise sums of R and singular value distribution orthogonal concludes the proof. 
of R. Additional requirements, beyond first and second order The properties above imply that knowing the row-wise
high level statistics will further increase the confidence in the sums, column-wise sums and singular value spectrum of the
realism of the expanded synthetic dataset. Nevertheless, we reduced rating matrix R b and the original rating matrix R
consider that focusing on these first three properties is a good is enough to deduce the corresponding properties for the
starting point. expanded rating matrix R e — analytically. As in [29], the Kro-
It is noteworthy that we do not consider here the temporal necker product enables analytic tractability while expanding
structure of the MovieLens data set. We leave the study of data sets in a fractal manner to orders of magnitude more
sequential user behavior — often found to be Long Range data.
Dependent [5], [13], [41] — and the extension of synthetic 3) Constructing a reduced R b matrix with a similar spec-
data generation to sequential recommendations [3], [48], [53] trum: Considering that the quasi “power-law” properties of
for further work. R imply — as in [29] — that S(R) × S(R) has a similar
2) Preserving MovieLens data properties while expanding distribution to S(R), we seek a small R b whose high order
it: We now expose the fractal transform design we rely on to statistical properties are similar to those of R. As we want to
preserve the key statistical properties of the previous section. generate a dataset with several billion user/item interactions,
millions of distinct users and millions of distinct items, we
Definition 1: Consider A ∈ Rm,n = (ai,j )i=1...n,j=1...m , we
Pm are looking for a matrix R b with a few hundred or thousand
nPset { i=1
denote the o ai,j } of row-wise sums of A by R(M ), rows and columns. The reduced matrix R b we seek is therefore
n
the set j=1 ai,j of column-wise sums of A by C(M ), orders of magnitude smaller than R. In order to produce a
and the set of non-zero singular values of A by S(M ). reduced matrix R b of dimensions (1000, 1700) one could use
Definition 2: Consider an integer i and a non-zero positive the reduced size older MovieLens 100K dataset [19]. Such a
integer p, we denote the integer part of i − 1 in base p dataset can be interpreted as a sub-sampled reduced version
bi − 1cp = b i−1 p c and the fractional part {i − 1}p = i − of MovieLens 20m with similar properties. However the data
bi − 1cp . sets have been collected seven years apart and therefore
First we focus on conservation properties in terms of row- temporal non-stationarity issues become concerning. Also, we
wise and column-wise sums which correspond respectively to aim to produce an expansion method where the expansion
marginalized user engagement and item popularity distribu- multipliers can be chosen flexibly by practitioners. In our
tions. In the following, × denotes the Minkowski product of experiments, it is noteworthy that naive uniform user and item
sampling strategies have not yielded smaller matrices R b with Algorithm 2 Compute reduced matrix R
b
similar properties to R in our experiments. Different random (U, Σ, V ) ← sparseSVD(R, k)
projections [2], [15], [31] could more generally be employed Ū ← imageResize(U, n0 , k)
however we rely on a procedure better tailored to our specific V̄ ← imageResize(V, k, m0 )
statistical requirements. e ← Ū Ū T Ū −1/2

U
We now describe the technique we employed to produce a  −1/2
reduced size matrix R b with first and second order properties Ve ← V̄ V̄ T V̄
close to R which in turn led to constructing an expansion Rtemp ← U ΣV
b e e
matrix R e=R b ⊗ R similar to R. We want the dimensions of M ← max(R btemp )
Rb to be (m0 , n0 ) with m0 << m and n0 << n. Consider again m ← min(Rtemp )
b
the approximate Singular Value Decomposition (SVD) [21] of btemp /(M − m)
return R
R with the k = min(m0 , n0 ) principal singular values of R:
R ' U ΣV (4)
V. E XPERIMENTATION ON M OVIE L ENS 20 MILLION DATA
where U ∈ Rn,k has orthogonal columns, V ∈ Rk,m has
The MovieLens 20m data comprises 20m ratings given
orthogonal rows, and Σ ∈ Rk,k is diagonal with non-negative
by 138 thousand users to 27 thousand items. In the present
terms.
section, we demonstrate how the fractal Kronecker expansion
To reduce the number of rows and columns of R while
technique we devised and presented helps scale up this dataset
preserving its top k singular values a trivial solution would
to orders of magnitude more users, items and interactions —
consist in replacing U and V by a small random orthogonal
all in a parallelizable and analytically tractable manner.
matrices with few rows and columns respectively. Unfortu-
1) Size of expanded data set: In present experiments we
nately such a method would only seemingly preserve the
construct a reduced rating matrix R b of size (16, 32) which
spectral properties of R as the principal singular vectors would
implies the resulting expanded data set will comprise 10
be widely changed. Such properties are important: one of the
billion interactions between 2 million users and 864K items.
key advantages of employing Kronecker products in [29] is
In appendix, we present the results obtained for a reduced
the preservation of the network values, i.e. the distributions of
rating matrix R b of size (128, 256) and a synthetic data set
singular vector components of a Graph’s adjacency matrix.
e ∈ Rn0 ,k with fewer rows than U but consisting of 655 billion interactions between 17 million users
To obtain a matrix U
and 7 million items.
column-orthogonal and similar to U in the distribution of its
Such a high number of interactions and items enable the
values we use the following procedure. We re-size U down
training of deep neural collaborative models such as the
to n0 rows with n0 < n through an averaging-based down-
Neural Collaborative Filtering model [20] with a scale which
scaling method that can classically be found in standard im-
is now more representative of industrial settings. Moreover,
age processing libraries (e.g. skimage.transform.resize in the
0 the increased data set size helps construct benchmarks for
scikit-image library [50]). Let Ū ∈ Rn ,k be the corresponding
deep learning software packages and ML accelerators that
resized version of U . We then construct U e as the column
n0 ,k employ the same orders of magnitude than production settings
orthogonal matrix in R closest in Frobenius norm to Ū .
in terms of user base size, item vocabulary size and number
Therefore as in [17] we compute
of observations.
e = Ū Ū T Ū −1/2 . 2) Empirical properties of reduced R b matrix: The con-

U (5)
struction technique of R had for objective to produce, just
b
We apply a similar procedure to V to reduce its number of like in [29], a matrix sharing the properties of R ⊗ R though
0
columns which yields a row orthogonal matrix Ve ∈ Rk,m smaller in size. To that end, we aimed at constructing a matrix
with m0 < m. The orthogonality of U e (column-wise) and Ve
Rb of dimension (16, 32) with properties close to those of R in
(row-wise) guarantees that the singular value spectrum of terms of column-wise sum, row-wise sum and singular value
R
b=U
e ΣVe (6) spectrum distributions.
We now check that the construction procedure we devised
consists exactly of the k = min(m0 , n0 ) leading components does produce a R b with the properties we expected. As the
of the singular value spectrum of R. Like R, R b is re-scaled impact of the re-sizing step is unclear from an analytic stand-
to take values in [−1, 1]. The whole procedure to reduce R point, we had to resort to numerical experiments to validate
down to R b is summarized in Algorithm 2. our method.
We verify empirically that the distributions of values of the In Figure 3, one can assess that the first and second order
reduced singular vectors in U e and Ve are similar to those of properties of R and R b match with high enough fidelity. In
U and V respectively to preserve first order properties of R particular, the higher magnitude column-wise and row-wise
and value distributions of its singular vectors. Such properties sum distributions follow a “power-law” behavior similar to
are demonstrated through numerical experiments in the next that of the original matrix. Similar observations can be made
section. about the singular value spectra of R b and R.
Orignal matrix item-wise Reduced matrix item-wise Orignal matrix item-wise Expanded matrix item-wise
10 5 rating sums (log/log) 10 0 rating sums (log/log) 10 5 rating sums (log/log) 10 5 rating sums (log/log)
10 4 10 4 10 4
10 3 10 3 10 3
10 -1 10 2
total rating

total rating

total rating

total rating
10 2 10 2 10 1
10 1 10 1 10 0
10 0 10 0 10 -1
10 -2 10 -2
10 -1 10 -1 10 -3
10 -2 10 -2 10 -4
10 -3 10 -3 10 -3 10 -5
10 0 10 1 10 2 10 3 10 4 10 5 10 0 10 1 10 2 10 0 10 1 10 2 10 3 10 4 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 6
item rank item rank item rank item rank
Original matrix user-wise Reduced matrix user-wise Original matrix user-wise Expanded matrix user-wise
10 3 rating sums (log/log) 10 1 rating sums (log/log) 10 3 rating sums (log/log) 10 4 rating sums (log/log)
10 2 10 2 10 3
10 2
10 1 10 1 10 1
total rating

total rating

total rating

total rating
10 0 10 0 10 0
10 0 10 -1
10 -1 10 -1
10 -2 10 -2 10 -2
10 -3
10 -3 10 -3 10 -4
10 -4 10 -1 10 -4 10 -5
10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7
user rank user rank user rank user rank
Original matrix user/item Reduced matrix user/item Original matrix user/item Expanded matrix user/item
10 3 rating spectrum (log/log) 10 1 rating spectrum (log/log) 10 3 rating spectrum (log/log) 10 3 rating spectrum (log/log)
singular value

singular value

singular value

singular value
10 2
magnitude

magnitude

magnitude

magnitude
10 2 10 0 10 2
10 1

10 1 10 -1 10 1 10 0
10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 5
singular value singular value singular value singular value
rank rank rank rank

Fig. 3: Properties of the reduced dataset R b ∈ R16,32 built Fig. 4: High order statistical properties of the expanded dataset
according to steps 4, 5 and 6. We validate the construction Rb ⊗ R. We validate the construction method numerically by
method numerically by checking that the distribution of row- checking that the distributions of row-wise sums, column-wise
wise sums, column-wise sums and singular values are similar sums and singular values are similar between R and R b ⊗ R.
between R and R. b Note here that as R is large we only Here we leverage the tractability of Kronecker products as they
compute its leading singular values. As we want to preserve impact column-wise and row-wise sum distributions as well as
statistical “power-laws”, we focus on preservation of the rela- singular value spectra. The plots corresponding to the extended
tive distribution of values and not their magnitude in absolute. dataset are derived analytically based on the corresponding
properties of the reduced matrix R b and the original matrix R.
Note here that as we only computed the leading singular values
There is therefore now a reasonable likelihood that our of R, we only show the leading singular values of R b ⊗ R. In
adapted Kronecker expansion — although somewhat differing all plots we can observe the preservation of the linear log-
from the method originally presented in [29] — will enjoy the log correspondence for the higher values in the distributions
same benefits in terms of enabling data set expansion while of interest (row-wise sums, column-wise sums and singular
preserving high order statistical properties. values) as well as the accelerated decay of the smaller values
3) Empirical properties of the expanded data set R: e We in those distributions.
now verify empirically that the expanded rating matrix R e=
R ⊗ R does share common first and second order properties
b
with the original rating matrix R. The new data size is 2 orders R is sufficient to determine the corresponding marginals for
of magnitude larger in terms of number of rows and columns the expanded data set R e = R b ⊗ R. Similarly, the leading
and 4 orders of magnitude larger in terms of number of non- singular values of the Kronecker product can be computed
zero terms. Notice here that, as in general R b is a dense matrix, with Theorem 1 just based on the leading singular values of
the level of sparsity of the expanded data set is the same as R and the singular values of R.
b
that of the original. In Figure 4, one can confirm that the spectral properties
Another benefit of using a fractal expansion method with of the expanded data set as well as the user engagement
analytic tractability, is that we can deduce high order statis- (row-wise sums) and item popularity (column-wise sums) are
tics of the expanded data set beforehand without having to similar to those of the original data set. Such observations
instantiate it. In particular, Proposition 1 implies that knowing indicate that the resulting data set is representative — in its fat-
the column-wise and row-wise sum distributions of R b and tailed data distribution and quasi “power-law” singular value
Rating distributions columns of each block in Eq (1) independently at random.
1.0
original The shuffles will break the block-wise repetitive structure and
extended (20m samples) prevent Kronecker SVD from producing a trivial solution to
0.5
the factorization problem.
As a result, the expansion technique we present appears as
ratings

0.0
a reliable first candidate to train linear matrix factorization
models [43] and non-linear user/item similarity scoring mod-
0.5
els [20].
1.0
0.0 0.5 1.0 1.5 2.0 VI. C ONCLUSION
rating rank 1e7
In conclusion, this paper presents a first attempt at synthe-
Fig. 5: Sorted ratings in the original MovieLens 20m data set sizing a realistic large-scale recommendation data sets without
and the extended data. We sample 20m ratings from the new having to make compromises in terms of user privacy. We
data set. Multiplications inherent to Kronecker products create use a small size publicly available data set, MovieLens 20m,
a finer granularity rating scale which somewhat differs from and expand it to orders of magnitude more users, items and
the original rating scale. However, we can check that the new observed ratings. Our expansion model is rooted into the
rating scale is still representative of common recommendation hierarchical structure of user/item interactions which naturally
problems in that most rating values are close to the average suggests a fractal extrapolation model.
and few user/item interactions are very positive or negative. We leverage Kronecker products as self-similar operators on
user/item rating matrices that impact key properties of row-
wise and column-wise sums as well as singular value spectra
spectrum — of problems encountered in ML for collaborative in an analytically tractable manner. We modify the original
filtering. Furthermore, the expanded data set reproduces some Kronecker Graph generation method to enable an expansion of
irregularities of the original data, in particular the accelerating the original data by orders of magnitude that yields a synthetic
decay of values in ranked row-wise and column-wise sums as data set matching industrial recommendation data sets in scale.
well as in the singular values spectrum. Our numerical experiments demonstrate the data set we create
4) Limitations: Although it does not condition the difficulty has key first and second order properties similar to those of
of rating matrix factorization problems — which depends the original MovieLens 20m rating matrix.
primarily on the interaction distribution, the number of users, Our next steps consist in making large synthetic data sets
the number of items and sparsity of the rating matrix — the publicly available although any researcher can readily use the
distribution of ratings is still an important statistical property techniques we presented to scale up any user/item interaction
of the MovieLens 20m dataset. Figure 5 shows that there is a matrix. Another possible direction is to adapt the present
certain degree of divergence between the rating scales of the method to recommendation data sets featuring meta-data (e.g.
original and synthetic data set. In particular, many more rating timestamps, topics, device information). The use of meta-data
values are present in the expanded data set as a result of the is indeed critical to solve the “cold-start” problem of users
multiplication of terms from R b and R. The transformations and items having no interaction history with the platform. We
turning R into R create a smoother scale of values leading
b also plan to benchmark the performance of well established
to a Kronecker product with many different possible ratings. baselines on the new large scale realistic synthetic data we
Although the non-zero value distribution is a divergence point produce.
between the two data sets, ratings in the synthetic data set
are dominated by values which are close to the average as in R EFERENCES
the original MovieLens 20m. Therefore the synthetic ratings [1] A BDOLLAHPOURI , H., B URKE , R., AND M OBASHER , B. Controlling
do have certain degree of realism as they represent user/item popularity bias in learning-to-rank recommendation. In Proceedings of
interactions where strong reactions (positive or negative) from the Eleventh ACM Conference on Recommender Systems (2017), ACM,
pp. 42–46.
users are much less likely than neutral interactions. [2] ACHLIOPTAS , D. Database-friendly random projections: Johnson-
Another limitation of the synthetic data set is the block- lindenstrauss with binary coins. Journal of computer and System
Sciences 66, 4 (2003), 671–687.
wise repetitive structure of Kronecker products. Although the
[3] B ELLETTI , F., B EUTEL , A., JAIN , S., AND C HI , E. Factorized recurrent
synthetic data set is still hard to factorize as the product of neural architectures for longer range dependence. In International
two low rank matrices because its singular values are still Conference on Artificial Intelligence and Statistics (2018), pp. 1522–
distributed similarly to the original data set, it is now easy to 1530.
[4] B ELLETTI , F., L AKSHMANAN , K., K RICHENE , W., C HEN , Y.-F., AND
factorize with a Kronecker SVD [24] which takes advantage A NDERSON , J. Scalable realistic recommendation datasets through
of the block-wise repetitions in Eq (1). Randomized fractal fractal expansions. arXiv preprint arXiv:1901.08910 (2019).
expansions which presented in Eq (2) and Eq (3) address [5] B ELLETTI , F., S PARKS , E., BAYEN , A., AND G ONZALEZ , J. Random
projection design for scalable implicit smoothing of randomly observed
this issue. A simple example of such a randomized variation stochastic processes. In Artificial Intelligence and Statistics (2017),
around the Kronecker product consists in shuffling rows and pp. 700–708.
[6] B ENNETT, J., L ANNING , S., ET AL . The netflix prize. In Proceedings networks. Journal of Machine Learning Research 11, Feb (2010), 985–
of KDD cup and workshop (2007), vol. 2007, New York, NY, USA, 1042.
p. 35. [30] L EVY, M., AND B OSTEELS , K. Music recommendation and the long
[7] B EUTEL , A., C OVINGTON , P., JAIN , S., X U , C., L I , J., G ATTO , V., tail. In 1st Workshop On Music Recommendation And Discovery
AND C HI , E. H. Latent cross: Making use of context in recurrent (WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer.
recommender systems. In International Conference on Web Search and [31] L I , P., H ASTIE , T. J., AND C HURCH , K. W. Very sparse random
Data Mining (2018), ACM, pp. 46–54. projections. In Proceedings of the 12th ACM SIGKDD international
[8] B HARGAVA , A., G ANTI , R., AND N OWAK , R. Active positive semidefi- conference on Knowledge discovery and data mining (2006), ACM,
nite matrix completion: Algorithms, theory and applications. In Artificial pp. 287–296.
Intelligence and Statistics (2017), pp. 1349–1357. [32] L IANG , D., A LTOSAAR , J., C HARLIN , L., AND B LEI , D. M. Factoriza-
[9] B LEI , D. M., N G , A. Y., AND J ORDAN , M. I. Latent dirichlet allocation. tion meets the item embedding: Regularizing matrix factorization with
Journal of machine Learning research 3, Jan (2003), 993–1022. item co-occurrence. In Proceedings of the 10th ACM conference on
[10] C HANEY, A. J., B LEI , D. M., AND E LIASSI -R AD , T. A probabilistic recommender systems (2016), ACM, pp. 59–66.
model for using social networks in personalized item recommendation. [33] L INDEN , G., S MITH , B., AND YORK , J. Amazon. com recommenda-
In Proceedings of the 9th ACM Conference on Recommender Systems tions: Item-to-item collaborative filtering. IEEE Internet computing, 1
(2015), ACM, pp. 43–50. (2003), 76–80.
[11] C HANEY, A. J., S TEWART, B. M., AND E NGELHARDT, B. E. How [34] L U , J., L IANG , G., S UN , J., AND B I , J. A sparse interactive model
algorithmic confounding in recommendation systems increases homo- for matrix completion with side information. In Advances in neural
geneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017). information processing systems (2016), pp. 4071–4079.
[12] C OVINGTON , P., A DAMS , J., AND S ARGIN , E. Deep neural networks [35] M AHDIAN , M., AND X U , Y. Stochastic kronecker graphs. In Interna-
for youtube recommendations. In Proceedings of the 10th ACM tional Workshop on Algorithms and Models for the Web-Graph (2007),
Conference on Recommender Systems (2016), ACM, pp. 191–198. Springer, pp. 179–186.
[13] C RANE , R., AND S ORNETTE , D. Robust dynamic classes revealed by [36] M ANDELBROT, B. B. The fractal geometry of nature, vol. 1. WH
measuring the response function of a social system. Proceedings of the freeman New York, 1982.
National Academy of Sciences 105, 41 (2008), 15649–15653. [37] NARAYANAN , A., AND S HMATIKOV, V. How to break anonymity of
[14] C REMONESI , P., KOREN , Y., AND T URRIN , R. Performance of rec- the netflix prize dataset. arXiv preprint cs/0610105 (2006).
ommender algorithms on top-n recommendation tasks. In Proceedings [38] N IMISHAKAVI , M., JAWANPURIA , P. K., AND M ISHRA , B. A dual
of the fourth ACM conference on Recommender systems (2010), ACM, framework for low-rank tensor completion. In Advances in Neural
pp. 39–46. Information Processing Systems (2018), pp. 5489–5500.
[15] F RADKIN , D., AND M ADIGAN , D. Experiments with random pro- [39] O ESTREICHER -S INGER , G., AND S UNDARARAJAN , A. Recommenda-
jections for machine learning. In Proceedings of the ninth ACM tion networks and the long tail of electronic commerce. Mis quarterly
SIGKDD international conference on Knowledge discovery and data (2012), 65–83.
mining (2003), ACM, pp. 517–522. [40] PARK , Y.-J., AND T UZHILIN , A. The long tail of recommender systems
[16] G OEL , S., B RODER , A., G ABRILOVICH , E., AND PANG , B. Anatomy of and how to leverage it. In Proceedings of the 2008 ACM conference on
the long tail: ordinary people with extraordinary tastes. In Proceedings of Recommender systems (2008), ACM, pp. 11–18.
the third ACM international conference on Web search and data mining [41] P IPIRAS , V., AND TAQQU , M. S. Long-range dependence and self-
(2010), ACM, pp. 201–210. similarity, vol. 45. Cambridge university press, 2017.
[17] G OLUB , G. H., AND VAN L OAN , C. F. Matrix computations, vol. 3. [42] R ESNICK , P., I ACOVOU , N., S UCHAK , M., B ERGSTROM , P., AND
JHU Press, 2012. R IEDL , J. Grouplens: an open architecture for collaborative filtering
[18] H ARIRI , N., M OBASHER , B., AND B URKE , R. Context-aware music of netnews. In Proceedings of the 1994 ACM conference on Computer
recommendation based on latenttopic sequential patterns. In Proceedings supported cooperative work (1994), ACM, pp. 175–186.
of the sixth ACM conference on Recommender systems (2012), ACM, [43] R ICCI , F., ROKACH , L., AND S HAPIRA , B. Recommender systems:
pp. 131–138. introduction and challenges. In Recommender systems handbook.
[19] H ARPER , F. M., AND KONSTAN , J. A. The movielens datasets: History Springer, 2015, pp. 1–34.
and context. Acm transactions on interactive intelligent systems (tiis) 5, [44] RUDOLPH , M., RUIZ , F., M ANDT, S., AND B LEI , D. Exponential
4 (2016), 19. family embeddings. In Advances in Neural Information Processing
[20] H E , X., L IAO , L., Z HANG , H., N IE , L., H U , X., AND C HUA , T.-S. Systems (2016), pp. 478–486.
Neural collaborative filtering. In Proceedings of the 26th International [45] S ARWAR , B., K ARYPIS , G., KONSTAN , J., AND R IEDL , J. Item-based
Conference on World Wide Web (2017), International World Wide Web collaborative filtering recommendation algorithms. In Proceedings of
Conferences Steering Committee, pp. 173–182. the 10th international conference on World Wide Web (2001), ACM,
[21] H ORN , R. A., H ORN , R. A., AND J OHNSON , C. R. Matrix analysis. pp. 285–295.
Cambridge university press, 1990. [46] S CHMIT, S., AND R IQUELME , C. Human interaction with recommen-
[22] JAWANPURIA , P., AND M ISHRA , B. A unified framework for structured dation systems. arXiv preprint arXiv:1703.00535 (2017).
low-rank matrix learning. In International Conference on Machine [47] S HANI , G., H ECKERMAN , D., AND B RAFMAN , R. I. An mdp-based
Learning (2018), pp. 2259–2268. recommender system. Journal of Machine Learning Research 6, Sep
[23] J OLLIFFE , I. Principal component analysis. In International encyclope- (2005), 1265–1295.
dia of statistical science. Springer, 2011, pp. 1094–1096. [48] TANG , J., AND WANG , K. Personalized top-n sequential recommenda-
[24] K AMM , J., AND NAGY, J. G. Optimal kronecker product approximation tion via convolutional sequence embedding. In International Conference
of block toeplitz matrices. SIAM Journal on Matrix Analysis and on Web Search and Data Mining (2018), IEEE, pp. 565–573.
Applications 22, 1 (2000), 155–172. [49] T U , K., C UI , P., WANG , X., WANG , F., AND Z HU , W. Structural deep
[25] KOREN , Y., B ELL , R., AND VOLINSKY, C. Matrix factorization embedding for hyper-networks. In Thirty-Second AAAI Conference on
techniques for recommender systems. Computer, 8 (2009), 30–37. Artificial Intelligence (2018).
[26] K RICHENE , W., M AYORAZ , N., R ENDLE , S., Z HANG , L., Y I , X., [50] VAN DER WALT, S., S CH ÖNBERGER , J. L., N UNEZ -I GLESIAS , J.,
H ONG , L., C HI , E., AND A NDERSON , J. Efficient training on very B OULOGNE , F., WARNER , J. D., YAGER , N., G OUILLART, E., AND
large corpora via gramian estimation. arXiv preprint arXiv:1807.07187 Y U , T. scikit-image: image processing in python. PeerJ 2 (2014), e453.
(2018). [51] WANG , H., WANG , N., AND Y EUNG , D.-Y. Collaborative deep learning
[27] L E C UN , Y., B ENGIO , Y., AND H INTON , G. Deep learning. nature 521, for recommender systems. In ACM SIGKDD International Conference
7553 (2015), 436. on Knowledge Discovery and Data Mining (2015), ACM, pp. 1235–
[28] L ESKOVEC , J., C HAKRABARTI , D., K LEINBERG , J., AND FALOUTSOS , 1244.
C. Realistic, mathematically tractable graph generation and evolution, [52] Y IN , H., C UI , B., L I , J., YAO , J., AND C HEN , C. Challenging the long
using kronecker multiplication. In European Conference on Principles of tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012),
Data Mining and Knowledge Discovery (2005), Springer, pp. 133–145. 896–907.
[29] L ESKOVEC , J., C HAKRABARTI , D., K LEINBERG , J., FALOUTSOS , C., [53] Y U , F., L IU , Q., W U , S., WANG , L., AND TAN , T. A dynamic recurrent
AND G HAHRAMANI , Z. Kronecker graphs: An approach to modeling model for next basket recommendation. In Proceedings of the 39th
International ACM SIGIR conference on Research and Development in Orignal matrix item-wise Reduced matrix item-wise
Information Retrieval (2016), ACM, pp. 729–732. 10 5 rating sums (log/log) 10 1 rating sums (log/log)
[54] Z HANG , J.-D., C HOW, C.-Y., AND X U , J. Enabling kernel-based 10 4
10 3 10 0
attribute-aware matrix factorization for rating prediction. IEEE Trans-

total rating

total rating
10 2
actions on Knowledge and Data Engineering 29, 4 (2017), 798–812. 10 1 10 -1
[55] Z HAO , Q., C HEN , J., C HEN , M., JAIN , S., B EUTEL , A., B ELLETTI , 10 0
10 -1 10 -2
F., AND C HI , E. H. Categorical-attributes-based item classification for 10 -2
recommender systems. In Proceedings of the 12th ACM Conference on 10 -3 10 -3
10 0 10 1 10 2 10 3 10 4 10 5 10 0 10 1 10 2 10 3
Recommender Systems (2018), ACM, pp. 320–328. item rank item rank
[56] Z HENG , Y., TANG , B., D ING , W., AND Z HOU , H. A neural autoregres-
Original matrix user-wise Reduced matrix user-wise
sive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 rating sums (log/log) rating sums (log/log)
10 3 10 1
(2016). 10 2
[57] Z HOU , B., H UI , S. C., AND C HANG , K. An intelligent recommender 10 1 10 0

total rating

total rating
system using sequential web access patterns. In IEEE conference on 10 0
cybernetics and intelligent systems (2004), vol. 1, IEEE Singapore, 10 -1
10 -2 10 -1
pp. 393–398.
[58] Z HOU , G., Z HU , X., S ONG , C., FAN , Y., Z HU , H., M A , X., YAN , Y., 10 -3
10 -4 10 -2
J IN , J., L I , H., AND G AI , K. Deep interest network for click-through 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3
rate prediction. In Proceedings of the 24th ACM SIGKDD International user rank user rank
Conference on Knowledge Discovery & Data Mining (2018), ACM, Original matrix user/item Reduced matrix user/item
pp. 1059–1068. 10 3 rating spectrum (log/log) 10 1 rating spectrum (log/log)

singular value

singular value
magnitude

magnitude
10 2 10 0

10 1 10 -1
10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3
singular value singular value
rank rank

b ∈ R128,256 . Once
Fig. 6: Properties of the reduced dataset R
more, we validate the construction method numerically by
checking that the distribution of row-wise sums, column-wise
sums and singular values are similar between R and R.b

A PPENDIX
MovieLens 655 billion
In this section, we present the properties of a larger expan-
sion for MovieLens. The reduced rating matrix R b is now of
size (128, 256) and the synthetic data consists of 655 billion
interactions between 17 million users and 7 million items.
Empirical properties of the reduced matrix R: b We assess
the scalability of the approach we present to synthesize R. b In
particular, we check that with a size of (128, 256) instead of
(16, 32) R b still shares common statistical properties with the
original matrix R. Figure 6 demonstrates that the construc-
tion method we devised for R b still preserves key statistical
properties of R.
Numerical validation for the expanded data set: We now
verify that even with different extension factors and a much
larger size, the synthetic data set we generate is similar to
the original MovieLens 20m. We focus on the distribution of
column-wise and row-wise sums in R e=R b × R as well as the
singular value distribution of the expanded matrix. In Figure 7,
we find again that the “power-law” statistical behaviors and
their inflections are preserved by the expansion procedure we
designed.
Limitations: The previous observations demonstrate the
scalability and robustness of our expansion method, even with
an expansion factor of (128, 256). However, the same limita-
tions are present as in the smaller case and Figure 8 shows that
Orignal matrix item-wise Expanded matrix item-wise
10 5 rating sums (log/log) 10 54 rating sums (log/log)
10 4 10 3
10 2
total rating 10 3 10 1

total rating
10 2 10 0
10 1 10
10 -1-2
10 0 10 -3
10 -1 10 -4
10 -2 10 -5
10 -6
10 -3 10
10 0 10 1 10 2 10 3 10 4 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7
item rank item rank
Original matrix user-wise Expanded matrix user-wise
10 3 rating sums (log/log) 10 4 rating sums (log/log)
10 2 10 3
10 2
10 1 10 1
total rating

total rating
10 0 10 0
10 -1 10 -1
10 -2 10 -2
10 -3
10 -3 10 -4
10 -4 10 -5
10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8
user rank user rank
Original matrix user/item Expanded matrix user/item
10 3 rating spectrum (log/log) 10 4 rating spectrum (log/log)

10 3
singular value

singular value
magnitude

magnitude

10 2 10 2

10 1

10 1 10 0
10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 5 10 6
singular value singular value
rank rank

Fig. 7: We again validate the construction method numerically


by checking that the distributions of row-wise sums, column-
wise sums and singular values are similar between R and
b ⊗ R. Here as well, we can observe the similarity between
R
key features of the original rating matrix and its synthetic
expansion.

1.0 Rating distributions


original
extended (20m samples)
0.5
ratings

0.0

0.5

1.0
0.0 0.5 1.0 1.5 2.0
rating rank 1e7

Fig. 8: Sorted ratings in the original MovieLens 20m data


set and the extended data. We sample 20m ratings from the
new data set. Again we observe a certain divergence between
the synthetic data set and the original. There is some realism
though in that most interactions being near neutral.

a similar divergence in rating scales exists between the original


data set and its expanded synthetic version. Like before the
synthetic ratings remain realistic in that their majority is near
average. The present section therefore demonstrates that our
method scales up and is able to synthesize very large realistic
recommendation data sets.

You might also like