Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Received November 18, 2019, accepted December 15, 2019, date of publication December 20, 2019,

date of current version December 31, 2019.


Digital Object Identifier 10.1109/ACCESS.2019.2961182

Multi-Level Coupling Network for Non-IID


Sequential Recommendation
YATONG SUN 1, GUIBING GUO 1, XIAODONG HE 2, AND XIAOHUA LIU 3
1 SoftwareCollege, Northeastern University, Shenyang 110819, China
2 JD AI Research, Beijing 100050, China
3 AINemo, Montreal, QC H4B 2L9, Canada

Corresponding author: Guibing Guo (guogb@swc.neu.edu.cn)


This work was supported in part by the National Natural Science Foundation of China under Grant 61972078 and Grant 61702084, and in
part by the Fundamental Research Funds for the Central Universities under Grant N181705007.

ABSTRACT Sequential recommendation has been recently attracting a lot attention to suggest users with
next items to interact. However, most of the traditional studies implicitly assume that users and items are
independent and identically distributed (IID) and ignore the couplings within and between users and items.
Although deep learning techniques allow the model to learn coupling relationships, they only capture a
partial picture of the underlying non-IID problem. We argue that it is essential to distinguish the interactions
between multiple aspects at different levels explicitly for users’ dynamic preference modeling. On the other
hand, existing non-IID recommendation methods are not well designed for the sequential recommendation
task since users’ interaction sequence causes more complicated couplings within the system. Hence,
to systematically exploit the non-IID theory for coupling learning of sequential recommendation, we propose
a non-IID sequential recommendation framework, which extracts users’ dynamic preferences from complex
couplings between three aspects, including users, interacted items and target items at feature level, entity level
and preference level. Furthermore, to capture the couplings effectively, we implement the framework on the
base of capsule network. We indicate that the basic ideas of capsule seamlessly suit our purpose of modeling
sophisticated couplings. Finally, we conduct extensive experiments to demonstrate the effectiveness of our
approach. The results on three public datasets show that our model consistently outperforms six state-of-the-
art methods over three ranking metrics.

INDEX TERMS Capsule network, coupling network, non-IID recommendation, sequential


recommendation.
I. INTRODUCTION features, users and items, which could be crucial for extract-
With the rapid development of personalized services, ing users’ dynamic preferences. For instance, suppose there
the sequential patterns of user-item interactions have been is a user who likes comedy movies in general. Recently he
viewed as a valuable source of information to enhance tradi- has watched several movies acted by Denzel Washington.
tional recommender system as user interest generally evolves Then a traditional sequential recommender feels reasonable
over time. For instance, on YouTube, the videos played (by to recommend Carbon Copy, a comedy acted by Denzel
the same user) at continuous time slots could be correlated. Washington, to the user. However, this decision might not
To better understand and characterize users’ dynamic pref- be appropriate as Denzel Washington is a great action film
erence, the task of sequential recommendation has attracted actor while not good at acting comedy movies. In other
much attention. It aims to recommend top-N items with which words, the influence of Denzel Washington, comedy and
a user is likely to interact in subsequent time steps. action should not be considered as independent to users’
However, most of the traditional sequential recommen- preference. The feature coupling of Denzel Washington and
dation methods implicitly assume that users and items are comedy should lead to a low recommendation score. Apart
independent and identically distributed (IID). Specifically, from features, the interacted items by different users are
they overlook the complex couplings within and between also usually coupled together by sharing similar or cor-
related properties. Furthermore, for sequential recommen-
The associate editor coordinating the review of this manuscript and dation, it is important to consider the couplings of users’
approving it for publication was Ruqiang Yan. long- and short-term preferences. Whereas existing non-IID

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 186247
Y. Sun et al.: MCN for Non-IID Sequential Recommendation

recommendation approaches are only designed to model A. NON-IID RECOMMENDER SYSTEMS


users’ static preference and ignore the influence of users’ The study of non-IID recommender systems was introduced
purchase sequences. to cater for the non-IID nature of users and items [1]–[3],
On the other hand, the sequential relationships in recom- which can be summarized into two categories.
mendation tasks are different from those in natural language Similarity-based methods adopt manually designed sim-
processing (NLP) applications. First, there is no strict gram- ilarity measures to capture the couplings in the system.
mar rules that govern the order of items. A user’s new inter- CMF [4] exploits coupled user/item similarity [5] to find
actions are usually generated by sharing similar or correlated similar users and items. CPF [6] captures user/item inter-
features (feature-feature coupling) and items (item-item cou- actions by factorizing the similarity matrix with Poisson
pling) with the user’s historical feedbacks. Therefore, extract- factorization. Wang et al. [5] propose a similarity measure
ing couplings effectively is crucial for learning sequential for nominal attributes. It argues that couplings not only
patterns from the roughly ordered interactions, which cannot exist between attributes, but also between the values of each
be solved properly by copying NLP solutions. Secondly, attribute. Besides, Wang et al. [7] bring up with a similar-
the data of sequential recommendation is much sparser than ity measure for numerical attributes. It extends the Pear-
NLP tasks. Since the interaction sequences are generated son correlation coefficient [8] with a Taylor-like non-linear
by different users and the model aims to make accurate expansion.
personalized predictions. Hence, it is necessary to take user- The basic framework of similarity-based methods can
item couplings into consideration for modeling each user’s be summarized as Figure 1(a). In general, similarity-based
dynamic preference. methods capture coupling relationships hierarchically. It first
To capture the complex couplings of sequential recommen- models feature-level couplings from inter- and intra- feature
dation, we propose a non-IID sequential recommendation interactions, which are further used to form the coupling
framework as shown in Figure 1(c). It incorporates compre- representation of users and items. Then, it captures entity-
hensive couplings at three levels. To begin with, we adopt level couplings by assuming that like-minded users/items also
user/item intra-couplings to capture the couplings between share similar coupling representations. Finally, it matches
each feature and form the representation of each entity (each each user-item pair with their coupling representations to
user and item). Then we extract users’ long-, short- and long- approximate the rating matrix.
short term preferences from the couplings between users and Neural network-based methods exploit deep neural net-
their interacted items. Finally, each user’s dynamic preference works to extract couplings from comprehensive interactions.
is composed of the coupling of her multi-term preferences. NCF [9] captures linear and non-linear couplings between
The main contributions of this paper are as follows. users and items with a MF and a multi-layer perceptron
• We highlight the significance of modeling couplings (MLP) module respectively. ConvNCF [10] uses an outer
for sequential recommendation and propose a novel product to explicitly model pairwise interactions between
Multi-level Coupling Network to fill the gap between each feature dimension. CoupledCF [11] jointly learns
non-IID recommender system and sequential recom- explicit and implicit couplings within and between users and
mendation. It considers comprehensive couplings within items with a CNN and a MLP respectively. PGCR [12] puts
and between multi-aspects at different levels for accurate forward a policy gradients for contextual recommendations
modeling of users’ dynamic preferences. to capture high-order non-linear couplings of features in the
• We implement the framework on the base of a capsule contexts. HERS [13] aggregates user-user/item-item relations
network. We indicate that the basic idea of capsule seam- within a given context for coupling learning.
lessly suits our need of modeling sophisticated coupling The general framework of neural network based methods is
relationships. It can also avoid the intrinsic limitations summarized as Figure 1(b). Similar to similarity-based meth-
of widely used CNN- and RNN-based methods. In addi- ods, they also extract couplings from comprehensive interac-
tion, we devise three routing strategies to model user- tions. The main difference is that it learns couplings implicitly
item couplings with different influential level of user with a hierarchical neural network, instead of distinguishing
information. To our best knowledge, this is the first work the couplings at different levels explicitly. In other words,
to demonstrate the effectiveness of capsule network for similarity-based methods are built upon a strong prior about
sequential recommendation. how the couplings are organized in the system, while neural
• We conduct extensive experiments on three real datasets. network based methods mainly learn the couplings from data.
The results show that our approach significantly outper- However, both approaches have their limitations. To be
forms six state-of-the-art models. We also demonstrate specific, the performance of similarity-based methods heav-
the effectiveness of each of our designs with comprehen- ily rely on its coupling assumptions, while neural network
sive ablation tests. based methods require high-quality datasets to support the
training of its expressive model. Moreover, existing non-
II. RELATED WORK IID recommender frameworks are not well designed for
Our work synthesizes two tracks of research work, namely the sequential recommendation task. Regarding the dynamic
Non-IID Recommendation and Sequential Recommendation characteristic of user preference, the coupling relationship
186248 VOLUME 7, 2019
Y. Sun et al.: MCN for Non-IID Sequential Recommendation

FIGURE 1. Basic frameworks of non-IID recommendation methods, where (a, b) show the frameworks of most existing work and (c) is our proposed
generic non-IID sequential recommendation framework.

will be more complicated within and between more aspects contextual information [25]–[28]. Memory networks [29],
at different levels as shown in Figure 1(c). [30] also have be utilized to read and write a user’s dynamic
preference explicitly, which assumes the items/features that
B. SEQUENTIAL RECOMMENDATION are similar to the target item should be given more attention.
The development of sequential recommendation can be KSR [31] combines knowledge graph with memory network
roughly summarized into two stages. to provide interpretability for sequential recommendation.
In the early stage, approaches utilize Markov chain (MC) On the other hand, CNN also has been recognized as a
to model the sequential data. For instance, FPMC [14] learns suitable structure to mine local sequential patterns of item
a transition cube where each slice is a user-specific transi- sequences. Specifically, Caser [32] embeds a user’s recent
tion matrix of an underlying MC based on users’ purchase interactions into an ‘image’ which is then processed by
history. The model is trained by an adaption of the Bayesian convolution networks, so as to form the user’s short-term
Personalized Ranking [15] framework so that it can bring preference. 3D-CNN [33] integrates additional side informa-
together the advantages of MC and matrix factorization (MF). tion to enhance the performance of CNN-based framework.
Fossil [16] shares a similar idea with FPMC while it replaces However, CNN is good at detecting features but less effective
the MF part with FISM [17]. The major problem of FPMC at exploring the spatial relationships among features due to
and Fossil is that they assume each interacted item influences pooling operation. To solve this issue, NextItNet [34] adopts
users’ behavior linearly and independently, which is unable a stack of holed convolutional layers to model long-range
to capture the couplings between interacted items. HRM [18] dependencies without pooling. M3R [35] employs a mix-
partially addresses the problem by summarizing multiple ture of models, each with a different temporal range. These
interacting factors through a nonlinear max pooling opera- models are combined by a learned gating mechanism capa-
tion. However, this aggregation operation fails to capture the ble of exerting different model combinations given different
combination patterns of local adjacent items, i.e., the feature contextual information. BERT4Rec [36] employs the deep
level couplings within and between items. Besides, they also bidirectional self-attention to model user behavior sequences.
have difficulty in learning long-short interactive preference Although existing sequential recommendation methods
effectively. have achieved promising results, they are not able to explic-
Recently, many deep learning methods have been pro- itly distinguish the complicated couplings within the system.
posed to capture accurate sequential features, and achieved Specifically, RNN-based methods update user’s dynamic
state-of-the-art recommendation accuracy in comparison preference by directly interacting with the representation of
with the traditional approaches. For instance, DREAM [19] the next item, and thus lose the important hierarchical inter-
and GRU4Rec [20], [21] embed all the users’ historical actions. For CNN-based methods, the feature level coupling
interactions into the final hidden state of RNN, so as to is partially captured by the convolution operation. But the
represent the current preference. Other RNN based meth- learned coupling types are limited by the shape and number
ods enhance the performance by integrating user personal of filters. It also lacks the capacity of modeling entity- and
information [22], attention mechanism [23], [24] or other preference-level couplings explicitly. Furthermore, recent

VOLUME 7, 2019 186249


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

studies show that straightforward transformations for fusing Algorithm 1 Dynamic Routing Algorithm
embeddings of different representations usually limits the Input: outputs of capsules at layer y, where the output
effectiveness [37], [38]. However, most existing approaches embedding of capsule i is denoted as mi
adopt oversimplified strategies, like concatenation, to com- Output: the activity vectors of capsules at the
bine users’ general (long-term) and short-term preference, subsequent layer y + 1, where capsule j ’s
which may fail to model complex long-short interactive pref- activity vector is denoted as v̂j
erence of users. 1 for each i ∈ y and j ∈ y + 1 do
2 m̂j|i ← W ij mi ; // build vote vectors
III. NON-IID SEQUENTIAL RECOMMENDATION 3 bij ← 0 ; // initialize weights
FRAMEWORK WITH CAPSULE NETWORK 4 for T iterations do
A. NON-IID SEQUENTIAL RECOMMENDATION 5 for each i ∈ y and j ∈ y + 1 do
FRAMEWORK exp(bij )
6 cij ← P exp(b )
;
k ik
To solve the challenges of non-IID sequential recommen-
dations, we borrow the basic ideas of both similarity-based 7 for each 
Pj ∈ y + 1 do
methods and neural network-based methods, and construct 8 sj ← i cij m̂j|i ; // weighted sum
||sj ||2 sj
our non-IID sequential recommendation framework as shown 9 v̂j ← ; // squashing
1+||sj ||2 ||sj ||
in Figure 1(c).
In particular, we capture the couplings explicitly with 10 for each i ∈ y and j ∈ y + 1 do
a hierarchical architecture as similarity-based methods. 11 bij ← bij + m̂>
j|i v̂j ; // update
At first, we conduct feature interactions within each embed-
12 return all the v̂j at layer y + 1;
ding vector to form the coupling representations of each user
and her interacted items. Then we perform entity level user-
item, item-item interactions to build multi-term preferences.
Finally, each user’s dynamic preference is composed of the
the higher-level couplings are built on top of the lower-level
coupling of her multi-term preferences. It is worth noting
couplings. Specifically, the transformation matrices between
that we do not incorporate the target item for hierarchical
capsules at different levels learn to convert lower-level fea-
coupling learning and the reason is two-fold. In the first place,
tures to higher-level representations. On the other hand,
users’ dynamic preference should not be influenced by the
the dynamic routing between capsules at the same level learn
interactions in the future. Besides, the computation cost is
to highlight the salient common information and eliminate the
too huge to extract different dynamic preference embeddings
noises. This process is crucial for sequential recommendation
for every target item. At the same time, we implement the
since the common ground of different aspects usually reflects
hierarchical framework on the base of capsule network to
a user’s current interest. Whereas the accidental feedback can
effectively learn the sophisticated couplings. The details of
confuse the model to learn the real preference.
the architecture will be illustrated in the next section.
Formally, the dynamic routing process is summarized in
Algorithm 1. The input to a capsule j at layer (y + 1) is
B. CAPSULE NETWORK the outputs of the capsules at layer y. At line 2, the vector
In this paper, we realize the non-IID sequential recommen- sent from capsule i to capsule j is first transformed by
dation framework on the base of capsule network. In this a matrix W ij ∈ RD×D . The generated vector m̂j|i is termed
section, we will first analyze why dose capsule network fit as the vote vector from capsule i to capsule j . bij is the
our requirements of modeling sophisticated couplings. Then log prior probability that capsule i should be coupled to
we will elaborate the technical details of its implementation. capsule j , which is initialized by 0. From line 5 to line 6, for
Firstly, the main intuition of capsule suits our need of man- each capsule i at layer y, we adopt softmax to calculate the
aging expanded feature representations. Capsule is proposed probability that capsule i will be coupled with each capsule
to extend the expressive capacity of a neuron. It replaces j at layer y + 1, denoted as cij . Next, at line 8, for each
each neuron with an activity vector which represents the capsule j at layer y + 1, we calculate the weighted sum of
instantiation parameters of a specific type of entity such as its received vote vectors, which is denoted as sj . Then at line
an object or an object part [39]. Similar to our requirements, 9, we squash sj into vector v̂j , the activity vector of capsule j .
we expand the representation of each feature for coupling || · || denotes the L2 norm of a vector. In this way, the L2 norm
learning [5], [7] as each feature is not only represented by of v̂j is squashed into (0, 1), representing the activation degree
its own extent, but also its relation with other features. Then of capsule j . The direction of v̂j indicates the instantiation
it is reasonable to adopt capsule to arrange these expanded parameters of capsule j . Next, at line 10 and line 11, all the
representations for higher level couplings. log prior probabilities (bij ) are updated w.r.t. the dot similarity
Secondly, the dynamic routing mechanism of capsule net- between m̂j|i and v̂j . Intuitively, the agreements between vote
work is suitable for arranging the part-whole relationships of vectors are highlighted while the anomaly information is
representations of our multi-level coupling network, where reduced during routing iterations. Figure 5(b) presents an

186250 VOLUME 7, 2019


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

FIGURE 2. Treat user as data point for routing. FIGURE 3. Treat user as activation regularizer for routing.

example of a simple capsule network with four input embed-


dings and three output capsules.
With the help of dynamic routing, features can be con-
stituted hierarchically through the capsule network. For
instance, in computer vision tasks [40]–[43], the features are
learned in the form of edges, shapes and objects. In neural
language processing tasks [44], the features are learned in
the form of words, phrases and sentences. Wang et al. [45]
propose a capsule architecture for sentiment analysis, but
replace the routing process with an attention mechanism. FIGURE 4. Treat user as bias for routing.

C. DYNAMIC ROUTING WITH USER INFORMATION


When modeling user-item couplings with a capsule network, where m̂j|u is the squashed user vote vector for capsule j ,
the straightforward intuition is to pose more attention to which is calculated as the same process of items. In this
the items which are similar to the user’s general interest. way, we re-scale the activity vector of capsule j w.r.t. its
However, the extent of user influence is hard to determine. cosine similarity with user’s vote vector while keeping the
To investigate an appropriate manner of guiding routing with original direction as shown in Figure 3. We adopt cosine
user information, we devise three dynamic routing strategies similarity, instead of dot product since we mainly focus on
with different influential levels. the similarity between their properties (orientation). In this
view, if a higher-level capsule’s activity vector is far from
1) USER AS DATA POINT the user’s long-term preference, then this capsule would be
During dynamic routing, the lower-level capsules can be difficult to activate. This strategy only restricts the activation
viewed as data points that will be clustered into several Gaus- degree of higher-level capsules, but imposes no constraints
sian distributions. Therefore, the most straightforward way of on their properties.
integrating user information is to treat users as the data points
in the same item space. That is, users will participate in the 3) USER AS ROUTING BIAS
routing procedure as items as shown in Figure 2. This strategy The third strategy is to treat user information as the bias of
has relatively loose effect on the routing process, since user’s all lower-level capsules’ votes for one higher-level capsule.
prediction vector will be deactivated when its vote strongly Formally, we modify the line 8 of Algorithm 1 as follows:
disagrees with the items. In other words, if a user’s current P
i cij m̂j|i
taste differs a lot from her general interest, the routing output µj ← P (2)
cij
will be dominated by the user’s short-term preference. v i
m̂hj|i − µhj )2
uP
t i cij (P
u
2) USER AS ACTIVATION REGULARIZER
h
sj ← + m̂hj|u (3)
i cij
To pose more influence on routing process, we assume that
capsules agreeing with user’s general interest should be acti- where µj denotes the expectation of items’ vote vectors for
vated easier. In comparison with the first strategy, we make capsule j . The superscript h denotes the h-th dimension of
the user influence unneglectable by regularizing the activa- a vector. Then the new sj is constituted with the standard
tion calculation. Formally, we add the following updating rule deviation of items’ votes and the user’s general preference as
of v̂j after line 9 of Algorithm 1: shown in Figure 4. Similar to the second strategy, the routing
procedure is forced to consider the influence of user infor-
v̂>
j m̂j|u mation, regardless of its agreement with other items. But it
v̂j = v̂j · (1)
||v̂j || ||m̂j|u || restricts routing harder as it contributes to the calculation of

VOLUME 7, 2019 186251


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

output properties directly. When there are strong disagree- own extent value and the relationships with other features.
ments between a user with her latest interacted items, their Then each user and item is modeled by a feature interaction
properties will be fused together to form the user’s long-short matrix, which is further used as the source of extracting
term preference. higher level coupling in the following layers. In addition,
we also inherit the advantages and avoid the limitations of
IV. MULTI-LEVEL COUPLING NETWORK the two original solutions. On the one hand, the manually
In this section, we will first introduce the general framework designed similarities might not be flexible for various datasets
of our multi-level coupling network (MCN), followed by the and can only be applied with explicit attribute data. On the
detailed realization of feature-, entity- and preference-level other hand, the deep learning methods directly use pairwise
couplings. feature interactions to model the couplings between entities
(users and items), which may lose some nuanced feature level
A. GENERAL FRAMEWORK coupling relationships.
Before the illustration of our model’s architecture, we begin
with the basic notations. Suppose the systems involves a set C. ENTITY COUPLING WITH CAPSULE NETWORK
of users U = {u1 , u2 , . . . , uM } and a set of items I = After obtaining the feature interaction maps of user u and
{i1 , i2 , . . . , iN }, where M and N denote the number of users her interacted items, we are able to model the higher entity
and items respectively. Each user u has interacted with a level couplings, i.e., the user-item couplings and item-item
sequence of items Hu = {iu1 , iu2 , . . . , iu|Hu | }. We use Hu,t = couplings. In MCN, we adopt capsule network for entity
{iut−1 , iut−2 , . . . , iut−L } to denote the previous L items before u level couplings. Unlike other applications with capsule [44],
interacting t. Let pu ∈ RD and qi ∈ RD be the embeddings we only follow the design philosophy of capsule, instead
of user u and item i, which are obtained from P ∈ RM ×D of adopting the original architecture proposed in [39], [46].
and Q ∈ RN ×D , where D denotes the dimension size of each This is because the original version is designed for the tasks
embedding vector. with dense input as images or texts, which consists several
Figure 5 shows the overall architecture of MCN. The basic continuous capsule layers to extract more abstract represen-
idea of MCN is that the user’s dynamic interest should be tations. However, in our recommendation task, the inputs are
modeled from hierarchical couplings. Specifically, given a sparse user/item ids. Moreover, the observable feedback only
user u, a target item t ∈ Hu , we can obtain the corresponding covers a very small fraction of the full rating matrix, which
latest L interacted items, denoted as i ∈ Hu,t . We first retrieve is known as the issue of data sparsity. Consequently, it is
their embedding vectors as the inputs and capture the cou- difficult to learn a huge amount of parameters with origin
plings from the interactions within and between them w.r.t. capsule network.
the following process: 1) we model the couplings between Formally, for item-item couplings, we feed the same
each feature within the same embedding vector with outer dimension of row vectors of L interaction maps into the
product operation; 2) then we capture user-item, item-item same capsule. Thus the height of our capsule network will
couplings from the interactions between their feature vectors be D. The output of each capsule is a D-dimensional vector,
with capsules; 3) finally, we extract user dynamic preference representing the agreements between L interacted items at
from the couplings between multi-term preferences with a this dimension. Besides, to enable the couplings for different
capsule network. agreement, we pose K capsules for each feature dimension.
Thus, the depth of our capsule network is K . Figure 6 presents
B. FEATURE COUPLING WITH OUTER PRODUCT a toy example with L = 4, D = 5, K = 3. The formal output
The idea of modeling feature couplings with outer product of item-item coupling at dimension d with the k-th capsule is
is inspired from both similarity-based and neural network computed as follows:
based non-IID methods. Specifically, similarity-based mod-
els expand the feature representation as they assume the simi-
d , Cd , . . . , Cd )
Okd = Routing(C t−1 t−2 t−L
(5)
larity of two features are dependent on their relationships with
other features [4], [5], [7]. Neural network based methods
adopt CNN or MLP to extract higher level couplings from the where C t−l u
d denotes the row d of item it−l ’s coupling matrix
t−l
feature interaction map generated by outer product between C . The procedure of Routing is summarized as Algo-
user and item embeddings [10], [11]. As a fusion of these rithm 1. Then all the item-item coupling outputs at depth k
two ideas, we conduct outer product above each embedding will form a matrix Ok ∈ RD×D . All K matrices at different
vector itself as follows: depth will form a D × D × K cube O, representing the short-
term preference of user u.
C u = pu ⊗ pu , C i = qi ⊗ qi (4)
For user-item coupling, the computation process is similar.
u
where ⊗ denotes the outer product operation. C ∈ RD×D The main difference is the additional input of user C ud with
and C i ∈ RD×D are the output feature coupling representa- three influential levels of routing strategies, which are sum-
tions of user u and item i (i ∈ Hu,t ) respectively. By doing marized in Section III-C. Formally, the output of user-item
this, each feature is represented by a vector, including its coupling at dimension d with the k-th capsule is computed as

186252 VOLUME 7, 2019


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

FIGURE 5. Illustration of the proposed MCN model. The left subfigure shows the overall architecture of MCN, and the right subfigue presents a simple
example of the capsule network with four input embeddings and three output capsules.

coupling:
0 1 0K
Ekd = Routing(O1d , . . . , OK
d , O d , . . . , O d , Cd )
u
(7)
All the routing outputs Ekd at different feature dimension d
and depth k will also form a D × D × K cube. We flatten
the cube to a D-dimensional vector by adding the K matrices
together and then summarize the D vectors together.
K X
X D
eu = Ekd (8)
k=1 d=1
Hence, the final prediction for user u to target item t can
be estimated by the inner product of the user’s dynamic
preference and the item’s embedding.
FIGURE 6. A toy example of coupling learning with capsule network.
It includes L D × D interaction maps as inputs for a D × K capsule r̂ut = e>
u qt (9)
network, which outputs a D × D × K cube.
The whole model can be jointly trained by minimizing the
following pairwise cost function.
follows:
X
u qt − eu qt 0 )) + λ||2||F
2
arg min − log(logistic(e> >
(10)
2
k (u,t,t 0 )
O0 d = Routing(C t−1
d , Cd , . . . , Cd , Cd )
t−2 t−L u
(6)
where t 0 denotes the sampled negative item, λ is the coef-
Similar to O, O0
also denotes a D × D × K cube. Besides, ficient of the L2 norm regularizer, and 2 denotes the set of
with the help of user information, we are able to highlight model parameters.
the overlapping parts of user’s general interest and her latest
tastes with different influential degree of user information. V. EXPERIMENTS AND ANALYSIS
The generated cube O0 represents the long-short term prefer- In this section, we evaluate the proposed model on a wide
ence of the user. spectrum of real-world datasets to answer the following
research questions:
D. PREFERENCE COUPLING WITH CAPSULE NETWORK RQ1 Does the proposed multi-level coupling network
After obtaining user’s long-, short- and long-short term pref- achieve state-of-the-art performance?
erences, we are able to model the user’s dynamic preference RQ2 How does MCN’s key hyper-parameters influence its
as the agreements between them. For preference coupling, performance.
we adopt the same capsule network structure as entity level RQ3 Is each design of the proposed model effective.

VOLUME 7, 2019 186253


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

TABLE 1. Basic statistics of datasets. chain, which learn users’ long- and short-term pref-
erence respectively. GRU4Rec [20] is a deep learn-
ing based recommendation method. It adopts RNN to
capture sequential dependencies and make predictions.
Caser [32] utilizes CNN to explore the users’ short-term
preference by capturing local sequential patterns of adja-
cent items, which is then fused with users’ general inter-
est for prediction. RUM [29] adopts memory network to
store external memories for sequential recommendation.
A. EXPERIMENTAL SETTINGS We take the item-level RUM for comparison.
1) DATASETS
We evaluate our model in two different domains to demon- B. COMPLEXITY ANALYSIS AND IMPLEMENTATION
strate its generality. In specific, Movielens1 is a dataset con- For each user, the computation cost of MCN at feature cou-
taining user watching behaviors on different movies. Ama- pling layer is O((L + 1)D2 ), simplified as O(LD2 ). Next,
zon [47] is an e-commerce dataset, from which we choose the computation cost of a dynamic routing procedure with
two categories–Electronics and Kindle–to cover different L D−dimensional input embeddings and K output capsules
data characters. We transformed the explicit ratings into with T iterations is denoted as O(LKD2 + T (LK + K (LD +
implicit data, where each entry is marked as 0 or 1 indicating D)+LKD)), simplified as O(LKD(D+T )). For item-item cou-
whether the user has rated the item. To collect sufficient user pling, user-item coupling, and preference coupling, the input
preference, we select those users with at least 10 records embedding number are L, L + 1, and 2K + 1, respectively
for experiments as [9], [29]. The statistics of the processed with D capsule networks. Thus, the computation cost of entity
datasets are shown in Table 1. coupling and preference coupling is denoted as O(KD2 (L +
K )(T + D)). Note that the hyper-parameters K , L and T are
2) EVALUATION PROCEDURE much less than D, and each user’s inferred dynamic pref-
To evaluate the performance of recommendation, each of our erence embedding is consistent given different target item.
datasets is split into training/validation/testing sets. For each Therefore, The computation cost of matching M users with
user, we preserve the last two interactions to validation and N items is about O(MK (L + K )D3 + MND), reformulated as
testing sets, while the rest interactions are used for training. O(MD(K (L + K )D2 + N )). In practice, K (L + K )D2 and item
To reduce the expensive cost of ranking all items for each number N are about the same order of magnitude, indicating
user during evaluation, we follow the common strategy [9], that the time complexity of MCN is approximately O(MND),
[11], [48] to randomly sample 99 items that have not been which is the same order of magnitude as traditional matrix
interacted by the user, and to rank the target item among the factorization methods like BPR [15]. On the other hand, exist-
100 items. We adopt three well-known metrics to evaluate the ing non-IID models like NCF [9] and CoupledCF [11] feed
ranking quality, namely Precision, Recall, and Normalized each user-item pair into a deep neural network to compute
Discounted Cumulative Gain (NDCG). We omit their defini- matching score. Their overall computation cost of inference
tions due to space limitations. Generally, higher metric values is about O(MNDγ ), where γ denotes the depth of the network.
indicate better ranking performance. Empirically, BPR, MCN, NCF, and CoupledCF cost
around 120s, 260s, 590s, and 670s, for matching all user-
3) COMPARISON METHODS item pairs on Amazon-kindle dataset, respectively. As we
As our work synthesizes non-IID Recommendation and can see, MCN achieves comparable computation complexity
Sequential Recommendation, we have to adopt two groups to traditional matrix factorization method BPR, being much
of baselines for each track of the research works respectively. more efficient than existing non-IID approaches (NCF and
• Non-IID Recommendation: NCF [9] is the combi- CoupledCF). It demonstrates that MCN is very scalable for
nation of a MF model and a multi-layer perceptron real applications. We implement MCN with TensorFlow on
(MLP), which captures the linear and non-linear user- a single NVIDIA GeForce GTX 1080 Ti GPU without any
item couplings respectively. CoupledCF [11] adopts a pre-training. In practice, MCN can further benefit from multi-
CNN and a MLP to learn the explicit and implicit user- GPU servers with pure C++ implementation and CUDA
item couplings respectively. As we do not consider other acceleration.
side information for recommendation, we take the outer Parameter Settings: All the models are trained by Tensor-
product of user and item embedding vectors as the input Flow2 using Adam optimizer [49] with a batch size of 256.
of its CNN module. The optimal experimental settings for comparison methods
• Sequential Recommendation: FPMC [14] is a tra- are determined either by our experiments or suggested by
ditional method for sequential recommendation. It is previous works. Specifically, the learning rate and regular-
a combination of matrix factorization and Markov ization coefficient are commonly set to 0.001. The latent
1 grouplens.org/datasets/movielens/1m/ 2 www.tensorflow.org/

186254 VOLUME 7, 2019


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

TABLE 2. Performance comparison of MCN with other models, the best performance among all baseline methods is boldfaced. All the improvement of
MCN over the best baseline is signifcant at the level of 0.01.

sequential recommendation models, shedding light on the


importance of modeling users’ dynamic preference in real-
world applications. In addition, NCF is inferior to CoupledCF
consistently, indicating that the outer product operation over
user/item embedding vectors is effective for capturing feature
level couplings. All the deep learning based models (RUM,
GRU4Rec, Caser) consistently outperform the traditional
method (FPMC), indicating that neural networks can bet-
ter capture the coupling relationships than traditional struc-
tures with over simplified IID assumptions. Even though,
FIGURE 7. Parameter analysis of MCN in terms of routing iteration
numberT and capsule number K . the performance of deep neural models may vary in dif-
ferent datasets. Specifically, GRU4Rec outperforms RUM
and Caser on electronics dataset, while performing worse
on movielens and kindle datasets. It may be explained by
dimension D is set to 50. The length of item sequence L is the fact that electronics dataset contains much more short-
empirically set as 5. For NCF and CoupledCF, we employee term dependencies than the other datasets and GRU4Rec is
three hidden layers for their MLP modules with tower struc- good at modeling users’ short-term preferences. On the other
tures of {64, 32, 16} for Movielens and {32, 16, 8} for the hand, users’ preference towards movies and books is rela-
other datasets. The filter shape of CoupledCF is set as (8,8) tively stable and thus requires fine grained coupling modeling
with 8 channels for all datasets. For Caser, the number of between users’ long- and short-term preferences. However,
its horizontal filters is set as 16 for each height. And the GRU4Rec lacks the ability of distinguishing users’ multi-
number of vertical filters is set as 4. For GRU4Rec, the hidden term preferences explicitly as RUM and Caser. Although
layer size is set to 100. For MCN, the capsule number K for RUM and Caser captures users’ long- and short-term inter-
each feature dimension and the iteration number of dynamic ests separately, their straightforward fusion strategy cannot
routing T are set as 3. The influence of these two parameters capture fine grained couplings effectively. Furthermore, they
will be analyzed in the following experiments. are incapable of modeling entity-level couplings explicitly.
After addressing the limitations of existing works, our
C. MODEL COMPARISON MCN model outperforms all the counterparts consistently
In this part, we answer RQ1 by comparing the performance across all three datasets over three evaluation metrics. The
of MCN with 6 state-of-the-art baselines on 3 benchmark maximal relative percentage of improvements can be up
datasets. For clarity, we post results with the first routing to 10% in electronics dataset, and often stays at 5% in
procedure (user as data point) for overall comparison. The other datasets. It demonstrates the effectiveness of our prac-
detailed analysis of different routing strategies will be dis- tice in extracting users’ dynamic preferences from complex
cussed in the following tests. Table 2 shows the overall couplings between three aspects, including users, interacted
performance of all models. It is noted that the two non-IID items and target items at feature level, entity level and prefer-
baselines (NCF and CoupledCF) perform the worst than other ence level. The superiority of MCN also verifies that capsule

VOLUME 7, 2019 186255


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

TABLE 3. The relative performance decline of MCN variants w.r.t. the


original version. The worst performance is boldfaced.

FIGURE 8. The training loss and test accuracy comparison of MCN,


MCNmatrix , and MCNvector . We conduct early stop if the test accuracy
doesn’t increase in 100 continuous iterations.
network is effective for modeling sophisticated coupling rela-
tionships.
1) IMPACT OF HIERARCHICAL STRUCTURE
D. PARAMETER ANALYSIS In the previous analysis, we assume the coupling relation-
In this subsection, we aim to answer RQ2 by analyzing the ships exist at different levels in the non-IID sequential rec-
effect of routing iteration number T and the capsule number ommendation circumstance. And we implement the model
K of MCN. The results are presented in Figure 7. with the corresponding hierarchical architecture. To verify
Recall that routing is used to highlight the salient common our assumption and the structure, we reduce the entity and
features and eliminate the accidental noises of the inputs. preference level coupling layer into a single layer capsule
Therefore higher iteration number T will focus on more network and test its performance. The results are denoted
specific fraction of the input and discard the other infor- as no_hierarchy in Table 3. We observe that the hierarchical
mation as noises. As shown in Figure 7 (a), the perfor- architecture indeed improves the performance consistently.
mance achieves the best when T = 3, indicating that it
is helpful to exclude noises with routing. But discarding 2) IMPACT OF OUTER PRODUCT OPERATION FOR FEATURE
too much information will hinder the extraction of accurate LEVEL COUPLING
features. Although there are some works on modeling user-item inter-
On the other hand, capsule number K controls the actions with outer product [10], [11], there is little study
complexity of the model by projecting the input vectors on learning feature interactions within the same embedding
into K different capsules for routing. As shown in Fig- vector. As illustrated before, we aim to represent each feature
ure 7(b), the accuracy stops growing when K is greater with its own extent and its relationship with other features
than 3 and 2 on Movielens and the other datasets respec- together. To investigate whether this operation is helpful,
tively. This observation indicates that it is redundant to con- the experiment is two-fold. On the one hand, we build the
duct routing with too many capsules. In addition, denser MCNmatrix variant, which abandons the outer product opera-
dataset is able to reach higher performance by training with tion and directly represents each user/item with an embedding
larger K . matrix. On the other hand, we build MCNvector to feed the
user/item embedding vector into capsules directly. That is,
E. MODEL ABLATION AND DISCUSSION the height of the capsule network is reduced to 1. As shown
In this part, we aim to answer RQ3 by diving into an in- in Table 3, both of the variants deteriorate significantly.
depth model analysis. Specifically, we need to demonstrate From Figure 8, we observe that MCNmatrix can reach much
the effectiveness of capturing coupling relationship with the lower training error in comparison with the original MCN
hierarchical structure, capturing feature level coupling with model, but its test accuracy does not increase correspond-
outer product operation, building hierarchical couplings with ingly. In other words, MCNmatrix suffers from over-fitting as it
capsule network, building user’s dynamic interest with three incorporates too many redundant parameters. On the contrary,
terms of preference and modeling user-item coupling with MCNvector reaches higher loss and lower accuracy compared
different routing strategies. with the raw version, i.e., it suffers from under-fitting since

186256 VOLUME 7, 2019


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

FIGURE 9. The visualization of MCNAtten and MCN with different routing iterations. We project the learned feature vectors into two-dimension with
PCA on movielens dataset. We pick a random user with her latest interacted items as the input of MCN and MCNAtten , and visualize the output of one
capsule of the user-item coupling network, which is denoted as the red point. Then we feed the user’s next interacted item into the same capsule and
get its prediction vector, which is denoted as the blue point. Finally we sample 1000 negative items and feed them into the same capsule, their
prediction vectors are denoted as the gray points.

its capacity of coupling learning is limited by the simplified TABLE 4. The relative performance decline of MCN variants for
preference level coupling learning.
structure. Additionally, it shows that MCNmatrix outperforms
MCNvector on movielens while vice versa on electronics. This
can be explained that the over-fitting of MCNmatrix becomes
worse on sparser datasets.

3) IMPACT OF CAPSULE NETWORK


As illustrated before, we adopt capsule network to capture
entity/preference level couplings as its basic idea perfectly
suits our requirements. To justify our analysis, we conduct the
following experiments. To begin with, we build the MCNMLP
variant, which replaces each capsule network in the origi-
nal architecture with a MLP. Namely, MCNMLP learns all
the entity and preference level couplings implicitly with the
MLPs. Secondly, we build the MCNAtten variant to replace
4) IMPACT OF MULTI-TERM PREFERENCES
each capsule with a self-attention [50] network (i.e., Multi-
Head Attention), which arranges the weights of feature vec- In the preference coupling layer of MCN, we consider three
tors explicitly. From Table 3, we find that MCNAtten beats terms of preference, namely, long, short and long-short term
MCNMLP and most of the baselines in Table 2, indicating that preferences. To investigate the influence of each part, we con-
it is helpful to learn the couplings explicitly. duct the ablation test by removing each of the preference and
However, MCNAtten cannot achieve the performance of test its performance decline. The results are shown in Table 4,
original MCN, which gives evidence of the superiority of where −l, −s, and −ls denote deleting long-term, short-
capsule network. This might be explained that self-attention term, and long-short-term preference respectively. We can
is good at learning the relationships between different fea- observe that all the variants are inferior to the original model,
tures, but lacks the ability to build higher level representations indicating that it is necessary to incorporate all three terms
as capsule does. In addition, self-attention cannot enhance preference for preference coupling. Besides, −l performs the
the salient common information explicitly. As shown in Fig- worst among all the variants in most settings, indicating that
ure 9(a) and (b), the predictions of MCNAtten and MCN with user’s general taste is crucial for the modeling of dynamic
1 routing iteration are both inaccurate. However, Figure 9(b) preference. It also proves that the influence of long-term
(c) and (d) show that the feature vector of the input and the preference might vanish during user-item coupling. Lastly,
target item are getting closer through the routing iterations. −s possesses the smallest accuracy gain. It can be explained
Moreover, in self-attention network, all the queries share the by the fact that the long-short term preference captured by
same transformation matrix, and thus loss the spatial infor- user-item coupling is capable of compensating user’s short-
mation about the inputs. The original paper solves this issue term interest to some extent.
by incorporating a position embedding for each input. While
capsule network poses different parameter matrix for differ- 5) IMPACT OF ROUTING STRATEGIES
ent feature dimensions and input positions, which gives more Recall that to guide the routing procedure of capsule network
flexibility for learning hierarchical couplings and avoids the with user information, we devise three strategies with incre-
bother of position embedding. mental user influential levels: 1) user as data point; 2) user

VOLUME 7, 2019 186257


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

TABLE 5. The performance of MCN variants with different routing [8] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and
strategies. Applications, vol. 20. Philadelphia, PA, USA: SIAM, 2007.
[9] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, ‘‘Neural
collaborative filtering,’’ in Proc. 26th Int. Conf. World Wide Web (WWW),
2017, pp. 173–182.
[10] X. He, X. Du, X. Wang, F. Tian, J. Tang, and T.-S. Chua, ‘‘Outer product-
based neural collaborative filtering,’’ in Proc. Int. Joint Conf. Artif. Intell.
(IJCAI), 2018.
[11] Q. Zhang, L. Cao, C. Zhu, Z. Li, and J. Sun, ‘‘CoupledCF: Learning
explicit and implicit user-item couplings in recommendation for deep
collaborative filtering,’’ in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2018,
pp. 3662–3668.
[12] F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He, ‘‘Policy gradients for
contextual recommendations,’’ in Proc. 28th Int. Conf. World Wide Web
(WWW), 2019.
[13] L. Hu, S. Jian, L. Cao, Z. Gu, Q. Chen, and A. Amirbekyan, ‘‘Hers:
Modeling influential contexts with heterogeneous relations for sparse and
cold-start recommendation,’’ in Proc. AAAI Conf. Artif. Intell., vol. 33,
2019, pp. 3830–3837.
as activation regularizer; and 3) user as vote bias. We denote [14] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, ‘‘Factorizing person-
these approaches as point, activ and bias respectively. The alized Markov chains for next-basket recommendation,’’ in Proc. Int. Conf.
World Wide Web (WWW), 2010, pp. 811–820.
results are presented in Table 5. Specifically, the accuracy
[15] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, ‘‘BPR:
is increasing with the promotion of user influential levels Bayesian personalized ranking from implicit feedback,’’ in Proc. 25th
on movielens dataset, while decreasing on electronics and Conf. Uncertainty Artif. Intell. (UAI), 2009, pp. 452–461.
kindle datasets, indicating that user influence is more reliable [16] R. He and J. McAuley, ‘‘Fusing similarity models with Markov chains for
sparse sequential recommendation,’’ in Proc. IEEE Int. Conf. Data Mining
with denser dataset. This pattern can also be observed from (ICDM), Dec. 2016, pp. 191–200.
Table 4 as the −l variant gets highest performance decline on [17] S. Kabbur, X. Ning, and G. Karypis, ‘‘FISM: Factored item simi-
movielens dataset. larity models for top-N recommender systems,’’ in Proc. 19th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2013,
pp. 659–667.
VI. CONCLUSION [18] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, and X. Cheng, ‘‘Learning hierar-
In this paper, we proposed a novel multi-level coupling chical representation model for nextbasket recommendation,’’ in Proc. Int.
ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2015, pp. 403–412.
network (MCN) for non-IID sequential recommendation.
[19] F. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, ‘‘A dynamic recurrent model
It considered comprehensive couplings within and between for next basket recommendation,’’ in Proc. Int. ACM SIGIR Conf. Res.
multi-aspects at different levels: 1) feature level couplings Develop. Inf. Retr. (SIGIR), 2016, pp. 729–732.
within each user and her interacted items; 2) user-item and [20] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, ‘‘Session-based
recommendations with recurrent neural networks,’’ in Proc. Int. Conf.
item-item entity level couplings; and 3) long-, short- and Learn. Represent. (ICLR), 2016.
long-short- term preference couplings. We implemented the [21] B. Hidasi and A. Karatzoglou, ‘‘Recurrent neural networks with top-k gains
framework on the base of capsule network as its basic ideas for session-based recommendations,’’ 2018, arXiv:1706.03847. [Online].
Available: https://arxiv.org/abs/1706.03847
are seamlessly suitable for sophisticated coupling learning. [22] M. Quadrana, A. Karatzoglou, B. Hidasi, and P. Cremonesi, ‘‘Personal-
We did detailed analysis about the advantages of adopting izing session-based recommendations with hierarchical recurrent neural
capsule network. We conducted extensive experiments on networks,’’ in Proc. 11th ACM Conf. Recommender Syst. (RecSys), 2017,
pp. 130–137.
three real datasets. The results showed that our approach sig- [23] W. Pei, J. Yang, Z. Sun, J. Zhang, A. Bozzon, and D. M. J. Tax, ‘‘Interacting
nificantly outperformed six state-of-the-art models. We also attention-gated recurrent networks for recommendation,’’ in Proc. Int.
conducted comprehensive ablation experiments to verify the Conf. Inf. Knowl. Manage. (CIKM), 2017, pp. 1459–1468.
[24] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, ‘‘Neural attentive session-
effectiveness of each design of the model. based recommendation,’’ in Proc. Int. Conf. Inf. Knowl. Manage. (CIKM),
2017.
REFERENCES [25] C. Qiang, W. Shu, H. Yan, and L. Wang, ‘‘A hierarchical contex-
[1] L. Cao, ‘‘Non-iid recommender systems: A review and framework of rec- tual attention-based gru network for sequential recommendation,’’ 2017,
ommendation paradigm shifting,’’ Engineering, vol. 2, no. 2, pp. 212–224, arXiv:1711.05114. [Online]. Available: https://arxiv.org/abs/1711.05114
2016. [26] P. Sun, L. Wu, and M. Wang, ‘‘Attentive recurrent social recommendation,’’
[2] L. Cao, Y. Ou, and P. S. Yu, ‘‘Coupled behavior analysis with applications,’’ in Proc. 41st Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2018,
IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, pp. 1378–1392, Aug. 2012. pp. 185–194.
[3] L. Cao, ‘‘Coupling learning of complex interactions,’’ J. Inf. Process. [27] Z. Sun, J. Yang, J. Zhang, A. Bozzon, L.-K. Huang, and C. Xu,
Manage., vol. 51, no. 2, pp. 167–186, 2015. ‘‘Recurrent knowledge graph embedding for effective recommenda-
[4] F. Li, G. Xu, and L. Cao, ‘‘Coupled matrix factorization within non- tion,’’ in Proc. 12th ACM Conf. Recommender Syst. (RecSys), 2018,
iid context,’’ in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining pp. 297–305.
(PAKDD). New York, NY, USA: Springer, 2015, pp. 707–719. [28] R. Li, Y. Shen, and Y. Zhu, ‘‘Next point-of-interest recommendation with
[5] C. Wang, L. Cao, M. Wang, J. Li, W. Wei, and Y. Ou, ‘‘Coupled nominal temporal and multi-level context attention,’’ in Proc. IEEE Int. Conf. Data
similarity in unsupervised learning,’’ in Proc. 20th ACM Int. Conf. Inf. Mining (ICDM), Nov. 2018, pp. 1110–1115.
Knowl. Manage. (CIKM), 2011, pp. 973–978. [29] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha, ‘‘Sequential
[6] T. D. T. Do and L. Cao, ‘‘Coupled Poisson factorization integrated with recommendation with user memory networks,’’ in Proc. 11th ACM Int.
user/item metadata for modeling popular and sparse ratings in scalable Conf. Web Search Data Mining (WSDM), 2018, pp. 108–116.
recommendation,’’ in Proc. AAAI Conf. Artif. Intell. (AAAI), 2018, pp. 1–7. [30] G. Guo, S. Ouyang, X. He, F. Yuan, and X. Liu, ‘‘Dynamic item block and
[7] C. Wang, Z. She, and L. Cao, ‘‘Coupled attribute analysis on numerical prediction enhancing block for sequential recommendation,’’ in Proc. Int.
data,’’ in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2013. Joint Conf. Artif. Intell. (IJCAI), 2019.

186258 VOLUME 7, 2019


Y. Sun et al.: MCN for Non-IID Sequential Recommendation

[31] H. Jin, X. Z. Wayne, D. Hongjian, W. Ji-Rong, and Y. C. Edward, ‘‘Improv- YATONG SUN is currently pursuing the M.S.
ing sequential recommendation with knowledge-enhanced memory net- degree in software engineering with Northeast-
works,’’ in Proc. 11th ACM Int. Conf. Web Search Data Mining (WSDM), ern University, China. Since September 2018, he
2018, pp. 108–116. has been an Algorithm Engineer with JD AI
[32] J. Tang and K. Wang, ‘‘Personalized top-N sequential recommendation via Research. His research interests include machine
convolutional sequence embedding,’’ in Proc. 11th ACM Int. Conf. Web learning, data mining, and recommender sys-
Search Data Mining (WSDM), 2018, pp. 565–573. tems. Specifically, he is focusing on incorporating
[33] T. X. Tuan and M. P. Tu, ‘‘3d convolutional networks for session-based
knowledge-based information into sequential rec-
recommendation with content features,’’ in Proc. 11th ACM Conf. Recom-
ommendation tasks.
mender Syst. (RecSys), 2017, pp. 138–146.
[34] F. Yuan, A. Karatzoglou, I. Arapakis, J. M. Jose, and X. He, ‘‘A sim-
ple but hard-to-beat baseline for session-based recommendations,’’ 2018,
arXiv:1808.05163. [Online]. Available: https://arxiv.org/abs/1808.05163
[35] J. Tang, F. Belletti, S. Jain, M. Chen, A. Beutel, C. Xu, and E. H. Chi,
‘‘Towards neural mixture recommender for long range dependent user
sequences,’’ in Proc. 28th Int. Conf. World Wide Web (WWW), 2019, GUIBING GUO received the Ph.D. degree in
pp. 1782–1793. computer science from Nanyang Technological
[36] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, ‘‘BERT4Rec: University, Singapore, in 2015. He is currently a
Sequential recommendation with bidirectional encoder representations Professor with the Software College, Northeastern
from transformer,’’ in Proc. Int. Conf. Inf. Knowl. Manage. (CIKM), 2019. University, China. His research interests include
[37] A. Fukui, H. P. Dong, D. Yang, A. Rohrbach, and M. Rohrbach, ‘‘Mul- recommender systems, deep learning, natural lan-
timodal compact bilinear pooling for visual question answering and guage processing, and data mining.
visual grounding,’’ in Proc. Empirical Methods Natural Lang. Process.
(EMNLP), 2016.
[38] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, ‘‘MUTAN: Multi-
modal tucker fusion for visual question answering,’’ in Proc. Empirical
Methods Natural Lang. Process. (EMNLP), 2017.
[39] S. Sabour, N. Frosst, and G. E. Hinton, ‘‘Dynamic routing between cap-
sules,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017.
[40] D. Wang and Q. Liu, ‘‘An optimization view on dynamic routing between
capsules,’’ in Proc. ICLR Workshop, 2018. XIAODONG HE is currently the Deputy Manag-
[41] P. Afshar, A. Mohammadi, and K. N. Plataniotis, ‘‘Brain tumor type ing Director of the JD AI Research and the Head
classification via capsule networks,’’ 2018, arXiv:1802.10200. [Online]. of the Deep learning, NLP and Speech Lab, and
Available: https://arxiv.org/abs/1802.10200 the Technical Vice President of JD.com. He is
[42] R. LaLonde and U. Bagci, ‘‘Capsules for object segmentation,’’ 2018, also an Affiliate Professor with the University of
arXiv:1804.04241. [Online]. Available: https://arxiv.org/abs/1804.04241 Washington, Seattle, serves in Ph.D. supervisory
[43] M. T. Bahadori, ‘‘Spectral capsule networks,’’ in Proc. ICLR Workshop, committees. Before joining JD.com, he was with
2018. Microsoft for about 15 years, served as a Princi-
[44] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang, ‘‘Investigat- pal Researcher and the Research Manager of the
ing capsule networks with dynamic routing for text classification,’’ in DLTC with Microsoft Research, Redmond,
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2018,
WA, USA.
pp. 3110–3119.
[45] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu, ‘‘Sentiment analysis by
capsules,’’ in Proc. Int. Conf. World Wide Web (WWW), 2018.
[46] H. Geoffrey, S. Sara, and F. Nicholas, ‘‘Matrix capsules with EM routing,’’
in Proc. Int. Conf. Learn. Represent. (ICLR), 2018.
[47] J. McAuley and J. Leskovec, ‘‘Hidden factors and hidden topics: Under-
XIAOHUA LIU is currently a Researcher with
standing rating dimensions with review text,’’ in Proc. 7th ACM Conf.
AINemo. He is also a Postdoctoral Research Fel-
Recommender Syst. (RecSys), 2013, pp. 165–172.
[48] Y. Koren, ‘‘Factorization meets the neighborhood: A multifaceted collab- low with the University of Montreal. Prior to
orative filtering model,’’ in Proc. 14th ACM SIGKDD Int. Conf. Knowl. that, he was a Project Lead and a Researcher
Discovery Data Mining (KDD), 2008, pp. 426–434. with Microsoft Research Asia, Natural Language
[49] D. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ in Computing Group, and JD AI Research. His
Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15. research interests include big data processing,
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, social content mining, NLP for ranking, and
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. machine learning.
Neural Inf. Process. Syst. (NIPS), 2017, pp. 5998–6008.

VOLUME 7, 2019 186259

You might also like