Professional Documents
Culture Documents
Multi-Level Coupling Network For Non-IID Sequential Recommendation
Multi-Level Coupling Network For Non-IID Sequential Recommendation
ABSTRACT Sequential recommendation has been recently attracting a lot attention to suggest users with
next items to interact. However, most of the traditional studies implicitly assume that users and items are
independent and identically distributed (IID) and ignore the couplings within and between users and items.
Although deep learning techniques allow the model to learn coupling relationships, they only capture a
partial picture of the underlying non-IID problem. We argue that it is essential to distinguish the interactions
between multiple aspects at different levels explicitly for users’ dynamic preference modeling. On the other
hand, existing non-IID recommendation methods are not well designed for the sequential recommendation
task since users’ interaction sequence causes more complicated couplings within the system. Hence,
to systematically exploit the non-IID theory for coupling learning of sequential recommendation, we propose
a non-IID sequential recommendation framework, which extracts users’ dynamic preferences from complex
couplings between three aspects, including users, interacted items and target items at feature level, entity level
and preference level. Furthermore, to capture the couplings effectively, we implement the framework on the
base of capsule network. We indicate that the basic ideas of capsule seamlessly suit our purpose of modeling
sophisticated couplings. Finally, we conduct extensive experiments to demonstrate the effectiveness of our
approach. The results on three public datasets show that our model consistently outperforms six state-of-the-
art methods over three ranking metrics.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 186247
Y. Sun et al.: MCN for Non-IID Sequential Recommendation
FIGURE 1. Basic frameworks of non-IID recommendation methods, where (a, b) show the frameworks of most existing work and (c) is our proposed
generic non-IID sequential recommendation framework.
will be more complicated within and between more aspects contextual information [25]–[28]. Memory networks [29],
at different levels as shown in Figure 1(c). [30] also have be utilized to read and write a user’s dynamic
preference explicitly, which assumes the items/features that
B. SEQUENTIAL RECOMMENDATION are similar to the target item should be given more attention.
The development of sequential recommendation can be KSR [31] combines knowledge graph with memory network
roughly summarized into two stages. to provide interpretability for sequential recommendation.
In the early stage, approaches utilize Markov chain (MC) On the other hand, CNN also has been recognized as a
to model the sequential data. For instance, FPMC [14] learns suitable structure to mine local sequential patterns of item
a transition cube where each slice is a user-specific transi- sequences. Specifically, Caser [32] embeds a user’s recent
tion matrix of an underlying MC based on users’ purchase interactions into an ‘image’ which is then processed by
history. The model is trained by an adaption of the Bayesian convolution networks, so as to form the user’s short-term
Personalized Ranking [15] framework so that it can bring preference. 3D-CNN [33] integrates additional side informa-
together the advantages of MC and matrix factorization (MF). tion to enhance the performance of CNN-based framework.
Fossil [16] shares a similar idea with FPMC while it replaces However, CNN is good at detecting features but less effective
the MF part with FISM [17]. The major problem of FPMC at exploring the spatial relationships among features due to
and Fossil is that they assume each interacted item influences pooling operation. To solve this issue, NextItNet [34] adopts
users’ behavior linearly and independently, which is unable a stack of holed convolutional layers to model long-range
to capture the couplings between interacted items. HRM [18] dependencies without pooling. M3R [35] employs a mix-
partially addresses the problem by summarizing multiple ture of models, each with a different temporal range. These
interacting factors through a nonlinear max pooling opera- models are combined by a learned gating mechanism capa-
tion. However, this aggregation operation fails to capture the ble of exerting different model combinations given different
combination patterns of local adjacent items, i.e., the feature contextual information. BERT4Rec [36] employs the deep
level couplings within and between items. Besides, they also bidirectional self-attention to model user behavior sequences.
have difficulty in learning long-short interactive preference Although existing sequential recommendation methods
effectively. have achieved promising results, they are not able to explic-
Recently, many deep learning methods have been pro- itly distinguish the complicated couplings within the system.
posed to capture accurate sequential features, and achieved Specifically, RNN-based methods update user’s dynamic
state-of-the-art recommendation accuracy in comparison preference by directly interacting with the representation of
with the traditional approaches. For instance, DREAM [19] the next item, and thus lose the important hierarchical inter-
and GRU4Rec [20], [21] embed all the users’ historical actions. For CNN-based methods, the feature level coupling
interactions into the final hidden state of RNN, so as to is partially captured by the convolution operation. But the
represent the current preference. Other RNN based meth- learned coupling types are limited by the shape and number
ods enhance the performance by integrating user personal of filters. It also lacks the capacity of modeling entity- and
information [22], attention mechanism [23], [24] or other preference-level couplings explicitly. Furthermore, recent
studies show that straightforward transformations for fusing Algorithm 1 Dynamic Routing Algorithm
embeddings of different representations usually limits the Input: outputs of capsules at layer y, where the output
effectiveness [37], [38]. However, most existing approaches embedding of capsule i is denoted as mi
adopt oversimplified strategies, like concatenation, to com- Output: the activity vectors of capsules at the
bine users’ general (long-term) and short-term preference, subsequent layer y + 1, where capsule j ’s
which may fail to model complex long-short interactive pref- activity vector is denoted as v̂j
erence of users. 1 for each i ∈ y and j ∈ y + 1 do
2 m̂j|i ← W ij mi ; // build vote vectors
III. NON-IID SEQUENTIAL RECOMMENDATION 3 bij ← 0 ; // initialize weights
FRAMEWORK WITH CAPSULE NETWORK 4 for T iterations do
A. NON-IID SEQUENTIAL RECOMMENDATION 5 for each i ∈ y and j ∈ y + 1 do
FRAMEWORK exp(bij )
6 cij ← P exp(b )
;
k ik
To solve the challenges of non-IID sequential recommen-
dations, we borrow the basic ideas of both similarity-based 7 for each
Pj ∈ y + 1 do
methods and neural network-based methods, and construct 8 sj ← i cij m̂j|i ; // weighted sum
||sj ||2 sj
our non-IID sequential recommendation framework as shown 9 v̂j ← ; // squashing
1+||sj ||2 ||sj ||
in Figure 1(c).
In particular, we capture the couplings explicitly with 10 for each i ∈ y and j ∈ y + 1 do
a hierarchical architecture as similarity-based methods. 11 bij ← bij + m̂>
j|i v̂j ; // update
At first, we conduct feature interactions within each embed-
12 return all the v̂j at layer y + 1;
ding vector to form the coupling representations of each user
and her interacted items. Then we perform entity level user-
item, item-item interactions to build multi-term preferences.
Finally, each user’s dynamic preference is composed of the
the higher-level couplings are built on top of the lower-level
coupling of her multi-term preferences. It is worth noting
couplings. Specifically, the transformation matrices between
that we do not incorporate the target item for hierarchical
capsules at different levels learn to convert lower-level fea-
coupling learning and the reason is two-fold. In the first place,
tures to higher-level representations. On the other hand,
users’ dynamic preference should not be influenced by the
the dynamic routing between capsules at the same level learn
interactions in the future. Besides, the computation cost is
to highlight the salient common information and eliminate the
too huge to extract different dynamic preference embeddings
noises. This process is crucial for sequential recommendation
for every target item. At the same time, we implement the
since the common ground of different aspects usually reflects
hierarchical framework on the base of capsule network to
a user’s current interest. Whereas the accidental feedback can
effectively learn the sophisticated couplings. The details of
confuse the model to learn the real preference.
the architecture will be illustrated in the next section.
Formally, the dynamic routing process is summarized in
Algorithm 1. The input to a capsule j at layer (y + 1) is
B. CAPSULE NETWORK the outputs of the capsules at layer y. At line 2, the vector
In this paper, we realize the non-IID sequential recommen- sent from capsule i to capsule j is first transformed by
dation framework on the base of capsule network. In this a matrix W ij ∈ RD×D . The generated vector m̂j|i is termed
section, we will first analyze why dose capsule network fit as the vote vector from capsule i to capsule j . bij is the
our requirements of modeling sophisticated couplings. Then log prior probability that capsule i should be coupled to
we will elaborate the technical details of its implementation. capsule j , which is initialized by 0. From line 5 to line 6, for
Firstly, the main intuition of capsule suits our need of man- each capsule i at layer y, we adopt softmax to calculate the
aging expanded feature representations. Capsule is proposed probability that capsule i will be coupled with each capsule
to extend the expressive capacity of a neuron. It replaces j at layer y + 1, denoted as cij . Next, at line 8, for each
each neuron with an activity vector which represents the capsule j at layer y + 1, we calculate the weighted sum of
instantiation parameters of a specific type of entity such as its received vote vectors, which is denoted as sj . Then at line
an object or an object part [39]. Similar to our requirements, 9, we squash sj into vector v̂j , the activity vector of capsule j .
we expand the representation of each feature for coupling || · || denotes the L2 norm of a vector. In this way, the L2 norm
learning [5], [7] as each feature is not only represented by of v̂j is squashed into (0, 1), representing the activation degree
its own extent, but also its relation with other features. Then of capsule j . The direction of v̂j indicates the instantiation
it is reasonable to adopt capsule to arrange these expanded parameters of capsule j . Next, at line 10 and line 11, all the
representations for higher level couplings. log prior probabilities (bij ) are updated w.r.t. the dot similarity
Secondly, the dynamic routing mechanism of capsule net- between m̂j|i and v̂j . Intuitively, the agreements between vote
work is suitable for arranging the part-whole relationships of vectors are highlighted while the anomaly information is
representations of our multi-level coupling network, where reduced during routing iterations. Figure 5(b) presents an
FIGURE 2. Treat user as data point for routing. FIGURE 3. Treat user as activation regularizer for routing.
output properties directly. When there are strong disagree- own extent value and the relationships with other features.
ments between a user with her latest interacted items, their Then each user and item is modeled by a feature interaction
properties will be fused together to form the user’s long-short matrix, which is further used as the source of extracting
term preference. higher level coupling in the following layers. In addition,
we also inherit the advantages and avoid the limitations of
IV. MULTI-LEVEL COUPLING NETWORK the two original solutions. On the one hand, the manually
In this section, we will first introduce the general framework designed similarities might not be flexible for various datasets
of our multi-level coupling network (MCN), followed by the and can only be applied with explicit attribute data. On the
detailed realization of feature-, entity- and preference-level other hand, the deep learning methods directly use pairwise
couplings. feature interactions to model the couplings between entities
(users and items), which may lose some nuanced feature level
A. GENERAL FRAMEWORK coupling relationships.
Before the illustration of our model’s architecture, we begin
with the basic notations. Suppose the systems involves a set C. ENTITY COUPLING WITH CAPSULE NETWORK
of users U = {u1 , u2 , . . . , uM } and a set of items I = After obtaining the feature interaction maps of user u and
{i1 , i2 , . . . , iN }, where M and N denote the number of users her interacted items, we are able to model the higher entity
and items respectively. Each user u has interacted with a level couplings, i.e., the user-item couplings and item-item
sequence of items Hu = {iu1 , iu2 , . . . , iu|Hu | }. We use Hu,t = couplings. In MCN, we adopt capsule network for entity
{iut−1 , iut−2 , . . . , iut−L } to denote the previous L items before u level couplings. Unlike other applications with capsule [44],
interacting t. Let pu ∈ RD and qi ∈ RD be the embeddings we only follow the design philosophy of capsule, instead
of user u and item i, which are obtained from P ∈ RM ×D of adopting the original architecture proposed in [39], [46].
and Q ∈ RN ×D , where D denotes the dimension size of each This is because the original version is designed for the tasks
embedding vector. with dense input as images or texts, which consists several
Figure 5 shows the overall architecture of MCN. The basic continuous capsule layers to extract more abstract represen-
idea of MCN is that the user’s dynamic interest should be tations. However, in our recommendation task, the inputs are
modeled from hierarchical couplings. Specifically, given a sparse user/item ids. Moreover, the observable feedback only
user u, a target item t ∈ Hu , we can obtain the corresponding covers a very small fraction of the full rating matrix, which
latest L interacted items, denoted as i ∈ Hu,t . We first retrieve is known as the issue of data sparsity. Consequently, it is
their embedding vectors as the inputs and capture the cou- difficult to learn a huge amount of parameters with origin
plings from the interactions within and between them w.r.t. capsule network.
the following process: 1) we model the couplings between Formally, for item-item couplings, we feed the same
each feature within the same embedding vector with outer dimension of row vectors of L interaction maps into the
product operation; 2) then we capture user-item, item-item same capsule. Thus the height of our capsule network will
couplings from the interactions between their feature vectors be D. The output of each capsule is a D-dimensional vector,
with capsules; 3) finally, we extract user dynamic preference representing the agreements between L interacted items at
from the couplings between multi-term preferences with a this dimension. Besides, to enable the couplings for different
capsule network. agreement, we pose K capsules for each feature dimension.
Thus, the depth of our capsule network is K . Figure 6 presents
B. FEATURE COUPLING WITH OUTER PRODUCT a toy example with L = 4, D = 5, K = 3. The formal output
The idea of modeling feature couplings with outer product of item-item coupling at dimension d with the k-th capsule is
is inspired from both similarity-based and neural network computed as follows:
based non-IID methods. Specifically, similarity-based mod-
els expand the feature representation as they assume the simi-
d , Cd , . . . , Cd )
Okd = Routing(C t−1 t−2 t−L
(5)
larity of two features are dependent on their relationships with
other features [4], [5], [7]. Neural network based methods
adopt CNN or MLP to extract higher level couplings from the where C t−l u
d denotes the row d of item it−l ’s coupling matrix
t−l
feature interaction map generated by outer product between C . The procedure of Routing is summarized as Algo-
user and item embeddings [10], [11]. As a fusion of these rithm 1. Then all the item-item coupling outputs at depth k
two ideas, we conduct outer product above each embedding will form a matrix Ok ∈ RD×D . All K matrices at different
vector itself as follows: depth will form a D × D × K cube O, representing the short-
term preference of user u.
C u = pu ⊗ pu , C i = qi ⊗ qi (4)
For user-item coupling, the computation process is similar.
u
where ⊗ denotes the outer product operation. C ∈ RD×D The main difference is the additional input of user C ud with
and C i ∈ RD×D are the output feature coupling representa- three influential levels of routing strategies, which are sum-
tions of user u and item i (i ∈ Hu,t ) respectively. By doing marized in Section III-C. Formally, the output of user-item
this, each feature is represented by a vector, including its coupling at dimension d with the k-th capsule is computed as
FIGURE 5. Illustration of the proposed MCN model. The left subfigure shows the overall architecture of MCN, and the right subfigue presents a simple
example of the capsule network with four input embeddings and three output capsules.
coupling:
0 1 0K
Ekd = Routing(O1d , . . . , OK
d , O d , . . . , O d , Cd )
u
(7)
All the routing outputs Ekd at different feature dimension d
and depth k will also form a D × D × K cube. We flatten
the cube to a D-dimensional vector by adding the K matrices
together and then summarize the D vectors together.
K X
X D
eu = Ekd (8)
k=1 d=1
Hence, the final prediction for user u to target item t can
be estimated by the inner product of the user’s dynamic
preference and the item’s embedding.
FIGURE 6. A toy example of coupling learning with capsule network.
It includes L D × D interaction maps as inputs for a D × K capsule r̂ut = e>
u qt (9)
network, which outputs a D × D × K cube.
The whole model can be jointly trained by minimizing the
following pairwise cost function.
follows:
X
u qt − eu qt 0 )) + λ||2||F
2
arg min − log(logistic(e> >
(10)
2
k (u,t,t 0 )
O0 d = Routing(C t−1
d , Cd , . . . , Cd , Cd )
t−2 t−L u
(6)
where t 0 denotes the sampled negative item, λ is the coef-
Similar to O, O0
also denotes a D × D × K cube. Besides, ficient of the L2 norm regularizer, and 2 denotes the set of
with the help of user information, we are able to highlight model parameters.
the overlapping parts of user’s general interest and her latest
tastes with different influential degree of user information. V. EXPERIMENTS AND ANALYSIS
The generated cube O0 represents the long-short term prefer- In this section, we evaluate the proposed model on a wide
ence of the user. spectrum of real-world datasets to answer the following
research questions:
D. PREFERENCE COUPLING WITH CAPSULE NETWORK RQ1 Does the proposed multi-level coupling network
After obtaining user’s long-, short- and long-short term pref- achieve state-of-the-art performance?
erences, we are able to model the user’s dynamic preference RQ2 How does MCN’s key hyper-parameters influence its
as the agreements between them. For preference coupling, performance.
we adopt the same capsule network structure as entity level RQ3 Is each design of the proposed model effective.
TABLE 1. Basic statistics of datasets. chain, which learn users’ long- and short-term pref-
erence respectively. GRU4Rec [20] is a deep learn-
ing based recommendation method. It adopts RNN to
capture sequential dependencies and make predictions.
Caser [32] utilizes CNN to explore the users’ short-term
preference by capturing local sequential patterns of adja-
cent items, which is then fused with users’ general inter-
est for prediction. RUM [29] adopts memory network to
store external memories for sequential recommendation.
A. EXPERIMENTAL SETTINGS We take the item-level RUM for comparison.
1) DATASETS
We evaluate our model in two different domains to demon- B. COMPLEXITY ANALYSIS AND IMPLEMENTATION
strate its generality. In specific, Movielens1 is a dataset con- For each user, the computation cost of MCN at feature cou-
taining user watching behaviors on different movies. Ama- pling layer is O((L + 1)D2 ), simplified as O(LD2 ). Next,
zon [47] is an e-commerce dataset, from which we choose the computation cost of a dynamic routing procedure with
two categories–Electronics and Kindle–to cover different L D−dimensional input embeddings and K output capsules
data characters. We transformed the explicit ratings into with T iterations is denoted as O(LKD2 + T (LK + K (LD +
implicit data, where each entry is marked as 0 or 1 indicating D)+LKD)), simplified as O(LKD(D+T )). For item-item cou-
whether the user has rated the item. To collect sufficient user pling, user-item coupling, and preference coupling, the input
preference, we select those users with at least 10 records embedding number are L, L + 1, and 2K + 1, respectively
for experiments as [9], [29]. The statistics of the processed with D capsule networks. Thus, the computation cost of entity
datasets are shown in Table 1. coupling and preference coupling is denoted as O(KD2 (L +
K )(T + D)). Note that the hyper-parameters K , L and T are
2) EVALUATION PROCEDURE much less than D, and each user’s inferred dynamic pref-
To evaluate the performance of recommendation, each of our erence embedding is consistent given different target item.
datasets is split into training/validation/testing sets. For each Therefore, The computation cost of matching M users with
user, we preserve the last two interactions to validation and N items is about O(MK (L + K )D3 + MND), reformulated as
testing sets, while the rest interactions are used for training. O(MD(K (L + K )D2 + N )). In practice, K (L + K )D2 and item
To reduce the expensive cost of ranking all items for each number N are about the same order of magnitude, indicating
user during evaluation, we follow the common strategy [9], that the time complexity of MCN is approximately O(MND),
[11], [48] to randomly sample 99 items that have not been which is the same order of magnitude as traditional matrix
interacted by the user, and to rank the target item among the factorization methods like BPR [15]. On the other hand, exist-
100 items. We adopt three well-known metrics to evaluate the ing non-IID models like NCF [9] and CoupledCF [11] feed
ranking quality, namely Precision, Recall, and Normalized each user-item pair into a deep neural network to compute
Discounted Cumulative Gain (NDCG). We omit their defini- matching score. Their overall computation cost of inference
tions due to space limitations. Generally, higher metric values is about O(MNDγ ), where γ denotes the depth of the network.
indicate better ranking performance. Empirically, BPR, MCN, NCF, and CoupledCF cost
around 120s, 260s, 590s, and 670s, for matching all user-
3) COMPARISON METHODS item pairs on Amazon-kindle dataset, respectively. As we
As our work synthesizes non-IID Recommendation and can see, MCN achieves comparable computation complexity
Sequential Recommendation, we have to adopt two groups to traditional matrix factorization method BPR, being much
of baselines for each track of the research works respectively. more efficient than existing non-IID approaches (NCF and
• Non-IID Recommendation: NCF [9] is the combi- CoupledCF). It demonstrates that MCN is very scalable for
nation of a MF model and a multi-layer perceptron real applications. We implement MCN with TensorFlow on
(MLP), which captures the linear and non-linear user- a single NVIDIA GeForce GTX 1080 Ti GPU without any
item couplings respectively. CoupledCF [11] adopts a pre-training. In practice, MCN can further benefit from multi-
CNN and a MLP to learn the explicit and implicit user- GPU servers with pure C++ implementation and CUDA
item couplings respectively. As we do not consider other acceleration.
side information for recommendation, we take the outer Parameter Settings: All the models are trained by Tensor-
product of user and item embedding vectors as the input Flow2 using Adam optimizer [49] with a batch size of 256.
of its CNN module. The optimal experimental settings for comparison methods
• Sequential Recommendation: FPMC [14] is a tra- are determined either by our experiments or suggested by
ditional method for sequential recommendation. It is previous works. Specifically, the learning rate and regular-
a combination of matrix factorization and Markov ization coefficient are commonly set to 0.001. The latent
1 grouplens.org/datasets/movielens/1m/ 2 www.tensorflow.org/
TABLE 2. Performance comparison of MCN with other models, the best performance among all baseline methods is boldfaced. All the improvement of
MCN over the best baseline is signifcant at the level of 0.01.
FIGURE 9. The visualization of MCNAtten and MCN with different routing iterations. We project the learned feature vectors into two-dimension with
PCA on movielens dataset. We pick a random user with her latest interacted items as the input of MCN and MCNAtten , and visualize the output of one
capsule of the user-item coupling network, which is denoted as the red point. Then we feed the user’s next interacted item into the same capsule and
get its prediction vector, which is denoted as the blue point. Finally we sample 1000 negative items and feed them into the same capsule, their
prediction vectors are denoted as the gray points.
its capacity of coupling learning is limited by the simplified TABLE 4. The relative performance decline of MCN variants for
preference level coupling learning.
structure. Additionally, it shows that MCNmatrix outperforms
MCNvector on movielens while vice versa on electronics. This
can be explained that the over-fitting of MCNmatrix becomes
worse on sparser datasets.
TABLE 5. The performance of MCN variants with different routing [8] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and
strategies. Applications, vol. 20. Philadelphia, PA, USA: SIAM, 2007.
[9] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, ‘‘Neural
collaborative filtering,’’ in Proc. 26th Int. Conf. World Wide Web (WWW),
2017, pp. 173–182.
[10] X. He, X. Du, X. Wang, F. Tian, J. Tang, and T.-S. Chua, ‘‘Outer product-
based neural collaborative filtering,’’ in Proc. Int. Joint Conf. Artif. Intell.
(IJCAI), 2018.
[11] Q. Zhang, L. Cao, C. Zhu, Z. Li, and J. Sun, ‘‘CoupledCF: Learning
explicit and implicit user-item couplings in recommendation for deep
collaborative filtering,’’ in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2018,
pp. 3662–3668.
[12] F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He, ‘‘Policy gradients for
contextual recommendations,’’ in Proc. 28th Int. Conf. World Wide Web
(WWW), 2019.
[13] L. Hu, S. Jian, L. Cao, Z. Gu, Q. Chen, and A. Amirbekyan, ‘‘Hers:
Modeling influential contexts with heterogeneous relations for sparse and
cold-start recommendation,’’ in Proc. AAAI Conf. Artif. Intell., vol. 33,
2019, pp. 3830–3837.
as activation regularizer; and 3) user as vote bias. We denote [14] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, ‘‘Factorizing person-
these approaches as point, activ and bias respectively. The alized Markov chains for next-basket recommendation,’’ in Proc. Int. Conf.
World Wide Web (WWW), 2010, pp. 811–820.
results are presented in Table 5. Specifically, the accuracy
[15] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, ‘‘BPR:
is increasing with the promotion of user influential levels Bayesian personalized ranking from implicit feedback,’’ in Proc. 25th
on movielens dataset, while decreasing on electronics and Conf. Uncertainty Artif. Intell. (UAI), 2009, pp. 452–461.
kindle datasets, indicating that user influence is more reliable [16] R. He and J. McAuley, ‘‘Fusing similarity models with Markov chains for
sparse sequential recommendation,’’ in Proc. IEEE Int. Conf. Data Mining
with denser dataset. This pattern can also be observed from (ICDM), Dec. 2016, pp. 191–200.
Table 4 as the −l variant gets highest performance decline on [17] S. Kabbur, X. Ning, and G. Karypis, ‘‘FISM: Factored item simi-
movielens dataset. larity models for top-N recommender systems,’’ in Proc. 19th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2013,
pp. 659–667.
VI. CONCLUSION [18] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, and X. Cheng, ‘‘Learning hierar-
In this paper, we proposed a novel multi-level coupling chical representation model for nextbasket recommendation,’’ in Proc. Int.
ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2015, pp. 403–412.
network (MCN) for non-IID sequential recommendation.
[19] F. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, ‘‘A dynamic recurrent model
It considered comprehensive couplings within and between for next basket recommendation,’’ in Proc. Int. ACM SIGIR Conf. Res.
multi-aspects at different levels: 1) feature level couplings Develop. Inf. Retr. (SIGIR), 2016, pp. 729–732.
within each user and her interacted items; 2) user-item and [20] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, ‘‘Session-based
recommendations with recurrent neural networks,’’ in Proc. Int. Conf.
item-item entity level couplings; and 3) long-, short- and Learn. Represent. (ICLR), 2016.
long-short- term preference couplings. We implemented the [21] B. Hidasi and A. Karatzoglou, ‘‘Recurrent neural networks with top-k gains
framework on the base of capsule network as its basic ideas for session-based recommendations,’’ 2018, arXiv:1706.03847. [Online].
Available: https://arxiv.org/abs/1706.03847
are seamlessly suitable for sophisticated coupling learning. [22] M. Quadrana, A. Karatzoglou, B. Hidasi, and P. Cremonesi, ‘‘Personal-
We did detailed analysis about the advantages of adopting izing session-based recommendations with hierarchical recurrent neural
capsule network. We conducted extensive experiments on networks,’’ in Proc. 11th ACM Conf. Recommender Syst. (RecSys), 2017,
pp. 130–137.
three real datasets. The results showed that our approach sig- [23] W. Pei, J. Yang, Z. Sun, J. Zhang, A. Bozzon, and D. M. J. Tax, ‘‘Interacting
nificantly outperformed six state-of-the-art models. We also attention-gated recurrent networks for recommendation,’’ in Proc. Int.
conducted comprehensive ablation experiments to verify the Conf. Inf. Knowl. Manage. (CIKM), 2017, pp. 1459–1468.
[24] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, ‘‘Neural attentive session-
effectiveness of each design of the model. based recommendation,’’ in Proc. Int. Conf. Inf. Knowl. Manage. (CIKM),
2017.
REFERENCES [25] C. Qiang, W. Shu, H. Yan, and L. Wang, ‘‘A hierarchical contex-
[1] L. Cao, ‘‘Non-iid recommender systems: A review and framework of rec- tual attention-based gru network for sequential recommendation,’’ 2017,
ommendation paradigm shifting,’’ Engineering, vol. 2, no. 2, pp. 212–224, arXiv:1711.05114. [Online]. Available: https://arxiv.org/abs/1711.05114
2016. [26] P. Sun, L. Wu, and M. Wang, ‘‘Attentive recurrent social recommendation,’’
[2] L. Cao, Y. Ou, and P. S. Yu, ‘‘Coupled behavior analysis with applications,’’ in Proc. 41st Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2018,
IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, pp. 1378–1392, Aug. 2012. pp. 185–194.
[3] L. Cao, ‘‘Coupling learning of complex interactions,’’ J. Inf. Process. [27] Z. Sun, J. Yang, J. Zhang, A. Bozzon, L.-K. Huang, and C. Xu,
Manage., vol. 51, no. 2, pp. 167–186, 2015. ‘‘Recurrent knowledge graph embedding for effective recommenda-
[4] F. Li, G. Xu, and L. Cao, ‘‘Coupled matrix factorization within non- tion,’’ in Proc. 12th ACM Conf. Recommender Syst. (RecSys), 2018,
iid context,’’ in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining pp. 297–305.
(PAKDD). New York, NY, USA: Springer, 2015, pp. 707–719. [28] R. Li, Y. Shen, and Y. Zhu, ‘‘Next point-of-interest recommendation with
[5] C. Wang, L. Cao, M. Wang, J. Li, W. Wei, and Y. Ou, ‘‘Coupled nominal temporal and multi-level context attention,’’ in Proc. IEEE Int. Conf. Data
similarity in unsupervised learning,’’ in Proc. 20th ACM Int. Conf. Inf. Mining (ICDM), Nov. 2018, pp. 1110–1115.
Knowl. Manage. (CIKM), 2011, pp. 973–978. [29] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha, ‘‘Sequential
[6] T. D. T. Do and L. Cao, ‘‘Coupled Poisson factorization integrated with recommendation with user memory networks,’’ in Proc. 11th ACM Int.
user/item metadata for modeling popular and sparse ratings in scalable Conf. Web Search Data Mining (WSDM), 2018, pp. 108–116.
recommendation,’’ in Proc. AAAI Conf. Artif. Intell. (AAAI), 2018, pp. 1–7. [30] G. Guo, S. Ouyang, X. He, F. Yuan, and X. Liu, ‘‘Dynamic item block and
[7] C. Wang, Z. She, and L. Cao, ‘‘Coupled attribute analysis on numerical prediction enhancing block for sequential recommendation,’’ in Proc. Int.
data,’’ in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2013. Joint Conf. Artif. Intell. (IJCAI), 2019.
[31] H. Jin, X. Z. Wayne, D. Hongjian, W. Ji-Rong, and Y. C. Edward, ‘‘Improv- YATONG SUN is currently pursuing the M.S.
ing sequential recommendation with knowledge-enhanced memory net- degree in software engineering with Northeast-
works,’’ in Proc. 11th ACM Int. Conf. Web Search Data Mining (WSDM), ern University, China. Since September 2018, he
2018, pp. 108–116. has been an Algorithm Engineer with JD AI
[32] J. Tang and K. Wang, ‘‘Personalized top-N sequential recommendation via Research. His research interests include machine
convolutional sequence embedding,’’ in Proc. 11th ACM Int. Conf. Web learning, data mining, and recommender sys-
Search Data Mining (WSDM), 2018, pp. 565–573. tems. Specifically, he is focusing on incorporating
[33] T. X. Tuan and M. P. Tu, ‘‘3d convolutional networks for session-based
knowledge-based information into sequential rec-
recommendation with content features,’’ in Proc. 11th ACM Conf. Recom-
ommendation tasks.
mender Syst. (RecSys), 2017, pp. 138–146.
[34] F. Yuan, A. Karatzoglou, I. Arapakis, J. M. Jose, and X. He, ‘‘A sim-
ple but hard-to-beat baseline for session-based recommendations,’’ 2018,
arXiv:1808.05163. [Online]. Available: https://arxiv.org/abs/1808.05163
[35] J. Tang, F. Belletti, S. Jain, M. Chen, A. Beutel, C. Xu, and E. H. Chi,
‘‘Towards neural mixture recommender for long range dependent user
sequences,’’ in Proc. 28th Int. Conf. World Wide Web (WWW), 2019, GUIBING GUO received the Ph.D. degree in
pp. 1782–1793. computer science from Nanyang Technological
[36] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, ‘‘BERT4Rec: University, Singapore, in 2015. He is currently a
Sequential recommendation with bidirectional encoder representations Professor with the Software College, Northeastern
from transformer,’’ in Proc. Int. Conf. Inf. Knowl. Manage. (CIKM), 2019. University, China. His research interests include
[37] A. Fukui, H. P. Dong, D. Yang, A. Rohrbach, and M. Rohrbach, ‘‘Mul- recommender systems, deep learning, natural lan-
timodal compact bilinear pooling for visual question answering and guage processing, and data mining.
visual grounding,’’ in Proc. Empirical Methods Natural Lang. Process.
(EMNLP), 2016.
[38] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, ‘‘MUTAN: Multi-
modal tucker fusion for visual question answering,’’ in Proc. Empirical
Methods Natural Lang. Process. (EMNLP), 2017.
[39] S. Sabour, N. Frosst, and G. E. Hinton, ‘‘Dynamic routing between cap-
sules,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017.
[40] D. Wang and Q. Liu, ‘‘An optimization view on dynamic routing between
capsules,’’ in Proc. ICLR Workshop, 2018. XIAODONG HE is currently the Deputy Manag-
[41] P. Afshar, A. Mohammadi, and K. N. Plataniotis, ‘‘Brain tumor type ing Director of the JD AI Research and the Head
classification via capsule networks,’’ 2018, arXiv:1802.10200. [Online]. of the Deep learning, NLP and Speech Lab, and
Available: https://arxiv.org/abs/1802.10200 the Technical Vice President of JD.com. He is
[42] R. LaLonde and U. Bagci, ‘‘Capsules for object segmentation,’’ 2018, also an Affiliate Professor with the University of
arXiv:1804.04241. [Online]. Available: https://arxiv.org/abs/1804.04241 Washington, Seattle, serves in Ph.D. supervisory
[43] M. T. Bahadori, ‘‘Spectral capsule networks,’’ in Proc. ICLR Workshop, committees. Before joining JD.com, he was with
2018. Microsoft for about 15 years, served as a Princi-
[44] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang, ‘‘Investigat- pal Researcher and the Research Manager of the
ing capsule networks with dynamic routing for text classification,’’ in DLTC with Microsoft Research, Redmond,
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2018,
WA, USA.
pp. 3110–3119.
[45] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu, ‘‘Sentiment analysis by
capsules,’’ in Proc. Int. Conf. World Wide Web (WWW), 2018.
[46] H. Geoffrey, S. Sara, and F. Nicholas, ‘‘Matrix capsules with EM routing,’’
in Proc. Int. Conf. Learn. Represent. (ICLR), 2018.
[47] J. McAuley and J. Leskovec, ‘‘Hidden factors and hidden topics: Under-
XIAOHUA LIU is currently a Researcher with
standing rating dimensions with review text,’’ in Proc. 7th ACM Conf.
AINemo. He is also a Postdoctoral Research Fel-
Recommender Syst. (RecSys), 2013, pp. 165–172.
[48] Y. Koren, ‘‘Factorization meets the neighborhood: A multifaceted collab- low with the University of Montreal. Prior to
orative filtering model,’’ in Proc. 14th ACM SIGKDD Int. Conf. Knowl. that, he was a Project Lead and a Researcher
Discovery Data Mining (KDD), 2008, pp. 426–434. with Microsoft Research Asia, Natural Language
[49] D. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ in Computing Group, and JD AI Research. His
Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15. research interests include big data processing,
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, social content mining, NLP for ranking, and
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. machine learning.
Neural Inf. Process. Syst. (NIPS), 2017, pp. 5998–6008.