Reseach Paper 4 Final

2021 IEEE International Conference on Big Data (Big Data)
Deep Reinforcement Learning based Recommender

System with State Representation
1st Peng Jiang 2nd Jiafeng Ma 3rd Jianming Zhang
Polytechnic Institute Colledge of Control Science and Colledge of Control Science and
ZheJiang University Engineering Engineering
Hangzhou, China ZheJiang University ZheJiang University
2021 IEEE International Conference on Big Data (Big Data) | 978-1-6654-3902-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/BigData52589.2021.9671687
jiangpeng6@zju.edu.cn Hangzhou, China Hangzhou, China

mjf_zju@zju.edu.cn ncsl@zju.edu.cn
Abstract—With the scale of E-commerce increasing year by In this article, we propose a recommender system based on
year, the importance of recommender systems is getting deep reinforcement learning, to improve the recommendation
increasing attention. Based on deep reinforcement learning, we quality, we introduce the state representation module. In
can model the recommendation task as an interactional and summary, we have the following contributions:
sequential decision procedure between the system and users,
instead of a static process. It can improve the recommendation • We propose a recommender system based on deep
quality to a large extent. By integrating a state representation reinforcement learning, instead of treating recommen-
module, the quality of modeling interaction between users and dation task as a static procedure of classification and
system can be improved. In this article, we propose a prediction, it models the task as a sequential decision
recommender system based on user-commodity state process and takes both immediate and long-term
representation integrated deep reinforcement learning, named reward into account.
UCSRDRL, and conduct experiments on the datasets offered by
FUXI AI Lab and the outcome performs better than the • We utilize the Actor-Critic framework and integrate a
baseline. The score of UCSRDRL ranked third in the state representation module in the system, and we also
competition. adopt the method of embedding in dealing with the
items information and user-portraits, which output
Keywords—Recommender system, Deep Reinforcement vectors to be the input of the state, those help to im-
Learning, Actor-Critic prove the recommendation quality.
I. INTRODUCTION • Using the system, we conduct experiments with the
The E-commerce has gained great popularity these years, data and virtual environment offered by FUXI AI Lab,
and the scale of E-commerce transactions in China has and the result outperforms the baseline to a large
increased from 4.55 trillion yuan to 34.81 trillion yuan, with extent.
an average annual compound growth rate of 25%. Among the II. RELATED WORK
intelligent applications in E-commerce, the recommender
systems are becoming an inseparable part. The systems help A. Reinforcement Learning based Recommender systems
users with their searching task by recommending relative Reinforcement learning is an important kind of machine
items, services and information to them, which have been learning, in addition to supervised learning and unsupervised
widely utilized in various domains including movies, music. learning. According to Sutton [1], an RL problem is
Traditionally, the recommendation task has long been seen as characterized by three features: (1) the problem is closed-
a pure classification and prediction problem, and there are lots loop; (2) there isn’t a tutor to teach the agent how to finish the
of techniques proposed to deal with it, including collaborative task, but the agent can learn by trial and error; (3) the actions
filtering and content-based filtering, matrix factorization- performed by the agent don’t only affect the immediate
based methods. Despite the progress of the mentioned reward, but also have impact on the reward in the long run.
techniques, there are still two limitations: (1) these methods During the procedure of reinforcement learning, in order to
treat the recommendation task as a static procedure, ignoring obtain the highest possible reward value in a certain state in
the sequential and interactive nature of it; (2) these methods the certain environment, continuous attempts, which are also
focus on the immediate rewards during the recommendation, called actions, are made by the agent through trial and error.
without considering the long-term benefits of some items.
There have been lots of work on recommender systems
To tackle these limitations, the sequential process can be based on reinforcement learning, utilizing approaches
modeled as a Markov Decision Procession, and Reinforce- including tabular and approximate ways. In 1997, Thorsten
ment Learning (RL) can be introduced to handle it. To break Joachims et al. [2] utilized RL to improve the quality of
the limitations of conventional RL in situations with large recommender systems, which was probably the first
state dimensionality and action spaces, the deep neural recommendation algorithm using RL. They modelled the web
network is introduced to improve the approximation ability, page recommendation task as a reinforcement learning
and the improved RL is called deep reinforcement learning problem, and improved the quality using Q-learning method.
(DRL). Later in 2005, Anongnart Srivihok et al. [3] suggested a
personalized support system applying the Q-learning
978-1-6654-3902-2/21/$31.00 ©2021 IEEE 5703

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 19:04:06 UTC from IEEE Xplore. Restrictions apply.
algorithm to learn customer behavior and then recommend
products to meet users’ interest. Using personalization learner
by cluster properties and user behavior and conducting the
experiment with real users in small scale , the study revealed
the possibility of developing personalized support system
utilizing RL. But the problems of large dimensionality of
states and action spaces and reward function designation
remain to be tackled. Nima Taghipour et al. [4] and Tariq
Mahmood et al. [5] proposed recommender systems using N-
gram model to address the large state dimensionality problem.
Elad Liebman et al. [6] factored the reward function as the
listeners’ preference on the single songs and songs’ transition
patterns.
Fig. 1. The item lists
Despite the great success of RL based recommender
system, there are still some defects. When dealing with the
tasks with large action spaces and state dimensionality, the Minmin Chen et al. [17] presented a general recipe of
computational expenses grow rapidly and the addressing biases in a top-K recommender system for
recommendation quality declines a lot. To break the recommending videos at YouTube. The system was built with
limitations mentioned , the deep neural network is introduced REINFORCE [18], a policy-gradient-based algorithm.
to improve the approximation ability, and the improved RL is
called deep reinforcement learning (DRL). III. PROBLEM STATEMENT AND DATA ANALYSIS
B. Deep Reinforcement Learning Based Recommender A. Problem Statement
systems This item recommendation task is characterized by the
Thanks to the recent progress in combining deep learning features of its mid-dimensional action spaces (roughly 400
with reinforcement learning, the deep reinforcement item candidates after reducing) and special interaction rules.
learning(DRL) can deal with the tasks with massive states and As is shown in Fig. 1, for each user’s request, the
action spaces, including the recommendation tasks. recommendation engine will respond with 3 item lists and
there are 3 items in per list, and the next item list cannot be
In DRL-based recommender systems, DL is used to
purchased unless the items of the current list are sold out. The
approximate the policy or value function. There are mainly
users' response depends on not only the current item but also
three kinds of methods adopted by the recommender systems,
the items of the next list.
including Deep Q-learning Network(DQN), Actor-Critic, and
REINFORCE. So, the goal of the system is to recommend nine items for
each user request to maximize the aggregated rewards.
Aniruddh Raghu et al. [7] proposed an approach based on
Because other user satisfaction indicators are difficult to
DQN to discover sepsis treatment strategies, they used
quantify, here the contest defines the reward as the sum of the
continuous state-space and discrete action models and a
user's purchase amount of the nine items. The task is to
clinically aided reward function, and utilized DQN to
generate the best item combination with no repeat items for
optimize the policy. In view of the inherent problems of DQN,
each user to maximize the expected reward.
various algorithms are proposed, such as Double DQN [8]
chose action according to the updating Q-network to reduce B. Data Analysis
the overestimation of Q value, Prioritized Experience Replay There are three datasets that can be used during the contest,
(PER) [9] replayed important transitions more frequently to namely item-info, trainset and track2_testset [19].The item
learn more efficiently. And they are used in some studies [10], data consists of 381 items with their ids, corresponding
[11]. vectors, prices and places. The training data involves more
Gabriel Dulac-Arnold et al. [12] proposed a policy than 250k sessions, 400 items, and 40k users. It consists of the
architecture using Actor-Critic to handle the tasks with a large id, click history and portrait of the users, the exposed items of
number of actions and performed the optimization using Deep the recommender system to users and the users corresponding
Deterministic Policy Gradient (DDPG) [13]. Xiangyu Zhao et response and the time when the event happens. The test data
al. [14] proposed a page-wise recommendation framework for track 2 consists of the information of the users.
based on DRL, named DeepPage. It designed the Critic In the item-info.csv, there are four kinds of data, namely
network to act as an approximator to estimate the action-value item-id, item-vec, price and location. “item-id” is a unique id
function, which judging the coordinate of the action generated for each item, which is the basis for distinguishing different
by the Actor network with the current state. Feng Liu et al. items and the results to be outputted by the recommender
[15] proposed a recommendation framework based on DRL, system. “item-vec” is a five-dimensional vector indicating the
called DRR. It adopted the Actor-Critic framework to difference between items, which is the main basis for us to
explicitly model the dynamic user-item interactions and calculate the similarity between the items by calculating the
incorporated a state representation module into the system to cosine distance of the embedded vector. The location of the
improve its performance. Later in DRRS proposed by the item indicates the number of the item list that the item is
same authors [16], they used the Actor-Critic framework and allowed to be recommended, which can be viewed as a
a ranking function composing of CTR and bid to generate piecewise function of the id of items.
recommendations.
In the trainset.csv, there are six kinds of data, namely user-
id, user-click-history, user-portrait, exposed-items, labels and
5704
time. The first three data describe the information of users,
user_ id is the unique identifier of the user, changing with the
training session, user-click-history is the feature of the user’s
click history, which reflects his interest and will be used for
the estimate of the probability of his buying the items
recommended. Together with the exposed-items and labels
indicating the result of user’s buying, the history is used in the
design of the reward function. The user-portrait is a ten-
dimensional vector indicating the portrait features of the user,
which serves as the basis of our system’s recommendations. Fig. 2. The structure of the recommender system
In the track2_test.csv, there is the users’ information only,
consisting of user-id, user-click-history, user-portrait. And the • Reward. Given the recommendation based on the
usage of them is same as above. action a and the user state (S), the user will provide his
feedback, such as buying or not. The recommender
IV. THE PROPOSED UCSRDRL FRAMEWORK receives immediate reward (R) according to the user’s
A. Strategy feedback. And we use different reward functions in the
We model the recommendation procedure as a sequential train sessions and test sessions.
decision-making problem, in which the recommender (agent) During the training sessions, having the data of clicking
interacts with users (environment) to suggest a list of items history and labels of exposed items, we define the reward
sequentially over the timesteps, by maximizing the cumulative function as follows:
rewards of the whole recommendation procedure. The
(acti *itemprice -priceave )/t

structure of recommender systems based on reinforcement 9
learning is shown as Fig. 2. R= (1)

i
i
More specifically, the recommendation procedure is
modeled by an MDP, as follows:
⎧1, if the item i in the list of click histories and the random
• Agent. In reinforcement learning, the learner or 1, if the item i in the list of exposed items and the label is 1
⎪
decision maker is called agent, here the recommend
(2)
⎨1, if the item i is neither in the list of exposed items nor in
system can be seen as the agent, and it tries to act=
decimal is larger than 0.3
⎪ the list of histories, and the random decimal is larger than 0.7
recommend the best items to the user and to
⎩ 0, otherwise
maximize the user’s satisfaction, which is analogous
to the typical RL setting that an agent tries to
maximize the reward by interacting with the
where the random decimal is chosen randomly from
environment.
[0,1], t=3*(891+1358+2641),the act of the (i+1) th and the
• Environment. The environment is everything (i+2) th row of items (if they exist) will be 0 if not all of
outside the agent, including the users of the systems the i-th row of items are sold out, the price_ave is the
and the items. After the agent takes an action at a average price of all the items that can be shown in the
state, which means that the recommender system same list.
output a page of items to the users according to the
item and the users here, the environment gives its
feedback and changes to the next state, which means (3)
that the user browses the list and gives its feedback
and changes to another user.
During the test process, without the labels of exposed
• States. A state (S) is the representation of user’s items, we define the reward function as follows:
positive interaction history with recommender, as
well as her demographic information (if it exists in 9
the datasets). R= acti *itemprice -priceave t

i (4)
i
• Action. An action (A) is a continuous parameter
vector. Each item has a ranking score, which is
defined as the inner product of the action and the item
⎧
embedding, then the top ranked ones will be 1, if the item i in the list of click histories and
⎪
recommended. the random decimal is larger than 0.3
⎨
1, if the item i is neither in the list of exposed (5)
act=
⎪
• Transition. The state is modeled as the represent- items nor in the list of histories, and the
⎩0,
tation of user’s positive interaction history. Hence, random decimal is larger than 0.7
once the user’s feedback is collected, the state otherwise
transition (P) is determined.
The output of our network is a raw action vector, which
represents the relationship between user-portrait and item-
portrait, we choose the top 3 items according to the inner
product of the sub action vector and the item embedding
5705
Fig. 3. The structure of DDPG used for training
Fig. 5. The state representation module
to the vector with 1*500 dimension through an embedding

layer. After that, because of the different lengths of the vector
characterizing users’ click-history, we take the average of
every user history’s embedding result as the length. We also
use the same trick to deal with the data of user-portrait.
When embedding is finished, the state representation
module matters in both the Actor network and Critic network.
Inspired by relative studies, we propose to design the state
representation module by explicitly modeling the interactions
between the users and items, of which the structure is shown
as the Fig. 5 above.
Fig. 4. The structure of Actor-Critic used in UCSRDRL B. Experiment
Using the model mentioned before, we conducted the
vector, after the 3 steps mentioned before, the recommender experiments on the datasets offered. We first train the model
system will recommend 9 items to a user. on the training-set, save the weights and training rewards
every 20000 episodes.
B. Based Models
Then we input the user-portraits in the test set, save the
The embeddings of users and items from the historical
item lists recommended by the system with trained weights
interactions are fed into a carefully designed multi-layer
and input the lists into the virtual environment offered by
network, which explicitly models the interactions between
FUXI AI Lab, then we can get a score of the recommendation
users and items, to produce a continuous state representation lists. The quality of submissions will be evaluated using the
of the user in terms of her underlying sequential behaviors. pure reinforcement learning measure, reward.
The actor receives the state from state representation
module and generates the actions using three full-connection VI. RESULTS AND DISCUSSION
layers with two ReLU layers and one Tanh layer. The state A. Results
representation module features three structures: product-based
item, product-based item-user, and averaged (pooled) Because of our relatively simple network structure, the
product-based item-user. The critic used is a modified DQN output of our model converges quickly, just as shown in the
module, which judges the action generated by the actor. Using Fig. 6 below.
three full-connection layers, we first use a layer to deal with
the state input describing what the actor does, then
concatenate it with action as the input of the rest.
Finally, in order to train the model, DDPG is used. The
framework of our model is shown in the Fig. 3 and Fig. 4
mentioned above.
V. EXPERIMENTAL SETUP
A. Data Preprocessing
The most important method in the preprocessing section is
data embedding, and we design the state-representation model
for it. As mentioned before, the lengths of different users’
user-click-history-vector vary. So we should preprocess it Fig. 6. The average reward of every 100 episodes during training
before treating it as the input of our network. Data embedding
is a crucial way to keep the data in the same dimensions. Here
we transform the vector representing every user-click-history
5706
As shown in the Fig. 6, the reward of training shows an ACKNOWLEDGMENT
upward trend before 10000 episodes, but after that, the reward This material is based upon work supported by the FUXI
fluctuates and decreases. Considering the uncomplicated AI Lab, Netease.
network structure, we choose the weights of 100000 training
sessions, for the prevention of over-fitting. And the result REFERENCES
shows satisfactory performance. [1] Richard S Sutton and Andrew G Barto. Introduction to reinforcement
As for the evaluation of our recommender system, we use learning, volume 2.MIT press Cambridge, 2017.
the pre-trained large-scale RL environment designed by [2] Thorsten Joachims, Dayne Freitag, Tom Mitchell, et al. Webwatcher:
A tour guide for the world wide web. In IJCAI (1), pages 770-777.
FUXI AI Lab using massive data collected under real Citeseer, 1997.
circumstances to accomplish the testing. Using the 206096
[3] Anongnart Srivihok and Pisit Sukonmanee. E-commerce intelligent
users’ profiles provided by the track2_testset.csv, we get a list agent: personalization travel support agent using q learning. In
of items for each user offered, and input it into the virtual Proceedings of the 7th international conference on Electronic
environment. And a score was given according to purchasing commerce, pages 287-292, 2005.
outcome, the scores of the models mentioned are listed in [4] Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. Usage-
TABLE I. And for comparison, we set the score of using based web recommendations:a reinforcement learning approach. In
logged offline actions, the model trained with DDPG in Proceedings of the 2007 ACM conferenceon Recommender systems,
pages 113-120, 2007.
LSTM-based simulated environment as the baselines.
[5] Tariq Mahmood, Ghulam Mujtaba, and Adriano Venturini. Dynamic
personalization in conversational recommender systems. Information
TABLE I. THE RESULT OF THE RECOMMENDER SYSTEM Systems and e-Business Management, 12(2):213-238, 2014.
[6] Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone. Dj-mc: A
Models Score reinforcement learning agent for music playlist recommendation. arXiv
preprint arXiv:1401.1880, 2014.
Trained after 60000 episodes 432511736 [7] Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi,
Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning
Trained after 80000 episodes 506765120 for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017.
Trained after 100000 episodes 1115440564 [8] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with
double q-learning[C]//Proceedings of the AAAI conference on
Trained after 120000 episodes 303672114 artificial intelligence. 2016, 30(1).
[9] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J].
Using logged offline actions 770378225 arXiv preprint arXiv:1511.05952, 2015.
LSTM-based environment simulator [10] Shi-Yong Chen, Yang Yu, Qing Da, et al. Stabilizing reinforcement
1033481948
+ DDPG learning in dynamic environment with application to online
As we can see from the TABLE I, the result corresponds recommendation. In Proceedings of the 24th ACM SIGKDD
to our expectation, the models with the weights trained after International Conference on Knowledge Discovery & Data Mining,
100000 episodes performs best, and outperforms the pages 1187-1196, 2018.
baselines to a large extent. In the competition, the score of [11] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas
Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement
UCSRDRL ranked third. learning framework for news recommendation. In Proceedings of the
2018 World Wide Web Conference, pages 167-176, 2018.
VII. SUMMARY AND CONCLUSIONS
[12] Dulac-Arnold G, Evans R, van Hasselt H, et al. Deep reinforcement
In this competition, we propose a deep reinforcement learning in large discrete action spaces[J]. arXiv preprint
learning based recommender system with a state- arXiv:1512.07679, 2015.
representation module named UCSRDRL, and utilize it to [13] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep
reinforcement learning[J]. arXiv preprint arXiv:1509.02971, 2015.
perform the recommendation task. Unlike the conventional
studies, this method treats the recommendation as a sequential [14] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and
Jiliang Tang. Deep reinforcement learning for page-wise
decision-making process and adopts an “Actor-Critic” recommendations. In Proceedings of the 12th ACM Conference on
learning scheme, which can take both the immediate and long- Recommender Systems, pages 95-103, 2018.
term rewards into account. A state representation module is [15] Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye,
incorporated, which can explicitly model the interactions Haokun Chen, Huifeng Guo, and Yuzhou Zhang. Deep reinforcement
between users and items. learning based recommendation with explicit user-item interactions
modeling. arXiv preprint arXiv:1810.12027, 2018.
We conduct the experiments on the datasets using the [16] Jianhua Han, Yong Yu, Feng Liu, Ruiming Tang, and Yuzhou Zhang.
system, and according to the training reward, we choose the Optimizing ranking algorithm in recommender system via deep
model with the weights trained after 100000 episodes. To reinforcement learning. In 2019 International Conference on Artificial
evaluate the quality of our recommender system, we perform Intelligence and Advanced Manufacturing (AIAM), pages 22-26.
IEEE, 2019.
the evaluation using the pre-trained large-scale RL
[17] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois
environment designed by FUXI AI Lab, and the result is in Belletti, and Ed H Chi. Top-k off-policy correction for a reinforce
line with our speculation. The recommender system built with recommender system. In Proceedings of the Twelfth ACM
deep reinforcement learning integrated with a state International Conference on Web Search and Data Mining, pages 456-
representation module outperforms the system offered as the 464, 2019.
baseline, and the quality of recommendation is improved to a [18] Williams R J. Simple statistical gradient-following algorithms for
large extent. connectionist reinforcement learning[J]. Machine learning, 1992, 8(3):
229-256.
[19] Kai Wang, Zhene Zou, Qilin Deng, Yue Shang, Minghao Zhao, Runze
Wu, Xudong Shen, Tangjie Lyu, Changjie Fan. RL4RS: A Real-World
Benchmark for Reinforcement Learning based Recommender System
[J]. arXiv preprint arXiv: 2110.11073.
5707

Reseach Paper 4 Final

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reseach Paper 4 Final

Uploaded by

Copyright:

Available Formats

2021 IEEE International Conference on Big Data (Big Data)

Deep Reinforcement Learning based Recommender

jiangpeng6@zju.edu.cn Hangzhou, China Hangzhou, China

978-1-6654-3902-2/21/$31.00 ©2021 IEEE 5703

(acti *itemprice -priceave )/t

learning is shown as Fig. 2. R= (1)

the datasets). R= acti *itemprice -priceave t

Fig. 5. The state representation module

to the vector with 1*500 dimension through an embedding

You might also like