Professional Documents
Culture Documents
Reseach Paper 4 Final
Reseach Paper 4 Final
Abstract—With the scale of E-commerce increasing year by In this article, we propose a recommender system based on
year, the importance of recommender systems is getting deep reinforcement learning, to improve the recommendation
increasing attention. Based on deep reinforcement learning, we quality, we introduce the state representation module. In
can model the recommendation task as an interactional and summary, we have the following contributions:
sequential decision procedure between the system and users,
instead of a static process. It can improve the recommendation • We propose a recommender system based on deep
quality to a large extent. By integrating a state representation reinforcement learning, instead of treating recommen-
module, the quality of modeling interaction between users and dation task as a static procedure of classification and
system can be improved. In this article, we propose a prediction, it models the task as a sequential decision
recommender system based on user-commodity state process and takes both immediate and long-term
representation integrated deep reinforcement learning, named reward into account.
UCSRDRL, and conduct experiments on the datasets offered by
FUXI AI Lab and the outcome performs better than the • We utilize the Actor-Critic framework and integrate a
baseline. The score of UCSRDRL ranked third in the state representation module in the system, and we also
competition. adopt the method of embedding in dealing with the
items information and user-portraits, which output
Keywords—Recommender system, Deep Reinforcement vectors to be the input of the state, those help to im-
Learning, Actor-Critic prove the recommendation quality.
I. INTRODUCTION • Using the system, we conduct experiments with the
The E-commerce has gained great popularity these years, data and virtual environment offered by FUXI AI Lab,
and the scale of E-commerce transactions in China has and the result outperforms the baseline to a large
increased from 4.55 trillion yuan to 34.81 trillion yuan, with extent.
an average annual compound growth rate of 25%. Among the II. RELATED WORK
intelligent applications in E-commerce, the recommender
systems are becoming an inseparable part. The systems help A. Reinforcement Learning based Recommender systems
users with their searching task by recommending relative Reinforcement learning is an important kind of machine
items, services and information to them, which have been learning, in addition to supervised learning and unsupervised
widely utilized in various domains including movies, music. learning. According to Sutton [1], an RL problem is
Traditionally, the recommendation task has long been seen as characterized by three features: (1) the problem is closed-
a pure classification and prediction problem, and there are lots loop; (2) there isn’t a tutor to teach the agent how to finish the
of techniques proposed to deal with it, including collaborative task, but the agent can learn by trial and error; (3) the actions
filtering and content-based filtering, matrix factorization- performed by the agent don’t only affect the immediate
based methods. Despite the progress of the mentioned reward, but also have impact on the reward in the long run.
techniques, there are still two limitations: (1) these methods During the procedure of reinforcement learning, in order to
treat the recommendation task as a static procedure, ignoring obtain the highest possible reward value in a certain state in
the sequential and interactive nature of it; (2) these methods the certain environment, continuous attempts, which are also
focus on the immediate rewards during the recommendation, called actions, are made by the agent through trial and error.
without considering the long-term benefits of some items.
There have been lots of work on recommender systems
To tackle these limitations, the sequential process can be based on reinforcement learning, utilizing approaches
modeled as a Markov Decision Procession, and Reinforce- including tabular and approximate ways. In 1997, Thorsten
ment Learning (RL) can be introduced to handle it. To break Joachims et al. [2] utilized RL to improve the quality of
the limitations of conventional RL in situations with large recommender systems, which was probably the first
state dimensionality and action spaces, the deep neural recommendation algorithm using RL. They modelled the web
network is introduced to improve the approximation ability, page recommendation task as a reinforcement learning
and the improved RL is called deep reinforcement learning problem, and improved the quality using Q-learning method.
(DRL). Later in 2005, Anongnart Srivihok et al. [3] suggested a
personalized support system applying the Q-learning
5704
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 19:04:06 UTC from IEEE Xplore. Restrictions apply.
time. The first three data describe the information of users,
user_ id is the unique identifier of the user, changing with the
training session, user-click-history is the feature of the user’s
click history, which reflects his interest and will be used for
the estimate of the probability of his buying the items
recommended. Together with the exposed-items and labels
indicating the result of user’s buying, the history is used in the
design of the reward function. The user-portrait is a ten-
dimensional vector indicating the portrait features of the user,
which serves as the basis of our system’s recommendations. Fig. 2. The structure of the recommender system
In the track2_test.csv, there is the users’ information only,
consisting of user-id, user-click-history, user-portrait. And the • Reward. Given the recommendation based on the
usage of them is same as above. action a and the user state (S), the user will provide his
feedback, such as buying or not. The recommender
IV. THE PROPOSED UCSRDRL FRAMEWORK receives immediate reward (R) according to the user’s
A. Strategy feedback. And we use different reward functions in the
We model the recommendation procedure as a sequential train sessions and test sessions.
decision-making problem, in which the recommender (agent) During the training sessions, having the data of clicking
interacts with users (environment) to suggest a list of items history and labels of exposed items, we define the reward
sequentially over the timesteps, by maximizing the cumulative function as follows:
rewards of the whole recommendation procedure. The
⎧1, if the item i in the list of click histories and the random
• Agent. In reinforcement learning, the learner or 1, if the item i in the list of exposed items and the label is 1
⎪
decision maker is called agent, here the recommend
(2)
⎨1, if the item i is neither in the list of exposed items nor in
system can be seen as the agent, and it tries to act=
decimal is larger than 0.3
⎪ the list of histories, and the random decimal is larger than 0.7
recommend the best items to the user and to
⎩ 0, otherwise
maximize the user’s satisfaction, which is analogous
to the typical RL setting that an agent tries to
maximize the reward by interacting with the
where the random decimal is chosen randomly from
environment.
[0,1], t=3*(891+1358+2641),the act of the (i+1) th and the
• Environment. The environment is everything (i+2) th row of items (if they exist) will be 0 if not all of
outside the agent, including the users of the systems the i-th row of items are sold out, the price_ave is the
and the items. After the agent takes an action at a average price of all the items that can be shown in the
state, which means that the recommender system same list.
output a page of items to the users according to the
item and the users here, the environment gives its
feedback and changes to the next state, which means (3)
that the user browses the list and gives its feedback
and changes to another user.
During the test process, without the labels of exposed
• States. A state (S) is the representation of user’s items, we define the reward function as follows:
positive interaction history with recommender, as
well as her demographic information (if it exists in 9
⎨
1, if the item i is neither in the list of exposed (5)
act=
⎪
• Transition. The state is modeled as the represent- items nor in the list of histories, and the
⎩0,
tation of user’s positive interaction history. Hence, random decimal is larger than 0.7
once the user’s feedback is collected, the state otherwise
transition (P) is determined.
The output of our network is a raw action vector, which
represents the relationship between user-portrait and item-
portrait, we choose the top 3 items according to the inner
product of the sub action vector and the item embedding
5705
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 19:04:06 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. The structure of DDPG used for training
5706
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 19:04:06 UTC from IEEE Xplore. Restrictions apply.
As shown in the Fig. 6, the reward of training shows an ACKNOWLEDGMENT
upward trend before 10000 episodes, but after that, the reward This material is based upon work supported by the FUXI
fluctuates and decreases. Considering the uncomplicated AI Lab, Netease.
network structure, we choose the weights of 100000 training
sessions, for the prevention of over-fitting. And the result REFERENCES
shows satisfactory performance. [1] Richard S Sutton and Andrew G Barto. Introduction to reinforcement
As for the evaluation of our recommender system, we use learning, volume 2.MIT press Cambridge, 2017.
the pre-trained large-scale RL environment designed by [2] Thorsten Joachims, Dayne Freitag, Tom Mitchell, et al. Webwatcher:
A tour guide for the world wide web. In IJCAI (1), pages 770-777.
FUXI AI Lab using massive data collected under real Citeseer, 1997.
circumstances to accomplish the testing. Using the 206096
[3] Anongnart Srivihok and Pisit Sukonmanee. E-commerce intelligent
users’ profiles provided by the track2_testset.csv, we get a list agent: personalization travel support agent using q learning. In
of items for each user offered, and input it into the virtual Proceedings of the 7th international conference on Electronic
environment. And a score was given according to purchasing commerce, pages 287-292, 2005.
outcome, the scores of the models mentioned are listed in [4] Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. Usage-
TABLE I. And for comparison, we set the score of using based web recommendations:a reinforcement learning approach. In
logged offline actions, the model trained with DDPG in Proceedings of the 2007 ACM conferenceon Recommender systems,
pages 113-120, 2007.
LSTM-based simulated environment as the baselines.
[5] Tariq Mahmood, Ghulam Mujtaba, and Adriano Venturini. Dynamic
personalization in conversational recommender systems. Information
TABLE I. THE RESULT OF THE RECOMMENDER SYSTEM Systems and e-Business Management, 12(2):213-238, 2014.
[6] Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone. Dj-mc: A
Models Score reinforcement learning agent for music playlist recommendation. arXiv
preprint arXiv:1401.1880, 2014.
Trained after 60000 episodes 432511736 [7] Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi,
Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning
Trained after 80000 episodes 506765120 for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017.
Trained after 100000 episodes 1115440564 [8] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with
double q-learning[C]//Proceedings of the AAAI conference on
Trained after 120000 episodes 303672114 artificial intelligence. 2016, 30(1).
[9] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J].
Using logged offline actions 770378225 arXiv preprint arXiv:1511.05952, 2015.
LSTM-based environment simulator [10] Shi-Yong Chen, Yang Yu, Qing Da, et al. Stabilizing reinforcement
1033481948
+ DDPG learning in dynamic environment with application to online
As we can see from the TABLE I, the result corresponds recommendation. In Proceedings of the 24th ACM SIGKDD
to our expectation, the models with the weights trained after International Conference on Knowledge Discovery & Data Mining,
100000 episodes performs best, and outperforms the pages 1187-1196, 2018.
baselines to a large extent. In the competition, the score of [11] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas
Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement
UCSRDRL ranked third. learning framework for news recommendation. In Proceedings of the
2018 World Wide Web Conference, pages 167-176, 2018.
VII. SUMMARY AND CONCLUSIONS
[12] Dulac-Arnold G, Evans R, van Hasselt H, et al. Deep reinforcement
In this competition, we propose a deep reinforcement learning in large discrete action spaces[J]. arXiv preprint
learning based recommender system with a state- arXiv:1512.07679, 2015.
representation module named UCSRDRL, and utilize it to [13] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep
reinforcement learning[J]. arXiv preprint arXiv:1509.02971, 2015.
perform the recommendation task. Unlike the conventional
studies, this method treats the recommendation as a sequential [14] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and
Jiliang Tang. Deep reinforcement learning for page-wise
decision-making process and adopts an “Actor-Critic” recommendations. In Proceedings of the 12th ACM Conference on
learning scheme, which can take both the immediate and long- Recommender Systems, pages 95-103, 2018.
term rewards into account. A state representation module is [15] Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye,
incorporated, which can explicitly model the interactions Haokun Chen, Huifeng Guo, and Yuzhou Zhang. Deep reinforcement
between users and items. learning based recommendation with explicit user-item interactions
modeling. arXiv preprint arXiv:1810.12027, 2018.
We conduct the experiments on the datasets using the [16] Jianhua Han, Yong Yu, Feng Liu, Ruiming Tang, and Yuzhou Zhang.
system, and according to the training reward, we choose the Optimizing ranking algorithm in recommender system via deep
model with the weights trained after 100000 episodes. To reinforcement learning. In 2019 International Conference on Artificial
evaluate the quality of our recommender system, we perform Intelligence and Advanced Manufacturing (AIAM), pages 22-26.
IEEE, 2019.
the evaluation using the pre-trained large-scale RL
[17] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois
environment designed by FUXI AI Lab, and the result is in Belletti, and Ed H Chi. Top-k off-policy correction for a reinforce
line with our speculation. The recommender system built with recommender system. In Proceedings of the Twelfth ACM
deep reinforcement learning integrated with a state International Conference on Web Search and Data Mining, pages 456-
representation module outperforms the system offered as the 464, 2019.
baseline, and the quality of recommendation is improved to a [18] Williams R J. Simple statistical gradient-following algorithms for
large extent. connectionist reinforcement learning[J]. Machine learning, 1992, 8(3):
229-256.
[19] Kai Wang, Zhene Zou, Qilin Deng, Yue Shang, Minghao Zhao, Runze
Wu, Xudong Shen, Tangjie Lyu, Changjie Fan. RL4RS: A Real-World
Benchmark for Reinforcement Learning based Recommender System
[J]. arXiv preprint arXiv: 2110.11073.
5707
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 19:04:06 UTC from IEEE Xplore. Restrictions apply.