A deep actor critic reinforcement learning framework for learning to rank 2023

Neurocomputing 547 (2023) 126314
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A deep actor critic reinforcement learning framework for learning to

rank
Vaibhav Padhye ⇑, Kailasam Lakshmanan
Department of Computer Science, IIT BHU, Varanasi, India
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we propose a Deep Reinforcement learning based approach for Learning to rank task.
Received 5 July 2022 Reinforcement Learning has been applied in the ranking task with good success, but the existing Policy
Revised 14 December 2022 Gradient based approaches suffer from noisy gradients and high variance, resulting in unstable learning.
Accepted 7 May 2023
The natural policy gradient methods like REINFORCE perform Monte Carlo sampling, thus taking samples
Available online 18 May 2023
randomly, which leads to high variance. As the action space becomes large, i.e., with a very large number
Communicated by Zidong Wang
of documents, traditional RL techniques lack the complex model required in the scenario to deal with a
large number of items. We propose a Deep Reinforcement learning based approach for learning to rank
Keywords:
Reinforcement learning
task in this paper to address these issues. By combining Deep learning with the Reinforcement Learning
Learning to Rank framework, our approach can learn a complex function as deep neural networks can provide significant
Deep reinforcement learning function approximation. We used Actor-Critic framework where the critic network can reduce variance
Policy gradient by utilizing techniques such as clipped delayed policy updates, clipped double q learning, etc. Also,
due to the enormous space of the web, the most relevant results are needed to be returned for the cor-
responding query from within a large action space. Policy gradient algorithms have been effectively
applied to problems in large action spaces(items) with deep neural networks as they don’t rely on finding
value for each action(item) as in value-based methods. Further, we use an actor-network with a CNN
layer in the ranking process to capture the sequential patterns among the documents. We utilize the
TD3 method to train our Reinforcement Learning agent with a listwise loss function, which performs
delayed policy updates resulting in value estimates with lower variance. To the best of our knowledge,
this is the first Deep reinforcement learning method applied in Learning to Rank for document retrieval.
We performed experiments on the various Letor datasets and showed that our method outperforms var-
ious state-of-the-art baselines.
Ó 2023 Elsevier B.V. All rights reserved.
1. Introduction Reinforcement Learning has been effectively applied in the

ranking task with good success [5–7]. Most of the existing RL-
Learning to rank [1,2] generally addresses the problem where based approaches utilize natural policy gradient methods [8–13]
the system needs to give ranking over the input. The numerous however, the Policy Gradient based approaches suffer from noisy
application includes web search ranking, recommendation sys- gradients and high variance. The natural policy gradient algorithms
tems, sentiment analysis, document retrieval, etc. In these prob- like REINFORCE update the policy parameter through Monte Carlo
lems, the learning function presents to the user a list of ranked updates. The random sampling from different stochastic policies
items using machine learning methods [3–5]. During the learning leads to high variability in log probabilities and cumulative reward
phase, the learning to rank model tries to learn the ranking func- values, as the trajectories can significantly deviate from each other
tion from the query-document pairs and the corresponding rele- while training [14]. As a result, it leads to noisy gradients and
vance levels and subsequently build a model to generate the unstable learning. Even other techniques like providing a baseline
relevance for the documents and rank them accordingly. Different function are not as effective as the state space is extremely large
learning to rank techniques that optimize the evaluation metric, due to a large number of query document pairs. Reinforcement
such as MAP, NDCG, etc., directly have become popular. learning (RL) has proven to be effective in problems of different
domains with non-differentiable metrics through policy gradient
⇑ Corresponding author.
approach. For instance, successful results have been achieved in
E-mail address: vaibhavpadhye10@gmail.com (V. Padhye).
summarization [15], machine translation [16], image captioning
https://doi.org/10.1016/j.neucom.2023.126314
0925-2312/Ó 2023 Elsevier B.V. All rights reserved.
V. Padhye and K. Lakshmanan Neurocomputing 547 (2023) 126314
[17], recommender systems [18], etc., using RL methods. Bandit- noise averaged over the mini-batches. With the actor-critic archi-
based Learning to Rank methods have also been proposed [19– tecture, we use a CNN layer in the state representation module
23] based on implicit feedback, i.e., clicks were proposed for rank- for generating a diverse list of items in the ranking process. By
ing task. However, clicks are considered weak relevance signals feeding the document vectors into the ”image”, the CNN layer
because they often are affected by a number of noise and biases learns the sequential features of the documents from the embed-
such as position bias. With position bias, higher ranked documents ded image [35]. Also, most of the existing methods in Learning to
have more chance to be observed and therefore accumulate more Rank ignore the dependence between the documents [36]. In our
clicks, even if they are not relevant [24–26]. In [27], the authors solution, we utilize the similarity between the documents during
proposed a reinforcement learning based approach for query for- the state-action generation process as described in Section 4.1.
mulation. However, their approach comprised of semantic match- To the best of our knowledge, this is the first Deep reinforcement
ing between query and a candidate term which is ineffective in learning method applied in Learning to Rank for document
capturing relevance signals such as term importance, document retrieval.
frequency, document length, etc. [28,29]. We summarize the contribution of our work as follows:
Due to the vast size of web search space, it is essential to find
the most relevant documents for a specific query; hence ranking 1. We propose a Deep Reinforcement learning based approach for
the search results is an important task in information retrieval. learning to rank problem. We address the issues such as high
With large action space, i.e., a very large number of items, value- variance and noisy gradients associated with the other Rein-
based methods in Reinforcement Learning are less effective as it forcement Learning approaches with the state-of-the-art TD3-
becomes infeasible to calculate the value function estimates of based Learning to Rank algorithm utilizing techniques such as
all the actions in the state. We address the issue of Learning to delayed policy updates, clipped double q learning, etc.
Rank over large action spaces with Deep Reinforcement Learning 2. We combine the Reinforcement learning framework with deep
using Policy gradient algorithms, which are known to be effective learning; thus, our model is able to learn a complex function as
over large-scale datasets [30] as they directly learn optimal policy deep neural networks can provide significant function approxi-
from the usage data. Further, the powerful function approximation mation. Further, we used Policy based approach for large-scale
properties of deep neural networks can be very effective in large ranking as Policy gradient algorithms have been effectively
action spaces. Deep Reinforcement Learning (DRL) has achieved applied to problems in large action spaces(large number of
significant success by scaling up complex problems which were items) as they don’t rely on finding value for each action(item)
earlier intractable with traditional RL methods due to the curse like value-based methods.
of dimensionality. DRL provides the powerful function approxima- 3. We conduct extensive experiments on the Letor datasets for dif-
tion properties of deep neural networks and thus has been effec- ferent evaluation metrics such as NDCG and MAP. Experimental
tively applied in different complex problems with large action results demonstrate our proposed solution is able to achieve
spaces(a large number of items), thus, combining the advantages better performance than various state-of-the-art baselines.
of Deep Learning and Reinforcement Learning [18]. On the other
hand, policy gradient algorithms learn the optimal policy directly The rest of the paper is organized as follows. We present the liter-
by learning the policy parameters instead through neural net- ature survey and other important work in Section 2. In Section 3,
works. Policy-based approaches are useful for dealing with large we give an overview of Deep Reinforcement Learning and actor-
action spaces, including continuous spaces with an infinite number critic methods. We describe our approach in detail in 4. Section 5
of actions. The agent learns the policy directly and chooses an displays the experimental work, including experimental settings,
action from a probability distribution of the action space in baselines methods, and the experimental results of our proposed
policy-based techniques [31]. Also, for continuous actions, we method. Finally, Section 6 provides the conclusion and the future
can learn the parameters of a probability distribution and choose scope related to our work.
a value from there instead. Further, policy-based methods can
learn stochastic policy, unlike value-based methods, and thus han-
dle the explore/exploitation dilemma automatically. 2. Related Work
When used in RL, the fundamental constraints of Deep Learning,
such as instability and unpredictability, are exacerbated due to cor- Learning to Rank methods are generally grouped into pointwise
related and identically distributed sample sets, estimation vari- [37], pairwise [38,39] or list wise methods [40,41]. Pontwise meth-
ances, and bias in function approximation [32]. Methods such as ods for Learning to Rank problem are generally classification or
DDPG also update Q-value in a similar way to DQN, which makes regression techniques and are thus reduced to predicting a class
it prone to overestimating Q-values for action, resulting in higher or relevance label for the incoming document. Any general classi-
bias and instability in convergence. When updating the critic, fication or regression technique could be applied for pointwise
deterministic policy techniques have a tendency to produce target based approaches. Pairwise methods take pair of documents of dif-
values with a high variance which occurs due to overfitting to the ferent relevance and learn a loss function to perform the classifica-
skews in the value estimate [33,14]. tion. Listwise methods instead take the entire list of elements as
However, when the action space is high dimensional, as in a instances during the learning process and perform the prediction
web search task with millions of documents, the variance in gradi- over the list through the trained model. Different machine learning
ent estimation becomes large, leading to slow convergence of the based methods have been proposed that directly optimize the eval-
algorithm and increase in sample complexity [23]. To address uation metric such as Precision, NDCG, etc., to learn the ranking
these issues, we propose a Deep Reinforcement Learning based function during the training process [4,42–45]. Support vector
methodology for learning to rank task by modeling the ranking machine based methods such as SVMMAP [43], SVMNDCG [4] have
process as a Markov Decision Process. We propose our solution been applied to the task by optimizing the different information
DRLRank based on the state-of-the-art algorithm twin-delayed retrieval metrics, MAP and NDCG, respectively. The authors in [3]
DDPG (TD3) [34], to improve on these shortcomings by employing used Boosting algorithm ADARANK to the Learning to Rank prob-
techniques such as clipped double-q learning, delayed policy lem by optimizing the NDCG and Precision metrics.In [22],
updates, etc., for training the RL agent. In order to reduce the vari- bandit-based method was proposed based on the Cascade model
ance, TD3 performs noise regularization by adding little random for the ranking of web pages. Deep Learning based methods for
2
learning to rank have been proposed recently[29,46]. Recently The gradient of Jðph Þ, also knows as Naive Policy gradient [7] is
reinforcement learning [31] methods have been applied to the
X
T
learning to rank problem [8,47,48,11]. In [8], the authors proposed rh Jðph Þ ¼ Ep ½ rh log ph ðat jst ÞGt ; ð1Þ
a Reinforcement learning based solution for the Learning to Rank t¼0
task using a policy gradient algorithm, improving the performance
where Gt denotes the return of the current trajectory.
over the NDCG metric. A Multi-Armed bandit approach was pro-
We can extend the above equation as
posed in [42] which utilized user click behavior for ranking the
web documents. A dual agent bandit game modeled as POMDP X
T
was proposed in [49] to the session search problem. In [48], the rh Jðph Þ ¼ Es0 ;a0 ;...:;st ;at ½ rh log ph ðat jst ÞGt ; ð2Þ
authors proposed Monte Carlo based exploratory algorithm using t¼0
tree search. A dueling bandit based online learning framework Using the following equality Es0 ;a0 ;...:;st ;at ½Gt ¼ Q ðst ; at Þ, we can formu-
for real-time learning from implicit feedback was proposed in late above equation as,
[20] using pairwise comparisons. Further, other Reinforcement
learning models were applied successfully in ranking and search X
T
rh Jðph Þ ¼ Es0 ;a0 ;...:;st ;at ½ rh log ph ðat jst ÞQ ðst ; at Þ
result diversification [48,11,50]. A multi-page search scenario
t¼0
problem formalizing the ranking of documents as a Markov Deci-
X
T
sion Process was proposed in [51]. In [23], a log document based ¼ Ep ½ rh log ph ðat jst ÞQðst ; at Þ ð3Þ
re-ranking problem was modeled as a POMDP using the user’s t¼0
click-behaviors was proposed. A multi-agent Reinforcement Learn-
ing based solution for Learning to Rank was proposed in [10], mod- The gradient of policy performance, rh Jðph Þ, is called the policy
eling each document as an agent and ranking process as an gradient, and algorithms that optimize the policy this way are
interactive MDP between the agents. called policy gradient algorithms. The Q value can be learned
by using a neural network to parameterize the Q function, uti-
lizing which we get the Actor-Critic algorithm, with v ðsÞ denot-
3. Background ing the value function of the state s and c representing the
discount factor. The actor-critic algorithm uses two separate
3.1. Deep Reinforcement Learning modules for calculating the policy gradient, policy, and value
function. The actor-critic framework thus consists of two
Reinforcement Learning [31] is a semi-supervised machine models:
learning approach that uses a feedback mechanism to train the
learning agent. In RL, the training agent learns in an interactive Actor: The Actor module decides what action is to be taken and
environment by trial and error using feedback it receives from its updates the actor parameters in the direction suggested by the
own actions and experiences. In reinforcement learning, the agent critic.
learns the policy that can maximize long-term reward by interact- Critic: The critic module evaluates whether the action per-
ing with the environment. In RL, the problem is generally modeled formed was good or not by updating the value function param-
as a Markov Decision process (MDP). MDP consists of a tuple of five eters. The Actor-Critic framework is demonstrated in Fig. 1.
elements ðS; A; P; R; cÞ ,where S is the set of states, A is the set of
actions, Tðstþ1 jst ; aÞ : S A S ! Ris the transition probability of 4. Problem Statement
reaching state stþ1 after executing action a on state st ,
Rðs; aÞ : S A ! Ris the reward agent receives after executing A general Learning to Rank problem is composed of N labeled
action a from state stþ1 , and c denotes the discount factor. Deep queries. Each query has associated with it different documents
Reinforcement Learning (DRL) incorporates the functionality of and their corresponding relevant judgement labels. We can for-
Deep learning in Reinforcement Learning algorithms [52]. Deep mally describe it as fQ i ; X i ; Y i gNi¼1 , where Q i denotes the query
RL algorithms can process in very large input space and have been and X i ¼ fx1 ; . . . ; xk g and Y i ¼ fy1 ; . . . ; yk g, denotes the candidate
used to solve a diverse range of complex decision problems that documents and the corresponding relevance labels associated with
were previously too complex for traditional RL techniques it, respectively. After the user submits a query, they are provided
[53,54]. Deep RL techniques have been applied to various new with the corresponding list of documents. The goal in LTR, thus,
applications in domains such as robotics, finance, video games, is to rank the candidate documents according to their relevancy
natural language processing, computer vision, etc. Deep reinforce- and provide the list to the user.
ment learning algorithms utilize a deep learning framework for
solving complex problems which represent the policy or value
4.1. Formulation of Learning to Rank as MDP
functions as a neural network and integrate them with RL tech-
niques creating specialized algorithms that perform well in this
We formulated the Learning to Rank problem in a Markov Deci-
environment. They can be used in various practical decision-
sion Process setting as described in Section 3.1. At each time step t,
making problems where states of the MDP are high-dimensional.
the agent observes the particular state st , and uses it perform the
action at following the policy /t . It then obtains the reward rt .
3.1.1. Actor-critic Methods The agent then moves to the next state stþ1 . We further describe
Policy gradient algorithms [6] optimize the policy directly the different components of MDP as:
rather than evaluating value functions for each state action pair. State: The state space S, is a set of states which describe the
The policy is generally paramaterized with h; ph ðajsÞ. Let ph denote environment. The state st at time t, as ½X t ; Dt , where X t comprise
a policy with parameters h, and Jðph Þ denote the expected finite- the list of k documents and Dt as the candidate set of all items.
horizon undiscounted return of the policy. The objective of Policy For designing the state space, initially, we combine the vector list
Gradient algorithm is to find the optimal h that maximizes the of k documents into a matrix. We then pass the matrix through a
cumulative return over all trajectories. It can be formulated as Convolution Neural Network layer [55] to extract the sequential
Jðph Þ ¼ Esph ½RðsÞ, where s is a trajectory and RðsÞ denotes the patterns among the documents by capturing the features of the
reward over the trajectory. image [35]. The output of the Convolution module is then fed into
3
Fig. 1. Actor Critic framework.
two fully-connected ReLU network layers to obtain higher-level fa4 ; a8 ; d3 ; . . . ::d10 g. If no documents are found relevant, then the
features, v t . We then perform the cross product with the previous state remains the same.
state vector st1 to obtain the current state st . Initially, the previous Discount factor: The discount factor k 2 ½0; 1 is used to mea-
state vector s0 is taken as a vector of all 1’s. The state representa- sure the tradeoff between the immediate and future rewards.
tion module is shown in Fig. 2. When k is set to 0, the recommender system only considers the
Action: We generate the action from the state st in two stages immediate rewards but ignores the future rewards. On the other
for generating a ranked list from a continuous vector, which is hand, when k is set to 1, the recommender system counts all the
the output of the actor in the first stage. Initially, we generate a future rewards as well as the immediate rewards.
sub-action a , which represents the intermediate action using the
actor-network. From a we finally generate the action at , a ranked
4.2. Algorithm
list of items.
Sub action Generation: We use the deep neural network archi-
In order to address the issues related to variance and instability
tecture to obtain the intermediate action from the current state st .
as described in Section 1, we propose an algorithm DRLRANK,
To obtain a , we pass the st through multiple fully connected layers
based on the state-of-the-art algorithm twin-delayed DDPG
as shown in Fig. 2. We fed the state to the two layers with Relu
(TD3) [34], to improve on these shortcomings applying techniques
activation function and a final layer with tanh activation to obtain
such as clipped double-q learning, delayed policy updates, etc., for
the intermediate action, a . We then use a , to generate the action,
training the RL agent. It uses noise regularization for smoothing the
i.e., the ranking list of k documents as described in Algorithm 1.
target policy in order to reduce the variance. Further, TD3 uses
Reward: After generating the action i.e. the list, the environ-
clipped double Q learning where it takes the smallest value of
ment than provides the feedback to the agent in terms of reward.
the two critic networks for the underestimation of Q values. This,
In the mdp formulation for the ranking problem, the reward can
coupled with the delayed update, results in stable approximation
be viewed as an assessment of the selected document’s relevance
and lower bias. TD3 is an off-policy method that stores historical
to the query. The reward then is estimated as the increase in the
events in an experience replay buffer, then randomly samples tran-
value of DCG betwen the list at time step t and t + 1. For eg, let
sitions from it and feeds the sample data to actor and critic net-
the documents be D1 ; D2 ; . . . . . . ; D10 and their corresponding rele-
works, thus also improving the efficiency of sample utilization.
vance judgements be ½1; 0; 2; 1; 0; 0; 0; 2; 1; 1. The DCG is then cal-
P We describe the actor-critic architecture that mitigates these
culated with the formula DCGp ¼ pi¼1 logrelðiþ1Þ i
, where i denotes issues in TD3 below. The algorithm is presented in Algorithm 2.
2
the index of the document, reli denotes the relevance of the ith doc- The algorithm takes as input query document pairs and initializes
ument. Using this method, the reward function can directly opti- the policy and target parameters (Line 1–2). The algorithm
mize the IR evaluation metric, which has proven to be effective observes the state from the environment and then uses the actor
in recent works. network to generate the sub-action (Line 4–5). Further, it then
Transition: After receiving the reward, the agent moves to the obtains the action, i.e., a list of ranked documents from Algorithm 1
next state stþ1 ¼ ½at ; Dt n at . We use the positive feedback for the (Line 6). The algorithm then presents the list to the user, receives
documents for transitioning to the next state. The documents the reward, and moves to the next state (Line 7–9). The algorithm
which the user finds relevant are to appended to the next state then stores the state, next state, reward, and action in the replay
and we remove the same number of documents from the current buffer (Line 10). The algorithm computes target action and target
state. For e.g., let the current state st be fd1 ; d2 ; . . . :; d10 g, and the Q value from the sampled transition using clipped double Q learn-
action at be list of documents fa1 ; a2 ; . . . a10 g. If the documents rel- ing and then uses Gradient Descent to update Q-function (Line 11–
evant to the query are a4 and a8 , then the next state stþ1 will be 14). Finally, it updates policy and target networks (Line 15–20).
4
iteratively it’s parameters / and selects the current action a

Algorithm 1: Generating Action: Ranked Document list according to the current state s. We define our Actor network as,
1: Input: Intermediate Action a , List of Candidate Documents
f / : s ! a ð4Þ
Xi
2: Output: Action List We used a Deep Neural network to design the network as described
3:// Generate the list of documents from the intermediate action in 4.1. A target network is responsible for selecting the optimal next
4: for each xi 2 Xdo action a according to the next state s0 sampled in the experience
5: compute a xi replay buffer. We select the top k documents having the highest
6: end for score to obtain the ranked list, at ¼ argmaxi a xi , where xi repre-
7:// select top k from line 5 as at sents the document vectors. The Guassian noise is added to the
8: return at action, which places the values in a range that the environment
supports. It also adds the exploration of the environment.

a ¼ clip lhtarg ðs0Þ þ clipð; c; cÞ ; N ð0; rÞ ð5Þ
Algorithm 2: DRLRANK Target network parameters /0 are periodically copied from the actor
1: Input: Query-Document Pairs fQ i ; X i ; Y i gN parameters /. For the actor network, the deterministic policy is
initial policy i¼1 ,
used to optimize the parameters, and the loss function is defined as,
parameters /, Q-function parameters h1 ; h2 , empty replay
buffer D 1 X
2: Set target parameters equal to main parameters r/ Jð/Þ ¼ Q j r/ pð/Þ ð6Þ
jBj j h1 ðs;aÞ a
/targ /; htarg;1 h1 ; htarg;2 h2 3: fort in range Tdo
4: Observe the current state s The target network uses a deep function approximator to offer a
5: Select action a ¼ clipðl/ ðsÞ þ ; aLow ; aHigh Þ, where N steady objective during the learning process, resulting in improved
//Generate the sub-action using Policy Network convergence. However, it can lead to sub-optimal policy due to
6: Generate the action, a, list of ranked documents from errors with the observed states. The learning diverges when an
Algorithm 1 erroneous policy is overestimated and further aggravates as it is
7: Execute a in the environment//Present list to user updating on states with errors. Thus the policy network is needed
8: Observe next state s0, reward r to be updated with lesser frequency compared to the value function
9: Store ðs; a; r; s0Þ in replay buffer D updates to minimize error propagations. Policy network is, there-
10: If s0 is terminal, reset environment state. fore, designed to update with a lower frequency than the value net-
11: Randomly sample a batch of transitions, B ¼ fðs; a; r; s0Þg work. Less frequent policy changes can result in a smaller variance
from D in Q-value function updates, resulting in a better policy.
12: Compute target actions Critic Network: The aim of Critic Network Q hi is to predict the Q
value based on the current state and action. The value function uti-
lized is to provide feedback to the actor for the current action. In
a0ðs0Þ ¼ clip l/targ ðs0Þ þ clipð; c; cÞ; aLow ; aHigh ; N ð0; rÞ
the TD3 algorithm, there are two critic networks and two critic tar-
get networks which are updated less frequently than the critic
13: Compute targets models. TD3 algorithm applies certain improvements to prevent
the overestimation of the Q-values. It utilizes the concept of
yðr; s0; dÞ ¼ r þ cð1 dÞminQ htarg;i ðs0; a0ðs0ÞÞ
Double-Q learning, using two critics instead,
i¼1;2
y1 ¼ r þ cQ h2 ðs0; l/ ðs0ÞÞ; y2 ¼ r þ cQ h1 ðs0; l/ ðs0ÞÞ ð7Þ

14: Update Q-functions by one step of gradient descent
The Clipped Double Q-learning objective instead uses the minimum
using
of the two estimated Q-values inorder to avoid the overestimation
bias.
1 X 2
rhi Q ðs; aÞ yðr; s0; dÞ fori ¼ 1; 2 y ¼ r þ cminQ hi ðs0; l/ ðs0ÞÞ ð8Þ
jBj ðs;a;r;s0;dÞ2B hi
i¼1;2
This underestimating bias isn’t an issue since as low values aren’t

15: iftmod dthen //Delayed Policy update
propagated through the algorithm as in the case of overestimation
16: Update policy by deterministic gradient ascent using
of Q values. This results in a more stable approximation, which
improves the algorithm’s overall stability. Finally, to avoid overfit-
1 X ting, the computation of Q-values has to be smoothed in order to
r/ Jð/Þ ¼ Q j r/ pð/Þ
jBj j h1 ðs;aÞ a resolve the trade-off between bias and variance. Consequently, a
truncated normal distribution noise is applied to each action as a
17: Update target networks with regularization, resulting in the adjusted target update.
18: htarg;i qhtarg;i þ ð1 qÞhi y ¼ r þ cQ h0 ðs0; l/ ðs0Þ þ Þ ð9Þ
19: /targ q/targ þ ð1 qÞ/
20: end if Thus, Target policy smoothing mitigates the issue of exploiting the
21: end for incorrect actions if the Q-function approximator learns an incorrect
sharp peak for actions and hence providing the sub-optimal perfor-
mance. TD3 algorithm also delays the updates to target networks
Actor Network: Actor defines the policy. It takes the state as for actor and critics, updating them less frequently using Polyak
input and outputs the action. The Actor Network is used to update Averaging, htarg;i qhtarg;i þ ð1 qÞhi , /targ q/targ þ ð1 qÞ/.
5
5. Experiments a varied number of nodes in the hidden layer from 50 to 200.

The different baselines were trained with the Adam optimizer
5.1. Evaluation datasets and experimental settings based on each model’s respective ranking loss function. The differ-
ent hyperparameters for the various baselines are chosen using a
We conducted experiments to test the performance of our algo- grid search from the various values described in Table 2. For
rithm on the traditional Learning to Rank datasets [56], OHSUMED, RankSVM, we used a linear kernel with a regularization rate
Million Query Track 2007 (MQ2007), and Million Query Track 2008 103 . For the MDPRANK algorithm, we used a learning rate of
(MQ2008). The datasets consist of a set of queries along with the 103 and a discount rate of 0.99 and trained the RL agent for
corresponding set of documents and relevance judgments. The 1000 episodes. For A2C and DDPG we used a learning rate of
three different relevance labels in datasets are relevant (2), par-
103 and a discount rate of 0.99. Further, we set the batch size to
tially relevant (1), and non-relevant(0). OHSUMED and MQ2007
128 and used a buffer size 10000 for the DDPG algorithm. For
consist of 45 features, whereas MQ2008 consists of 46 features.
our algorithm, we used a learning rate of 103 for actor and 104
We used the Letor standard features in all the experiments, with
each document represented through the corresponding vector. critic and a discount rate of 0.99. We set a learning rate of 104
Further statistics of the three datasets are described in Table 1. for the actor-network and 105 for the critic and used the Adam
Each row in the dataset corresponds to a query-document pair optimizer for the network parameters. For the ACDRL method,
represented through a vector. For eg, query id; relev ance label; we used the standard parameters specified in the paper. We used
feature 1; feature 2; . . . :; feature 45 represent a vector with 5-fold cross-validation on these datasets in all of the trials and
query id with its corresponding features and a relevance label. reported the average of the five folds results. We conducted our
The different features of the datasets include document’s TF-IDF, experiments on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20 GHz pro-
IDF, BM-25, Pagerank, URL dwell time, etc. cessor, x86_64 architecture with 264 GB RAM.
We compared the results of our experiments with the following Evaluation measures: We employ two common evaluation
state of the art baselines RankSVM [39], RankNet [38], AdaRank [3], measures in IR to assess the performance of the proposed algo-
A2C [57], DDPG [58], ACDRL [59] and MDPRank [8]. We used neu- rithm on the datasets, Normalized Discount Cumulative Gain
ral network based implementation for Ranknet, Listnet and Adar- (NDCG) [60] and Mean Average Precision (MAP)[61]. In the exper-
ank. We employed two hidden layers for the neural network and iments, we evaluated NDCG at positions 1, 3, 5, and 10.
Fig. 2. State Representation.
6
Table 1
Statistics on OHSUMED, MQ2007, MQ2008 Datasets.
Dataset Queries Documents Features RelevanceLabels

OHSUMED 106 16,140 45 3
MQ2007 1692 69,623 46 3
MQ2008 784 9360 46 3
Table 2
Hyperparameters.
Dropout Rate 0 0.2 0.4 0.8

Learning Rate 2
10 10 3
104
105
Epochs 200 400 800 1600
Batch Size 32 64 128 256
5.2. Experimental Results datasets as the number of episodes increases. Further, we can see
that our algorithm achieves better stability on MQ2007 from the
The experimental results on the OHSUMED, MQ2007 and reward distribution over the episodes. We compared the perfor-
MQ2008 are displayed in Table 3, Table 4 and Table 5, respectively. mance of our algorithm for varying discount factor and showed
The different evaluation measures are NDCG and MAP. We evalu- the results for the MQ2007 dataset in Fig. 4.
ated the NDCG metric at positions 1, 3, 5, and 10. For the Further, we showed the performance of our algorithm over the
OHSUMED dataset, it can be observed that our algorithm outper- NDCG metric for varying discount factor in Fig. 5. We can observe
forms the different baselines for most ndcg measures except at from the results that our algorithm gives optimal reward and
NDCG@5. Similarly, for the MQ2007 and MQ2008 datasets, our NDCG performance over the different datasets with a discount fac-
algorithm performs better than other baselines apart from tor of 0.99. We applied significance testing(two-tailed Student’s t-
NDCG@3 on the MQ2008 dataset. In OHSUMED, our algorithm test) [62] with p-value < 0.05 by comparing the statistical signifi-
shows significant improvement over the baselines for NDCG@1 cance of the results of our method over the different baselines. Our
and NDCD@5. For the MQ2007 and MQ2008 datasets, our algo- results were statistically significant for the majority of metrics on
rithm outperforms all other algorithms on different metrics apart the different datasets (denoted by y in Table 4–6).
from NDCG@3 and NDCG@5 on MQ2008. Overall, an improvement In Fig. 6, 7 we observed the effect of other parameters such as
over the MQ2007 dataset is the most among the three datasets for batch size and episode length on the NDCG metric. We can observe
our algorithm. We speculate the reason for this improvement that as the batch size increases the performance of the model also
among the three datasets is that MQ2007 has the highest training increases. For MQ2007 and OHSUMED the model gives the best
document and queries among the three datasets, and the Deep results for NDCG@1 at batch size of 128. Further, for MQ2008 the
Reinforcement Learning agent is able to learn a more complex best results are obtained for batch size 64. However, for the
model compared to the other two. Also, the performance of differ- OHSUMED and MQ2008 datasets we can observe that the perfor-
ent baselines, including our algorithm, is more stable in MQ2007 mance slightly diminishes after the batch size of 128. One reason
and MQ2008 compared to the OHSUMED dataset. This behavior could be that the lesser ablity to generalize at higher batch sizes
can also be interpreted from the query space of the three datasets; leads to the lower performance. Fig. 7 shows how the episode
while MQ2007 has 1700 queries and MQ2008 has around 800 length affects the performance of the algorithm. We observe that
queries, the OHSUMED dataset has only 100 queries. The policy performace increases initially with the increase in episode length
gradient method is able to learn a better model as much more sam- on the different datasets. We can speculate that when the episode
ple data is available, which leads to low variance and more stable length is too small, the agent does not fullly interact with the envi-
learning with the gradient descent. Further, we can observe Adar- ronment and doesn’t gather sufficient experience for the learning.
ank algorithm perform the least on OHSUMED compared to other We can also see that our algorithm has the better performance on
datasets, while Ranknet gives the least on MQ2007. The reward MQ2007 dataset for different range of episode length among the
distribution of our algorithm over different datasets for different datasets. On the other hand, we can also observe that performance
episodes is shown in Fig. 3. We can observe that our algorithm, lowers after a certain episode length, while the intermediate range
DRLRank is able to achieve stable learning for the three different gives the better results on the different datasets. We may state that
Table 3
Results for different metrics on OHSUMED Dataset. Super script y indicates statistical significance of results at p-value < 0.05.
Method NDCG@1 NDCG@3 NDCG@5 NDCG@10 MAP

y y y y
RankSVM 0:4921 0:0024 0:4711 0:0013 0:4423 0:0046 0:4554 0:0039 0:4321 0:0027y
Ranknet 0:4803 0:0064y 0:4601 0:0014y 0:4344 0:0044y 0:4392 0:0076y 0:4223 0:0013y
Listnet 0:5082 0:0022y 0.4754 0:0055 0:4361 0:0014y 0:4488 0:0066y 0:4176 0:0038y
Adarank 0:4810 0:0028y 0:4166 0:0081y 0:4392 0:0033y 0:4286 0:0063 0:4209 0:0010y
ACDRL 0:4921 0:0017y 0:4622 0:0028y 0:4215 0:0062y 0:4302 0:0020 0:4212 0:0013y
A2C 0:4764 0:0031y 0:4180 0:0048y 0:4229 0:0037y 0:4264 0:0012 0:4166 0:0059y
DDPG 0:5011 0:0058y 0:4716 0:0016y 0:4383 0:0061y 0:4608 0:0047y 0:4281 0:0033y
MDPRank 0:5108 0:0022y 0:4633 0:0048y 0:4322 0:0029y 0:4424 0:0012y 0:4387 0:0087y
DRLRank 0:5273 0:0036 0:4782 0:0019 0:4379 0:0028 0:4626 0:0044 0:4406 0:0018
7
Table 4
Results for different metrics on MQ2007 Dataset. Super script y indicates statistical significance of results at p-value < 0.05.

y y y y
RankSVM 0:3956 0:0045 0:3982 0:0037 0:4057 0:0069 0:4124 0:0081 0:4237 0:0032y
Ranknet 0:3884 0:0018y 0:3902 0:0038y 0:3938 0:0044y 0:4131 0:0028y 0:4188 0:0040y
Listnet 0:4002 0:0026y 0:4072 0:0055y 0:4143 0:0048 0:4238 0:0033y 0:4214 0:0037y
Adarank 0:3877 0:0054y 0:3954 0:0081y 0:3921 0:0049y 0:4126 0:0091y 0:4221 0:0022y
ACDRL 0:4023 0:0036y 0:4116 0:0077y 0:3910 0:0071y 0:4112 0:0041 0:4267 0:0020y
A2C 0:3961 0:0045y 0:4042 0:0055y 0:4130 0:0017 0:4258 0:0021y 0:4287 0:0029y
y y y y
DDPG 0:4170 0:0037 0:4118 0:0067 0:4252 0:0054 0:4295 0:0029 0:4277 0:0054y
MDPRank 0:4152 0:0043y 0:4164 0:0074y 0:4218 0:0028y 0:4224 0:0058y 0:4203 0:0025y
DRLRank 0:4206 0:0019 0:4243 0:0027 0:4310 0:0064 0:4353 0:0053 0:4388 0:0021
Table 5
Results for different metrics on MQ2008 Dataset. Super script y indicates statistical significance of results at p-value < 0.05.

RankSVM 0:3686 0:0037y 0:4198 0:0043y 0:4550 0:0018y 0:2268 0:0014y 0:4464 0:0022y
Ranknet 0:3341 0:0059 0:3902 0:0077y 0:4321 0:0016y 0:2172 0:0022y 0:4408 0:0029y
Listnet 0:3802 0:0047y 0:4409 0:0028y 0:4612 0:0033y 0:2134 0:0067y 0:4505 0:0047y
Adarank 0:3878 0:0038y 0:4434 0:0022y 0:4781 0:0015 0:2405 0:0028y 0:4402 0:0061y
ACDRL 0:3827 0:0025y 0:4319 0:0044y 0:4484 0:0063y 0:2465 0:0033 0:4508 0:0059y
A2C 0:3811 0:0048y 0:4283 0:0026y 0:4676 0:0022 0:2101 0:0011y 0:4449 0:0062y
DDPG 0:3774 0:0028 0:4378 0:0016y 0:4723 0:0038y 0:2551 0:0048 0:4391 0:0055y
MDPRank 0:3846 0:0051y 0:4354 0:0040y 0:4714 0:0064y 0:2641 0:0028 0:4481 0:0033y
DRLRank 0:3984 0:0059 0:4417 0:0088 0:4682 0:0034 0:2831 0:0065 0:4706 0:0067
Fig. 3. Reward Distribution. Fig. 4. Varying Discount factor.
the reason is trade off between the exploration and explotation and From the results we can infer that the DRLRank algorithm
thus an optimum episode length is required to maintain a balance works well on the LTR problems and also we can observe that
of both. as the more query-document is available for the policy gradient
In Fig. 8, we further compared the performance of our algorithm algorithm, the better the performance becomes. This validates
with other DRL algorithms DDPG and A2C. We can observe the notion that policy based methods are more effective in the
DRLRank converges better than the other two, while DDPG com- large data sets as there are more sample trajectories available
pare releatively better than A2C. One of the reasons could be the for the algorithm, which reduces the variance of the experi-
overestimation bias of Q values in DDPG. This occurs when the Q ments. We further believe the Deep RL algorithm can further
values that are estimated by the agent are larger than the actual be extended to even more large scale datasets. We attribute
values and thus the RL algorithm overstimates the future rewards. the improvement of the algorithm to the deep RL paradigm that
Thus the policy might wrongly exploit the oversetimated values is able to learn complex function of the model in high dimen-
and suffer from erroneous learning. In DRLRank, we utilize the sional action space. Further, in the state action generation pro-
clipped q learning and target policy smoothing techniques to tackle cess CNN based module is able to exploit the shared
the overstimas bias. Further, while the baseline methods such as information between the documents and as the training contin-
the advantage function in A2C, reduces the variance which can ues the algorithm is able to find more similar relevant
result in improved imoprtance, when the state action space is high documents.
dimensional such as in the ranking task their effectiveness also In the figure we compared the performance to other Deep RL
reduces. algorithms, we can observe that DRLRank converges
8
Fig. 5. NDCG metric with different discount factor, x-axis: Discount factor, y-axis: NDCG.
Fig. 6. Varying Batch Size.
Fig. 7. Varying Episode Length.
6. Conclusion lem as a Markov Decision Process and utilized a Deep Reinforce-

ment Learning framework with a TD3-based algorithm for
In this paper, we propose a Deep Reinforcement Learning based addressing the issues such as high variance and unstable learning
approach for the Learning to Rank task. We formulated our prob- associated with the Natural Policy gradient approaches. Also, a
9
Fig. 8. Comparing DRLRank with other DRL algorithms.
deep neural network architecture builds a much more complex [9] J. Feng, H. Li, M. Huang, S. Liu, W. Ou, Z. Wang, X. Zhu, Learning to collaborate:
Multi-scenario ranking via multi-agent reinforcement learning, in:
model compared to a traditional RL-based solution which can pro-
Proceedings of the 2018 World Wide Web Conference, WWW ’18,
vide a stable approximation, faster learning, and more stability. For International World Wide Web Conferences Steering Committee, Republic
future work, we plan to extend this work with a Multi-Agent Archi- and Canton of Geneva, CHE, 2018, p. 1939–1948. doi:10.1145/
tecture framework for further incorporating the diversified search 3178876.3186165. URL: doi: 10.1145/3178876.3186165.
[10] S. Zou, Z. Li, M. Akbari, J. Wang, P. Zhang, Marlrank: Multi-agent reinforced
results in the ranking task. In a Multi-agent setting, different learning to rank, in: Proceedings of the 28th ACM International Conference on
agents can run a Reinforcement Learning agent and can communi- Information and Knowledge Management, 2019, pp. 2073–2076.
cate with other agents to achieve a shared goal in the ranking task. [11] L. Xia, J. Xu, Y. Lan, J. Guo, W. Zeng, X. Cheng, Adapting markov decision
process for search result diversification, in: Proceedings of the 40th
International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’17, Association for Computing Machinery, New
CRediT authorship contribution statement
York, NY, USA, 2017, p. 535–544. doi:10.1145/3077136.3080775. URL: doi:
10.1145/3077136.3080775.
Vaibhav Padhye: Conceptualization, Methodology. Kailasam [12] A. Montazeralghaem, H. Zamani, J. Allan, A Reinforcement Learning
Framework for Relevance Feedback, Association for Computing Machinery,
Lakshmanan: Supervision, Writing - review & editing.
New York, NY, USA, 2020, p. 59–68. URL: https://doi.org/10.1145/
3397271.3401099
[13] J. Yao, Z. Dou, J. Xu, J.-R. Wen, RLPer: A Reinforcement Learning Model for
Data availability Personalized Search, Association for Computing Machinery, New York, NY,
USA, 2020, p. 2298–2308. URL: https://doi.org/10.1145/3366423.3380294
Data will be made available on request. [14] R.S. Sutton, S. Singh, D. McAllester, Comparing policy-gradient algorithms, IEEE
Transactions on Systems, Man, and Cybernetics.
[15] R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive
Declaration of Competing Interest summarization, arXiv preprint arXiv:1705.04304.
[16] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Sequence level training with
recurrent neural networks, arXiv preprint arXiv:1511.06732.
The authors declare that they have no known competing finan- [17] S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence
cial interests or personal relationships that could have appeared training for image captioning, in: Proceedings of the IEEE conference on
to influence the work reported in this paper. computer vision and pattern recognition, 2017, pp. 7008–7024.
[18] X. Chen, L. Yao, J. McAuley, G. Zhou, X. Wang, A survey of deep reinforcement
learning in recommender systems: A systematic review and future directions,
References arXiv preprint arXiv:2109.03540.
[19] F. Radlinski, R. Kleinberg, T. Joachims, Learning diverse rankings with multi-
armed bandits, in: Proceedings of the 25th International Conference on
[1] H. Li, Learning to Rank for Information Retrieval and Natural Language
Machine Learning, ICML ’08, Association for Computing Machinery, New York,
Processing, Morgan & Claypool Publishers, 2011.
NY, USA, 2008, p. 784–791. doi:10.1145/1390156.1390255. URL: doi: 10.1145/
[2] T.-Y. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3
1390156.1390255.
(3) (2009) 225–331, https://doi.org/10.1561/1500000016.
[20] Y. Yue, T. Joachims, Interactively optimizing information retrieval systems as a
[3] J. Xu, H. Li, Adarank: A boosting algorithm for information retrieval, in:
dueling bandits problem, in: Proceedings of the 26th Annual International
Proceedings of the 30th Annual International ACM SIGIR Conference on
Conference on Machine Learning, 2009, pp. 1201–1208.
Research and Development in Information Retrieval, SIGIR ’07, Association for
[21] A. Slivkins, F. Radlinski, S. Gollapudi, Ranked bandits in metric spaces:
Computing Machinery, New York, NY, USA, 2007, p. 391–398. doi:10.1145/
Learning diverse rankings over large document collections, J. Mach. Learn. Res.
1277741.1277809. URL: doi: 10.1145/1277741.1277809.
14 (1) (2013) 399–436.
[4] S. Chakrabarti, R. Khanna, U. Sawant, C. Bhattacharyya, Structured learning for
[22] B. Kveton, C. Szepesvari, Z. Wen, A. Ashkan, Cascading bandits: Learning to
non-smooth ranking losses, in: Proceedings of the 14th ACM SIGKDD
rank in the cascade model, in: F. Bach, D. Blei (Eds.), Proceedings of the 32nd
International Conference on Knowledge Discovery and Data Mining, KDD ’08,
International Conference on Machine Learning, Vol. 37 of Proceedings of
Association for Computing Machinery, New York, NY, USA, 2008, p. 88–96.
Machine Learning Research, PMLR, Lille, France, 2015, pp. 767–776. URL:
doi:10.1145/1401890.1401906. URL: doi: 10.1145/1401890.1401906.
https://proceedings.mlr.press/v37/kveton15.html.
[5] K. Hofmann, S. Whiteson, M. Rijke, Balancing exploration and exploitation in
[23] H. Wang, S. Kim, E. McCord-Snook, Q. Wu, H. Wang, Variance reduction in
listwise and pairwise online learning to rank for information retrieval, Inf. Retr.
gradient exploration for online learning to rank, in: Proceedings of the 42nd
16 (1) (2013) 63–90, https://doi.org/10.1007/s10791-012-9197-9.
International ACM SIGIR Conference on Research and Development in
[6] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient methods for
Information Retrieval, 2019, pp. 835–844.
reinforcement learning with function approximation, in: S. Solla, T. Leen, K.
[24] B. Pan, H. Hembrooke, T. Joachims, L. Lorigo, G. Gay, L. Granka, In google we
Müller (Eds.), Advances in Neural Information Processing Systems, Vol. 12, MIT
trust: Users’ decisions on rank, position, and relevance, Journal of computer-
Press, 2000, URL: https://proceedings.neurips.cc/paper/1999/file/
mediated communication 12 (3) (2007) 801–823.
464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.
[25] T. Joachims, A. Swaminathan, T. Schnabel, Unbiased learning-to-rank with
[7] R.J. Williams, Simple statistical gradient-following algorithms for
biased feedback, in: Proceedings of the Tenth ACM International Conference on
connectionist reinforcement learning, Machine learning 8 (3) (1992) 229–256.
Web Search and Data Mining, WSDM ’17, Association for Computing
[8] Z. Wei, J. Xu, Y. Lan, J. Guo, X. Cheng, Reinforcement learning to rank with
Machinery, New York, NY, USA, 2017, p. 781–789. doi:10.1145/
markov decision process, in: Proceedings of the 40th International ACM SIGIR
3018661.3018699. URL: doi: 10.1145/3018661.3018699.
Conference on Research and Development in Information Retrieval, SIGIR ’17,
[26] S. Zhuang, Z. Qiao, G. Zuccon, Reinforcement online learning to rank with
Association for Computing Machinery, New York, NY, USA, 2017, p. 945–948.
unbiased reward shaping, arXiv preprint arXiv:2201.01534.
doi:10.1145/3077136.3080685. URL: doi: 10.1145/3077136.3080685.
10
[27] R. Nogueira, K. Cho, Task-oriented query reformulation with reinforcement [49] J. Luo, S. Zhang, H. Yang, Win-win search: Dual-agent stochastic game in
learning, arXiv preprint arXiv:1704.04572. session search, in: Proceedings of the 37th International ACM SIGIR
[28] S. Clinchant, E. Gaussier, A theoretical analysis of pseudo-relevance feedback Conference on Research and Development in Information Retrieval, SIGIR
models, in: Proceedings of the 2013 Conference on the Theory of Information ’14, Association for Computing Machinery, New York, NY, USA, 2014, p. 587–
Retrieval, ICTIR ’13, Association for Computing Machinery, New York, NY, USA, 596. doi:10.1145/2600428.2609629. URL: doi: 10.1145/2600428.2609629.
2013, p. 6–13. doi:10.1145/2499178.2499179. URL: doi: 10.1145/ [50] F. Liu, R. Tang, X. Li, Y. Ye, H. Guo, X. He, Novel approaches to accelerating the
2499178.2499179. convergence rate of markov decision process for search result diversification,
[29] J. Guo, Y. Fan, Q. Ai, W.B. Croft, A deep relevance matching model for ad-hoc CoRR abs/1802.08401. arXiv:1802.08401. URL: http://arxiv.org/abs/
retrieval, in: Proceedings of the 25th ACM international on conference on 1802.08401.
information and knowledge management, 2016, pp. 55–64. [51] W. Zeng, J. Xu, Y. Lan, J. Guo, X. Cheng, Multi page search with reinforcement
[30] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. learning to rank, in: Proceedings of the 2018 ACM SIGIR International
Mann, T. Weber, T. Degris, B. Coppin, Deep reinforcement learning in large Conference on Theory of Information Retrieval, ICTIR ’18, Association for
discrete action spaces, arXiv preprint arXiv:1512.07679. Computing Machinery, New York, NY, USA, 2018, p. 175–178. doi:10.1145/
[31] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, 2nd Edition., 3234944.3234977. URL: doi: 10.1145/3234944.3234977.
The MIT Press, 2018, URL: http://incompleteideas.net/book/the-book-2nd. [52] K. Arulkumaran, M.P. Deisenroth, M. Brundage, A.A. Bharath, Deep
html. reinforcement learning: A brief survey, IEEE Signal Processing Magazine 34
[32] H. Dong, Z. Ding, R. Huang, S. Zhang, Deep Reinforcement Learning: (6) (2017) 26–38.
Fundamentals, Research, and Applications, Springer Nature, 2020. [53] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A.
[33] T. Zhao, H. Hachiya, G. Niu, M. Sugiyama, Analysis and improvement of policy Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control
gradient estimation, Advances in Neural Information Processing Systems 24. through deep reinforcement learning, Nature 518 (7540) (2015) 529–533.
[34] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in [54] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M.
actor-critic methods, in: International conference on machine learning, PMLR, Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint
2018, pp. 1587–1596. arXiv:1312.5602.
[35] J. Tang, K. Wang, Personalized top-n sequential recommendation via [55] K. O’Shea, R. Nash, An introduction to convolutional neural networks, arXiv
convolutional sequence embedding, in: Proceedings of the Eleventh ACM preprint arXiv:1511.08458.
International Conference on Web Search and Data Mining, WSDM ’18, [56] T. Qin, T.-Y. Liu, J. Xu, H. Li, Letor: A benchmark collection for research on
Association for Computing Machinery, New York, NY, USA, 2018, p. 565–573. learning to rank for information retrieval, Inf. Retr. 13 (4) (2010) 346–374,
doi:10.1145/3159652.3159656. URL: doi: 10.1145/3159652.3159656. https://doi.org/10.1007/s10791-009-9123-y.
[36] E. Ghanbari, A. Shakery, Err. rank: An algorithm based on learning to rank for [57] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K.
direct optimization of expected reciprocal rank, Applied Intelligence 49 (3) Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in:
(2019) 1185–1199. International conference on machine learning, PMLR, 2016, pp. 1928–1937.
[37] K. Crammer, Y. Singer, Pranking with ranking, in: T. Dietterich, S. Becker, Z. [58] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D.
Ghahramani (Eds.), Advances in Neural Information Processing Systems, Vol. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint
14, MIT Press, 2002, URL: https://proceedings.neurips.cc/paper/2001/file/ arXiv:1509.02971.
5531a5834816222280f20d1ef9e95f69-Paper.pdf. [59] R. Lu, Z. Jiang, H. Wu, Y. Ding, D. Wang, H.-T. Zhang, Reward shaping-based
[38] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. actor-critic deep reinforcement learning for residential energy management,
Hullender, Learning to rank using gradient descent, in: Proceedings of the IEEE Transactions on Industrial Informatics (2022) 1–12, https://doi.org/
22nd International Conference on Machine Learning, ICML ’05, Association for 10.1109/TII.2022.3183802.
Computing Machinery, New York, NY, USA, 2005, p. 89–96. doi:10.1145/ [60] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques,
1102351.1102363. URL: doi: 10.1145/1102351.1102363. ACM Trans. Inf. Syst. 20 (4) (2002) 422–446, https://doi.org/10.1145/
[39] T. Joachims, Optimizing search engines using clickthrough data, in: 582415.582418.
Proceedings of the Eighth ACM SIGKDD International Conference on [61] K. Järvelin, J. Kekäläinen, Ir evaluation methods for retrieving highly relevant
Knowledge Discovery and Data Mining, KDD ’02, Association for Computing documents, in: Proceedings of the 23rd Annual International ACM SIGIR
Machinery, New York, NY, USA, 2002, p. 133–142. doi:10.1145/ Conference on Research and Development in Information Retrieval, SIGIR ’00,
775047.775067. URL: doi: 10.1145/775047.775067. Association for Computing Machinery, New York, NY, USA, 2000, p. 41–48.
[40] C.J. Burges, From ranknet to lambdarank to lambdamart: An overview, Tech. doi:10.1145/345508.345545. URL: doi: 10.1145/345508.345545.
Rep. MSR-TR-2010-82 (June 2010). URL: https://www.microsoft.com/en-us/ [62] D.W. Zimmerman, Comparative power of student t test and mann-whitney u
research/publication/from-ranknet-to-lambdarank-to-lambdamart-an- test for unequal sample sizes and variances, The Journal of Experimental
overview/. Education 55 (3) (1987) 171–174, URL: http://www.jstor.org/stable/
[41] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, H. Li, Learning to rank: From pairwise 20151691.
approach to listwise approach, in: Proceedings of the 24th International
Conference on Machine Learning, ICML ’07, Association for Computing
Machinery, New York, NY, USA, 2007, p. 129–136. doi:10.1145/
1273496.1273513. URL: doi: 10.1145/1273496.1273513. Vaibhav Padhye has obtained his Masters Degree in
[42] M. Taylor, J. Guiver, S. Robertson, T. Minka, Softrank: Optimizing non-smooth Computer Science and Engineering from BITS Pilani ,
rank metrics, in: Proceedings of the 2008 International Conference on Web Pilani , India and currently pursuing his Ph.D in Indian
Search and Data Mining, WSDM ’08, Association for Computing Machinery, Institute of Technology (BHU), Varanasi, India. His
New York, NY, USA, 2008, p. 77–86. doi:10.1145/1341531.1341544. URL: doi: research interests include Reinforcement Learning,
10.1145/1341531.1341544. Information Retrieval , Policy Gradient learning ,
[43] Y. Yue, T. Finley, F. Radlinski, T. Joachims, A support vector method for Machine Learning etc.
optimizing average precision, in: Proceedings of the 30th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’07, Association for Computing Machinery, New York, NY, USA, 2007, p.
271–278. doi:10.1145/1277741.1277790. URL: doi: 10.1145/
1277741.1277790.
[44] H. Valizadegan, R. Jin, R. Zhang, J. Mao, Learning to rank by optimizing ndcg
measure., in: NIPS, Vol. 22, 2009, pp. 1883–1891.
[45] Y. Yang, S. Gopal, Multilabel classification with meta-level features in a
learning-to-rank framework, Machine Learning 88 (1–2) (2012) 47–68.
[46] A. Rahangdale, S. Raut, Deep neural network regularization for feature Kailasam Lakshmanan is an Assistant Professor with
selection in learning-to-rank, Vol. 7, IEEE, 2019, pp. 53988–54006. the Department of Computer Science and Engineering,
[47] G. Shani, R.I. Brafman, D. Heckerman, An mdp-based recommender system, in: Indian Institute of Technology (BHU), Varanasi, India. He
Proceedings of the Eighteenth Conference on Uncertainty in Artificial received his Ph.D in Computer Science and Automation
Intelligence, UAI’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, from IISc Bangalore and completed Post-Doc at IIT
USA, 2002, p. 453–460. Bombay, University of Leoben, Austria and National
[48] Y. Feng, J. Xu, Y. Lan, J. Guo, W. Zeng, X. Cheng, From greedy selection to University of Singapore. His research interests include
exploratory decision-making: Diverse ranking with policy-value networks, Reinforcement Learning, Optimization , Machine
in: The 41st International ACM SIGIR Conference on Research and
Learning etc.
Development in Information Retrieval, SIGIR ’18, Association for Computing
Machinery, New York, NY, USA, 2018, pp. 125–134, https://doi.org/10.1145/
3209978.3209979.
11

A deep actor critic reinforcement learning framework for learning to rank 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A deep actor critic reinforcement learning framework for learning to rank 2023

Uploaded by

Copyright:

Available Formats

Neurocomputing 547 (2023) 126314

Contents lists available at ScienceDirect

A deep actor critic reinforcement learning framework for learning to

1. Introduction Reinforcement Learning has been effectively applied in the

Fig. 1. Actor Critic framework.

iteratively it’s parameters / and selects the current action a

y1 ¼ r þ cQ h2 ðs0; l/ ðs0ÞÞ; y2 ¼ r þ cQ h1 ðs0; l/ ðs0ÞÞ ð7Þ

This underestimating bias isn’t an issue since as low values aren’t

5. Experiments a varied number of nodes in the hidden layer from 50 to 200.

Fig. 2. State Representation.

Dataset Queries Documents Features RelevanceLabels

Dropout Rate 0 0.2 0.4 0.8

Method NDCG@1 NDCG@3 NDCG@5 NDCG@10 MAP

Method NDCG@1 NDCG@3 NDCG@5 NDCG@10 MAP

Method NDCG@1 NDCG@3 NDCG@5 NDCG@10 MAP

Fig. 3. Reward Distribution. Fig. 4. Varying Discount factor.

Fig. 6. Varying Batch Size.

Fig. 7. Varying Episode Length.

6. Conclusion lem as a Markov Decision Process and utilized a Deep Reinforce-

Fig. 8. Comparing DRLRank with other DRL algorithms.

You might also like