Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Usage of Neural Networks in Information Retrieval

Seminar & Technical Writing-II


Department of Electrical Engineering,
NIT Rourkela

Presented by:
Ansh Abhay Balde
715EE4018
1
Outline

• Introduce Information Retrieval

• Describe the use of neural network based methods in information retrieval


 Feedforward neural network
 Back propagation Algorithm
 Recurrent neural networks
 Sequence-to-sequence models
 Convolutional neural networks

• Summarize where the field is and explain what future work would be covered.

2
Introduction
• Information retrieval (IR) is the activity of obtaining information
system resources relevant to an information need from a collection of information
resources. Searches can be based on full-text or other content-based indexing.
• One of the best examples for the usage of IR is in search engines like Google,
Yahoo and Yandex (Its quite popular in Russia since Google is banned there.)

Fig. 1: Search Process


3
Multi-layer perceptron a.k.a. feedforward neural network

xi,j
target: y y1 y2 y3
cost function
output/prediction: y^ ^y ^y ^y e.g.: 1/2 (y^- y)2 φ: activation function
1 2 3
e.g.: sigmoid
1
1 + e-o
output layer

weights

hidden layer j

weights w 1,4
xi-1, 1 xi-1, 4
x1 x3 x4 xi-1, 2 xi-1, 3
input: x x2
node j at level i

Fig. 2: Feedforward neural network

4
Back propagation y1 y2 y3
• Until convergence: cost function
e.g.: 1/2 (y^- y)2 ^y ^y ^y
i. do a forwardpass 1 2 3

ii. compute the cost/error


iii. adjust weights ← how??
weights

• Adjust every weight wi,j by:


∆w i,j = −α ∂w
∂ cost
(α=learning rate) weights
i, j

∂cost ∂x i,j
= −α input: x x1 x2 x3 x4

∂x i,j ∂wi,j
∂cost ∂x i,j ∂oi,j
= −α cost

∂x i,j ∂oi,j ∂wi,j


= −α yj − x i,j x i,j (1 − x i,j ) x i−1,j
= l.rate cost activation input w i,j

= −α δ x i−1,j Fig. 3: Backpropagation Algorithm and Cost Function


5
Recurrent neural networks

• Lots of information is sequential and • Recurrent neural networks (RNNs) are


requires a memory for successful called recurrent because they perform
processing same task for every element of
• Sequences as input, sequences as sequence, with output dependent on
output previous computations
• RNNs have memory that captures
information about what has been
computed sofar
• RNNs can make use of information in
arbitrarily long sequences – in practice
they limited to looking back only few
Fig. 5: RNNs steps
Image credits: h t t p : / / k a r p a t h y . g i t h u b . i o / a s s e t s / r n n / d i a g s . j p e g
6
Recurrent neural networks

• RNN being unrolled (or unfolded) into


full network
• Unrolling: write out network for
complete sequence

Fig. 6: Unrolled RNN


• Formulas governing computation:
 x t input at time step t
 s t hidden state at time step t – memory of the network, calculated based on previous
hidden state and input at the current step: s t = f ( U x t + W s t−1 ); f usually
nonlinearity, e.g., tanh or ReLU; s−1 typically initialized to all zeroes
 ot output at step t. E.g.,, if we want to predict next word in sentence, a vector of
probabilities across vocabulary: ot = softmax(V s t )

7
Language modeling using RNNs

• Language model allows us to predict • Cross-entropy loss aslossfunction


probability of observing sentence(in a • For N training examples (words in
given dataset) as: P(w 1 , . . . , w m ) = text) and C classes (the size of our
π m
i= 1
P(w i | w 1, . . . , w i−1) vocabulary), loss wrt predictions o
• In RNN, set ot = x t+ 1 : we want and true labels y is:
output at step t to be actual next word L(y, o) = − N1 Σ n∈N yn log o n
• Input x a sequence of words; each x t • Training RNN similar to training a
is a single word; we represent each traditional NN: backpropagation
word as a one-hot vector of size algorithm, but with small twist
vocabular y_ s i z e • Parameters shared by all time steps, so
• Initialize parameters U, V , W to small gradient at each output depends on
random values around 0 calculations of previous time steps:
Backpropagation Through Time
8
Vanishing and exploding gradients
L0 L1 L2 L3 L4
• For training RNNs, calculate gradients for U,
V , W – ok for V but for W and U . . .
• Gradients for W :
3
∂L 3 ∂L 3 ∂o3 ∂s 3 ∂L ∂o ∂s ∂s
∂W = ∂o3 ∂s 3 ∂W = Σ
k= 0
∂o
3
3 ∂s 3
3
∂s k
3
∂W
k

• More generally: ∂∂sL t = ∂s


∂L ·
m
∂ sm
∂sm−1 ·∂∂ssmm−−2 1 ····· ∂s
∂ st + 1
t
⇒ 1
< 1 < 1 < 1
• Gradient contributions from far away steps become zero: state
at those steps doesn’t contribute to what you are learning.

Fig. 7: Gradient back propagated and Vanishing Gradient

9
Long Short Term Memory [Hochreiter and Schmidhuber, 1997]
LSTMs designed to combat vanishing gradients through gating mechanism

• i, f , o: input, forget and output gates • Compute output hidden state s t by


• Gates optionally let information multiplying memory with output gate
through: composed out of sigmoid • Plain RNNs a special case of LSTMs:
neural net layer and pointwise - Fix input gate to all 1’s
multiplication operation - Fix forget gate to all 0’s (always
• g is a candidate hidden state forget the previous memory)
computed based on current input and
- Fix output gate to all 1’s (expose the
whole memory)
previous hidden state - Additional tanh squashes output
• ct is internal memory of LSTM unit: • Gating mechanism allows LSTMs to
combines previous memory ct−1 model long-term dependencies
multiplied by forget gate, and newly
computed hidden state g, multiplied • Learn parameters for gates, to learn
by input gate how memory should behave
10
Sequence-to-sequence models
e seq2seq models build on top of languagemodels
- Encoder step: a model converts input sequence into a fixed representation
- Decoder step: a language model is trained on both the output sequence (e.g.,
translated sentence) as well as fixed representation from theencoder
- Since decoder model seesencoded representation of input sequence as well asoutput
sequence, it can make more intelligent predictions about future words based on
current word

Image credits: [Sutskever et al., 2014; Cho et al., 2014; Bahdanouet al., 2014]
11
Sequence-to-sequence models
Used for a “traditional information retrieval task”

12
Convolutional neural networks
Major breakthroughs in image classification – at core of many computer visions systems
Some initial applications of CNNs to problems in text and informationretrieval
What is a convolution? Intuition: sliding window function applied to a matrix
Example: convolution with 3 × 3 filter

Multiply values element-wise with original matrix, then sum. Slide over whole matrix.

Image credits: h t t p : / / d e e p l e a r n i n g . s t a n f o r d . e d u / w i k i / i n d e x . p h p / F e a t u r e _ e x t r a c t i o n _ u s i n g _ c o n v o l u t i on
13
Convolutional neural networks

e Use convolutions over input layer to e Image classification a CNN may learn to
compute output detect edges from raw pixels in first layer
e Yields local connections: each region of e Then use edges to detect simple shapes in
input connected to a neuron in output second layer
e Each layer applies different filters and e Then use shapes to detect higher-level
combines results features, such as facial shapes in higher
layers
e Pooling (subsampling) layers
e Last layer is then a classifier that uses
e During training, CNN learns values of
high-level features
filters
14
CNNs in text

Example uses in IR
e MSR: how to learn semantically meaningful representations of sentences that can
be used for Information Retrieval
e Recommending potentially interesting documents to users based on what they are
currently reading
e Sentence representations are trained based on search engine log data
e Gao et al. Modeling Interestingness with Deep Neural Networks. EMNLP 2014;
Shen et al. A Latent Semantic Model with Convolutional-Pooling Structure for
Information Retrieval. CIKM 2014.

15
Future Presentations

• We’ll learn about learn text matching. How queries are matched with the
documents

• How documents are ranked?

• How user behaviour is taken into account to improve search results?

• How Neural Click Model is used in practice to improve results over Probability
Click Models?

• How responses are generated in final result page?

16
References
• Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016a. Analysis of the Paragraph Vector Model for Information Retrieval. In ICTIR. ACM,
133–142.
• Qingyao Ai, Liu Yang, Jiafeng Guo, and W Bruce Croft. 2016b. Improving language estimation with the paragraph vector model for ad-hoc retrieval.
In SIGIR. ACM, 869–872.
• Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and Bruce W. Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product
Search. In SIGIR.
• Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo test collections for learning web search ranking functions. In SIGIR.
ACM, 1073–1082.
• Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building simulated queries for known-item topics: An analysis using six European
languages. In SIGIR. ACM.
• Marco Baroni, Georgiana Dinu, and Germ´an Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors.. In ACL (1). 238–247.
• Yoshua Bengio and Jean-S´ebastien Sen´ecal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model.
IEEE Transactions on Neural Networks 19, 4 (2008), 713–722.
• Yoshua Bengio, Jean-S´ebastien Sen´ecal, and others. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling.. In AISTATS.
• Richard Berendsen, Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. 2013. Pseudo test collections for training and tuning
microblog rankers. In SIGIR. ACM, 53–62.
• David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3 (2003), 993–
1022. Antoine Bordes and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. ICLR (2017).
• Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A neural click model for web search. In Proceedings of the 25th
International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 531–541.
• Chris Burges. 2015. RankNet: A ranking retrospective. (2015).
h t t p s : / / w w w . m i c r o s o f t . c o m / e n - u s / r e s e a r c h / b l o g / r a n k n e t - a - r a n k i n g - r e t r o s p e c t i v e / Accessed January 15,2019.

17

You might also like