For Seminar

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Usage of Neural Networks in Information Retrieval

Seminar & Technical Writing-II


Department of Electrical Engineering,
NIT Rourkela

Presented by:
Ansh Abhay Balde
715EE4018
1
Outline

• Introduce Information Retrieval

• Describe the use of neural network based methods in information retrieval


 Feedforward neural network
 Back propagation Algorithm
 Recurrent neural networks
 Sequence-to-sequence models
 Convolutional neural networks

• Summarize where the field is and explain what future work would be covered.

2
Introduction
• Information retrieval (IR) is the activity of obtaining information
system resources relevant to an information need from a collection of information
resources. Searches can be based on full-text or other content-based indexing.
• One of the best examples for the usage of IR is in search engines like Google,
Yahoo and Yandex (Its quite popular in Russia since Google is banned there.)

Fig. 1: Search Process


3
Multi-layer perceptron a.k.a. feedforward neural network

xi,j
target: y y1 y2 y3
cost function
output/prediction: y^ ^y ^y ^y e.g.: 1/2 (y^- y)2 φ: activation function
1 2 3
e.g.: sigmoid
1
1 + e-o
output layer

weights

hidden layer j

weights w 1,4
xi-1, 1 xi-1, 4
x1 x3 x4 xi-1, 2 xi-1, 3
input: x x2
node j at level i

Fig. 2: Feedforward neural network

4
Back propagation y1 y2 y3
• Until convergence: cost function
e.g.: 1/2 (y^- y)2 ^y ^y ^y
i. do a forwardpass 1 2 3

ii. compute the cost/error


iii. adjust weights ← how??
weights

• Adjust every weight wi,j by:


∆w i,j = −α ∂w
∂ cost
(α=learning rate) weights
i, j

∂cost ∂x i,j
= −α input: x x1 x2 x3 x4

∂x i,j ∂wi,j
∂cost ∂x i,j ∂oi,j
= −α cost

∂x i,j ∂oi,j ∂wi,j


= −α yj − x i,j x i,j (1 − x i,j ) x i−1,j
= l.rate cost activation input w i,j

= −α δ x i−1,j Fig. 3: Backpropagation Algorithm and Cost Function


5
Recurrent neural networks

• Lots of information is sequential and • Recurrent neural networks (RNNs) are


requires a memory for successful called recurrent because they perform
processing same task for every element of
• Sequences as input, sequences as sequence, with output dependent on
output previous computations
• RNNs have memory that captures
information about what has been
computed sofar
• RNNs can make use of information in
arbitrarily long sequences – in practice
they limited to looking back only few
Fig. 5: RNNs steps
Image credits: h t t p : / / k a r p a t h y . g i t h u b . i o / a s s e t s / r n n / d i a g s . j p e g
6
Recurrent neural networks

• RNN being unrolled (or unfolded) into


full network
• Unrolling: write out network for
complete sequence

Fig. 6: Unrolled RNN


• Formulas governing computation:
 x t input at time step t
 s t hidden state at time step t – memory of the network, calculated based on previous
hidden state and input at the current step: s t = f ( U x t + W s t−1 ); f usually
nonlinearity, e.g., tanh or ReLU; s−1 typically initialized to all zeroes
 ot output at step t. E.g.,, if we want to predict next word in sentence, a vector of
probabilities across vocabulary: ot = softmax(V s t )

7
Language modeling using RNNs

• Language model allows us to predict • Cross-entropy loss aslossfunction


probability of observing sentence(in a • For N training examples (words in
given dataset) as: P(w 1 , . . . , w m ) = text) and C classes (the size of our
π m
i= 1
P(w i | w 1, . . . , w i−1) vocabulary), loss wrt predictions o
• In RNN, set ot = x t+ 1 : we want and true labels y is:
output at step t to be actual next word L(y, o) = − N1 Σ n∈N yn log o n
• Input x a sequence of words; each x t • Training RNN similar to training a
is a single word; we represent each traditional NN: backpropagation
word as a one-hot vector of size algorithm, but with small twist
vocabular y_ s i z e • Parameters shared by all time steps, so
• Initialize parameters U, V , W to small gradient at each output depends on
random values around 0 calculations of previous time steps:
Backpropagation Through Time
8
Vanishing and exploding gradients
L0 L1 L2 L3 L4
• For training RNNs, calculate gradients for U,
V , W – ok for V but for W and U . . .
• Gradients for W :
3
∂L 3 ∂L 3 ∂o3 ∂s 3 ∂L ∂o ∂s ∂s
∂W = ∂o3 ∂s 3 ∂W = Σ
k= 0
∂o
3
3 ∂s 3
3
∂s k
3
∂W
k

• More generally: ∂∂sL t = ∂s


∂L ·
m
∂ sm
∂sm−1 ·∂∂ssmm−−2 1 ····· ∂s
∂ st + 1
t
⇒ 1
< 1 < 1 < 1
• Gradient contributions from far away steps become zero: state
at those steps doesn’t contribute to what you are learning.

Fig. 7: Gradient back propagated and Vanishing Gradient

9
Long Short Term Memory [Hochreiter and Schmidhuber, 1997]
LSTMs designed to combat vanishing gradients through gating mechanism

• i, f , o: input, forget and output gates • Compute output hidden state s t by


• Gates optionally let information multiplying memory with output gate
through: composed out of sigmoid • Plain RNNs a special case of LSTMs:
neural net layer and pointwise - Fix input gate to all 1’s
multiplication operation - Fix forget gate to all 0’s (always
• g is a candidate hidden state forget the previous memory)
computed based on current input and
- Fix output gate to all 1’s (expose the
whole memory)
previous hidden state - Additional tanh squashes output
• ct is internal memory of LSTM unit: • Gating mechanism allows LSTMs to
combines previous memory ct−1 model long-term dependencies
multiplied by forget gate, and newly
computed hidden state g, multiplied • Learn parameters for gates, to learn
by input gate how memory should behave
10
Sequence-to-sequence models
e seq2seq models build on top of languagemodels
- Encoder step: a model converts input sequence into a fixed representation
- Decoder step: a language model is trained on both the output sequence (e.g.,
translated sentence) as well as fixed representation from theencoder
- Since decoder model seesencoded representation of input sequence as well asoutput
sequence, it can make more intelligent predictions about future words based on
current word

Image credits: [Sutskever et al., 2014; Cho et al., 2014; Bahdanouet al., 2014]
11
Sequence-to-sequence models
Used for a “traditional information retrieval task”

12
Convolutional neural networks
Major breakthroughs in image classification – at core of many computer visions systems
Some initial applications of CNNs to problems in text and informationretrieval
What is a convolution? Intuition: sliding window function applied to a matrix
Example: convolution with 3 × 3 filter

Multiply values element-wise with original matrix, then sum. Slide over whole matrix.

Image credits: h t t p : / / d e e p l e a r n i n g . s t a n f o r d . e d u / w i k i / i n d e x . p h p / F e a t u r e _ e x t r a c t i o n _ u s i n g _ c o n v o l u t i on
13
Convolutional neural networks

e Use convolutions over input layer to e Image classification a CNN may learn to
compute output detect edges from raw pixels in first layer
e Yields local connections: each region of e Then use edges to detect simple shapes in
input connected to a neuron in output second layer
e Each layer applies different filters and e Then use shapes to detect higher-level
combines results features, such as facial shapes in higher
layers
e Pooling (subsampling) layers
e Last layer is then a classifier that uses
e During training, CNN learns values of
high-level features
filters
14
CNNs in text

Example uses in IR
e MSR: how to learn semantically meaningful representations of sentences that can
be used for Information Retrieval
e Recommending potentially interesting documents to users based on what they are
currently reading
e Sentence representations are trained based on search engine log data
e Gao et al. Modeling Interestingness with Deep Neural Networks. EMNLP 2014;
Shen et al. A Latent Semantic Model with Convolutional-Pooling Structure for
Information Retrieval. CIKM 2014.

15
Future Presentations

• We’ll learn about learn text matching. How queries are matched with the
documents

• How documents are ranked?

• How user behaviour is taken into account to improve search results?

• How Neural Click Model is used in practice to improve results over Probability
Click Models?

• How responses are generated in final result page?

16
References
• Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016a. Analysis of the Paragraph Vector Model for Information Retrieval. In ICTIR. ACM,
133–142.
• Qingyao Ai, Liu Yang, Jiafeng Guo, and W Bruce Croft. 2016b. Improving language estimation with the paragraph vector model for ad-hoc retrieval.
In SIGIR. ACM, 869–872.
• Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and Bruce W. Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product
Search. In SIGIR.
• Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo test collections for learning web search ranking functions. In SIGIR.
ACM, 1073–1082.
• Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building simulated queries for known-item topics: An analysis using six European
languages. In SIGIR. ACM.
• Marco Baroni, Georgiana Dinu, and Germ´an Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors.. In ACL (1). 238–247.
• Yoshua Bengio and Jean-S´ebastien Sen´ecal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model.
IEEE Transactions on Neural Networks 19, 4 (2008), 713–722.
• Yoshua Bengio, Jean-S´ebastien Sen´ecal, and others. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling.. In AISTATS.
• Richard Berendsen, Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. 2013. Pseudo test collections for training and tuning
microblog rankers. In SIGIR. ACM, 53–62.
• David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3 (2003), 993–
1022. Antoine Bordes and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. ICLR (2017).
• Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A neural click model for web search. In Proceedings of the 25th
International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 531–541.
• Chris Burges. 2015. RankNet: A ranking retrospective. (2015).
h t t p s : / / w w w . m i c r o s o f t . c o m / e n - u s / r e s e a r c h / b l o g / r a n k n e t - a - r a n k i n g - r e t r o s p e c t i v e / Accessed January 15,2019.

17

You might also like