Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Query Subtopic Mining Based on

Cluster Ranking

Presenter Supervisor

1. Amrita Das Gupta (150101073) Dr. S. M. Sadakatul Bari

2. Tasnima Saddaki Tani (150101069) Asst. Professor

Department of Computer Science &


3. Medha Joita Sikder (150101062)
Engineering
4. Jannatun Nime Jotey (150101083)

Department of Computer Science and Engineering


Bangladesh Army University of Science and Technology
Presentation Outlines
• Introduction and Motivation
• Related Works
• Proposed Method
Overview
Subtopic Candidate Generation
Subtopic Feature Extraction
Subtopic Clustering
Subtopic Ranking
• Experiments and Evaluation
Dataset
Experimental Setup
Experimental Results
• Conclusions and Future Directions

2
2
Introduction & Motivation

What is Subtopic
Mining??

Identify subtopics with diverse user intents to disambiguate


the query.

3
3
Cont.

Why will we use subtopic


mining??

4
4
Cont.

All results
related
to car !!!!

But users might look for


1. Jaguar animal
2. Jaguar sanitary services

5
5
Cont.
Problem with Web search

 Different users have different information need issuing same query


 Queries are broad, ambiguous, and may have multiple intents

Apple

Apple IPhone Apple New York office

Apple MacBook Apple Product


Review
Apple Company Apple Fruit

6
6
Problems Statement

 Problems with Current Web Search

 Different users have different information need issuing same query

 Queries are broad, ambiguous, and may have multiple intents

7
7
Objectives

 Study the different method of query subtopic mining


used for web search.
 Develop an enhanced method for query subtopic
mining based on cluster ranking.
 Implement the developed method using python
programming language.
 Compare the results using standard data set.

8
8
Related Work
• Santos et al. [1] describe search engine suggestions as subtopic candidates to uncover
query intents.
• Kim and Lee, [2] suggested a method to mine subtopics using simple patterns and
hierarchical structure of subtopic exploiting a set of relevant documents.
• Ren et al. [3] proposed a system to mine subtopics based on a heterogeneous graph and
improve the subtopic quality with the help of Wikipedia concepts by introducing
heterogeneous graph-based soft-clustering to attain an intent indicator for each object
based on the constructed heterogeneous graph.
• Another research work by Hu et al. [4] in which they identified the intents of the input
query by mapping the query into the Wikipedia representation space.
• Zheng et al. [5] integrated the information from both structured and unstructured data to
extract high-quality subtopics.
• M. Shajalal, M. Z. Ullah, A. N. Chy, and M. Aono [6] suggested a method to apply soft
clustering to the subtopic candidates based on frequent phrases to group subtopics of
similar intents.

9
9
Proposed Method

Google Yahoo Bing

Search Engines Suggestions

Query Subtopic
Subtopic Clustering
Candidates

Feature Extraction

Ranked Subtopic Ranking


subtopics

10
Subtopic Candidate Generation

Bing Google Yaho


o

Query Search Engine Subtopi


Suggestions Candidates
c

pocono record
poconos all inclusive
Bing
poconos family resorts
pocono record
pocono mountain
poconos all inclusive
resorts
poconos family resorts
• Pocono pocono record pocono mountain resorts
Google
pocono raceway pocono raceway
poconos vacation poconos resorts
pocono medical pocono medical center
center pocono mountains
pocono raceway Yahoo poconos vacation
pocono record
pocono medical center
pocono mountains

10
Cont…

Google Yahoo Bing

Search Engines Suggestions

Query Subtopic
Subtopic Clustering
Candidates

Feature Extraction

Ranked Subtopic Ranking


subtopics

12
11
Feature Extraction
 Query Dependent Features
•  Average Concept Similarity
•  WordNet Path Similarity
•  Lexical Similarity
•  Query Term Overlap
•  Query Synonym Overlap
•  Exact Match
•  Hit Count
•  Point-wise Mutual Information
 Query Independent Features
•  Selective POS Percentage
•  Avg. Term Length
•  Reciprocal Rank
•  Voting
13
12
Feature Extraction
Query Dependent Features (1/2)
1. Average Concept Similarity (ACS)
The average concept similarity between query 𝑄 and subtopic 𝑆:

𝑄, 𝑆 = 𝑡 𝑖 ∈𝑄 𝑡 𝑗 ∈ 𝑆𝐶𝑜𝑛𝑆𝑖𝑚(𝑡𝑖, 𝑡𝑗 )
𝑓𝐴𝐶𝑆 𝑄 ∗ |𝑆|
where 𝐶𝑜𝑛𝑆𝑖𝑚(𝑡𝑖, 𝑡𝑗 ) denotes similarity between two concepts 𝑡𝑖 and 𝑡𝑗 in a
large conceptual domain.
2 ∗ 𝑑𝑒𝑝𝑡ℎ1
𝐶𝑜𝑛𝑆𝑖𝑚 𝑡 𝑖 , 𝑡𝑗 =
𝑑𝑒𝑝𝑡ℎ2 + 𝑑𝑒𝑝𝑡ℎ3 + 2 ∗ 𝑑𝑒𝑝𝑡ℎ1

14
13
Cont...
Query Dependent Features (2/2)
2. WordNet Path Similarity (WPS)

𝑏 𝑄 𝑊𝑏 𝑆 𝑇
𝑄, 𝑆 =
𝑓𝑊𝑃𝑆 𝑏𝑄 ∗ |𝑏𝑆|
where 𝑏𝑄 and 𝑏𝑆 are two binary vector for query 𝑄 and subtopic 𝑆

𝑏𝑄 = 𝐼(𝑡 ∈ 𝑄)
𝑡∈𝑉

𝑏𝑆 = 𝐼(𝑡 ∈ 𝑆)
𝑡∈𝑉

where
𝑉 is a vector containing the terms of query and subtopic.
𝐼(𝑡 ∈ 𝑄) returns 1 if the argument is true, 0 otherwise.
𝑊 is a symmetric matrix containing all pair concept
similarity of vector 𝑉.

14
Cont...
Query Independent Features
3. Selective POS Percentage (SPP)

𝑡 ∈𝑆 𝐼(𝑃𝑂𝑆(𝑡) ∈ 𝑀)
𝑓𝑀𝑃 𝑆 =
𝑃
|𝑆|
where 𝑃𝑂𝑆(𝑡) returns the part of speech tag of a term 𝑡 and 𝑀 is the set of
selective POS, such as

𝑀 = {𝑁𝑜𝑢𝑛, 𝐴𝑑𝑗𝑒𝑐𝑡𝑖𝑣𝑒, 𝑉𝑒𝑟𝑏, 𝐴𝑑𝑣𝑒𝑟𝑏}

15
Cont…

Google Yahoo Bing

Search Engines Suggestions

Query Subtopic
Subtopic Clustering
Candidates

Feature Extraction

Ranked Subtopic Ranking


subtopics

17
16
Subtopic Clustering
Query Subtopics Intent
Pluto pictures
Pluto pictures nasa
Pictures of pluto Picture of pluto
Pluto Latest pictures of
pluto
Pluto reinstated as a planet
Pluto a planet again Pluto planet
Pluto planet
Pork tenderloin cooking
instructions Pork tenderloin Pork tenderloin
cooking time cooking
Pork tenderloin How to cook pork tenderloin
Baked pork tenderloin recipe
Pork tenderloin recipes crock pot
Pork tenderloin recipes oven Pork tenderloin
Pork tenderloin recipes recipe

18
17
Cont…

Google Yahoo Bing

Search Engines Suggestions

Query Subtopic
Subtopic Clustering
Candidates

Feature Extraction

Ranked
Subtopic Ranking
subtopics

19
18
Subtopic Ranking

Round-Robin Selection (𝐶):

Input: List of clusters 𝐶 where ranked subtopics inside the cluster.


Output: Ranked Subtopics, 𝑅
1: 𝑅 = [] // ranked subtopics
2: For each cluster 𝑐𝑖 in 𝐶
3: 𝑆 = Select top ranked subtopic from 𝑐𝑖
4: 𝑅 = 𝑅 ∪ 𝑆
5: go to Step 2 until all clusters are empty.
6. return 𝑅

19
Dataset

 NTCIR-10 IMINE-2 English Subtopic Mining test

 NTCIR-12 IMINE-2 English Subtopic Mining test

21
20
Experimental Setup
Table 1: Configuration of different Runs

Run Name Run Configuration

Linear Ranking with Extracted


Linear Ranking
Features

Subtopic Clustering + Round


Subtopic Clustering
Robin Selection from Cluster

Cluster Ranking Round Robin Selection

22
21
Experimental Result
Top 10 ranked list

23
22
Feature Importance
Random Forest:

Figure: Importance estimation of selected features

42
24
23
Conclusion and Future Direction

•Conclusion
Introduced a cluster ranking based subtopic mining method
Proposed three features
Average Concept Similarity
WordNet Path similarity
Selective POS Percentage

•Future Directions
Design subtopic pattern
Explore more features
Search result diversification using mined subtopics

24
Future Plan

Google Yahoo Bing

Search Engines Suggestions

Query Subtopic Cluster


Subtopic Clustering
Candidates Diversification

Ranked Clusters
Feature Extraction

Subtopic Relevance Subtopic


Estimation Diversification
Subtopic Diversification within
Ranked Cluster
Subtopic Ranking
Subtopics

25
References
• [1] R. L. Santos, C. Macdonald, and I. Ounis, “Exploiting query reformulations for web
search result diversification,” in Proceedings of the 19th international conference on World
wide web. ACM, 2010, pp. 881–890.

• [2] S. J. Kim, J. Shin, and J. H. Lee, “Subtopic mining based on three-level hierarchical
search intentions,” in Advances in Information Retrieval. Springer, 2016, pp. 741–747.

• [3] X. Ren, Y. Wang, X. Yu, J. Yan, Z. Chen, and J. Han, “Heterogeneous graph-based intent
learning with queries, web pages and wikipedia concepts,” in Proceedings of the 7th ACM
international conference on Web search and data mining. ACM, 2014, pp. 23–32.

• [4] J. Hu, G. Wang, F. Lochovsky, J. T. Sun, and Z. Chen, “Understanding user’s query
intent with wikipedia,” in Proceedings of the 18th international conference on World wide
web. ACM, 2009, pp. 471–480.

• [5] W. Zheng, H. Fang, C. Yao, and M. Wang, “Leveraging integrated information to extract
query subtopics for search result diversification,” Information retrieval, vol. 17, no. 1, pp.
52–73, 2014

• [6] M. Shajalal, M. Z. Ullah, A. N. Chy, and M. Aono, “Query subtopic diversification based
on cluster ranking and semantic features,” 2016 international conference on advanced
informatics: Concepts, Theory And Application (ICAICTA). IEEE 2016, pp. 1-6.
27
26
28

You might also like