Solr Cluster Installation Tool "Anuenue" and "Did You Mean?" For Japanese

Solr Cluster installation tool "Anuenue"
and
"Did You Mean?" for Japanese
Takahiko Ito
mixi, Inc.
1
mixi?
 One of the largest social
networking service in
Japan.
 Many services to promote
communication among
users.
 Blog, news, game
platform etc
 Most of the services
come with search
 15M monthly active users
2
Our current (urgent) project …
Replace in-house search engines into a up-to-date search
platform!
We have
 selected Apache Solr as the search platform!
 created a simple OSS package (Anuenue) which
wraps Solr
Project URL: http://code.google.com/p/anuenue-wrapper/
3
Reason why we make Anuenue
Deployment / daily operations of Solr search cluster is a bit
difficult for ordinary engineers.
 We need to edit the configuration files for all the Solr
instances respectively
 Commands for whole clusters are not provided
• We need to write client commands by ourselves
• Hadoop provides utility commands for clusters
E.g., start-all.sh (start processes), fsck (check all
discs), balancer (rebalance the data blocks)
What does Anuenue provide?
 Handy configuration of search clusters
 Commands for clusters
 Simple commands (post, delete, update, commit etc)
 Start and stop commands for processes in cluster.
 Japanese support
 Implementation of Japanese Did-You-Mean facilities
 Japanese tokenizer (Sen and Kuromoji)
5
Today’s Topics
 Anuenue
 Handy configuration of search clusters
 Commands for search clusters
 Did-You-Mean facilities for Japanese queries

 Common problem in Did-You-Mean implementation
 Mining a Japanese Did-You-Mean dictionary from
query log data
6
Cluster configuration with Anuenue
 Cluster setup is done with a special configuration file
 Anuenue assigns more than one roles to instances.

 Roles are the functions in a cluster
 Anuenue supports three roles (Master, Slave,
Merger)
7
Role: master
 Index input data.
NOTE: Anuenue provides a command to distribute the input

data into master instances (build Solr shard indexes) .
Master-1 Master-2 Master-3
Build shard indexes
Input Data
8
Role: slave
Has three functions
Merger-1
 Copy (replicate) index
from master Submit queries
 Accept queries from
mergers and then Slave-1 Slave-2
search it own index
Replicate index
 Return the results to
merger instance Master-1 Master-2
Index input data
Input Data
9
Role: merger
 Forwards queries from
clients to slaves. Client-1 Client-2
 Note: clients need not
to know the slave Submit queries
instances (merger
adds ‘shard’ Merger
parameter with slave
Forwards queries
instances)
 Merge the results from all
Slave-1 Slave-2
the slave instances and
returned the merged
results.
10
Example: Anuenue cluster
The cluster consists of five Client-1 Client-2
machines
 Each has one aa
Anuenue instance Forward queries
Instances cc dd
 Merger: aa Replicate index

 Master: bb, cc
bb ee
 Slave: dd, ee
Index input data
Input Data
11
How to assign roles to instance?
Edit cluster configuration file, anuenue-nodes.xml.

• Add three elements (mergers, slaves and masters)
• In each element, add more than one instance
information (machine name and port number).
12
Configuration example
Case: there is one merger instance in machine, aa (port
7000)
<mergers>
<merger>
<host>aa</host>
<port>7000</port>
</merger>
</mergers>
13
Specify the index to replicate
<masters>
<master iname=“master1”>
<host>aaaa</host>
<port>8983</port>
</master> Add name of master instance
</masters> by iname attribute
<slaves>
<slave >
<host>bbbb</host>
<port>8983</port>
<replicate>master1</replicate>
</slave>
Specify the master instance
</slaves>
to copy the index adding
replicate element
14
Example: simple cluster settings
<mergers> Client-1 Client-2
<merger>
<host>aa</host>
<port>8983</port>
</merger> aa
</mergers>
<masters> Forward queries
<master iname=“master1”>
<host>bb</host> cc
<port>8983</port>
</master> Replicate index
</masters>
<slaves> bb
<slave>
<host>cc</host> Index input data
<port>8983</port>
<replicate>master1</replicate>
</slave> Input Data
</slaves>
15
Cluster setup with Anuenue
 Flexible and support various types of search cluster.
 For example…
16
Assign multiple roles
Client1 Client2
Submit queries
instance
Index input data
Input Data
17
Large clusters to handle huge data with
high QPS
Client1 Client2 Client3 … ClientN
Merger1 Merger2 Merger3
Slave1 Slave2 Slave3 Slave4 Slave5 Slave6
Master1 Master2 Master3 Master4 Master5 Master6
Input Data
18
After setting up cluster
We can make use of commands for clusters.
Anuenue provides
 start / stop commands
 commands to manipulate the index
Start and stop clusters
Users can start / stop clusters by a command
(anuenue-distdaemon.sh).
Usage:
$sh bin/anuenue-distdaemon.sh [start|stop]
Simple commands for clusters
Anuenue also provides basic commands (‘post’, ‘delete’,
‘commit’, ‘optimize’ and ‘update’) for search cluster 　
 The commands are implemented in multi-thread
E.g.,
$sh bin/anuenue-distcommands.sh post -arg inputDir
Today’s Topics
 Anuenue
 Handy cluster configuration of search clusters
 Commands for search clusters
 Did-You-Mean facilities for Japanese queries

 Common problem in Did-You-Mean implementation
 Mining a Japanese Did-You-Mean dictionary from
query log data
22
What is Did-You-Mean service?
 Suggest correct spelling when users submit queries with
mistakes
 Increase the usability of search service
23
Example: Did-You-Mean service
(English: Ugly Betty)
24
Common implementation
Many search engines (including Solr) apply distance
measures such as Edit Distance [Levenshtein, 1965]
Edit Distance: measure of distance between two sequences.

Simply speaking, when two sequences have more common
characters, the distance is smaller.
E.g.,
like  likes (small distance)
like  foobar (large distance)
25
Common procedure: Did-You-Mean
When a user submits a query,
1. Did-You-Mean service computes edit distance between
input query and words in index.
2. If there is a word whose distance is small,
 Did-You-Mean handler suggests
E.g., when a user submit a query, “pthon”, Did-You-Mean

service suggests a word in the index with small distance
“python”.
26
Problem: Japanese queries
Simple application of edit distance does not work for

Japanese
 Misspelled queries are sometimes totally different from
the correct one (large distance).
E.g.,
 墨ともふどうさん (correct: 住友不動産 )
 米事案セット (correct: ベイジアンセット )
 These cases are derived from Japanese input method.
27
Typing in Japanese query
We input Japanese (query) words with two steps.
1. Type the reading of the Japanese word in Latin
alphabet.
2. Select a desired word from the list of candidates
This step cause a spelling mistake, too large

distance to correct spelling
28
Example: Typing in Japanese queries
Assume a user wants to submit a query:
オバマ (Obama)
1. Type in the reading in Latin alphabet.

reading: obama
2. Select correct spelling.

Possible candidates: オバマ (correct), おばま , 小浜
etc.
29
Japanese Did-You-Mean dictionary
 Because of the large distance problem, simple distance
measures (edit distance) do not work.
 To handle this problem, Anuenue supports a special

dictionary for Japanese Did-You-Mean service.
30
Dictionary for Japanese Did-You-Mean
service
Dictionary has two columns Query with Correct Query

1. Query with mistakes mistakes
2. Correct queries 墨ともふどうさん住友不動産
歌だ光る宇多田ヒカル
米事案セットベイジアンセット
31
Implementing Did-You-Mean service with
the dictionary
When users submit the Query with Correct Query
query with mistakes in mistakes
dictionary,
墨ともふどうさん住友不動産
 Did-You-Mean service
suggests the correct
query 歌だ光る宇多田ヒカル
NOTE: Anuenue provides 米事案セットベイジアンセット
handlers for the dictionary

format.
32
Problem…
How we can create the dictionary?
 We can make use of a query log mining tool Oluolu.
33
Oluolu
 Creates a spelling correction dictionary from query log
 Extracts pairs of queries (query with spelling mistakes,
query with correct spelling)
 Support the Japanese spelling mistakes (from version
0.2)
 runs on the Hadoop framework
Project URL: http://code.google.com/p/oluolu/
34
Input to Oluolu: query log
Three columns User Id Query Time
1. User Id
2. Query string 438904 Pthon 2009-11-21
3. Time of query 11:16:12
submission
34443 Java 2009-11-21
12:16:13
438904 Python 2009-11-21

12:16:20
8975 Java 2009-11-21

Tomcat 12:16:25
35
Procedure: creating Japanese Did-You-
Mean dictionary with Oluolu
Oluolu extracts the elements of Japanese Did-You-Mean
dictionary with 2 steps.
1. Extract all the query pairs in the same session
2. Validate the query pairs
36
Step1: extract query pairs
 Oluolu extracts pairs of User ID Query Time
queries in the same session.
E.g., Oluolu extracts pair 438904 Pthon 2009-11-21
12:16:12
(Pthon and Python).
34443 Java 2009-11-21
12:16:13
 Queries in the same session:
a set of queries submit by the 438904 Python 2009-11-21
12:16:20
same user within small time
range. 8975 Tomcat 2009-11-21
12:16:25
 Extracted pairs can be

misspelled query and correct
query.
37
Step 2: validate candidate pairs
 Oluolu validates all the query pairs extracted step 1.
 In validation phase (step 2), Oluolu makes use of query
readings.
38
Reading of Japanese words
 Japanese words can be convert into the readings in Latin
Alphabets.
 こんにちは (reading: konnichiha)
 伊藤 (reading: itou)
FACT: even when Japanese query with spelling mistakes

can be totally different from correct query,
 the readings are the same or the distance is small!
39
Validate candidate pair with reading
Given a query pairs, Oluolu validates the queries with 2
steps
1. Convert the queries into readings with Latin Alphabets
2. Compute edit distance with the two readings
 When the distance is small, the two queries are
extracted as a element of Did-You-Mean dictionary.
40
Example: step 2
Given a pair of queries: ( 墨ともふどうさん , 住友不動産 )
1. Convert them into readings

 readings are the same, “sumitomofudousan”.
2. Compute the distance with the readings

 Distance is zero
 Extracted as a element of Did-You-Mean dictionary
41
Creating Japanese Did-You-Mean
dictionary with Oluolu
 Installation requirements
 Java 1.6.0 or greater
 Hadoop 0.20.0 or greater
 Oluolu 0.2.0 or greater
 Copy the input query log into HDFS
 Run spellcheck task of oluolu
$ bin/oluolu spellcheck
-input testInput.txt
-output output
-inputLanguage ja
42
Preliminary experiments
 Experimental settings
 Input data: log file from a mixi service (community
search).
• 5 GB data
 Extracted dictionary
 number of elements is over 100.000
 succeeded to extract the query pairs with large edit
distance.
• ( 議 Ν, ギニュー )
• ( 不動有利 , 不動裕理 )
Current status
 Finished functional tests and stress tests.
 Now replacing an in-house search engine in a small
search service with Anuenue.
 In next phase, we will apply Anuenue to the search
service with large data and high QPS.
44
Future work
 Integrate SolrCloud and Zookeeper
 Support failover, and rebalance the index
 Kuromoji, a new OSS Japanese tokenizer
45
Summary
 Introduction of Anuenue
 Described a Did-You-Mean facility for Japanese query
46
Thank you for your attention!
47

Solr Cluster Installation Tool "Anuenue" and "Did You Mean?" For Japanese

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solr Cluster Installation Tool "Anuenue" and "Did You Mean?" For Japanese

Uploaded by

Copyright:

Available Formats

Solr Cluster installation tool "Anuenue"

Project URL: http://code.google.com/p/anuenue-wrapper/

 Did-You-Mean facilities for Japanese queries

 Anuenue assigns more than one roles to instances.

NOTE: Anuenue provides a command to distribute the input

Master-1 Master-2 Master-3

Build shard indexes

Index input data

 Merger: aa Replicate index

Edit cluster configuration file, anuenue-nodes.xml.

Index input data

Merger1 Merger2 Merger3

Slave1 Slave2 Slave3 Slave4 Slave5 Slave6

Master1 Master2 Master3 Master4 Master5 Master6

 Did-You-Mean facilities for Japanese queries

(English: Ugly Betty)

Edit Distance: measure of distance between two sequences.

E.g., when a user submit a query, “pthon”, Did-You-Mean

Simple application of edit distance does not work for

 These cases are derived from Japanese input method.

This step cause a spelling mistake, too large

1. Type in the reading in Latin alphabet.

2. Select correct spelling.

 To handle this problem, Anuenue supports a special

Dictionary has two columns Query with Correct Query

NOTE: Anuenue provides 米事案セット ベイジアンセット

handlers for the dictionary

Project URL: http://code.google.com/p/oluolu/

438904 Python 2009-11-21

8975 Java 2009-11-21

 Extracted pairs can be

FACT: even when Japanese query with spelling mistakes

1. Convert them into readings

2. Compute the distance with the readings

 Kuromoji, a new OSS Japanese tokenizer

You might also like

NOTE: Anuenue provides 米事案セットベイジアンセット