Professional Documents
Culture Documents
Solr Cluster Installation Tool "Anuenue" and "Did You Mean?" For Japanese
Solr Cluster Installation Tool "Anuenue" and "Did You Mean?" For Japanese
and
"Did You Mean?" for Japanese
Takahiko Ito
mixi, Inc.
1
mixi?
One of the largest social
networking service in
Japan.
Many services to promote
communication among
users.
Blog, news, game
platform etc
Most of the services
come with search
15M monthly active users
2
Our current (urgent) project …
Replace in-house search engines into a up-to-date search
platform!
We have
selected Apache Solr as the search platform!
created a simple OSS package (Anuenue) which
wraps Solr
3
Reason why we make Anuenue
Deployment / daily operations of Solr search cluster is a bit
difficult for ordinary engineers.
We need to edit the configuration files for all the Solr
instances respectively
Commands for whole clusters are not provided
• We need to write client commands by ourselves
• Hadoop provides utility commands for clusters
E.g., start-all.sh (start processes), fsck (check all
discs), balancer (rebalance the data blocks)
What does Anuenue provide?
Handy configuration of search clusters
Commands for clusters
Simple commands (post, delete, update, commit etc)
Start and stop commands for processes in cluster.
Japanese support
Implementation of Japanese Did-You-Mean facilities
Japanese tokenizer (Sen and Kuromoji)
5
Today’s Topics
Anuenue
Handy configuration of search clusters
Commands for search clusters
6
Cluster configuration with Anuenue
Cluster setup is done with a special configuration file
7
Role: master
Index input data.
Input Data
8
Role: slave
Has three functions
Merger-1
Copy (replicate) index
from master Submit queries
Accept queries from
mergers and then Slave-1 Slave-2
search it own index
Replicate index
Return the results to
merger instance Master-1 Master-2
Input Data
9
Role: merger
Forwards queries from
clients to slaves. Client-1 Client-2
Note: clients need not
to know the slave Submit queries
instances (merger
adds ‘shard’ Merger
parameter with slave
Forwards queries
instances)
Merge the results from all
Slave-1 Slave-2
the slave instances and
returned the merged
results.
10
Example: Anuenue cluster
The cluster consists of five Client-1 Client-2
machines
Each has one aa
Anuenue instance Forward queries
Instances cc dd
Input Data
11
How to assign roles to instance?
12
Configuration example
Case: there is one merger instance in machine, aa (port
7000)
<mergers>
<merger>
<host>aa</host>
<port>7000</port>
</merger>
</mergers>
13
Specify the index to replicate
<masters>
<master iname=“master1”>
<host>aaaa</host>
<port>8983</port>
</master> Add name of master instance
</masters> by iname attribute
<slaves>
<slave >
<host>bbbb</host>
<port>8983</port>
<replicate>master1</replicate>
</slave>
Specify the master instance
</slaves>
to copy the index adding
replicate element
14
Example: simple cluster settings
<mergers> Client-1 Client-2
<merger>
<host>aa</host>
<port>8983</port>
</merger> aa
</mergers>
<masters> Forward queries
<master iname=“master1”>
<host>bb</host> cc
<port>8983</port>
</master> Replicate index
</masters>
<slaves> bb
<slave>
<host>cc</host> Index input data
<port>8983</port>
<replicate>master1</replicate>
</slave> Input Data
</slaves>
15
Cluster setup with Anuenue
Flexible and support various types of search cluster.
For example…
16
Assign multiple roles
Client1 Client2
Submit queries
instance
Input Data
17
Large clusters to handle huge data with
high QPS
Client1 Client2 Client3 … ClientN
Input Data
18
After setting up cluster
We can make use of commands for clusters.
Anuenue provides
start / stop commands
commands to manipulate the index
Start and stop clusters
Users can start / stop clusters by a command
(anuenue-distdaemon.sh).
Usage:
$sh bin/anuenue-distdaemon.sh [start|stop]
Simple commands for clusters
Anuenue also provides basic commands (‘post’, ‘delete’,
‘commit’, ‘optimize’ and ‘update’) for search cluster
The commands are implemented in multi-thread
E.g.,
$sh bin/anuenue-distcommands.sh post -arg inputDir
Today’s Topics
Anuenue
Handy cluster configuration of search clusters
Commands for search clusters
22
What is Did-You-Mean service?
Suggest correct spelling when users submit queries with
mistakes
Increase the usability of search service
23
Example: Did-You-Mean service
24
Common implementation
Many search engines (including Solr) apply distance
measures such as Edit Distance [Levenshtein, 1965]
E.g.,
like likes (small distance)
like foobar (large distance)
25
Common procedure: Did-You-Mean
When a user submits a query,
1. Did-You-Mean service computes edit distance between
input query and words in index.
2. If there is a word whose distance is small,
Did-You-Mean handler suggests
26
Problem: Japanese queries
27
Typing in Japanese query
We input Japanese (query) words with two steps.
1. Type the reading of the Japanese word in Latin
alphabet.
2. Select a desired word from the list of candidates
28
Example: Typing in Japanese queries
Assume a user wants to submit a query:
オバマ (Obama)
29
Japanese Did-You-Mean dictionary
Because of the large distance problem, simple distance
measures (edit distance) do not work.
30
Dictionary for Japanese Did-You-Mean
service
歌だ光る 宇多田ヒカル
米事案セット ベイジアンセット
31
Implementing Did-You-Mean service with
the dictionary
When users submit the Query with Correct Query
query with mistakes in mistakes
dictionary,
墨ともふどうさん 住友不動産
Did-You-Mean service
suggests the correct
query 歌だ光る 宇多田ヒカル
32
Problem…
How we can create the dictionary?
We can make use of a query log mining tool Oluolu.
33
Oluolu
Creates a spelling correction dictionary from query log
Extracts pairs of queries (query with spelling mistakes,
query with correct spelling)
Support the Japanese spelling mistakes (from version
0.2)
runs on the Hadoop framework
34
Input to Oluolu: query log
Three columns User Id Query Time
1. User Id
2. Query string 438904 Pthon 2009-11-21
3. Time of query 11:16:12
submission
34443 Java 2009-11-21
12:16:13
35
Procedure: creating Japanese Did-You-
Mean dictionary with Oluolu
Oluolu extracts the elements of Japanese Did-You-Mean
dictionary with 2 steps.
1. Extract all the query pairs in the same session
2. Validate the query pairs
36
Step1: extract query pairs
Oluolu extracts pairs of User ID Query Time
queries in the same session.
E.g., Oluolu extracts pair 438904 Pthon 2009-11-21
12:16:12
(Pthon and Python).
34443 Java 2009-11-21
12:16:13
Queries in the same session:
a set of queries submit by the 438904 Python 2009-11-21
12:16:20
same user within small time
range. 8975 Tomcat 2009-11-21
12:16:25
38
Reading of Japanese words
Japanese words can be convert into the readings in Latin
Alphabets.
こんにちは (reading: konnichiha)
伊藤 (reading: itou)
39
Validate candidate pair with reading
Given a query pairs, Oluolu validates the queries with 2
steps
1. Convert the queries into readings with Latin Alphabets
2. Compute edit distance with the two readings
When the distance is small, the two queries are
extracted as a element of Did-You-Mean dictionary.
40
Example: step 2
Given a pair of queries: ( 墨ともふどうさん , 住友不動産 )
41
Creating Japanese Did-You-Mean
dictionary with Oluolu
Installation requirements
Java 1.6.0 or greater
Hadoop 0.20.0 or greater
Oluolu 0.2.0 or greater
Copy the input query log into HDFS
Run spellcheck task of oluolu
$ bin/oluolu spellcheck
-input testInput.txt
-output output
-inputLanguage ja
42
Preliminary experiments
Experimental settings
Input data: log file from a mixi service (community
search).
• 5 GB data
Extracted dictionary
number of elements is over 100.000
succeeded to extract the query pairs with large edit
distance.
• ( 議 Ν, ギニュー )
• ( 不動有利 , 不動裕理 )
Current status
Finished functional tests and stress tests.
Now replacing an in-house search engine in a small
search service with Anuenue.
In next phase, we will apply Anuenue to the search
service with large data and high QPS.
44
Future work
Integrate SolrCloud and Zookeeper
Support failover, and rebalance the index
45
Summary
Introduction of Anuenue
Described a Did-You-Mean facility for Japanese query
46
Thank you for your attention!
47