Solr Cluster Installation Tool "Anuenue"

Solr Cluster installation tool "Anuenue" and "Did You Mean?
" for Japanese
Takahiko Ito mixi, Inc.

1
mixi?
One of the largest social networking service in Japan. Many services to promote communication among users. Blog, news, game platform etc Most of the services come with search 15M monthly active users
Our current (urgent) project

Replace in-house search engines into a up-to-date search platform! We have selected Apache Solr as the search platform! created a simple OSS package (Anuenue) which wraps Solr Project URL: http://code.google.com/p/anuenue-wrapper/
Reason why we make Anuenue

Deployment / daily operations of Solr search cluster is a bit difficult for ordinary engineers. We need to edit the configuration files for all the Solr instances respectively Commands for whole clusters are not provided We need to write client commands by ourselves Hadoop provides utility commands for clusters E.g., start-all.sh (start processes), fsck (check all discs), balancer (rebalance the data blocks)
What does Anuenue provide?

Handy configuration of search clusters Commands for clusters Simple commands (post, delete, update, commit etc) Start and stop commands for processes in cluster. Japanese support Implementation of Japanese Did-You-Mean facilities Japanese tokenizer (Sen and Kuromoji)
Todays Topics
Anuenue Handy configuration of search clusters Commands for search clusters Did-You-Mean facilities for Japanese queries Common problem in Did-You-Mean implementation Mining a Japanese Did-You-Mean dictionary from query log data
Cluster configuration with Anuenue

Cluster setup is done with a special configuration file Anuenue assigns more than one roles to instances. Roles are the functions in a cluster Anuenue supports three roles (Master, Slave, Merger)
Role: master
Index input data. NOTE: Anuenue provides a command to distribute the input data into master instances (build Solr shard indexes) .
Master-1
Master-2
Master-3
Build shard indexes

Input Data
8
Role: slave
Has three functions Copy (replicate) index from master Accept queries from mergers and then search it own index Return the results to merger instance
Merger-1 Submit queries Slave-1 Slave-2 Replicate index Master-1 Master-2 Index input data
Input Data
9
Role: merger
Forwards queries from clients to slaves. Note: clients need not to know the slave instances (merger adds shard parameter with slave instances) Merge the results from all the slave instances and returned the merged results.
Client-1 Client-2 Submit queries Merger Forwards queries Slave-1 Slave-2
10
Example: Anuenue cluster

The cluster consists of five machines Each has one Anuenue instance Instances Merger: aa Master: bb, cc Slave: dd, ee
Client-1 Client-2
aa Forward queries cc dd Replicate index bb ee Index input data Input Data

11
How to assign roles to instance?

Edit cluster configuration file, anuenue-nodes.xml. Add three elements (mergers, slaves and masters) In each element, add more than one instance information (machine name and port number).
12
Configuration example
Case: there is one merger instance in machine, aa (port 7000) <mergers> <merger> <host>aa</host> <port>7000</port> </merger> </mergers>
13
Specify the index to replicate

<masters> <master iname=master1> <host>aaaa</host> <port>8983</port> </master> Add name of master instance </masters> by iname attribute <slaves> <slave > <host>bbbb</host> <port>8983</port> <replicate>master1</replicate> </slave> Specify the master instance </slaves>
to copy the index adding replicate element
14
Example: simple cluster settings

<mergers> <merger> <host>aa</host> <port>8983</port> </merger> </mergers> <masters> <master iname=master1> <host>bb</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>cc</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves> Client-1 Client-2
aa Forward queries cc Replicate index bb Index input data Input Data

15
Cluster setup with Anuenue

Flexible and support various types of search cluster. For example
16
Assign multiple roles
Client1
Client2
Submit queries instance Index input data
Input Data
17
Large clusters to handle huge data with high QPS

Client1 Client2 Client3 ClientN
Merger1
Merger2
Merger3
Slave1
Slave2
Slave3
Slave4
Slave5
Slave6
Master1
Master2
Master3
Master4
Master5
Master6
Input Data
18
After setting up cluster

We can make use of commands for clusters. Anuenue provides start / stop commands commands to manipulate the index
Start and stop clusters

Users can start / stop clusters by a command (anuenue-distdaemon.sh). Usage: $sh bin/anuenue-distdaemon.sh [start|stop]
Simple commands for clusters

Anuenue also provides basic commands ( post, delete, commit, optimize and update) for search cluster The commands are implemented in multi-thread E.g., $sh bin/anuenue-distcommands.sh post -arg inputDir
Todays Topics
Anuenue Handy cluster configuration of search clusters Commands for search clusters Did-You-Mean facilities for Japanese queries Common problem in Did-You-Mean implementation Mining a Japanese Did-You-Mean dictionary from query log data
22
What is Did-You-Mean service?

Suggest correct spelling when users submit queries with mistakes Increase the usability of search service
23
Example: Did-You-Mean service
(English: Ugly Betty)
24
Common implementation
Many search engines (including Solr) apply distance measures such as Edit Distance [Levenshtein, 1965] Edit Distance: measure of distance between two sequences. Simply speaking, when two sequences have more common characters, the distance is smaller. E.g., like 1 likes (small distance) like 1 foobar (large distance)
25
Common procedure: Did-You-Mean

When a user submits a query, 1. Did-You-Mean service computes edit distance between input query and words in index. 2. If there is a word whose distance is small, Did-You-Mean handler suggests E.g., when a user submit a query, pthon, Did-You-Mean service suggests a word in the index with small distance python.
26
Problem: Japanese queries

Simple application of edit distance does not work for Japanese Misspelled queries are sometimes totally different from the correct one (large distance). E.g., (correct: ) (correct: ) These cases are derived from Japanese input method.
27
Typing in Japanese query

We input Japanese (query) words with two steps. 1. Type the reading of the Japanese word in Latin alphabet. 2. Select a desired word from the list of candidates
This step cause a spelling mistake, too large distance to correct spelling
28
Example: Typing in Japanese queries

Assume a user wants to submit a query: (Obama) 1. Type in the reading in Latin alphabet. reading: obama 2. Select correct spelling. Possible candidates: (correct), , etc.
29
Japanese Did-You-Mean dictionary

Because of the large distance problem, simple distance measures (edit distance) do not work. To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service.
30
Dictionary for Japanese Did-You-Mean service

Dictionary has two columns 1.Query with mistakes 2.Correct queries Query with mistakes
Correct Query
31
Implementing Did-You-Mean service with the dictionary

When users submit the query with mistakes in dictionary, Did-You-Mean service suggests the correct query NOTE: Anuenue provides handlers for the dictionary format. Query with mistakes
Correct Query
32
Problem
How we can create the dictionary? We can make use of a query log mining tool Oluolu.
33
Oluolu
Creates a spelling correction dictionary from query log Extracts pairs of queries (query with spelling mistakes, query with correct spelling) Support the Japanese spelling mistakes (from version 0.2) runs on the Hadoop framework Project URL: http://code.google.com/p/oluolu/
34
Input to Oluolu: query log

Three columns 1. User Id 2. Query string 3. Time of query submission
User Id Query 438904 Pthon Time 2009-11-21 11:16:12 2009-11-21 12:16:13 2009-11-21 12:16:20
34443
Java
438904
Python
8975
Java 2009-11-21 Tomcat 12:16:25

35
Procedure: creating Japanese Did-YouMean dictionary with Oluolu

Oluolu extracts the elements of Japanese Did-You-Mean dictionary with 2 steps. 1. Extract all the query pairs in the same session 2. Validate the query pairs
36
Step1: extract query pairs

Oluolu extracts pairs of User ID queries in the same session. 438904 E.g., Oluolu extracts pair (Pthon and Python).
34443 Query Pthon Java Python Tomcat Time
2009-11-21 12:16:12 2009-11-21 12:16:13 2009-11-21 12:16:20 2009-11-21 12:16:25
Queries in the same session: a set of queries submit by the 438904 same user within small time range. 8975 Extracted pairs can be misspelled query and correct query.
37
Step 2: validate candidate pairs

Oluolu validates all the query pairs extracted step 1. In validation phase (step 2), Oluolu makes use of query readings.
38
Reading of Japanese words

Japanese words can be convert into the readings in Latin Alphabets. (reading: konnichiha) (reading: itou) FACT: even when Japanese query with spelling mistakes can be totally different from correct query, the readings are the same or the distance is small!
39
Validate candidate pair with reading

Given a query pairs, Oluolu validates the queries with 2 steps 1.Convert the queries into readings with Latin Alphabets 2.Compute edit distance with the two readings When the distance is small, the two queries are extracted as a element of Did-You-Mean dictionary.
40
Example: step 2
Given a pair of queries: (, ) 1. Convert them into readings readings are the same, sumitomofudousan. 3. Compute the distance with the readings Distance is zero Extracted as a element of Did-You-Mean dictionary
41
Creating Japanese Did-You-Mean dictionary with Oluolu

Installation requirements Java 1.6.0 or greater Hadoop 0.20.0 or greater Oluolu 0.2.0 or greater Copy the input query log into HDFS Run spellcheck task of oluolu $ bin/oluolu spellcheck -input testInput.txt -output output -inputLanguage ja
42
Preliminary experiments
Experimental settings Input data: log file from a mixi service (community search). 5 GB data Extracted dictionary number of elements is over 100.000 succeeded to extract the query pairs with large edit distance. (, ) (, )
Current status
Finished functional tests and stress tests. Now replacing an in-house search engine in a small search service with Anuenue. In next phase, we will apply Anuenue to the search service with large data and high QPS.
44
Future work
Integrate SolrCloud and Zookeeper Support failover, and rebalance the index Kuromoji, a new OSS Japanese tokenizer
45
Summary
Introduction of Anuenue Described a Did-You-Mean facility for Japanese query
46
Thank you for your attention!
47

Solr Cluster Installation Tool "Anuenue"

Uploaded by

Copyright:

Available Formats

You might also like

Solr Cluster Installation Tool "Anuenue"

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solr Cluster Installation Tool "Anuenue"

Uploaded by

Copyright:

Available Formats

Solr Cluster installation tool "Anuenue" and "Did You Mean?

" for Japanese

Takahiko Ito mixi, Inc.

Our current (urgent) project

Reason why we make Anuenue

What does Anuenue provide?

Cluster configuration with Anuenue

Build shard indexes

Example: Anuenue cluster

aa Forward queries cc dd Replicate index bb ee Index input data Input Data

How to assign roles to instance?

Specify the index to replicate

to copy the index adding replicate element

Example: simple cluster settings

aa Forward queries cc Replicate index bb Index input data Input Data

Cluster setup with Anuenue

Assign multiple roles

Submit queries instance Index input data

Large clusters to handle huge data with high QPS

After setting up cluster

Start and stop clusters

Simple commands for clusters

What is Did-You-Mean service?

Example: Did-You-Mean service

(English: Ugly Betty)

Common procedure: Did-You-Mean

Problem: Japanese queries

Typing in Japanese query

Example: Typing in Japanese queries

Japanese Did-You-Mean dictionary

Dictionary for Japanese Did-You-Mean service

Implementing Did-You-Mean service with the dictionary

Input to Oluolu: query log

Java 2009-11-21 Tomcat 12:16:25

Procedure: creating Japanese Did-YouMean dictionary with Oluolu

Step1: extract query pairs

Step 2: validate candidate pairs

Reading of Japanese words

Validate candidate pair with reading

Creating Japanese Did-You-Mean dictionary with Oluolu

Thank you for your attention!

You might also like