Working With The English Lemmatizer

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Open Source Search Server

Call +1 (888) 333-1345 | Contact

RSS
Twitter
LinkedIn
VKontakte
Google+
Skype

Downloads
Release
Beta
Development
Dictionaries
Archive
Services
Enterprise Support
Package Matrix
Consulting
Development
Embedding
Training
Community
Forum
Wiki
Bugtracker
Open Projects
Plugins
Resources
Documentation
Case-studies
Powered by Sphinx
Blog
Newsletter
Partners
Quartsoft
Flying & Thinking Sphinx
Ivinco
Monty Program
Percona
SkySQL
About
Company
Contact
Careers

Search:
Home » Blog
Blog
Mailing list

Search

Search for: Search

Archives
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
October 2012
September 2012
August 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
June 2011
May 2011
April 2011
March 2011
November 2010
October 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
December 2009
July 2009
May 2009
April 2009
March 2009
January 2009
November 2008
October 2008
July 2008
June 2008
May 2008
March 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
May 2007
April 2007
March 2007
December 2006
October 2006
September 2006
July 2006
June 2006
April 2006
Categories
Conferences (22)
Development (17)
featured (3)
General (78)
performance (8)
Powered By (6)
training (7)
Tutorials (6)
Meta
Log in
RSS
Comments RSS

Full-Text Diary

Dec 5, 2013. Working with the English Lemmatizer

Share this: Tweet Share Like 2

If you watch the fulltext diary closely, you may have already heard that Sphinx 2.2.1-beta supports English and
German lemmatization. In a previous post we announced support of Russian lemmatization. We also gave a
general outline of stemming and lemmatization. Go here if you want some background information. Now, we’re
going to show you some of the differences between English stemming and lemmatization. Read on to learn
more.

“Will”
To highlight the differences between English stemming and lemmatization, it’s best to use an example. A simple
example is the word ‘will’. It has several meanings. It can either be a modal verb, a form of ‘be’, but it can also
be a noun (‘Will”, as philosophy uses the term), not to mention it’s also a short form of a name — William.
Besides this, a big problem is that “will” can also be a root word for other words. So, it’s a good candidate for
our testing.

Let’s run a quick test searching ‘will’ on the same data indexed with the English stemmer and with the English
lemmatizer:
Stemmer:

mysql> select count(*) from wikipedia where match('@title will');


+----------+
| count(*) |
+----------+
| 1202 |
+----------+
1 row in set (0.01 sec)

Lemmatizer:

mysql> select count(*) from wikipedia2 Where match('@title will');


+----------+
| count(*) |
+----------+
| 1149 |
+----------+
1 row in set (0.01 sec)

So, the lemmatizer gave us fewer results. Why? Because, with the stemmer, a bunch of unrelated terms were
tokenized as ‘will’. To see this difference more clearly, we did a simple trick: we added an integer attribute that
holds a number for each index (“1″ for the stemmer index or “2″ for the lemmatizer index) and then performed
one search on both indexes. Now, it should be easy to differentiate between stemmed and lemmatized indexes.
For common rows, we will get the result from the lemmatized index, since it’s the last in the index list.

mysql> select title,idx from wikipedia,wikipedia2 where match('@title will') order by idx asc lim
+--------------------------------------------+------+
| title | idx |
+--------------------------------------------+------+
| Willful blindness | 1 |
| Ready, Willing, and Able | 1 |
| G. Willing Pepper | 1 |
| John Willes (cricketer) | 1 |
| William S. S. Willes | 1 |
| Christine Willes | 1 |
| Ready, Willing, and Able (film) | 1 |
| Mary Willing Byrd | 1 |
| Ann Willing Bingham | 1 |
| Stout Hearts and Willing Hands | 1 |
| The Willing Well IV: The Final Cut | 1 |
| Charles Willing | 1 |
| Thomas Willing | 1 |
| Clare Wille | 1 |
| Edward Willes | 1 |
| George M. Willing | 1 |
| Oscar F. Willing | 1 |
| Willing Suspension Productions | 1 |
| John Willes (judge) | 1 |
| The Wild, the Willing and the Innocent | 1 |
| Coalition of the Willing (Jericho episode) | 1 |
| Jodi Wille | 1 |
| Coalition of the willing | 1 |
| Take the Willing | 1 |
| Ready an' Willing | 1 |
| Hitler's Willing Executioners | 1 |
| Hilda M. Willing (skipjack) | 1 |
| Nick Willing | 1 |
| Ava Lowle Willing | 1 |
| The Coalition of the Willing (album) | 1 |
| Will the Shill | 2 |
| Hope Will | 2 |
| Helen Wills | 2 |
| For You I Will | 2 |
| Donald Wills Douglas | 2 |
| Will Skinner | 2 |
| Will and Perception (album) | 2 |
| Brush Creek (Wills Creek) | 2 |
| Wills Creek | 2 |
| Will (sociology) | 2 |
+--------------------------------------------+------+
40 rows in set (0.01 sec)

So, with the stemmed index, ‘Willing’ (which is an adjective as well as a name), ‘Wille’, ‘Willes’ (both names
that have nothing to do with ‘will’) were returned as results. You may note that even the lemmatized index
produces “Helen Wills” and “Wills Creek” as results, even though this isn’t ideal. This is because the lemmatizer
produced ‘will’ as the lemma of ‘wills’ (which makes sense).

Another fail for the stemmer: when we search for ‘willing’ (as in, “Coalition of the Willing” , “Ready an’
Willing”, etc..), because the stemmer tokenized it as ‘will’, we will get all the results for “will”. Although they
have some relation, ‘will’ and ‘willing’ have different meanings. When we search for “willing” we usually aren’t
thinking of anything related to “will”. When Schopenhauer talks about ‘will’, he means something different than
‘the willing’ (those who have given consent). With the lemmatizer ‘willing’ is tokenized separately from ‘will’.
Problem solved.

This can be seen clearly when we use a class of words that differ only by their ending.

For example:

operate/operated/operator/operating/operation/operational :

Keyword Stemming Lemmatizer


operate oper operate
operated oper operate
operator oper operator
operating oper operate
operation oper operation
operational oper operational

Lemmatize_en_all
What about lemmatize_en_all ? Without lemmatize_en_all, the lemmatizer picks only one root form of the word,
the one that is closest to the original word. However, sometimes a word form can have multiple corresponding
root words. For instance, by looking at “dove” it is not possible to tell whether this is a past tense of “dive” the
verb as in “He dove into a pool.”, or “dove” the noun as in “White dove” In this case, the lemmatizer can
generate all the possible root forms with lemmatize_en_all. This can lead to an increase of returned results,
because in the case of en_all, ‘dove’ will match ‘dive’ and ‘dove”. In short, lemmatize_en_all brings more
results because it allows searching of all possible forms.
Indexing speed
On the data we used, we got the following times:

Stemmer Lemmatize Lemmatize all


908.088 sec 1212.385 sec 1384.710 sec

Lemmatizing is a bit slower, which is normal, since it needs to do a lookup over a dictionary, while stemming
simply performs some basic operations. For this, we used lemmatize_cache=32M, which is enough for the
English lemmatizer, as it’s unpacked current size is less than 20M. However, for other lemmatizers, the cache
size should be adjusted accordingly. For example, the Russian lemmatizer is about 110M, so for best
performance, you should set a cache that fits the lemmatizer dictionary in memory.

Search speed
In terms of search speed, there is no difference between stemming and lemmatizing, it really just depends on
how many documents hit the tokenized keyword. For example, with stemming, the word ‘running’ is tokenized
as ‘run’ , which hits 267,196 documents within 16ms. If we use the lemmatizer, ‘running’ stays as it is, hitting
only 94,532 documents within 6ms. However, using lemmatize_en_all, results in 322,623 documents within
24ms. Choose what works best for you!

Conclusion
The lemmatizer is one step ahead of stemming. It not only improves relevance, but it can also reduce search
times. Check it out. Download Sphinx 2.2.1-beta now.

Happy Sphinxing!

« November 26, 2013. Sphinx 2.2.1-beta and 2.1.3-release December 11, 2013. Sign up for the Sphinx
now available Newsletter »

This entry was posted on Thursday, December 5th, 2013 at 16:58 and is filed under General. You can follow any responses to this entry
through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “Working with the English Lemmatizer”

1. conchis says:
December 9, 2013 at 18:58

happy to see same basic benchmark results for a new feature

2. Marc says:
December 12, 2013 at 14:14

Can I hard code some common acronyms that searchers might use to be replace with terms that are
indexed? For instance, users might commonly type “D2″ or “Div 2″. But it needs to actually be “Division
II” which is what is in the database. Can I custom map common phrases to their actual meaning?

Leave a Reply
Name (required)

Mail (will not be published) (required)

Website

Submit Comment

Notify me of follow-up comments by email.

Notify me of new posts by email.

News

Blog
Mailing list
Rss

Downloads

Release
Beta
Development/SVN
Dictionaries
Archive

Services

Enterprise Support
Package Matrix
Consulting
Development
Embedding
Training

Community

Forum
Wiki
Bugtracker
Open Projects
Plugins
Resources

Documentation
Case-Studies
Powered by Sphinx
Blog
Newsletter

About

Sphinx
Company
Contact
Careers

Copyright © 2001-2014, Sphinx Technologies Inc.


:)

You might also like