Professional Documents
Culture Documents
Gettingstarted With Lucid Works Enterprise
Gettingstarted With Lucid Works Enterprise
Gettingstarted With Lucid Works Enterprise
LucidWorks Enterprise
A Lucid Imagination
Technical White Paper
© 2010 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at
http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.5, published 7 October 2010. Solr,
Lucene, and their logos are trademarks of the Apache Software Foundation.
You will learn about installation, indexing content from local files, web sites, and databases,
and searching, as well as improving the user experience using features such as user alerts,
auto-complete and spell-check.
Installation
At the time of this writing, LucidWorks Enterprise is available as a limited distribution
Developer Access Release, from http://www.lucidimagination.com/lucidworks-enterprise.
Once you download the software, you’re ready to run the installer.
LucidWorks Enterprise provides both a graphical and a command-line option for
installation. Both provide the same options, but the command-line version can be used for
systems where a graphical user interface isn’t an option.
Note that this document is meant to be a guide, and not a comprehensive look at
installation. If you need more details, please search the downloaded documentation for
“Installation”.
The default is to install and activate all components on the local server using ports
8888 and 8989, and for beginning purposes this is just fine. If this configuration
conflicts with existing applications, however, feel free to change the port numbers.
Keep in mind, however, that when they are installed during the same session, the
SearchUI, AdminUI, and Alerts need to use the same port.
If you are installing components on different machines or ports, be sure to alter the
addresses to point to the appropriate server and port.
1) Make sure that you have Java 1.6 (JDK or JRE) installed on your machine. You can
download Java from
http://www.oracle.com/technetwork/java/javase/downloads/index.html.
2) In the directory in which the *.jar file is located, execute the following command:
java -jar lucidworks-enterprise-installer.jar --console
If everything installed and started properly, you will see the generic search page:
Of course, at this point, you don’t actually have any data indexed, so there’s no point
searching. To index data, you will need to log into the admin UI, so click the “login” link at
the top of the page and log in using:
Username: admin
Password: admin
(This user is pre-installed; before going to production, you will want to either use the User
API or an LDAP directory to manage your users.)
Logging in will bring you to the Quick Start page, where you can index content. Just to have
something to search against, click Local Filesystem and enter the path for a small-ish
directory of documents.
Finally, we have data to search! Click the Search tab to go to the search page.
Basic searching
On the surface, searching is pretty simple. Enter a keyword, press “search”, and get results.
And that’s true. It is that simple. But it’s also powerful, in that you have the ability to get
more out of your searching than a simple keyword search. In this section, we’ll discuss
how to get more out of your searches.
In fact, LucidWorks Enterprise provides a default AND operator unless you specify
something else, such as
indexing OR delete
I’d get no results, because that information isn’t stored in the default search field. Instead, I
could search for
mimeType:application/pdf
This tells LucidWorks Enterprise to search for documents that have a value of
“application/pdf” in the mimeType field.
Range queries
You also have the option to search for a range of values. For example, I can find all of the
documents in my index that have 50 pages or less with:
pageCount:[0 TO 50]
Or if I wanted to be even more specific, I could find all PDF files in that range:
mimeType:application/pdf AND pageCount:[0 TO 50]
One place you often find this kind of query is in faceted searches.
Faceted Searching
The use of facets is one of the great recent advances in searching. Search facets enable
users to “narrow down” their search by a variety of factors. For example, if we go back to
the original keyword search for “lucid”, you can see a number of different options down the
right-hand side of the page:
Understanding Fields
The first thing we’ll need to do before doing any indexing is understand just how
LucidWorks Enterprise looks at the data we’re putting into it.
Each document is made up of one or more fields. You can see a list of existing fields by
clicking the Index tab, and the Fields subtab in the administration user interface.
Facet: This property determines whether this field shows up as an available filter
on the search page. Note that documents without this field won’t show up in the
counts for this facet.
Enter a user-friendly name for your new data source and the full directory path in which
the documents are stored. You have the option to drill down into subdirectories or not, as
well as to follow symbolic links or not.
(Note that for security reasons, data indexed from the local filesystem will not
automatically be available via a link from the search results. Search the product
documentation for “linking” for information on how to configure LucidWorks Enterprise to
activate those links.)
Click Create to create the data source.
The Name should be something you’ll recognize later, and the URL is the value at which you
want the crawler to start.
The Allow Paths and Disallow Paths values enable you to control where the crawler goes.
For example, if I were to index my own site, as I’m doing here, I might want only my own
content, so I’ve specified only paths that start with my URL, using a regular expression to
specify the rest of the path. Similarly, I might not want to index my administration pages.
Once you’ve accomplished these two steps, you’re ready to create the new data source.
Click the Index tab, then the Sources subtab and the DB sub-subtab.
Enter the JDBC driver name. This is the actual class name you would use in a Java
application, such as com.mysql.jdbc.Driver. Enter the username and password for the
database.
Finally, enter the query used to extract the data from the database, mapping each column to
a LucidWorks Enterprise field. For example:
select id as id, prod_name as name, price_pt as price from products
In all likelihood, you will have information about a single item, or “document”, in several
tables; to add it to your index, create the appropriate joins to add all of your data as a single
query.
One final method of indexing data involves adding it directly using Solr’s native document
format. To do that, you will need a Solr document, which is just an XML document, such as:
<add>
<doc>
<field name='id'>prod3_0</field>
<field name='data_source'>Auxiliary Data</field>
<field name='itemId_s'>3</field>
<field name='itemType_s'>product</field>
<field name='cat'>Printers</field>
<field name='name_t'>Dokad SPE 3299 Printer</field>
<field name='price_td'>99</field>
<field name='blurb_t'>A great printer that doesn't use a lot of ink.</field>
<field name='description_t'>Sed ut perspiciatis ...</field>
<field name='text'>Dokad SPE 3299 Printer99A great printer that doesn't use a
lot of ink.Sed ut ...</field>
</doc>
<doc>
<field name='id'>prod4_0</field>
<field name='data_source'>Auxiliary Data</field>
<field name='itemId_s'>4</field>
<field name='itemType_s'>product</field>
<field name='cat'>Cameras</field>
<field name='cat'>Accessories</field>
<field name='name_t'>Kinok UltraCam II</field>
<field name='price_td'>550</field>
<field name='blurb_t'>Boy, the Kinok UltraCam II is a great camera, and it hooks
up to your printer terrifically.</field>
<field name='description_t'>Lorem ipsum dolor sit amet, consectetur adipisicing
elit, ...</field>
<field name='text'>Kinok UltraCam II550Boy, the Kinok UltraCam II is a great
camera, and it hooks up to your printer terrifically.Lorem ipsum dolor sit amet,
...</field>
</doc>
To index this type of document, click the Index tab, then the Sources subtab and the Solr
sub-subtab.
Enter a recognizable name, as well as the path to the actual document. Note that we’ve
added a data_source field that matches what we’ve named the data source. This is because
Solr documents don’t automatically have this field populated, as the other types do, so if we
want that information to show up in the search facets, we need to provide it ourselves.
Click Create to create the data source.
Note that this process does not start the indexer; we’ll look at that under Scheduling right
now.
Scheduling Tasks
We’re finally ready to look at scheduling the indexing of our data sources. Under the Index
tab, click the Schedules subtab.
This process is the opposite of the Delete button; it gets rid of the data, but no data sources.
So you’re ready to either reindex, or delete them and start over.
Assuming that you haven’t started over, we’re ready to look at enhancing your users’
search experience.
User Alerts
When you’re dealing with huge amounts of data, one of the biggest challenges is keeping up
with it as it grows. One way to do that is to use user alerts, which notify you when new
content matching your queries has been added to the system.
To set up a user alert for a query, click the “Add this query as alert” link under the search
box.
This link takes you to a page where you can specify the details of where you’d like to
receive the alert.
Auto-complete
One of the best ways to make sure that users don’t wind up with “no results” is to guide
them towards terms that actually exist within the index. And one of the best ways to do
that is using auto-complete functionality.
Auto-complete looks at the characters the user has already entered and offers terms that
start with those characters.
Specifying Fields
To specify one or more fields for these functions, click the Index tab and the Fields subtab.
Highlight the relevant field.
Indexing
Finally, make sure that the spell-checking and/or auto-complete information has been
indexed. To do that, click the Index tab and the Schedules subtab. Click the icon next to
Activities to expand it and highlight spelling or auto-complete. Schedule these indexes just
as you would schedule your data sources.
Now that we’ve got good queries, it’s time to make sure they return good results.
Improving Relevance
One of the advantages of using a search platform such as LucidWorks Enterprise is that
results can be ranked by relevance, with the results most likely to be what the user is
looking for at the top of the list. Out of the box, LucidWorks Enterprise does a pretty good
Synonyms
One way to provide better results is to provide LucidWorks Enterprise with groupings of
words that have the same (or at least similar) meanings. For example, a search for “lawyer”
should probably also find documents that only contain “attorney”. Most industries and
subject areas have their own set of jargon and synonyms, and you can configure them
directly from within the administration user interface.
Click the Queries tab, and then the Settings subtab. Highlight Synonyms and Stopwords,
and then expand the Synonyms entry.
From here, you can add new entries or remove existing entries. Each line is considered a
group; you can add as many comma-delimited terms as you like.
Stopwords
In search parlance, a “stopword” is a word that’s so common that adding it to a query rarely
increases the quality of results, and frequently decreases it. For example, if you did a
search for “the City of Chicago”, “Chicago” would certainly provide good results. “City”
might as well. But how many billions of documents that have nothing to do with Chicago
contain the words “the” and “of”?
Fortunately, LucidWorks Enterprise understands the concepts of stopwords, and in most
cases, will eliminate them from your query. It also understands how to handle stopwords
on the back end so that they help improve relevance (for example, by judging the proximity
of two words) rather than hinder it.
LucidWorks Enterprise starts with a list of several dozen stop words, such as “a”, “and”,
“for”, and so on. You may find, however, that you need to add your own. To do that, click
the Queries tab and the Settings subtab. Highlight Synonyms and Stopwords, and expand
the Stopwords entry.
Click Scoring
Perhaps the best way for LucidWorks Enterprise to know whether a result is really
relevant for a particular search query is to keep track of whether a human thinks it is. Click
scoring makes that happen.
When you enable click scoring, LucidWorks Enterprise tracks which results are most often
clicked for a particular query, and “boosts” their relevance scores accordingly. It will then
be more likely to present those results higher in the list for that query.
Using click scoring requires manual configuration of LucidWorks Enterprise. For
information on how to set it up, search for “Click Scoring Relevance Framework” in the
LucidWorks Enterprise documentation.
Next Steps
For more information on how Lucid Imagination can help search application developers,
employees, customers, and partners find the information they need, please visit
www.lucidimagination.com to access blog posts, articles, and reviews of dozens of
successful implementations.
Please e-mail specific questions to:
Support and Service: support@lucidimagination.com
Sales and Commercial: sales@lucidimagination.com
Consulting: consulting@lucidimagination.com
Or call: 1.650.353.4057