Professional Documents
Culture Documents
9ib TEXT
9ib TEXT
9ib TEXT
Lexer
Unlike English and most European languages, there is no explicit word boundary for Asian
languages such as Chinese, Japanese and Korean. Therefore, we need separate lexers to handle
these Asian languages. In previous versions of Oracle Text, we have Japanese V-Gram Lexer
which uses V-Gram algorithm to generate variable length Japanese tokens for indexing and
querying. It works well for Japanese full text retrieval. However, its generated tokens are not
natural words (morphemes in morphology) and it can not support theme query and other Oracle
Text advance features. Japanese Lexer uses different algorithm to parse Japanese text and
generate Japanese word based tokens. It will reduce index size and improve query performance.
In addition, it also will be able to serve as the foundation of Oracle Text advance features.
Functionality
Japanese Lexer parses Japanese text column and generate Japanese word tokens for indexing
and querying. It enhances Oracle Text’s Japanese full text retrieval capability and sets up the
foundation for Japanese advance features.
Ambiguity is one of the major problems for generating word tokens for Asian languages such as
Chinese, Japanese and Korean (CJK). Because there is no explicit word boundary in CJK and
one CJK word may be a subset of another CJK word, it is very easy to generate a wrong token if
it is not handled carefully. In order to solve this problem, Chinese Lexer has already been
created.
Because the architecture of Chinese Lexer is extendable, the Japanese lexer is implemented
based on Chinese Lexer. In addition, Japanese Lexer can handle Japanese specific characters
(Hiragana and Katakana).
The current Japanese V-Gram Lexer supports Japanese EUC, Shift JIS and UTF8. Japanese
Lexer also supports the same character sets.
• Example:
BEGIN
BEGIN
CTX_DDL.CREATE_PREFERENCE
CTX_DDL.CREATE_PREFERENCE ((
'JAPANESE_LEXER'
'JAPANESE_LEXER' ,,
'JAPANESE_LEXER');
'JAPANESE_LEXER');
END;
END;
• Improves performance:
– Smaller index size improves query performance
– Index build time longer than current lexer
• Migration
– Drop index with Japanese V-gram lexer
– Create index with Japanese lexer
PL/SQL Statements
A new lexer, JAPANESE_LEXER, is introduced. There is no new attribute introduced for Japanese
Lexer.
Performance Attributes
Performance is an important issue related to this new feature. The indexing time, the response time
and index size are three of the most important performance attributes to be minimized for Oracle
Text.
• Japanese Lexer will generate smaller index than Japanese V-Gram Lexer did.
• Comparing with Japanese V-Gram Lexer, Japanese Lexer will shorten the query response
time
• Japanese Lexer uses more complicated algorithm to generate tokens. Therefore, the indexing
time will be longer than using Japanese V-Gram Lexer.
• Japanese Lexer will generate real word token. Therefore, comparing with Japanese V-Gram
Lexer, it will have increased precision on the token generated and query results.
Performance Tuning
Systems that require shorter indexing time but are not critical for index size and query time should
use Japanese V-Gram Lexer instead of Japanese Lexer.
Migration
Japanese Lexer is not compatible with Japanese V-Gram Lexer. Index created with Japanese V-
Gram Lexer can not be used with Japanese Lexer. If you have used the Japanese V-Gram Lexer and
want to upgrade to Japanese Lexer, do the following:
1. Drop the index created with the Japanese V-Gram Lexer.
2. Create an index with theOracle
Japanese Lexer.
Text New Features 1-6
Korean Morphological Lexer
Dictionary File
System $ORACLE_HOME/ctx/data/kolx/drk2sdic.dat
Grammar $ORACLE_HOME/ctx/data/kolx/drk2gram.dat
Stopword $ORACLE_HOME/ctx/data/kolx/drk2xdic.dat
User-defined $ORACLE_HOME/ctx/data/kolx/drk2udic.dat
Text Format
The grammar, user-defined, and stop word dictionaries are text format KSC 5601. You can
modify these dictionaries using the defined rules. The system dictionary must not be modified.
You can add unregistered words to the user-defined dictionary file. The rules for specifying
new words are in the file.
Composite Indexing
In your language, you can create a user dictionary to customize how words are decomposed. You
create the user dictionary in the $ORACLE_HOME/ctx/data/<language id>directory. The
user dictionary must have the suffix .dct.
The format for the user dictionary is as follows:
input term <tab> output term
The individual parts of the decomposed word must be separated by the # character. If composite
indexing is enabled, than the composite word and its components are all indexed.
Document Classification
A document classification application is one that classifies an incoming stream of documents
based on its content. They are also know as document routing or filtering applications.
Consider the following scenarios:
• An online news agency might need to classify its incoming stream of articles as they
arrive into categories such as sports, crime, and technology.
• A brokerage firm receives earnings reports from a news wire service. It would like to
email these reports to its traders as they arrive. Each trader is interested only in certain
companies or sectors, so the reports must be routed by the textual content. For example,
if a report about Oracle arrives, it should be emailed to the software analysts, whereas a
report about PG&E might go to the CPG analysts.
• The technical support representatives of a company support several different products.
The support center has a single e-mail address for ease of use. Each e-mail message
must be classified as it arrives and forwarded to the specific support group which has
the expertise for that product.
All three scenarios can use content-based classification.
Unlike a traditional document retrieval system, which works on a large corpus of documents,
classification operates on a stream of documents, analyzing and classifying each document in
turn. The classification is generally done using the customer’s set of rules or queries.
CTXRULE Indexes
Oracle Text enables you to build these applications with the CTXRULE index type. This index
type essentially indexes the rules (queries) that define each class.
When documents arrive, the MATCHES operator can be used to match each document with the
rules that select it.
Oracle Text supports document classification for only plain text, XML, and HTML documents.
Documents in binary formats are not supported.
CREATE
CREATE TABLE
TABLE ad_routing
ad_routing ((
rule_id
rule_id NUMBER
NUMBER
PRIMARY
PRIMARY KEY,
KEY,
mail_id
mail_id VARCHAR2(30),
VARCHAR2(30),
rule
rule VARCHAR2(2000)
VARCHAR2(2000) );
);
CREATE
CREATE INDEX
INDEX ad_routing_idx
ad_routing_idx
ON
ON ad_routing(rule)
ad_routing(rule)
INDEXTYPE
INDEXTYPE IS
IS ctxsys.ctxrule
ctxsys.ctxrule
PARAMETERS
PARAMETERS ((
'lexer
'lexer my_basic_lexer
my_basic_lexer
wordlist
wordlist my_wordlist'
my_wordlist' );
);
Class Description
SQL>
SQL> SELECT
SELECT DISTINCT
DISTINCT ar.mail_id
ar.mail_id
22 FROM
FROM ad_routing
ad_routing ar,
ar,
33 print_media
print_media pmpm
44 WHERE
WHERE pm.ad_id
pm.ad_id == 11
55 AND
AND MATCHES(ar.rule,
MATCHES(ar.rule,
66 pm.ad_finaltext)
pm.ad_finaltext) >> 0;
0;
MAIL_ID
MAIL_ID
-------------
-------------
BERNST
BERNST
MATCHES Function
Use this function to find all rows in a table that match a given document. This function:
• Requires that document be plain text, HTML, or XML
• Requires a CTXRULE index for the column being used in the function
• Returns a number that indicates whether the document matches the query rule: It is 0 for
FALSE or 1 for TRUE
MATCHES Syntax
MATCHES (
[table.]column VARCHAR2,
document VARCHAR2 or CLOB,
RETURN NUMBER;
where
[table.]column is the column containing the indexed query set
document is the document to be classified
Slide Example
The table PRINT_MEDIA has the columns:
• AD_ID is the key that determines which advertisement to match
• AD_FINALTEXT contains the advertisement
The query on the slide lists the employee mail IDs that match the advertisement in the row with
AD_ID = 1. The DISTINCT clause removes duplicate mail IDs. In the example, the
advertisement matches one or more of the rules associated with the user BERNST.
Oracle Text New Features 2-10
Guidelines
INPATH Description
Use this operator to do path searching in XML documents. This operator is like the WITHIN
operator except that the right-hand side is a parentheses enclosed path, rather than a single
section name.
Tags and attribute names in path searching are case-sensitive.
INPATH Description
The components of the INPATH operation include the following:
• term is the search string, for example: Mozart
• INPATH is the operator name, it is a constant
• path is the section path. The syntax allows the developer to specify the following path
search properties:
– A document level that indicates whether the path starts at the document level or at
any level
– A wildcard for a single level in a path
– A wildcard for a multiple levels in a path
– Boolean operations to negate and combine path properties
• search is an optional clause that provides additional search criteria. It can have one of
three formats:
– path is a section path, as just described
– tag="value" tests for a tag equal to value
– @attribute="value"tests for an attribute equal to value
The brackets ([]) are required when the search attribute clause is included.
The equality operator (=) can also be an inequality operator (!=). These are the only two
valid operations.
The tag or attribute name must be the left operand, and the literal must be the right
operand.
dog
dog INPATH
INPATH (A[@B=
(A[@B= "pot
"pot of
of gold"])
gold"])
<A
<A B="POT
B="POT OF
OF GOLD">dog</A>
GOLD">dog</A>
<A
<A B="pot
B="pot of
of gold">dog</A>
gold">dog</A>
<A
<A B="POT
B="POT BLACK
BLACK GOLD">dog</A>
GOLD">dog</A>
<A
<A B="POT_OF_GOLD">dog</A>
B="POT_OF_GOLD">dog</A>
Attribute Searching
To find all documents where Mozart appears in the composer attribute of a music element
at any level, use the clause from the first code box on the slide:
Mozart INPATH (//music/@composer)
Attributes must be bound to a direct parent.
To perform a similar search with the music element at the top level, use one of the following
clauses:
Mozart INPATH (music/@composer)
Mozart INPATH (/music/@composer)
Attribute Existence Testing
To find all documents where Mozart appears in a top-level music element which has a
composer attribute, use the following clause from the second code box on the slide:
Mozart INPATH (music[@composer])
Attribute Value Testing
To find all documents where Mozart appears in a top-level music element which has a
composer attribute whose value is Bach, use the following clause from the last code box on
the slide:
Mozart INPATH (music[@composer = "Bach"])
You can also search for composers that are not Bach, using this clause:
Mozart INPATH (music[@composer != "Bach"])
• Mozart in music//opera/composer
(Mozart
(Mozart INPATH
INPATH
(//opera/composer/)
(//opera/composer/) INPATH
INPATH (music)
(music)
• Nested INPATH operators are independent of each
other
Complex Searches
The various types of INPATH clauses can be combined to form more complex searches.
Combining Path and Node Tests
The first clause in the slide returns documents where:
• Mozart appears in the composer element with a music parent
• The type attribute of the music parent is equal to opera
Nested INPATH
You can nest the entire INPATH expression in another INPATH expression as follows:
(Mozart INPATH (//opera/composer/) INPATH (music)
is equivalent to:
(Mozart INPATH (music//opera/composer/)
When you nest INPATH operators, the two paths are completely independent. The outer
INPATH path does not change the context node of the inner INPATH path. For example:
(Mozart INPATH (music)) INPATH (composer)
never finds any documents, because the inner INPATH is looking for Mozart within the top-
level tag music, and the outer INPATH constrains that to document with top-level tag
composer. A document can have only one top-level tag, so this expression never finds any
documents.
…… HASPATH(music/opera/composer)…
HASPATH(music/opera/composer)…
…… HASPATH(music="Mozart")…
HASPATH(music="Mozart")…
HASPATH Syntax
HASPATH(path) searches an XML document set and returns a score of 100 for all documents
where path exists. Separate the parent and child paths with the / character. For example, you
can specify music/opera/composer.
HASPATH(element="value")searches an XML document set and returns a score of 100
for all documents that have an element with content equal to value.
HASPATH Examples
The query
HASPATH(music/opera/composer)
finds and returns a score of 100 for the document
<music><opera><composer>Mozart</composer></opera></music>
without the query having to reference Mozart at all.
The query
Mozart INPATH music
finds
<music>Mozart</music>
but it also finds
<music>Mozart's Magic Flute</music>