9ib TEXT

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Lexer, Filter, and Theme Enhancements

Oracle Text New Features 1-1


Objectives

After this lesson, you should be able to:


• Create themes in multiple languages
• Use new Japanese morphological lexer
• Use Korean morphological lexer
• Detect big-endian or little-endian data for UTF-
16 character set
• Composite indexing user-dictionary in different
languages

Oracle Text New Features 1-2


Themes in Multiple Languages

• Theme based queries


– Use a knowledge base to determine the
theme
– The knowledge base is built from a
thesaurus
– Are done with the ABOUT operator
– Query against a theme index
• In Oracle9i, Oracle Text supports themes any
single-byte white space delimited language

Theme Based Queries


Prior to Oracle9i, Oracle Text supported theme-based functionality only in English and French,
the languages for which built-in knowledge bases are available. Theme-based features include
ABOUT queries against a theme index, gists, themes, and theme highlighting document services,
and hierarchical query feedback.
Extensive knowledge bases are essential to extract themes from a document. This feature
enables users to compile a thesaurus in any single-byte white-space delimited language into an
Text knowledge base so that themes can be generated in that language. The thesaurus is used to
build the knowledge base.

Oracle Text New Features 1-3


Themes in Multiple Languages

• You can create your own knowledge base


using the ctxkbtc compiler
– NLS_LANG must be set to the appropriate
language and character set before running
– The thesaurus must also be loaded in the
same language
• If no knowledge base for the language, then it
is derived from the thesaurus and stop list for
the language
– Themes derived, even when knowledge
base is incomplete
– If stop word is present in thesaurus, then
entry is stopped

Text Knowledge Base


Oracle Text is shipped with a compiled knowledge base in English and French. Users can
extend these knowledge bases by compiling a thesaurus into it using the ctxkbtc compiler. In
Oralcle9i, Text is enhanced to compile a thesaurus and extract themes without an original
knowledge base.
Good quality in theme extraction will require an extensive thesaurus that covers a large subset of
the words in a language. A stop list can be compiled into the new knowledge bases to indicate
useless words.

Oracle Text New Features 1-4


Japanese Lexer

• Japanese lexer issues:


– No explicit word boundaries
– Two character sets: Hiragana and Katakana
– Does not support themes and other advanced
features
• New Japanese lexer provides additional functions:
– Supports theme query and other Oracle Text
advanced features
– Parses any English included with the Japanese
characters

Lexer
Unlike English and most European languages, there is no explicit word boundary for Asian
languages such as Chinese, Japanese and Korean. Therefore, we need separate lexers to handle
these Asian languages. In previous versions of Oracle Text, we have Japanese V-Gram Lexer
which uses V-Gram algorithm to generate variable length Japanese tokens for indexing and
querying. It works well for Japanese full text retrieval. However, its generated tokens are not
natural words (morphemes in morphology) and it can not support theme query and other Oracle
Text advance features. Japanese Lexer uses different algorithm to parse Japanese text and
generate Japanese word based tokens. It will reduce index size and improve query performance.
In addition, it also will be able to serve as the foundation of Oracle Text advance features.
Functionality
Japanese Lexer parses Japanese text column and generate Japanese word tokens for indexing
and querying. It enhances Oracle Text’s Japanese full text retrieval capability and sets up the
foundation for Japanese advance features.
Ambiguity is one of the major problems for generating word tokens for Asian languages such as
Chinese, Japanese and Korean (CJK). Because there is no explicit word boundary in CJK and
one CJK word may be a subset of another CJK word, it is very easy to generate a wrong token if
it is not handled carefully. In order to solve this problem, Chinese Lexer has already been
created.
Because the architecture of Chinese Lexer is extendable, the Japanese lexer is implemented
based on Chinese Lexer. In addition, Japanese Lexer can handle Japanese specific characters
(Hiragana and Katakana).
The current Japanese V-Gram Lexer supports Japanese EUC, Shift JIS and UTF8. Japanese
Lexer also supports the same character sets.

Oracle Text New Features 1-5


Japanese Lexer

• Example:
BEGIN
BEGIN
CTX_DDL.CREATE_PREFERENCE
CTX_DDL.CREATE_PREFERENCE ((
'JAPANESE_LEXER'
'JAPANESE_LEXER' ,,
'JAPANESE_LEXER');
'JAPANESE_LEXER');
END;
END;
• Improves performance:
– Smaller index size improves query performance
– Index build time longer than current lexer
• Migration
– Drop index with Japanese V-gram lexer
– Create index with Japanese lexer

PL/SQL Statements
A new lexer, JAPANESE_LEXER, is introduced. There is no new attribute introduced for Japanese
Lexer.
Performance Attributes
Performance is an important issue related to this new feature. The indexing time, the response time
and index size are three of the most important performance attributes to be minimized for Oracle
Text.
• Japanese Lexer will generate smaller index than Japanese V-Gram Lexer did.
• Comparing with Japanese V-Gram Lexer, Japanese Lexer will shorten the query response
time
• Japanese Lexer uses more complicated algorithm to generate tokens. Therefore, the indexing
time will be longer than using Japanese V-Gram Lexer.
• Japanese Lexer will generate real word token. Therefore, comparing with Japanese V-Gram
Lexer, it will have increased precision on the token generated and query results.
Performance Tuning
Systems that require shorter indexing time but are not critical for index size and query time should
use Japanese V-Gram Lexer instead of Japanese Lexer.
Migration
Japanese Lexer is not compatible with Japanese V-Gram Lexer. Index created with Japanese V-
Gram Lexer can not be used with Japanese Lexer. If you have used the Japanese V-Gram Lexer and
want to upgrade to Japanese Lexer, do the following:
1. Drop the index created with the Japanese V-Gram Lexer.
2. Create an index with theOracle
Japanese Lexer.
Text New Features 1-6
Korean Morphological Lexer

• The KOREAN_MORP_LEXER lexer offers the


following benefits over KOREAN_LEXER:
– Better morphological analysis of Korean
text
– Faster indexing
– Smaller indexes
– More accurate query searching
• You can use KOREAN_MORP_LEXER if your
database character set is one of the following:
– KO16KSC5601
– UTF8

The LEXER Preference


Use the LEXER preference to specify the language of the text to be indexed. The
KOREAN_MORP_LEXER type identifies tokens in Korean text for creating Oracle Text
indexes.
Limitation
Sentence and paragraphs sections are not supported with the Korean lexer.

Oracle Text New Features 1-7


Supplied Dictionaries for
the KOREAN_MORP_LEXER

Dictionary File

System $ORACLE_HOME/ctx/data/kolx/drk2sdic.dat

Grammar $ORACLE_HOME/ctx/data/kolx/drk2gram.dat

Stopword $ORACLE_HOME/ctx/data/kolx/drk2xdic.dat

User-defined $ORACLE_HOME/ctx/data/kolx/drk2udic.dat

Text Format
The grammar, user-defined, and stop word dictionaries are text format KSC 5601. You can
modify these dictionaries using the defined rules. The system dictionary must not be modified.
You can add unregistered words to the user-defined dictionary file. The rules for specifying
new words are in the file.

Oracle Text New Features 1-8


KOREAN_MORP_LEXER Example

• Specify COMPONENT_WORD indexing as follows:


BEGIN
BEGIN
ctx_ddl.create_preference
ctx_ddl.create_preference ((
'korean_lexer',
'korean_lexer', 'korean_morp_lexer');
'korean_morp_lexer');
ctx_ddl.set_attribute
ctx_ddl.set_attribute ((
'korean_lexer','composite','component_word');
'korean_lexer','composite','component_word');
END;
END;

• Create the index as follows:


CREATE
CREATE INDEX
INDEX doc_x
doc_x ON
ON doc
doc (text)
(text)
INDEXTYPE
INDEXTYPE IS
IS ctxsys.context
ctxsys.context
PARAMETERS
PARAMETERS ('lexer
('lexer korean_lexer');
korean_lexer');

Specify COMPONENT_WORD Indexing


In the first code box on he slide:
• The Korean lexer preference is set to KOREAN_MORP_LEXER.
• The composite attribute of the lexer is set to COMPONENT_WORD.
Create the Index
The index created in the second code box on the slide has the following properties:
• The TEXT column in the DOC table is being indexed.
• The index type being created is an Oracle Text context index.
• The lexer used to create the index is the Korean lexer specified in the previous code box.

Oracle Text New Features 1-9


Attributes of the Korean Morphological Lexer
Attribute Description
verb_adjective Specify TRUE or FALSE to index verbs and adjectives.
one_char_word Specify TRUE or FALSE to index verbs and adjectives.
number Specify TRUE or FALSE to index one syllable.
user_dic Specify TRUE or FALSE to index number.
stop_dic Specify TRUE or FALSE to index stopwords using the stop
word dictionary that belongs to
KOREAN_MORP_LEXER.
composite Specify indexing style of composite noun (COMPOSITE_ONLY,
NGRAM, COMPONENT_WORD).
morpheme Specify TRUE or FALSE for morphological analysis. If set to
FALSE, tokens are created from the words that are
divided by delimiters such as white space in the document.
to_upper Specify or FALSE to convert English to uppercase.
hanja Specify TRUE to index hanja characters. If set to FALSE, hanja
characters are converted to hangul characters.
long_word Specify TRUE to index long words that have more than 16
syllables in Korean. Default is FALSE.
japanese Specify TRUE to index Japanese characters in KSC5601 code.
Default is FALSE.
english Specify TRUE to index alphanumeric strings.
Default values are underlined.

Oracle Text New Features 1-10


UTF-16 Auto-Detection of Big-Endian or
Little-Endian

• Use the CHARSET_FILTER to convert


documents from a UTF-16 character set to the
database character set.
• Set the CHARSET attribute to UTF16AUTO to
automatically detect big- or little-endian data.
• The first two bytes in the document determine
big or little-endian:
• 0xFE 0xFE: little-endian
• 0xFF 0xFE: big-endian
• Anything else: big-endian

UTF-16 Big- and Little-Endian Detection


If your character set is UTF-16, you can specify UTF16AUTO to automatically detect big- or little-
endian data. Oracle does so by examining the first two bytes of the document and using the
following logic to determine the byte order:
• If the first two bytes are 0xFE, 0xFE, the document is recognized as little-endian and the
remainder of the document minus those two bytes is passed on for indexing.
• If the first two bytes are 0xFF, 0xFE, the document is recognized as big-endian and the
remainder of the document minus those two bytes is passed on for indexing.
• If the first two bytes are anything else, the document is assumed to be big-endian and the
whole document including the first two bytes is passed on for indexing.

Oracle Text New Features 1-11


Composite Indexing User-Dictionary in
Different Languages

• For example, the supplied user dictionary file


for German is:
$ORACLE_HOME/ctx/data/del/drde.dct
$ORACLE_HOME/ctx/data/del/drde.dct

• The following example entries are for the


German word Hauptbahnhöf:
Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhofe<tab>Haupt#Bahnhof
Hauptbahnhofe<tab>Haupt#Bahnhof
Hauptbahnhöf<tab>Haupt#Bahnhof
Hauptbahnhöf<tab>Haupt#Bahnhof
Hauptbahnhoef<tab>Haupt#Bahnhof
Hauptbahnhoef<tab>Haupt#Bahnhof

Composite Indexing
In your language, you can create a user dictionary to customize how words are decomposed. You
create the user dictionary in the $ORACLE_HOME/ctx/data/<language id>directory. The
user dictionary must have the suffix .dct.
The format for the user dictionary is as follows:
input term <tab> output term
The individual parts of the decomposed word must be separated by the # character. If composite
indexing is enabled, than the composite word and its components are all indexed.

Oracle Text New Features 1-12


Summary

In this lesson, you should have learned how to:


• Create themes in multiple languages
• Use the new Japanese lexer
• Understand the benefits of Korean
morphological lexer
• Use different attributes for the Korean
morphological lexer
• Detect big-endian or little-endian data for UTF-
16 character set
• Composite indexing user-dictionary in different
languages

Oracle Text New Features 1-13


Classifying Documents by Content

Oracle Text New Features 2-1


Objectives

After this lesson, you should be able to:


• Describe document classification
• Create a document classification application

Oracle Text New Features 2-2


What is Document Classification?

• Classifies documents based on content, such


as sports, crime, or technology
• Used to route documents, depending on their
classification
• Traditional systems answer the question
“Which documents match this query?”
• Classification answers the question “Which
queries match this document?”

Document Classification
A document classification application is one that classifies an incoming stream of documents
based on its content. They are also know as document routing or filtering applications.
Consider the following scenarios:
• An online news agency might need to classify its incoming stream of articles as they
arrive into categories such as sports, crime, and technology.
• A brokerage firm receives earnings reports from a news wire service. It would like to
email these reports to its traders as they arrive. Each trader is interested only in certain
companies or sectors, so the reports must be routed by the textual content. For example,
if a report about Oracle arrives, it should be emailed to the software analysts, whereas a
report about PG&E might go to the CPG analysts.
• The technical support representatives of a company support several different products.
The support center has a single e-mail address for ease of use. Each e-mail message
must be classified as it arrives and forwarded to the specific support group which has
the expertise for that product.
All three scenarios can use content-based classification.
Unlike a traditional document retrieval system, which works on a large corpus of documents,
classification operates on a stream of documents, analyzing and classifying each document in
turn. The classification is generally done using the customer’s set of rules or queries.

Oracle Text New Features 2-3


What is Document Classification?

• Defines classification rules as Oracle Text


queries which are indexed
• The CTXRULE index type indexes the rules
(queries) that define each class
• Uses the MATCHES operator to classify
documents
• Supports only plain text, XML, and HTML
documents

CTXRULE Indexes
Oracle Text enables you to build these applications with the CTXRULE index type. This index
type essentially indexes the rules (queries) that define each class.
When documents arrive, the MATCHES operator can be used to match each document with the
rules that select it.
Oracle Text supports document classification for only plain text, XML, and HTML documents.
Documents in binary formats are not supported.

Oracle Text New Features 2-4


How to Create a
Document Classification Application
• To classify a document:
1. Create a table of queries that defines your
classifications
2. Populate the table with the classifications
and the queries that define each
3. Create the CXTRULE index on the queries
table
4. Use the MATCHES operator to classify each
document
• In the example application, the document is an
advertisement that is routed to a employee

The Example Application


The application on the following slides demonstrates how you can use CTXRULE indexes to
route a document. The document is an advertisement that is routed to the appropriate
employee, depending on the content of the advertisement.

Oracle Text New Features 2-5


How to Create a
Document Classification Application

1. Create a table of queries that define your


classifications:

CREATE
CREATE TABLE
TABLE ad_routing
ad_routing ((
rule_id
rule_id NUMBER
NUMBER
PRIMARY
PRIMARY KEY,
KEY,
mail_id
mail_id VARCHAR2(30),
VARCHAR2(30),
rule
rule VARCHAR2(2000)
VARCHAR2(2000) );
);

The AD_ROUTING Table


This table is used to determine which employee should receive the advertisement. It includes the
following columns:
• RULE_ID is the primary key.
• MAIL_ID is used to route the document to the appropriate person. In this application, the
MAIL_ID classifies the document by indicating the person who receives the advertisement.
For example, all queries that match advertisements with information about modems are
routed to user BERNST.
• RULE contains the Oracle Text predicate used to classify the document. It is the column that
is used in the Oracle Text MATCHES function to determine how the document is classified so
that it can be routed to the appropriate person defined in the previous column.
How Text Uses These Columns
The only column in this table that is used by Text is the RULE column. This column defines the
criteria for selecting the document. The other columns are included to meet the requirements of the
application, not to meet the requirements of Text. In this example, the other columns are needed to
identify the rule with a key and to route the advertisement to the appropriate employee.

Oracle Text New Features 2-6


How to Create a
Document Classification Application

2. Populate the table with the routing e-mail IDs


and the queries that classify the documents:
INSERT
INSERT INTO
INTO ad_routing
ad_routing
VALUES
VALUES (1,'BERNST',
(1,'BERNST',
'modems
'modems or
or monitors');
monitors');
INSERT
INSERT INTO
INTO ad_routing
ad_routing
VALUES
VALUES (2,
(2, 'BERNST',
'BERNST', 'ABOUT(modems)');
'ABOUT(modems)');
INSERT
INSERT INTO
INTO ad_routing
ad_routing
VALUES
VALUES (3,
(3, 'DAUSTIN',
'DAUSTIN', 'monitors');
'monitors');

The Slide Example


The last column inserted is the column that is used with the MATCHES operator to determine the
classification of a document.
Multiple rows are inserted with the same MAIL_ID. So, a document might be selected based on
more than one rule in a category. For example, an advertisement that discusses modems and
monitors could be selected for all three rows.
You can also insert multiple rows with the same query. This would allow a single advertisement to
be routed to more than one person.
You could also have created a single row to replace the first two rows. Its query column value
would be:
modems or monitors or ABOUT(modems)

Oracle Text New Features 2-7


How to Create a
Document Classification Application

3. Create the CXTRULE index on the queries table:

CREATE
CREATE INDEX
INDEX ad_routing_idx
ad_routing_idx
ON
ON ad_routing(rule)
ad_routing(rule)
INDEXTYPE
INDEXTYPE IS
IS ctxsys.ctxrule
ctxsys.ctxrule
PARAMETERS
PARAMETERS ((
'lexer
'lexer my_basic_lexer
my_basic_lexer
wordlist
wordlist my_wordlist'
my_wordlist' );
);

Create the CTXRULE Index


The SQL statement on the slide:
• Creates an index called AD_ROUTING_IDX.
• Populates the index from the RULE column from the AD_ROUTING table. This is the table
that was created and populated in the previous slides. The indexed column contains the
queries that are used with the MATCHES function to classify documents.
• Creates an index containing classification rules, because the index type is
CTXSYS.CTXRULE.
• Overrides the default lexer and word list with the objects specified in the PARAMETERS
clause.

Oracle Text New Features 2-8


Supported Preferences
for CTXRULE Indexes

Class Description

Lexer Language of rules and documents

Wordlist Whether stem queries are enabled

Storage How the index data is stored

Stoplist Words and themes not applicable in queries

Section Group How document sections are defined

Memory Memory used when building the index

[No]Populate Is the index populated?

Preferences Not Used With CTXRULE Indexes


The DATASTORE preference is not used, because the document is passed to the MATCHES
function, so the index does not need to know the location of the document.
The FILTER preference is not used, because only plain text, HTML, or XML documents are
valid for a CTXRULE index. However, it is possible to use the CTX_DOC.FILTER procedure
to generate either a plain text or HTML version of a document that can be used as input to the
MATCHES function.
LEXER Preferences and CTXRULE Indexes
Use the LEXER preference to specify the language of the text to be indexed. Only the
BASIC_LEXER is supported for indexing your query set.

Oracle Text New Features 2-9


How to Create a
Document Classification Application
4. Use the MATCHES operator to classify documents

SQL>
SQL> SELECT
SELECT DISTINCT
DISTINCT ar.mail_id
ar.mail_id
22 FROM
FROM ad_routing
ad_routing ar,
ar,
33 print_media
print_media pmpm
44 WHERE
WHERE pm.ad_id
pm.ad_id == 11
55 AND
AND MATCHES(ar.rule,
MATCHES(ar.rule,
66 pm.ad_finaltext)
pm.ad_finaltext) >> 0;
0;
MAIL_ID
MAIL_ID
-------------
-------------
BERNST
BERNST

MATCHES Function
Use this function to find all rows in a table that match a given document. This function:
• Requires that document be plain text, HTML, or XML
• Requires a CTXRULE index for the column being used in the function
• Returns a number that indicates whether the document matches the query rule: It is 0 for
FALSE or 1 for TRUE
MATCHES Syntax
MATCHES (
[table.]column VARCHAR2,
document VARCHAR2 or CLOB,
RETURN NUMBER;
where
[table.]column is the column containing the indexed query set
document is the document to be classified
Slide Example
The table PRINT_MEDIA has the columns:
• AD_ID is the key that determines which advertisement to match
• AD_FINALTEXT contains the advertisement
The query on the slide lists the employee mail IDs that match the advertisement in the row with
AD_ID = 1. The DISTINCT clause removes duplicate mail IDs. In the example, the
advertisement matches one or more of the rules associated with the user BERNST.
Oracle Text New Features 2-10
Guidelines

• DML updates the index asynchronously


• The document:
– Is not indexed
– Is parsed when MATCHES is called
– Does not need to be stored in the database
• The MATCHES function:
– Is valid only with a RULE index
– Allows a subset of Text query operators
– Is callable from a trigger
• Japanese documents may not be classified
properly, because wildcards are not supported

The Indexed Column


Like a CONTEXT index, any DML performed on the indexed column is asynchronously reflected in
the CTXRULE index. The indexed column is the column that contains the Text predicates used to
classify the document. This means that, after modifying the indexed columns containing the rules,
you update the index by synchronizing. For example, to synchronize the index
PRINT_MEDIA.AD_FINAL_TX, executing one of the following commands:
• SQL: ALTER INDEX print_media.ad_final_tx REBUILD
PARAMETERS('sync');
• PL/SQL: ctx_ddl.sync_index('print_media.ad_final_tx');
There could be thousands or millions of rows stored in the indexed table. Each rows represents a
different classification of the document.
Parsing the Document
Text parses the document at the time that the MATCHES function is called. The document is not
indexed; the Text predicates that classify the document are indexed. Because the document is passed
to the MATCHES function, it does not need to be stored in the database and does not need to be
indexed.
The CTXRULE index is used in those situations where a document only needs to be parsed once to
determine how the document is to be handled. If you intend to repeatedly execute Text queries
against the document, then you may want to index the document to avoid the overhead of parsing
the document each time it is queried.

Oracle Text New Features 2-11


Valid Query Operators in MATCHES
The Text operators used with the MATCHES function are limited. Valid operators include ABOUT,
AND, BT*, EQUIVALENCE, NEAR, NOT, NT*, OR, PHRASE, PT, RT, STEM, SYN, TR, TRSYS,
TT, and WITHIN. Invalid operators include ACCUM, EQUIV, FUZZY, MINUS, SOUNDEX,
THRESHOLD, WEIGHT, and WILDCARD.
The thesaurus operators (BT*, NT*, PT, RT, SYN, TR, TRSYS, TT) are supported. However, these
operators are expanded using a snapshot of the thesaurus at index time, not when the MATCHES
function is issued. This means that if you change your thesaurus after you index, you must re-index
your query set.
Calling the MATCHES Function
Before insert triggers are a natural place from which to invoke the MATCHES function. The values
returned from a query with the MATCHES function can be used to populate a column that classifies
the document. This column is in the same table as the document.
Japanese Documents
Japanese documents may not be classified properly since the JAPANESE_VGRAM_LEXER
sometimes inserts wildcards and wildcard is not supported in classification.

Oracle Text New Features 2-12


Summary

In this lesson, you should have learned how to:


• Describe document classification
• Create a document classification application

Oracle Text New Features 2-13


New Oracle Text XPath Operators

Oracle Text New Features 3-1


Objectives

After this lesson, you should be able to:


• Write a predicate clause using the INPATH
operator
• Write a predicate clause using the HASPATH
operator

Oracle Text New Features 3-2


New Oracle Text Operators

• XPATH specification for XML document queries


• Two new XPATH operators in Oracle Text:
– INPATH
– HASPATH
• Both operators:
– Select XML documents based on section
paths or section path contents
– Are used with Oracle Text CONTAINS function
– Operate only on documents indexed using
PATH_SECTION_GROUP
• The section paths look like directory paths

New XPATH Operators


The XPATH specification is a W3 recommendation that defines queries on XML documents
The two new XPATH operators, INPATH and HASPATH, are used in WHERE clauses to select
documents based on section paths or section path contents.
These XPATH operators are implemented as operators in the Oracle Text CONTAINS functions.
The index being searched must be created with the PATH_SECTION_GROUP for these
operators to work.
Section Paths
The section path is the sections that contain the text. For example, in the following document,
Mozart is in the section path /MUSIC/COMPOSER.
<music>
<composer>
Mozart
</composer>
<opera>
The Magic Flute
</opera>
</music>

Oracle Text New Features 3-3


INPATH Operator Searches

• Similar to WITHIN operator, except INPATH


uses a path, instead of a single section name
• Can be used to search for text within:
– Text within a section
– Text within a section path
– Text within a section path that is specified
using wildcards
– Text within a section with a specific
attribute value

INPATH Description
Use this operator to do path searching in XML documents. This operator is like the WITHIN
operator except that the right-hand side is a parentheses enclosed path, rather than a single
section name.
Tags and attribute names in path searching are case-sensitive.

Oracle Text New Features 3-4


INPATH Syntax

• Syntax is: term INPATH (path[search])


where:
– term is the search string
– INPATH is the operator
– path is a section path
– search is additional search criteria
• Additional search criteria syntax is:
– path
– tag="value"
– @attribute = "value"

INPATH Description
The components of the INPATH operation include the following:
• term is the search string, for example: Mozart
• INPATH is the operator name, it is a constant
• path is the section path. The syntax allows the developer to specify the following path
search properties:
– A document level that indicates whether the path starts at the document level or at
any level
– A wildcard for a single level in a path
– A wildcard for a multiple levels in a path
– Boolean operations to negate and combine path properties
• search is an optional clause that provides additional search criteria. It can have one of
three formats:
– path is a section path, as just described
– tag="value" tests for a tag equal to value
– @attribute="value"tests for an attribute equal to value
The brackets ([]) are required when the search attribute clause is included.
The equality operator (=) can also be an inequality operator (!=). These are the only two
valid operations.
The tag or attribute name must be the left operand, and the literal must be the right
operand.

Oracle Text New Features 3-5


Basic INPATH Examples
• Mozart within the top-level section music:
SELECT
SELECT id,
id, doc_name
doc_name
FROM
FROM doc
doc
WHERE
WHERE CONTAINS(text,
CONTAINS(text,
'Mozart
'Mozart INPATH
INPATH (music)')
(music)') >> 0;
0;
• Mozart in section path music/opera:
Mozart
Mozart INPATH
INPATH (music/opera)
(music/opera)
• Mozart in music with opera = Magic Flute
Mozart
Mozart INPATH
INPATH
(music[opera="Magic
(music[opera="Magic Flute"])
Flute"])

Top-Level Tag Searching


To return documents that have Mozart within the top-level tags <music> and </music>, use
one of the following clauses:
Mozart INPATH (/music)
Mozart INPATH (music)
The music tag must be a top-level tag, which is the document-type tag. These clauses are used in
the first two examples on the screen.
Direct Parentage Path Searching
To return documents where Mozart appears in an opera element which is a direct child of a top-
level music element, use one of the following clauses:
Mozart INPATH (music/opera)
Mozart INPATH (/music/opera)
For example, this clause selects documents containing
<music><opera>Mozart's Magic Flute</opera></music>
Tag Value Searching
To return documents where Mozart appears in the top-level music element which has an opera
tag equal to Magic Flute, use one of the following clauses:
Mozart INPATH (music[opera="Magic Flute"]))
Mozart INPATH (/music[opera="Magic Flute"]))
For example, this clause selects documents containing
<music>Mozart<opera>Magic Flute</opera></music>

Oracle Text New Features 3-6


INPATH Lexer Dependencies
• This query:

dog
dog INPATH
INPATH (A[@B=
(A[@B= "pot
"pot of
of gold"])
gold"])

• Matches these sections:

<A
<A B="POT
B="POT OF
OF GOLD">dog</A>
GOLD">dog</A>
<A
<A B="pot
B="pot of
of gold">dog</A>
gold">dog</A>

<A
<A B="POT
B="POT BLACK
BLACK GOLD">dog</A>
GOLD">dog</A>

<A
<A B="POT_OF_GOLD">dog</A>
B="POT_OF_GOLD">dog</A>

INPATH Lexer Dependencies


The test for equality or inequality depends on your lexer settings. With the default settings, the
query
dog INPATH (A[@B= "pot of gold"])
matches the following sections:
<A B="POT OF GOLD">dog</A>
<A B="pot of gold">dog</A>
because lexer is case-insensitive by default.
It also matches
<A B="POT BLACK GOLD">dog</A>
because OF is a default stop word in English and the query matches any word in that position.
It also matches
<A B="POT_OF_GOLD">dog</A>
because the underscore character is not a join character by default.

Oracle Text New Features 3-7


INPATH Examples with Wildcards
• Mozart in section music at any level
Mozart
Mozart INPATH
INPATH (//music)
(//music)
• Mozart in section opera with an ancestor music
Mozart
Mozart INPATH
INPATH (music//opera)
(music//opera)
• Mozart in section opera with grandparent music
Mozart
Mozart INPATH
INPATH (music/*/opera)
(music/*/opera)
• Mozart in opera with various ancestors
Mozart
Mozart INPATH
INPATH
(music/*/classical/*/*/opera)
(music/*/classical/*/*/opera)

Any-Level Tag Searching


To returns documents that have Mozart in the <music> tag at any level, use the following
syntax:
Mozart INPATH (//music)
This query is the same as:
Mozart WITHIN music
Any-Level Descendant Searching
To return documents where Mozart appears in a opera element which is some descendant, at
any level, of a top-level music element, use the following syntax:
Mozart INPATH(music//opera)
Single-Level Wildcard Searching
To return documents where Mozart appears in a opera element which is a grandchild, two
levels down from, a top-level music element, use the following syntax:
Mozart INPATH(music/*/opera)
Multiple-Level Wildcard Searching
Use the following clause to return documents where:
• Mozart appears in a opera element which is three levels down from a classical
element
• The classical element is two levels down from the top element, music
The syntax is:
Mozart INPATH(music/*/classical/*/*/opera)
Oracle Text New Features 3-8
INPATH Examples with Attribute Searches

• Mozart in composer attribute in music section


Mozart
Mozart INPATH
INPATH (//music/@composer)
(//music/@composer)

• Mozart in music section with composer attribute


Mozart
Mozart INPATH
INPATH (//music[@composer])
(//music[@composer])

• Mozart in section music where the attribute


composer is equal to Bach
Mozart
Mozart INPATH
INPATH
(music[@composer
(music[@composer == "BACH"])
"BACH"])

Attribute Searching
To find all documents where Mozart appears in the composer attribute of a music element
at any level, use the clause from the first code box on the slide:
Mozart INPATH (//music/@composer)
Attributes must be bound to a direct parent.
To perform a similar search with the music element at the top level, use one of the following
clauses:
Mozart INPATH (music/@composer)
Mozart INPATH (/music/@composer)
Attribute Existence Testing
To find all documents where Mozart appears in a top-level music element which has a
composer attribute, use the following clause from the second code box on the slide:
Mozart INPATH (music[@composer])
Attribute Value Testing
To find all documents where Mozart appears in a top-level music element which has a
composer attribute whose value is Bach, use the following clause from the last code box on
the slide:
Mozart INPATH (music[@composer = "Bach"])
You can also search for composers that are not Bach, using this clause:
Mozart INPATH (music[@composer != "Bach"])

Oracle Text New Features 3-9


INPATH Examples Using Descendants

• Mozart in top-level music with child composer


Mozart
Mozart INPATH
INPATH (music[composer])
(music[composer])

• Mozart in any level music with child composer


Mozart
Mozart INPATH
INPATH (//music[composer])
(//music[composer])

• Mozart in music with descendant composer


Mozart
Mozart INPATH
INPATH (music[.//composer])
(music[.//composer])

Direct Descendant Searching


To find all documents where Mozart appears in the top-level music element which has a
composer element as a direct descendant, use the clause from the first code box on the slide:
Mozart INPATH (music[composer])
To perform a similar search, except the music element can be at any level, use the clause from
the second code box on the slide:
Mozart INPATH (//music[composer])
Indirect Descendant Searching
To find all documents where Mozart appears in the top-level music element which has a
composer element as a descendant at any level, use the clause from the second code box on
the slide:
Mozart INPATH (music[.//composer])

Oracle Text New Features 3-10


INPATH Examples with Boolean Operators
• Mozart in music without jazz child
Mozart
Mozart INPATH
INPATH (music[NOT(jazz)])
(music[NOT(jazz)])

• Mozart in music with flute and piano children


Mozart
Mozart INPATH
INPATH (music[flute
(music[flute AND
AND piano])
piano])
• Mozart in music with flute or piano children
Mozart
Mozart INPATH
INPATH (music[flute
(music[flute OR
OR piano])
piano])

• Mozart in music with composer Bach and opera


Mozart
Mozart INPATH
INPATH
(music[opera
(music[opera AND
AND @composer
@composer == "Bach"])
"Bach"])

Searching with NOT


To return documents that have Mozart within the section music, which does not have jazz as
an immediate child, use the following clause from the first code box on the slide:
Mozart INPATH (music[NOT(jazz)])
Searching with AND and OR
The AND and OR Boolean operators can be used to combine predicates within the additional
search criteria of an INPATH search. The additional search criteria is specified within brackets.
To return documents that have Mozart in the top-level music section with flute and piano
as children, use the following clause from the second code box on the slide:
Mozart INPATH (music[flute AND piano])
To return documents that have Mozart in the top-level music section with flute or piano
as children, use the following clause from the third code box on the slide:
Mozart INPATH (music[flute OR piano])
Searching Attributes using Booleans
The clause on the last code box in the slide return documents where Mozart appears in an top-
level music element which has a direct child of opera and a composer attribute of Bach, use
the following clause from the last code box on the slide:
Mozart INPATH (music[opera AND @composer = "Bach"])

Oracle Text New Features 3-11


Complex INPATH Examples

• Mozart in composer with music with type = opera


Mozart
Mozart INPATH
INPATH
(music[type=opera]/composer)
(music[type=opera]/composer)

• Mozart in music//opera/composer
(Mozart
(Mozart INPATH
INPATH
(//opera/composer/)
(//opera/composer/) INPATH
INPATH (music)
(music)
• Nested INPATH operators are independent of each
other

Complex Searches
The various types of INPATH clauses can be combined to form more complex searches.
Combining Path and Node Tests
The first clause in the slide returns documents where:
• Mozart appears in the composer element with a music parent
• The type attribute of the music parent is equal to opera
Nested INPATH
You can nest the entire INPATH expression in another INPATH expression as follows:
(Mozart INPATH (//opera/composer/) INPATH (music)
is equivalent to:
(Mozart INPATH (music//opera/composer/)
When you nest INPATH operators, the two paths are completely independent. The outer
INPATH path does not change the context node of the inner INPATH path. For example:
(Mozart INPATH (music)) INPATH (composer)
never finds any documents, because the inner INPATH is looking for Mozart within the top-
level tag music, and the outer INPATH constrains that to document with top-level tag
composer. A document can have only one top-level tag, so this expression never finds any
documents.

Oracle Text New Features 3-12


HASPATH Operator

• Locate all XML documents that contain the


specified section path:

…… HASPATH(music/opera/composer)…
HASPATH(music/opera/composer)…

• Locate all XML documents that contain a specific


value in the specified section path:

…… HASPATH(music="Mozart")…
HASPATH(music="Mozart")…

• Requires an index with PATH_SECTION_GROUP


• Uses the // and * wild card operators
• Empty paths may return false matches

HASPATH Syntax
HASPATH(path) searches an XML document set and returns a score of 100 for all documents
where path exists. Separate the parent and child paths with the / character. For example, you
can specify music/opera/composer.
HASPATH(element="value")searches an XML document set and returns a score of 100
for all documents that have an element with content equal to value.
HASPATH Examples
The query
HASPATH(music/opera/composer)
finds and returns a score of 100 for the document
<music><opera><composer>Mozart</composer></opera></music>
without the query having to reference Mozart at all.
The query
Mozart INPATH music
finds
<music>Mozart</music>
but it also finds
<music>Mozart's Magic Flute</music>

Oracle Text New Features 3-13


HASPATH Examples (continued)
To limit the query to the term Mozart and nothing else, you can use a section equality test with
the HASPATH operator. For example,
HASPATH(music="Mozart")
finds and returns a score of 100 only for the first document, and not the second.
Guidelines
HASPATH uses the // and * wild card operators, the same as INPATH.
Because of how XML section data is recorded, false matches might occur with XML sections
that are completely empty as follows:
<A><B><C></C></B><D><E></E></D></A>
A query of HASPATH(A/B/E) or HASPATH(A/D/C) falsely matches this document. This
type of false matching can be avoided by inserting text between empty tags.

Oracle Text New Features 3-14


Summary

In this lesson, you should have learned how to:


• Write a predicate clause using the INPATH
operator
• Write a predicate clause using the HASPATH
operator

Oracle Text New Features 3-15

You might also like