1.regular Expressions

Session id: 40105
Introducing
Oracle Regular Expressions
Jonathan Gennick, O'Reilly & Associates
Peter Linsley, Oracle Corporation
What are Regular
Expressions?
 A language, or syntax, you can use to describe
patterns in text
 Example: [0-9]{3}-[0-9]{4}
 That which you can describe, you can find and
manipulate
 Unix ed, grep, perl, and now everywhere!
What are Regular
Expressions?
 Follow the script for build database and table
– CREATE DATABASE RE
– CREATE TABLE RE (DESCRIPTION VARCHAR2(6)
– INSERT INTO RE VALUES (‘652’),(‘217’),
(‘113');
Why Describe Patterns?
Humans have long worked with patterns:

– Postal and email addresses
– URLs
– Phone numbers
Often it’s not the data that’s important, but the
pattern:
– Bioinformatics
– Validate format of URLs and email addresses
– Correct formatting of phone numbers
Pre-Oracle Database 10g
Find parks with acreage in their descriptions:
SELECT *
FROM park
WHERE description LIKE '%acre%';
Finds '217-acre' and '27 acres', but also ‘few acres’,

‘more acres than all other parks’, 'the location of a
massacre', etc.
Pre-Oracle Database 10g cont.
Pattern matching with LIKE
– Limited to only two operators: % and _
OWA_PATTERN
– No support for alternation, ASCII only, relatively
poor performance
Non-native solutions
– External Procedures
– Difficult to deploy, maintain, and support
Client based solutions
– Pull all that data down across the network
Oracle Database 10g
Four regular expression functions

– REGEXP_LIKE does pattern match?
– REGEXP_INSTR where does it match?
– REGEXP_SUBSTR what does it match?
– REGEXP_REPLACE replace what matched.
POSIX Extended Regular Expressions
– UNIX Regular Expressions
– Backreference support added
– Longest match not supported
REGEXP_LIKE
Determine whether a pattern exists in a string

Revisiting the acreage problem:
SELECT *
FROM park
WHERE REGEXP_LIKE(description,
'[0-9]+(-| )acre');
Finds '217-acre' and '27 acres'
REJECTS ‘few acres’, ‘more acres than all
other parks’, 'the location of a massacre', etc.
Useful for Constraints
Filter allowable data with check constraint

Only allow alphabetical characters:
CREATE TABLE t1 (c1 VARCHAR2(20),
CHECK (REGEXP_LIKE(c1,
'^[[:alpha:]]+$')));
INSERT INTO t1 VALUES ('newuser');

 1 row created.
INSERT INTO t1 VALUES ('newuser1');

 ORA-02290: check constraint
violated
Metacharacters
Operator Description
. match any character
a? match 'a' zero or one time
a* match 'a' zero or more times
a+ match 'a' one or more times
a|b match either 'a' or 'b'
a{m,n} match 'a' between m and n times
[abc] match either 'a' or 'b' or 'c'
(abc) match group 'abc'
\n match nth group
[:cc:] match character class
[.ce.] match collation element
[=ec=] match equivalence class
REGEXP_INSTR
Find out where a match occurs:
SELECT REGEXP_INSTR(description,
'[0-9]+(-| )acre')
FROM park;
REGEXP_INSTR(DESCRIPTION,'[0-9]+…
---------------------------------
6
20
0
…
REGEXP_SUBSTR
Determine what text matched:
SELECT REGEXP_SUBSTR(description,
'[0-9]+(-| )acre')
FROM park;
REGEXP_SUBSTR(DESCRIPT
----------------------
217-acre
27 acre
…
REGEXP_SUBSTR Cont
 To extract just the acreage value:
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')
FROM park;
REGEXP_SUBSTR(REGEXP
--------------------
217
27
REGEXP_REPLACE
Convert acres to hectares:

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
Convert acres to hectares:

UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
This 217-acre park is wonderful.
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
217
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
217
217 * 0.4047 = 87.8199
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare
87.8199-hectare
1 2
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare
87.8199-hectare
This 87.8199-hectare park is wonderful.
UPDATE park
REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
D E M O N S T R A T I O N
Oracle Regular
Expressions
Performance
Pattern matching can be complex

– Need to compile to state machine
– Lex and parse
– Examine all possible branches until match found
Compiled once per statement
– Can be faster than LIKE for complex scenarios
– Usually faster than PL/SQL equivalent
ZIP code checking 5 times faster
Performance Cont.
Some poorly-performing expressions:

– 'a{2}' will be slower than 'aa'
– '.*b' on input that doesn't contain a 'b' can
also be quite time-consuming
Mastering Regular Expressions

By Jeffrey Friedl
Chapter 6, Crafting an Efficient Expression

Using with Indexes
Use function-based indexes:

CREATE INDEX acre_ind
ON park (REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+'));
To support regular expression queries:
SELECT * FROM park
WHERE REGEXP_SUBSTR(REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+') = 217;
Using with Views
Hide the complexity from users:

CREATE VIEW park_acreage as
SELECT park_name,
REGEXP_SUBSTR(
REGEXP_SUBSTR(
description,
'[0-9]+(-| )acre'),
'[0-9]+') acreage
FROM park;
Using with PL/SQL
REGEXP_LIKE acts as a Boolean function in

PL/SQL:
IF REGEXP_LIKE(description,
'[0-9]+(-| )acre') THEN
acres := REGEXP_SUBSTR(
'[0-9]+(-| )acre'),'[0-9]+');
...
All other functions act identically in PL/SQL
and SQL.
Longest Match vs Greediness
Greediness = each element matches as much

as possible. For example:
'In the beginning','.+[[:space:]]')
FROM dual;
 In the
Longest Match vs Greediness
Longest match = find the variations resulting

in the greatest number of matching
characters:
 SELECT REGEXP_SUBSTR('bbb','b|bb') FROM
dual;
 b
 SELECT REGEXP_SUBSTR('bbb','bb|b') FROM
dual;
 bb
Optional Parameters
All but REGEXP_LIKE take optional

parameters for starting position and
occurrence:
REGEXP_INSTR (source, pattern, start, occurrence, match)
REGEXP_SUBSTR (source, pattern, start, occurrence, match)
REGEXP_REPLACE(source, pattern, replace, start, occurrence,
match)
For example:
REGEXP_SUBSTR('description','[^[:space:]]+',1,10)
Match Parameter
All functions take an optional match

parameter:
– Is matching case sensitive?
– Does period (.) match newlines?
– Is the source string one line or many?
The match parameter comes last
Case-sensitivity
Case-insensitive search:
SELECT *
FROM park
WHERE REGEXP_LIKE(
description,
'[0-9]+(-| )acre',
'i');
Newline matching
INSERT INTO park VALUES ('Park 6',

'640' || CHR(10) || 'ACRE');
SELECT *
FROM park
WHERE REGEXP_LIKE(
description,
'[0-9]+.acre',
'in');
String anchors
INSERT INTO employee (surname)

VALUES ('Ellison' || CHR(10) ||
'Gennick');
SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
Yes!
surname,'^Ellison');
String anchors

'Gennick')
SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
No!
surname,'^Gennick');
String anchors

'Gennick')
SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
Yes!
surname,'^Gennick','m');
Locale Support
Full Locale Support

– All character sets
– All languages
Case and accent insensitive searching
Linguistic range
Character classes
Collation elements
Equivalence classes
Character Sets and Languages
For example, you can search for Ukrainian

names beginning with Ґ and ending with к:
SELECT *
FROM employee
WHERE REGEXP_LIKE(
surname,
'^Ґ[[:alpha:]]*к$','n');
Case- and Accent-Insensitive
Searching
Respect for NLS settings:
ALTER SESSION
SET NLS_SORT = GENERIC_BASELETTER;
With this sort, case won't matter and an
expression such as:
REGEXP_INSTR(x,'resume')
will find "resume", "résumé", "Résume", etc.
Linguistic Range
Ranges respect NLS_SORT settings:
NLS_SORT=GERMAN a,b,c…z
[a-z]
NLS_SORT=GERMAN_CI a,A,b,B,c,C…z,Z
Character Classes
Character classes such as [:alpha:] and

[:digit:] encompass more than just Latin
characters.
For example, [:digit:] matches:
– Latin 0 through 9
– Arabic-Indic٠through ٩
– And more
Collation Elements
ALTER SESSION SET NLS_SORT=XSPANISH;

'El caballo, Chico come la tortilla.',
'[[:alpha:]]*[ch][[:alpha:]]*',
1,1,'i')
FROM dual;
caballo
Collation Elements
ALTER SESSION SET NLS_SORT=XSPANISH;

'El caballo, Chico come la tortilla.',
'[[:alpha:]]*[[.ch.]][[:alpha:]]*',
1,1,'i')
FROM dual;
Chico
Equivalence Classes
Ignore case and accents without changing

NLS_SORT:
REGEXP_INSTR(x,'r[[=e=]]sum[[=e=]]')
Finds 'resume', 'résumé', and 'rEsumE'
Conclusion
String searching and manipulation is at the

heart of a great many applications
Oracle Regular Expressions provide versatile
string manipulation in the database instead of
externalized in middle tier logic
They are Locale sensitive and support
character large objects
Available in both SQL and PL/SQL
Next Steps….
 Recommended sessions
– Session #40088 New SQL Capabilities
– Session #40202 Oracle HTML DB
 Recommended demos and/or hands-on labs
– Database Globalization Pod R
 See Your Business in Our Software
– Visit the DEMOgrounds for a customized architectural review, see
a customized demo with Solutions Factory, or receive a
personalized proposal. Visit the DEMOgrounds for more
information.
 Relevant web sites to visit for more information
– http://www.opengroup.org/onlinepubs/007904975/
basedefs/xbd_chap09.html
Shameless Plug
Oracle Regular Expressions

Pocket Reference
Jonathan Gennick
& Peter Linsley
Free! At the O'Reilly &

Associaties Booth

1.regular Expressions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.regular Expressions

Uploaded by

Copyright:

Available Formats

Session id: 40105

Humans have long worked with patterns:

Find parks with acreage in their descriptions:

Finds '217-acre' and '27 acres', but also ‘few acres’,

Four regular expression functions

Determine whether a pattern exists in a string

Filter allowable data with check constraint

INSERT INTO t1 VALUES ('newuser');

INSERT INTO t1 VALUES ('newuser1');

Convert acres to hectares:

Convert acres to hectares:

Pattern matching can be complex

Some poorly-performing expressions:

Mastering Regular Expressions

Chapter 6, Crafting an Efficient Expression

Use function-based indexes:

Hide the complexity from users:

REGEXP_LIKE acts as a Boolean function in

Greediness = each element matches as much

Longest match = find the variations resulting

All but REGEXP_LIKE take optional

All functions take an optional match

INSERT INTO park VALUES ('Park 6',

INSERT INTO employee (surname)

INSERT INTO employee (surname)

INSERT INTO employee (surname)

Full Locale Support

For example, you can search for Ukrainian

Ranges respect NLS_SORT settings:

Character classes such as [:alpha:] and

ALTER SESSION SET NLS_SORT=XSPANISH;

ALTER SESSION SET NLS_SORT=XSPANISH;

Ignore case and accents without changing

String searching and manipulation is at the

Oracle Regular Expressions

Free! At the O'Reilly &

You might also like