Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

Session id: 40105

Introducing
Oracle Regular Expressions
Jonathan Gennick, O'Reilly & Associates
Peter Linsley, Oracle Corporation
What are Regular
Expressions?
 A language, or syntax, you can use to describe
patterns in text
 Example: [0-9]{3}-[0-9]{4}
 That which you can describe, you can find and
manipulate
 Unix ed, grep, perl, and now everywhere!
What are Regular
Expressions?
 Follow the script for build database and table
– CREATE DATABASE RE
– CREATE TABLE RE (DESCRIPTION VARCHAR2(6)
– INSERT INTO RE VALUES (‘652’),(‘217’),
(‘113');
Why Describe Patterns?

Humans have long worked with patterns:


– Postal and email addresses
– URLs
– Phone numbers
Often it’s not the data that’s important, but the
pattern:
– Bioinformatics
– Validate format of URLs and email addresses
– Correct formatting of phone numbers
Pre-Oracle Database 10g

Find parks with acreage in their descriptions:

SELECT *
FROM park
WHERE description LIKE '%acre%';

Finds '217-acre' and '27 acres', but also ‘few acres’,


‘more acres than all other parks’, 'the location of a
massacre', etc.
Pre-Oracle Database 10g cont.
Pattern matching with LIKE
– Limited to only two operators: % and _
OWA_PATTERN
– No support for alternation, ASCII only, relatively
poor performance
Non-native solutions
– External Procedures
– Difficult to deploy, maintain, and support
Client based solutions
– Pull all that data down across the network
Oracle Database 10g

Four regular expression functions


– REGEXP_LIKE does pattern match?
– REGEXP_INSTR where does it match?
– REGEXP_SUBSTR what does it match?
– REGEXP_REPLACE replace what matched.
POSIX Extended Regular Expressions
– UNIX Regular Expressions
– Backreference support added
– Longest match not supported
REGEXP_LIKE

Determine whether a pattern exists in a string


Revisiting the acreage problem:
SELECT *
FROM park
WHERE REGEXP_LIKE(description,
'[0-9]+(-| )acre');
Finds '217-acre' and '27 acres'
REJECTS ‘few acres’, ‘more acres than all
other parks’, 'the location of a massacre', etc.
Useful for Constraints

Filter allowable data with check constraint


Only allow alphabetical characters:
CREATE TABLE t1 (c1 VARCHAR2(20),
CHECK (REGEXP_LIKE(c1,
'^[[:alpha:]]+$')));

INSERT INTO t1 VALUES ('newuser');


 1 row created.

INSERT INTO t1 VALUES ('newuser1');


 ORA-02290: check constraint
violated
Metacharacters
Operator Description
. match any character
a? match 'a' zero or one time
a* match 'a' zero or more times
a+ match 'a' one or more times
a|b match either 'a' or 'b'
a{m,n} match 'a' between m and n times
[abc] match either 'a' or 'b' or 'c'
(abc) match group 'abc'
\n match nth group
[:cc:] match character class
[.ce.] match collation element
[=ec=] match equivalence class
REGEXP_INSTR
Find out where a match occurs:

SELECT REGEXP_INSTR(description,
'[0-9]+(-| )acre')
FROM park;

REGEXP_INSTR(DESCRIPTION,'[0-9]+…
---------------------------------
6
20
0

REGEXP_SUBSTR
Determine what text matched:

SELECT REGEXP_SUBSTR(description,
'[0-9]+(-| )acre')
FROM park;

REGEXP_SUBSTR(DESCRIPT
----------------------
217-acre
27 acre

REGEXP_SUBSTR Cont
 To extract just the acreage value:

SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')
FROM park;

REGEXP_SUBSTR(REGEXP
--------------------
217
27
REGEXP_REPLACE

Convert acres to hectares:


UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.

Convert acres to hectares:


UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre
217

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre
217
217 * 0.4047 = 87.8199

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare
87.8199-hectare

1 2
UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
REGEXP_REPLACE Cont.
This 217-acre park is wonderful.
217-acre
217
217 * 0.4047 = 87.8199
87.8199\2hectare
87.8199-hectare
This 87.8199-hectare park is wonderful.

UPDATE park
SET description = REGEXP_REPLACE(
description,'([0-9]+)(-| )acre',
TO_CHAR(0.4047 * TO_NUMBER(
REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+')))
|| '\2' || 'hectare');
D E M O N S T R A T I O N

Oracle Regular
Expressions
Performance

Pattern matching can be complex


– Need to compile to state machine
– Lex and parse
– Examine all possible branches until match found
Compiled once per statement
– Can be faster than LIKE for complex scenarios
– Usually faster than PL/SQL equivalent
ZIP code checking 5 times faster
Performance Cont.

Some poorly-performing expressions:


– 'a{2}' will be slower than 'aa'
– '.*b' on input that doesn't contain a 'b' can
also be quite time-consuming

Mastering Regular Expressions


By Jeffrey Friedl

Chapter 6, Crafting an Efficient Expression


Using with Indexes

Use function-based indexes:


CREATE INDEX acre_ind
ON park (REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+'));
To support regular expression queries:
SELECT * FROM park
WHERE REGEXP_SUBSTR(REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+') = 217;
Using with Views

Hide the complexity from users:


CREATE VIEW park_acreage as
SELECT park_name,
REGEXP_SUBSTR(
REGEXP_SUBSTR(
description,
'[0-9]+(-| )acre'),
'[0-9]+') acreage
FROM park;
Using with PL/SQL

REGEXP_LIKE acts as a Boolean function in


PL/SQL:
IF REGEXP_LIKE(description,
'[0-9]+(-| )acre') THEN
acres := REGEXP_SUBSTR(
REGEXP_SUBSTR(description,
'[0-9]+(-| )acre'),'[0-9]+');
...
All other functions act identically in PL/SQL
and SQL.
Longest Match vs Greediness

Greediness = each element matches as much


as possible. For example:

SELECT REGEXP_SUBSTR(
'In the beginning','.+[[:space:]]')
FROM dual;
 In the
Longest Match vs Greediness

Longest match = find the variations resulting


in the greatest number of matching
characters:
 SELECT REGEXP_SUBSTR('bbb','b|bb') FROM
dual;
 b
 SELECT REGEXP_SUBSTR('bbb','bb|b') FROM
dual;
 bb
Optional Parameters

All but REGEXP_LIKE take optional


parameters for starting position and
occurrence:
REGEXP_INSTR (source, pattern, start, occurrence, match)
REGEXP_SUBSTR (source, pattern, start, occurrence, match)
REGEXP_REPLACE(source, pattern, replace, start, occurrence,
match)

For example:
REGEXP_SUBSTR('description','[^[:space:]]+',1,10)
Match Parameter

All functions take an optional match


parameter:
– Is matching case sensitive?
– Does period (.) match newlines?
– Is the source string one line or many?
The match parameter comes last
Case-sensitivity

Case-insensitive search:
SELECT *
FROM park
WHERE REGEXP_LIKE(
description,
'[0-9]+(-| )acre',
'i');
Newline matching

INSERT INTO park VALUES ('Park 6',


'640' || CHR(10) || 'ACRE');

SELECT *
FROM park
WHERE REGEXP_LIKE(
description,
'[0-9]+.acre',
'in');
String anchors

INSERT INTO employee (surname)


VALUES ('Ellison' || CHR(10) ||
'Gennick');

SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
Yes!
surname,'^Ellison');
String anchors

INSERT INTO employee (surname)


VALUES ('Ellison' || CHR(10) ||
'Gennick')

SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
No!
surname,'^Gennick');
String anchors

INSERT INTO employee (surname)


VALUES ('Ellison' || CHR(10) ||
'Gennick')

SELECT * FROM
EMPLOYEE
WHERE REGEXP_LIKE(
Yes!
surname,'^Gennick','m');
Locale Support

Full Locale Support


– All character sets
– All languages
Case and accent insensitive searching
Linguistic range
Character classes
Collation elements
Equivalence classes
Character Sets and Languages

For example, you can search for Ukrainian


names beginning with Ґ and ending with к:
SELECT *
FROM employee
WHERE REGEXP_LIKE(
surname,
'^Ґ[[:alpha:]]*к$','n');
Case- and Accent-Insensitive
Searching
Respect for NLS settings:
ALTER SESSION
SET NLS_SORT = GENERIC_BASELETTER;
With this sort, case won't matter and an
expression such as:
REGEXP_INSTR(x,'resume')
will find "resume", "résumé", "Résume", etc.
Linguistic Range

Ranges respect NLS_SORT settings:

NLS_SORT=GERMAN a,b,c…z

[a-z]

NLS_SORT=GERMAN_CI a,A,b,B,c,C…z,Z
Character Classes

Character classes such as [:alpha:] and


[:digit:] encompass more than just Latin
characters.
For example, [:digit:] matches:
– Latin 0 through 9
– Arabic-Indic٠through ٩
– And more
Collation Elements

ALTER SESSION SET NLS_SORT=XSPANISH;


SELECT REGEXP_SUBSTR(
'El caballo, Chico come la tortilla.',
'[[:alpha:]]*[ch][[:alpha:]]*',
1,1,'i')
FROM dual;

caballo
Collation Elements

ALTER SESSION SET NLS_SORT=XSPANISH;


SELECT REGEXP_SUBSTR(
'El caballo, Chico come la tortilla.',
'[[:alpha:]]*[[.ch.]][[:alpha:]]*',
1,1,'i')
FROM dual;

Chico
Equivalence Classes

Ignore case and accents without changing


NLS_SORT:
REGEXP_INSTR(x,'r[[=e=]]sum[[=e=]]')
Finds 'resume', 'résumé', and 'rEsumE'
Conclusion

String searching and manipulation is at the


heart of a great many applications
Oracle Regular Expressions provide versatile
string manipulation in the database instead of
externalized in middle tier logic
They are Locale sensitive and support
character large objects
Available in both SQL and PL/SQL
Next Steps….
 Recommended sessions
– Session #40088 New SQL Capabilities
– Session #40202 Oracle HTML DB
 Recommended demos and/or hands-on labs
– Database Globalization Pod R
 See Your Business in Our Software
– Visit the DEMOgrounds for a customized architectural review, see
a customized demo with Solutions Factory, or receive a
personalized proposal. Visit the DEMOgrounds for more
information.
 Relevant web sites to visit for more information
– http://www.opengroup.org/onlinepubs/007904975/
basedefs/xbd_chap09.html
Shameless Plug

Oracle Regular Expressions


Pocket Reference

Jonathan Gennick
& Peter Linsley

Free! At the O'Reilly &


Associaties Booth

You might also like