Java Regular Expression Final

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 68

Regular Expression

Navaneethan S
IT VAC Team
Email Validation

• Let’s say you want to verify an email address in the form given below without
regular expressions

This becomes very


firstname_lastname@somewhere.org complex very
quickly using
String methods

• Check for an “@” sign


• Check whether the input string ends with “.org”
• Check for an underscore with letters before and after it
Email Validation – Without RegEx
Email Validation – With RegEx

Program seems
less complex.

Regular expressions allow you to exactly define matching patterns


Patterns describe text rather than specifying it
Objective

• To know what is regular expression?


• You should be able to write regex expressions all by yourself to solve
everyday problems.
What is Regular Expression ?

• What it is ?
• “Regular Expressions are a way to search for patterns within data sets”.

Datasets are Sales Data


everywhere…… Telephone
Directory
Data Sets Database Tables
HTML source code
of a web page
(Look around you…) A set of computer
programs
Once you have your data set, what’s next ?

• Once data set is available , the next task would be to extract specific
data that you will need from it.
• Example:

Extraction
Telephone process
Directory ( Looking for Specific Data
specific data )

Data Set Extracted specific data that actually need !!!


Email Validation – With RegEx

Program seems
less complex.
Looking for a
specific pattern
in given email.

Regular expressions allow you to exactly define matching patterns


Patterns describe text rather than specifying it
Its’ EASY, but only with the Is this process Easy
power of RegEx
or complicated ?

( Looking for
specific data
Telephone
Directory
match and Specific Data
extract if
needed )

Data Set Extracted specific data that actually need !!!


Need for Regular Expression

• Looking for a specific data from a huge data set is terrible process.
• The Traditional control “ if ” will not serve you well.
• You need to be equipped with better tools to wade through the data in an effective
and efficient way.
• This is where Regular Expression can be of super useful.

Regular Expression is not a programming language in itself rather


it’s a feature supported by all major programming languages.
What is ‘regex’ ?

• RegEx is an abbreviation for Regular Expression.


• Places to find regex ?

[Linux Users]
“The true power of Linux command line unleashed only if you
supplement it with regex.”
Regular Expression

“A Regular Expression, regex or regexp is a sequence of characters that


define a search pattern” - WikiPedia
Use Cases For Regular Expressions – Google SignUp
Regex Engine

Something validates
what we have typed
in against a
pattern..

Which might look something like this


^\w+@[A-Za-z_]+?\.[A-Za-z]{2,3}$
Use Case for Regex
Not only Programming languages, even some text editors (Ex: Notepad++) come with
regEx engine.
Open the below csv file in notepad++ and do the following.
(Note : Notepad doesn’t support Regex but Notepad++ do).

VideogameSalesData2019.csv file
Drive Location : https://drive.google.com/drive/folders/0ACCIIBaRrWySUk9PVA

Select all game records which was developed by PubG Corporation. - 7 Hits
Search for all the records containing the year 2019. - 768 Hits
Search Keyword : PUBG, 2019
(Search Mode : Normal)
Use Case for Regex
Complex without
using regex
Not take a complex scenario.

I want all the game records which was developed by “PUBG” or “Nintendo” and year should be compulsorily in “2018” or
“2019” only.

How do I frame my search term? I cannot come with a search term using normal mode.

Lets unleash the power of Regex. Use this regex as


Search Keyword
Change the search mode to regular expressions mode.
Now use the below regular expression as the search term in Ctrl + F window.

.*(PUBG|Nintendo).*201(8|9).*
Total Hits : 63 Hits
A Simple RegEx

fooaaaabar
foo aaaa bar
fooabar
foo a bar
foobar
foo bar
fooaabar foo aa bar
RegEx pattern
fooxxxbar

fooxbar
fooa*bar
Sample Exercises

ExerciseInputFiles (1).zip JavaClasses.zip


Generic solution to any generic problem
Steps to build Regular Expression

1 2 3 4
Step 1 Step 2 Step 3 Step 4
Understand the requirement: Identify the patterns in the Represent the patterns using Use a Regex engine like GREP or
What needs to be included? inclusions lists or the regular Expressions. Python or Java to apply the
What needs to be excluded? exclusions list regex pattern on the input.
Hands-on with Java regex engine
FYIP

• All major programming languages like Java, Python, JavaScript as well as


Linux commands like grep, sed, awk etc comes with the regex engine.

• Though regex engines shares the same fundamental philosophy , the


complete feature set of each regex engine may vary.

• Almost all major regex engines are POSIX compliant.

• POSIX stands for Portable Operating System Interface, and is an IEEE


standard.
Building a Foundation – Quiz 1
Test your knowledge on the fundamental concepts

What is the common abbreviation for Regular Expressions?


A) RE B) Rex C) Regex

What are some use cases of regular expressions?


A. Searching patterns through text files.
B. User input validation on web pages.
C. Searching through data sets .
D. All of the above.

What does the pattern a* stands for ?


E. Zero or more occurrences of ‘a’
F. One or more occurrences of ‘a’
G. ‘a’ followed by anything
POSIX Standards

• Under POSIX standards, regular expressions are divided into 2 sets.

POSIX Standards
(Regular Expressions)

BASIC Set Extended Set


Regular Expressions – POSIX
BASIC SET

• The BASIC Set comes with the set of symbols, each one having
specific meaning and interpretation.
Symbol What does it represents ?

* Zero or more occurrences of the character that precedes this asterisk

. A wildcard that represents any character

\s Represents whitespace

[pqr] A single character which can be either ‘p’, ‘q’ or ‘r’

[a-d] A single character that falls in the range ‘a-d’ i.e., one of ‘a’, ‘b’, ‘c’ or ‘d’.

[^pq] A single character that is neither ‘p’ nor ‘q’

^pattern ^ is an anchor tag that represents the beginning of the line.

pattern$ $ is an anchor tag that represents the end of the line.


BASIC Set – Wildcard Symbol

fooabar
fooxbar
. Single Wildcard
foo a bar character Symbol
represents any
baryfoo foo x bar single character

foobar foo c bar


fooxybar
foocbar foo.bar
Wildcard Asterisk Combo

foobar
The number of letters
barfoo and the letters itself is
fooabcbar unpredictable.
foobxcbar
foo bar
barcbyfoo
foo abc bar
foozbar
barafoo
foo bxc bar
barabfoo foo z bar
foo.*bar
Representing WhiteSpaces

fooxxxbar
foo bar foo\s*bar
fooxbar
fooxxbar foo <3 spaces> bar
foo bar foo <1 spaces> bar
foo bar foo <6 spaces> bar
foobar foo <0 spaces> bar
fooyyybar

\s represents a single whitespace character


Character Class
foo

moo foo f oo CHARACTER


CLASS
coo coo c oo Please Note: It
doo loo l oo represents only
one character.
poo

loo

boo
[fcl]00
No spaces or commas
hoo between the letters
Character class are represented using square brackets.
Example : [abc] –Character class. One of the characters inside the square brackets – a, b or c.
Character class are not wildcard. You just cant put anything in there. It just has to be either a, b or c.
Character Class - Quiz
No spaces or commas
between the letters
foo

moo [fcdplb]00
coo

doo If there are too many entries inside a character class, it starts to get
poo unmanageable.
In this example we have 6 valid cases.
loo
Is there a better regex pattern that we can come up with, that is not as
boo lengthy ?
hoo
Character class are represented using square brackets.
Example : [abc] –Character class. One of the characters inside the square brackets – a, b or c.
Character class are not wildcard. You just cant put anything in there. It just has to be either a, b or c.
Caret Symbol

^ Character Class - Quiz


or Exponent
Operator

foo

moo [^mh]00
coo

doo
Caret Symbol also called as Exponent Operator negates the class.
poo Example : [^abc] – What does this mean ?
loo It represents any letter other than ‘a’, ‘b’ or ‘c’
Please note it represents only single character position.
boo

hoo
So far we have come with regular expressions using inclusion list.
When length of the regex is lengthy we can also look for regex involving exclusion list just to avoid
length regex pattern.
Character classes with ranges
joo

boo

koo j oo [j-m]00
loo k oo
woo l oo
moo m oo
zoo

coo
Character classes with ranges
joo

boo

koo j oo [j-mz]00
loo k oo
woo l oo
moo m oo
zoo z oo
coo
Example : [a-cx] – represents one of the characters falling in the range OR any of the other
choices given in the square brackets – a, b, c, x
Character classes with ranges
joo Even though they look like they are in sequence, they are
mix of lowercase and uppercase letters.
boo

Koo J oo
Loo K oo
woo L oo [j-mJ-Mz]00
moo m oo
zoo z oo 2 ranges
coo

Example : [a-cA-Cx] – represents one of the characters falling in one of the ranges OR any of the
other choices given in the square brackets – a, b, c, A, B, C, x
Escaping with Backslash
xxx.yy
xxx . yy Occurrences of x followed by
xx.yyyy xx . yyyy
occurrences of y
No recurrences of period .

x.yy x . yy
xy FYIP:
The letters a,b,c or x,y,z are all literals. They don’t mean anything special to the
xxyy regex engine.But certain symbols like period, star, square brackets, etc mean
something special to the regex engine.
yyxx What if these special symbols becomes part of our input string ?
In this case .period symbol is part of our input string.
yx Our regex engine treats .period symbol as a wildcard, which is not
what we want.
yxxx We need to escape this character with a backslash symbol.
^ $ * . [ ( ) \
Escaping with Backslash

Whenever regex engine sees a backslash symbol, it considers whatever letter


immediately follows the backslash as a literal.

In other words, the backslash symbol is a way of escaping the symbol from being
interpreted as a special symbol

Following characters should be escaped with a backslash as these characters have


special meaning otherwise:

^ $ * . [ ( ) \
Escaping with Backslash
xxx.yy
xxx . yy
xx.yyyy xx . yyyy
x.yy X* \. *y
x . yy
xy
xxyy
yyxx
yx
x*\.y*
yxxx
Escaping with Backslash

x # y
x#y x : y x [.:#] y
x:y x . y
x.y # - Pound Symbol does not have special meaning in
regex# engine.
- Pound Symbol does not have special
meaning in regex engine.
x&y Likewise : colon symbol either does not have special
Likewise : colon symbol either does not have
meaning.
special meaning.
x%y But .period
NoteBut
symbol do have.
: A.period symbol do
period outside thehave.
square bracket x[.:#]y
Note : A
represents period
any singledoes not have
character butany meaning
does not have
inside theinside
any meaning square thebracket.
squareItbracket.
is simply treated
It is simply
as aas
treated literal. (Reason
a literal. (Reasonbehind whywhy
behind we have
we have
excluded
excluded backslash
backslash for . for . period
period symbol)
symbol)
Escaping with Backslash

x # y
x#y x : y x [#:\^] y
x:y x ^ y
x^y # - Pound Symbol does not have special meaning in
regex engine.
x&y Likewise : colon symbol either does not have special
meaning.
x%y But .period symbol do have.
x[#:\^]y
Note : A caret symbol (^) inside the square bracket
has special meaning. It is used for negating a
character class. (Reason behind why we have
included backslash for ^ caret symbol)
Important Points

• If any of the characters which has special meaning inside the


character class [ ] (square brackets ), it is mandate to escape them
with backslash.
• Example 1:
^ caret symbol inside [ ] brackets has special meaning of negating
the character class. So we should escape them with backslash.
• Example 2:
. period symbol inside [ ] brackets has no special meaning and
hence there is no requirement for us to escape them with
backslash.
Regex Pattern - Quiz 2

x#y

x\y x # y
x \ y
x^y x ^ y

x&y

x%y x[#\\\^]y
Anchors

foo bar baz


bar foo baz
foo bar baz
baz foo bar
bar baz foo foo baz bar ^foo.*
foo baz bar
baz bar foo If you are using caret as a placeholder then it should always be the
first thing in your regular expression string.

^ is a placeholder that signifies beginning of a line. The interpretation of ^ differs


within square brackets and outside of it. Inside square brackets, ^ stands for negation.
Outside, it is a placeholder for beginning of a line.
Anchors

foo bar baz


bar foo baz
baz foo bar
baz foo bar
bar baz foo foo baz bar .*bar$
foo baz bar
baz bar foo $ placeholder always matches only the end of the line and not anywhere else.

$ is a placeholder that signifies the end of the line.


Anchors

Pattern we are looking for should be ….


foo There should be nothing in front of foo….
foo bar There should be nothing after foo…
foo
baz foo
foo bar baz
baz bar foo ^foo$
If you observe the green ones, you ll find it contains only foo and nothing else
The red ones also have “foo” but not the only ones. It has something after “foo”. Some have before the “foo”.
Quiz on Basic Set

• Which of the following regex can be used to represent both the


strings ‘grey’ and ‘gray’.
A. gr[ae]y most specific answer
B. gr.y
C. gr[a-z]y
D. All of the above
Quiz on Basic Set
• Which of the following represents two digit even number ?
A. ^[0-9][2468]$
B. ^[2468][2468]$
C. ^[0-9][0-9]$
D. All of the above

• Which of the following represents three digit numbers that are multiples of 5?
A. ^[0-9][0-9][05]$
B. ^[0-9][0-9][0-9]$
C. ^[0-9]*$
D. ^[0-9].[0-9]$
POSIX Standard – Extended Set

Just like basic set, the extended set also comes with the set of symbols,
each one having a specific meaning and interpretation.
Symbol What does it represents ?

+ One or more occurrences of the character that precedes this + symbol

? Zero or more occurrences of the character that precedes this question mark

pat 1 | pat 2 Matches either the pattern 1 or the pattern 2

() Divides patterns into groups

{m} Exactly m occurrences of whatever precedes

{m,n} Atleast m and at most n occurrences of whatever precedes.


Only one of m, n is mandatory. Other can be left blank.
Curly Braces Repeater

834
519
^ [0-9][0-9][0-9] $
4874
^ [0-9][0-9][0-9] $
5
^ [0-9][0-9][0-9] $
89
45687
A digit can be any character from 0 to 9.
25 So a digit can be represented by a character class with ASCII ranges starting from 0 to 9.
We have also a line beginning and line end anchor at the left and right.
645 Why do we need these?
The reason is that we do not want matches done with a subset of the string.
i.e. we do not want to match against substrings. Lets discuss this with an example.
Curly Braces Repeater
Lets take the above number 45687
Take the substring of this, take the middle 3 characters. 5-6-8…
It forms a 3 – digit number

Just because it contains a three digit number somewhere in between, we don’t want to identify it as a positive
match.
We are only interested in matching the whole string. In order to ensure this, we put the anchor at both ends.
This way the match will be run only against the whole strings and not the substrings.

We are doing so because, the situation forces us to do so.


Not every situation will require such kind of anchor at both ends.
Example:
We might want to search for a pattern anywhere in the string. In that case, ANCHOR should not be
provided.
Curly Braces Repeater

834 ^ [0-9][0-9][0-9] $
519
4874
^
^
[0-9][0-9][0-9] $
[0-9][0-9][0-9] $ ^[0-9]{3}$
5 Lets assume for the problem above what if we want to represent a 10-digit numbers be
89 like?.
It might be cumbersome to write the character class 10 times over.
45687 We need a better compact way to represent this.
This leads us to the regex symbol, the repeater.
25 It is represented by opening and closing curly braces with a number in between.
This number signifies the number of repetitions.
645
a{m} represents exactly ‘m’ repetitions of whatever immediately precedes this. i.e. ‘a’
If you think you can do this with asterisk symbol ‘*’. Beware the limitation with asterisk is
that you cant represent an exact number of repetitions with it.
Curly braces Repeater

lion
tiger
^[a-z]{4,6}$
leopard lion 4 letters ^[a-z]{4}$
fox tiger 5 letters ^[a-z]{5}$
kangaroo mouse 5 letters ^[a-z]{5}$
bat cuckoo 6 letters ^[a-z]{6}$
mouse deer 4 letters ^[a-z]{4}$
cuckoo
a{m,n} represents atleast ‘m’ and atmost ‘n’ repetitions or whatever immediately
deer precedes this. i.e. ‘a’
Single Ended Curly Braces Repeater

ha
hahahahaha (ha){4,}
Hahahahaha ha{5}
hahaha Hahahaha ha{4}
If we have not used
Hahahahahaha ha{6}
hahahaha Hahahahahahahaha ha{8}
() parenthesis then
the repetition count
haha Hahahahahahahahaha ha{9}
inside curly braces
would be applied to
only ‘a’
hahahahahaha
hahahahahahahaha Parenthesis is used to group and treat as a single entity.
{m,} represents at least ‘m’ repetitions of whatever immediately
hahahahahahahahaha precedes this.
Single Ended Curly Braces Repeater

ha
haha
(ha){1,2} Why anchor at
both ends
ha (ha){1}
hahaha haha (ha){2}
here?

hahahahaha
hahahaha ^(ha){,2}$
hahaha hahaha ^(ha){3}$
hahahahahahaha hahahahaha
Hahahaha
^(ha){5}$
^(ha){4}$
hahahahahaha Hahaha ^(ha){3}$
Hahahahahahaha ^(ha){7}$
Hahahahahaha ^(ha){6}$
Single Ended Curly Braces Repeater
Reason why
anchor at both
hahahaha
^(ha){,2}$
ends are
here…

Lets take an above example :


hahahaha
The above string can be split into two equal parts, each part having 2 ha’s
Here if we have not used anchor, then the pattern match would be considered positive if the string has less than
or atleast 2 ha combinations even though the original string has 4 ha combinations. x
That’s why we place anchor at both ends so that the engine cannot take bits and pieces from the middle of the
string and run its match against that piece.
The Plus Repeater

fooaaaabar
Note :
fooabar
Foo aaaa bar + denotes one or more
foobar occurrences.
Foo a bar
fooaabar
fooxxxbar
Foo aa bar
fooxbar

Since * means zero or more occurrences


fooa*bar
of ‘a’ here, we cant consider zero
occurrences of ‘a’ as it would match
“foobar” which shouldn’t be fooa+bar
The Question Mark Binary

https://website
http://website http s ://website
httpss://website http ://website
httpx://website
httpxx://website https?://website
The number of ‘s’ can either be zero or one.
Only zero occurrences or a single occurrence should qualify.
This brings us to the next regex symbol, the question mark (?)
which represents only two possibilities either 0 or 1 repetition.

a? Zero or one occurrences of ‘a’(The character just preceding the question mark)
Making Choices With Pipe

sapwood
rosewood log wood
logwood ply wood
teakwood
plywood
redwood (log | ply)wood
Extended Set - Quiz

Which one of the following regular expressions can represent the words
‘colour’ as well as ‘color’?
A. colou*r
B. colou?r
C. colo.r
Which of the following regular expressions can represent the words
‘ascending’ as well as ‘descending’?
D. (asc/desc)ending
E. [asc|desc]ending
F. (asc|desc)ending
Extended Set - Quiz

Which of the following regular expressions can represent all of the


strings ’a’, ‘aa’ and ‘aaa’, AND should exclude empty strings?
A. a+
B. a*
C. a.
Regex – Group Capture, find and replace
Steps to do group capture, find and replace
Here we are going to replace some parts of the string or maybe the whole string itself.

1 2 3 4
Step 1 Step 2 Step 3 Step 4
Understand the requirement: Represent the search patterns Come up with the substitution Use a regex enabled find and
What needs to be replaced? using regex. Enclose the string by using the captured replace engine to do the
What should be the pattern(s) that needs to be pattern groups. replacement.
replacement ? replaced with parenthesis to
segregate them into capture
groups
The Monitor Resolutions Problem

1280x720 1280 pix by 720 pix \1 pix by \2 pix


1920x1080 1920 pix by 1080 pix
1600x900 1600 pix by 900 pix
1280x1024 1280 pix by 1024 pix
800x600 800 pix by 600 pix
1024x768 1024 pix by 768 pix Group \1 Group \2

([0-9]+)x[(0-9]+)
The Monitor Resolutions Problem

To do find and replace we use another linux command

SED \1 pix by \2 pix


Sed command can search for a pattern, and
replace it with the substitution string all in a
single line command invocation.
Yes it takes one line in linux than a whole
class file in java
Group \1 Group \2

sed –r ‘s/pattern/replacement/g’ inputfile


([0-9]+)x[(0-9]+)
The Monitor Resolutions Problem

sed –r ‘s/pattern/replacement/g’ inputfile

-r enables the POSIX extended set (Similar to –E in linux grep command)


s letter ‘s’ indicates substitution
/ forward slash separator
pattern search pattern regular expression (Group 1)
replacement replacement regular expression pattern (Group 2)
g ‘g’ stands for global. By default sed command substitutes only for the first match.
Adding ‘g’ will substitute for all occurrences which is what we want.
Inputfile input file name
The Monitor Resolutions Problem in Java Regex Engine

Java uses the group api on a matcher class to get the substitution string.

\1 in linux  m.group(1) in Java


First Name and Last Name Problem
Group \1 Group \2
• John Wallace Wallace,John
• Steve King King, Steve
([a-zA-Z]+)\s([a-zA-Z]+)
• Martin Cook Cook,Martin
• Adam Smith Smith, Adam
• Irene Peter Peter,Irene Replacement String : \2,\1
• Alice Johnson Johnson,Alice

sed –r ‘s/([a-zA-Z]+)\s([a-zA-Z]+)/\2,\1/g’ inputfile.txt


Clock Time Problem
Group \1 Group \2
• 7:32 32 mins past 7
• 6:12 12 mins past 6
([0-9]{1,2}):([0-9]{1,2})
• 12:23 23 mins past 12
• 1:23 23 mins past 1
• 11:33 33 mins past 11 Replacement String : \2 mins past \1
• 4:21 21 mins past 4

sed –r ‘s/([0-9]{1,2}):([0-9]{1,2})/\2 mins past \1/g’ inputfile.txt


Phone Number Problem

• 914.582.3013 xxx.xxx.3013 Group \1


• 873.334.2589 xxx.xxx.2589
[0-9]{3}\.[0-9]{3}\.([0-9]{4})
• 521.589.3147 xxx.xxx.3147
• 625.235.3698 xxx.xxx.3698
• 895.568.2145 xxx.xxx.2145 Replacement String : xxx.xxx.\1

• 745.256.3369 xxx.xxx.3369

sed –r ‘s/[0-9]{3}\.[0-9]{3}\.([0-9]{4})/xxx.xxx.\1/g’ inputfile.txt


The Date Problem

• Jan 5th 1987 5-Jan-87 Group \1 Group \2 Group \3


• Dec 26th 2010 26-Dec-10 ([a-zA-Z]{3})\s([0-9]{1,2})[a-z]{2}\s[0-9]{2}([0-9]{2})
• Mar 2nd 1923 2-Mar-23
• Oct 1st 2008 1-Oct-08
• Aug 3rd 2009 3-Aug-09 Replacement String : \2-\1-\3

• Jun 10th 2001 10-Jun-01

sed –r ‘s/([a-zA-Z]{3})\s([0-9]{1,2})[a-z]{2}\s[0-9]{2}([0-9]{2})/\2-\1-\3/g’ inputfile.txt


Find and Replace - Quiz
Which one of the following represent a capture group, inside a replacement string.
A. (1)
B. \1
C. /1

Given a US state code and a US zip code , separated by a space, (E.g. NY 10520) which of the following regular
expression would capture the state code into capture group 1 and the zip code into capture group 2?

D. ([A-Z]+)\s([0-9])+
E. ([A-Z]+)\s([0-9]+)
F. ([A-Z])+\s([0-9])+

The dollar price tag of a product(e.g. $21.44) is captured using the regex: \$([0-9]+)\.([0-9]+). Which of the below
substitution string can you use transform it to a string of the format: '44 cents and 21 dollars'?

G. \2 cents and \1 dollars


H. \1 cents and \2 dollars
I. (2) cents and (1) dollars
Thank you all!!!
[a-zA-Z]+\s[a-zA-Z]+\s[a-zA-Z]+[!]{3}

You might also like