Regular Expressions Guide and Practice

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Regular Expressions

Guide and Practice


Accelirate, Inc.
July xx, 2019
Revision 1.2
Contents
Document Revision ................................................................................................................................. 3
Introduction ............................................................................................................................................ 3
1.1 What is RegEx?............................................................................................................................... 3
1.2 How it works? ................................................................................................................................ 3
1.3 Usage? ........................................................................................................................................... 3
1.4 RegEx in UiPath Studio ................................................................................................................... 3
Basic RegEx ............................................................................................................................................. 4
Basic Functionality ............................................................................................................................... 4
2.1 Escape Characters ...................................................................................................................... 4
2.2 Character Classes ....................................................................................................................... 5
2.3 Anchors ...................................................................................................................................... 6
2.4 Grouping Constructs ................................................................................................................... 7
2.5 Quantifiers ................................................................................................................................. 8
Basic Examples .................................................................................................................................... 9
3.1 Matching names starting with S.................................................................................................. 9
3.2 Matching variable text using static text ...................................................................................... 9
Advanced RegEx .................................................................................................................................... 12
Advanced Functionality...................................................................................................................... 12
4.1 Backreference Constructs ......................................................................................................... 12
4.2 Alternation Constructs ............................................................................................................. 12
4.3 Substitutions ............................................................................................................................ 13
4.4 Regular Expression Options ...................................................................................................... 14
Advanced Examples ........................................................................................................................... 17
5.1 Matching Phone Numbers ........................................................................................................ 17
5.2 Matching Email RegEx .............................................................................................................. 17
5.3 Matching Header Information inside an Email .......................................................................... 18
Assignment ~30 mins............................................................................................................................. 20
Quick References................................................................................................................................... 21
Document Revision
Version
Date Contributor Document Changes
Number

Joseph Medina Michael


07/12/2019 1.0 Initial Draft.
Fung Quee
07/17/2019 Joseph Medina 1.1 Initial Assignment Added
07/18/2019 Joseph Medina Michael 1.2 Reformatting & Advanced Section
Fung Quee

Introduction
1.1 What is RegEx?
RegEx (short for Regular Expressions) are an efficient and flexible way to process text. It uses a
pattern-matching notation to parse through large amounts of data. For many situations, it can
be a better alternative than relying on basic string manipulation functions. It might even be your
best friend when dealing with invoices .

1.2 How does it work?


Regular expressions in UiPath use a regex-directed engine (as opposed to a text directed
engine). The engine defines how the match operations are carried out. In a regex-directed
engine, the engine attempts to match the first token (character, group, construct, etc..) in the
regular expression with the first character in the input string. If the match is successful, the
engine continues through the regular expression and input string. If a match is unsuccessful, the
engine backtracks to a previous position in the regular expression and input string and it
attempts a different path through the regular expression.

1.3 When do you usually use it?


• Parse through data to find specific character patterns
• Validate text to ensure it matches a predefined pattern (such as an email or URL)
• Extract, edit, replace, and delete substrings

1.4 RegEx in UiPath Studio


More recent versions of UiPath have activities like “Matches”, “Is Match”, and “Replace” that
make it easier to create and test your regular expressions on data. They come with a few
expressions that match basic text input such as email, digits, characters, spaces, etc.
Basic RegEx
Basic Functionality
In order to start processing text with regular expressions, you should first familiarize yourself
with the basic elements used in these expressions: escape characters, character classes,
anchors, grouping constructs, and quantifiers.
2.1 Escape Characters
The backslash character (\) indicates that the following character is either a special character or
is a character that should be interpreted literally (such as certain punctuation marks and
symbols that are not already used by the RegEx engine).
This is not an exhaustive list of all the escaped characters. However, these are ones that you will
likely use often.
Escaped Character Description Example
\b Matches a word boundary. Text: “Education for Ed is
important.”
Regex: \bEd\b.*
Match: “Ed is important.”
\t Matches a tab character. Text: “ After tab. After all.”
Regex: \tAfter\s.{3}
Match: “ After tab”
\s Matches a space character. Text: “racecars race cars on the
track.”
Regex: \scar.+
Match: “cars on the track.”
\n Matches a new-line character. Text: “Before new line.
After new line.”
Regex: \n.+
Match: “
After new line.”
\r Matches a return carriage character (return Text:
carriages vary in use from system to system, but Regex:
you will usually see them accompanying a new- Match:
line character).
\\ Matches a literal backslash character Text: “Name is \ls;execute –kill
sys32"
Regex: \\.*
Match: “\ls;execute –kill sys32”
\! Matches a literal exclamation mark Text: “Stop exploding you
cowards!”
Regex: \s.+\!
Match: “ cowards!”
\? Matches a literal question mark Text: “Who am I? I am you!”
Regex: .*\?
Match: “Who am I?”
\$ Matches a literal dollar sign Text: “$50 and #50”
Regex: \$..
Match: “$50”
\^ Matches a literal carrot symbol Text: “X-Y^Z+A”
Regex: .\^.
Match: “Y^Z”
\* Matches a literal asterisk symbol Text: “Our favorite A* algorithm.”
Regex: .\*.+
Match: “A* algorithm.””
\+ Matches a literal plus sign Text: “1+1=fish”
Regex: .\+.
Match: “1+1”
\( Matches a literal parenthesis Text: “70+1+(80-5)=146”
Regex: \(.*\)
Match: “(80-5)”
\{ Matches a literal carrot bracket Text: “else {print}”
Regex: \{.*\}
Match: “{print}”
\[ Matches a literal square bracket Text: “Only a Sith deals in
[absolutes].”
Regex: \[.*\]
Match: “[absolutes]”
\. Matches a literal period punctuation mark Text: “Sent. Received!”
Regex: \w+\.
Match: “Sent.”
\, Matches a literal comma Text: “First, second and third”
Regex: \w+\,\s
Match: “First, “
\| Matches a literal vertical bar character Text: “value<10 | boolExists”
Regex: \|\s\w+
Match: “| boolExists”

2.2 Character Classes


Character classes are used to match any single character within a specified character group.
Character Class Description Example
[ character_group ] Matches a single character in the Text: “Inside”
brackets Regex: [aeiou]
Matches: “i”, “e”
[^ character_group ] Matches any single character NOT Text: “Inside”
in the character group Regex: [^aeiou]
Matches: “I”, “n”, “s”, “d”
[ first – last ] Matches any single character in the Text: “Password is Sa9.”
character range starting from the
fist character and ending at the last Regex: [A-Z][a-z][0-9]
character (first and last are Match: “Sa9”
included in the range)
. Matches any single character Text: “The boy.”
Regex: .b
Match: “ b”
\w Matches any single word character Text: “The-fox.”
Regex: \w
Matches: "T”, “h”, “e”, “f”, “o”, “x”
\W Matches any single character that Text: “The-fox”
is not a word Regex: \W
Match: “-“
\s Matches any single white-space Text: “A wild string appears!”
character Regex: A\s
Match: ”A “
\S Matches any single character that Text: “Something and nothing”
is not a whitespace Regex: \S
Matches: “Something”, “and”,
“nothing”
\d Matches any decimal digit Text: “Somebody call 911!”
Regex: \d\d\d
Match: “911”
\D Matches any single character that Text: “Call 911”
is not a decimal digit Regex: \D
Matches: “C”, “a”, “l”, “l”, “ “

2.3 Anchors
Anchors are assertions that determine what must or must not be in a string of text.
Anchor Description Example
^ Match must occur at the beginning of the Text: “I lied.
string. In multiline mode, match must occur And I cried.”
at the beginning of the line. Regex: ^I\s\w+
Match: “I lied”
$ Match must occur at the end of the string or Text: “Oh ho... You’re coming right at me?
before \n at the end of the string. In Instead of running away...”
multiline mode, match must occur at the Regex: .*\.$
end of the line or before \n at the end of the Match: “Instead of running away...”
line.
\A Match must occur at the beginning of the Text: “I lied.
string. And I cried.”
Regex: \AI\s\w+
Match: “I lied”
\Z Match must occur at the end of the string or Text: “Oh ho... You’re coming right at me?
before \n at the end of the string Instead of running away...”
Regex: .*\.\Z
Match: “Instead of running away...”
\z The match must occur at the end of a string Text: “I like pie A. You like pie B.
only. She likes pie C.”
Regex: pie\s.\.\z
Match: “pie C.”
\G The match must start at the position where Text: “Cheese, bread, milk, ...Oops! Dropped my
the previous match ended. soda! Coconut, apple, carrot, salt.”
This anchor enforces a continuous chain of Regex: \G(\w+\,?\s?)
matches. After the first successful match, Matches: “Cheese, “, “bread, “, “milk, “
each subsequent match must be preceded
by a successful match. Matching stops upon
the first failed match in the chain.
\b The match must occur on a word boundary. Refer to section 2.1 on Escape characters for the
example.
\B The match must not occur on a word Text: “Added fan made content as well.”
boundary. Regex: \Ba.
Matches: “an”, “ad”

2.4 Grouping Constructs


Grouping constructs are typically used to capture subexpressions from a string.
Grouping Construct Description Example
( subexpression ) Matches the subexpression Text: “That’s a wrap!”
Regex: (wrap)
Matches: “wrap”
(?: subexpression ) Does not capture a substring matched by the Text: “Parking in a park around Park
subexpression. avenue.”
Typical use: Regex: Park(?:ing)?
When a quantifier is applied to a group, but Matches: “Parking”, “Park”
the substrings captured by the group are of
no interest.
(?= subexpression ) Zero-width positive lookahead assertion. Case 1:
Text: “I like apple pie but I hate
Case 1 (Typical use): blueberry pie.”
A regular expression followed by this Regex: .*(?= apple\spie)
grouping construct matches a pattern that Match: “I like”
must be followed by the indicated
Case 2:
subexpression. Subexpression is NOT
Text: “I like apple pie but I hate
included in match.
blueberry pie.”
Case 2: Regex: (?=apple\spie).*
Match: “apple pie but I hate
If a regular expression begins with this blueberry pie.”
grouping construct, it matches a pattern that
contains the subexpression in front of it.
Subexpression is included in match.
(?! subexpression ) Zero-width negative lookahead assertion. Case 1:
Text: “surely we rely on reily”
Case 1 (Typical use):
Regex: \b(?!re)\w+\b
A regular expression followed by this Matches: “surely”, “we”, and “on”
grouping construct matches a pattern that
must not be followed by the subexpression. Case 2:
Text: “surely we rely on reily”
Case 2: Regex: (?![sure]).*
Match: “ly we rely on reily”
If a regular expression begins with this
grouping construct, it defines a pattern that
must not be matched when the expression
matches a similar/more general pattern.
(?<= subexpression ) Zero-width positive lookbehind assertion. Case 1:
Case 1: (Typical use): Text: “Quick brown fox jumps”
A regular expression that begins with this Regex: (?<=brown)(.+)
grouping construct matches a pattern that Match: “ fox jumps”
must begin with the subexpression. Case 2:
Subexpression is NOT included in the match. Text: “Quick brown fox jumps"
Case 2: Regex: (.+)(?<=brown)
A regular expression that is followed by this Matches: “Quick brown”
grouping construct matches a pattern that
must end with the subexpression.
Subexpression included in match.
(?<! subexpression ) Zero-width negative lookbehind assertion. Case 1:
Case 1: (Typical Use): Text: “USD250 GBP100"
A regular expression that begins with this Regex: (?<!USD)\d{3}
grouping construct matches a pattern that Matches: “100”
must not begin with the subexpression.

(?> subexpression ) Nonbacktracking subexpression. Text: “5XXY 9XYY 6FMP "


Typical Use: Regex: ([0-9](?>X+Y+)
Performance tuning a regular expression by Matches: “5XXY”, “9XYY”
disabling backtracking.
This grouping construct will attempt to match
as many characters as possible in the
subexpression and will not backtrack to
attempt alternate pattern matches if the
initial match fails.
*Only recommended if you know
backtracking will not succeed.

2.5 Quantifiers
Quantifiers specify how many instances of a character, group, or character class must be present
in a given input for a match to be successful.
Quantifier Description Example
* Match zero or more times Text: “A1, B96, C, D901, E=?”
Regex: \w\d*
Matches: “A1”, “B96”, “C”, “D901”,
“E”
+ Match one or more times Text: “A1, B96, C, D901, E=?”
Regex: \w\d+
Matches: “A1”, “B96”, “D901”
? Match zero or one time Text: “B Ba Baa Baaa Bab”
Regex: Ba?(?=\s)
Matches: “B”, “Ba”
{n} Match n times. Text: “10, 220, 3450, 98759, 1987”
Where n is a positive integer Regex: \d\d{3}
Matches: “3450”, “1987”
{n, } Match at least n times Text: “Now this is podracing!”
Where n is a positive integer Regex: \w{4,}
Matches: “this”, “podracing”
{n,m} Match from n to m times Text: “1, 10, 101, 1010, 10101,
Where n and m are positive 101010, 1010101”
integers Regex: \d{2,4}
Matches: “10”, “101”, “1010”
Basic Examples
3.1 Matching names starting with S
Let's examine this list of names:
• Michael, Alex, Sarah, Joseph, Stephanie, Zack, Krystal, Sean, Batman, Jack, George, Jess
We would like to get all the names from this like that start with the letter “S”. Why? Because we have a
strong distaste for names with that start with “S” and we want to send a strongly worded letter to each
of those individuals. Being that this is a small list, it would be relatively easy to pick out all the names.
However, if the list has one hundred, one thousand, or even one million names, it will become
extremely cumbersome to pick out all the names from the list.
So instead, lets extract the names with a regular expression.
We already know that the name must start with S and there are only names in this list. As such, we can
start out our regular expression with the uppercase character “S”.
From this small list, we can see that there are 3 names that start with S: Sarah, Stephanie, and Sean.
Each of these names not only has a different arrangement of characters, but the lengths also vary. If we
have an excessively long list, it would be difficult to keep track of all the different arrangements of
letters in a given name. To simplify this example, let’s also assume the format for all the names in the list
is always an uppercase “S” followed by a variable amount of lowercase letters.
First let’s tackle the problem of varying arrangements. For any given name, each position after “S”
character contains a lowercase letter from “a” to “z”. We can capture each single character easily with a
character class. The expression to match a single lowercase letter is:
• [a-z]
Now that we can capture any letter in any position, we must now account for the variable amount of
characters in each name. Let’s assume there is at least one or more character’s after the uppercase “S”.
The simplest quantifier to account for this case is the “+” quantifier.
All together, we have a regular expression that is the following:
• S[a-z]+
Congratulations! We can now send out all of our strongly worded letters.
3.2 Matching variable text using static text
Consider the following text snippet from the Accelirate website:
Do you need to get your automation initiatives in shape? Are you struggling to identify the right
processes to show the immediate return of man-hours back to the business? RPA90X was built
for just these reasons. Accelirate was born from the Enterprise Consulting world and
understands the challenges Fortune 1000 companies face in their automation journeys, from
platform selection, to setting up infrastructure with IT, to building an automation process
pipeline, to development and testing; RPA90X covers it all!
When you are first embarking on your automation journey, there will be a lot of questions, a lot
of uncertainty about which product is best suited, and usually it is all being done by someone
that doesn’t even have an automation background. Regardless of the person’s background,
finding the right information to select the best platform and the right resources to perform a
proper pilot program can be overwhelming. Usually this person still has their full-time job to
perform, which is why we have a whole article HERE outlining the importance of a Head of CoE
right from the beginning.
The RPA90X program was designed to get in and get real results quickly through our E^3
strategy
Given this text as input, we would like to find every instance of “RPA90X” and only display the first three
words that follow each instance.
Let’s start by breaking down each part of the text we’re looking for.
First, we know that in each string, there is a reference to “RPA90X”. Since we constantly reference this
specific text, we can use this exact text into our regular expression.
Second, we know that the text we are looking for comes after “RPA90X”. Conversely, this means that
“RPA90X” is behind the text we’re trying to extract. This fact is very important.
It should be obvious that we can use a grouping construct here but how do we choose the right one? If
we look back at the table on grouping constructs in section 2.4, we’ll notice that each grouping construct
has a typical use case and may also have an alternate use case. We should first check if our situation falls
within any of the typical use cases since these cases are usually more straight-forward when it comes to
the use of the grouping construct. If our situation doesn’t apply to any of the typical cases, then we
should check the alternate cases.
Luckily, our situation matches typical use case of the zero-width positive lookbehind assertion:
• (?<=subexpression)
In the typical use case for this assertion, the text that we’re looking for should begin with the text
captured by the subexpression, but our final output will not include the subexpression itself. If we make
the subexpression in this assertion “RPA90X” and add another regex to capture the first three words
after this assertion, we will get our desired result.
We can easily capture the first 3 words after “RPA90X” with the regex:
• (\s[A-Za-z]+\s){3}
This regex captures a space ( \s ), followed by one or more letters ( [A-Za-z]+ ), followed by another
space ( \s ) exactly 3 times ( {3} ).
Our final regular expression is:
• (?<=RPA90X\s)([A-Za-z\S?]+\s){3}
The matches found by this regular expression are:
• “ was built for “
• “ covers it all “
• “ program was designed “
Advanced RegEx
Advanced Functionality
Regex has additional functionality to allow you to create a more powerful expression. Back
reference constructs, alternation constructs, substitutions, and options can all prove useful as
they allow you to reference back a specific capturing group within a regular expression, add
conditionality to your expressions, or even change how the regular expression processes the
text or regular expression itself.
4.1 Backreference Constructs
Back reference constructs provide a couple of ways to reference back to a capturing within a
regular expression. This may make it easier to identify multiple occurrences of a character or
substring.
Backreference Construct Description Example
\number Numbered Backreference: Text: “Save 15% on car
Where number is a positive integer The number in this construct represents insurance and 15% on home
the position of a capturing group within a insurance.”
regular expression. It’s like an index given Regex: (\d{2}%)(.+)\s\1
to each capturing group. Match: “15% on car insurance
Using this backreference has the same and 15%”
effect as writing the captured group
again in the regular expression.
*Non-capturing groups can’t be referred
to by this construct.
First assign the name: Named Backreference: Text: “The British are coming!
(?< name > subexpression ) The British are coming! Hide
Then make backreference: The name in this construct is like a your kids and hide your
\k< name > variable name in a programming wives.”
\k’ name ‘ language. Regex:
After naming the subexpression, you can (?<alert>(.{22}\!\s))\k<alert>
Match: “The British are
repeat it again in the regular expression
coming! The British are
by using the backreference. coming! “
Assign the name (not mandatory): Named Numeric Backreference: Text: “The British are coming!
If a subexpression is named with the The British are coming! Hide
(?< number > subexpression ) number, this construct works the same your kids and hide your
Make backreference: as \k<name>. Otherwise, it functions like wives.”
\k< number > \number where the number denotes the Regex: (.{22}\!\s)\k<1>
position of the captured group. Match: “The British are
coming! The British are
coming! “

4.2 Alternation Constructs


Alternation constructs allow you to make conditional statements inside of a regular expression.
Alternation Construct Description Example
Subexpression1 | subexpression2 Pattern matching Text: “I tank, you dps, and they
heal."
Regex: \w+\s(tank|dps)
Matches either the first Matches: “I tank”, “you dps”
subexpression or the second
subexpression in a string.
(? (expression) then| else ) Conditional matching Text: "01-9999999 020-333333
If the expression is found in the 777-88-9999"
(?(name) then | else ) string, it is evaluated as true and Regex: \b(?<n2>\d{2}-
(?(number) then | else ) moves on to and matches the )?(?(n2)\d{7}|\d{3}-\d{2}-\d{4})\b
‘then’ expression. Otherwise, the Match: “01-9999999”, “777-88-
‘else’ expression is matched. 9999“”
The vertical bar and else expression
are option in this construct

4.3 Substitutions
Substitutions are elements used in replacement functions of regular expressions. These
elements are ONLY valid in replacement patterns (they won’t work in regular expression
patterns).
Substitution Description Example
$ number Indicates that the capture group Text: “Pepperidge Farm
Where number is a positive referenced by “number” is inserted remembers.”
integer into the replacement string. Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $1$3
New Text: “Pepperidge Fm
remembers.”
${name} Indicates that the capture group Text: “Pepperidge Farm
referenced by “name” is inserted remembers.”
into the replacement string. Regex: (F)(?<rifle>ar)(m)
Match: “Farm”
Substitution: ${rifle}
New Text: “Pepperidge ar
remembers.”
$$ Inserts a literal dollar sign into the Text: “Pepperidge Farm
replacement string remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $$
New Text: “Pepperidge $
remembers.”
$& Inserts the entire regex match in Text: “Pepperidge Farm
the substitution string remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $&$&
New Text: “Pepperidge FarmFarm
remembers.”
$` Inserts all of the text BEFORE the Text: “Pepperidge Farm
match in the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $`
New Text: “Pepperidge Pepperidge
remembers.”
$’ Inserts all of the text AFTER the Text: “Pepperidge Farm
match in the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $’
New Text: “Pepperidge remembers
remembers.”
$+ Inserts the last group captured in Text: “Pepperidge Farm
the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $+
New Text: “Pepperidge m
remembers.”
$_ Inserts the entire input string in the Text: “Pepperidge Farm
replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $_
New Text: “Pepperidge Pepperidge
Farm remembers. remembers.”

4.4 Regular Expression Options


The options for regular expressions allow you to edit the behavior of a regular expression. They
can be included inline (as part of the regular expression). However, UiPath already provides an
easy to way to select these options through the Input section of the Properties tab.

Options Inline character Description


None Not Available Uses the default behavior
IgnoreCase i Matching is case-insensitive (case of letters is
ignored)
Multiline m The regex engine will handle an input string
that contains multiple lines.
^ and $ will match the beginning and end of
each line in the string rather than the beginning
and end of the whole input string.
Singleline s The regex engine treats an input string as a
single line.
The period (.) wildcard replaces every character
rather than every character except the newline
character (\n)
ExplicitCapture n Enforces capture groups to be valid only if they
are named or numbered.
Naming a capture group:
(?<name>subexpression)
Compiled Not Available The regular expression is parsed into a set of
custom opcodes which are used to perform the
regular expression operations. Doing so
increases the initialization time but it also
improves the run-time performance.
IgnorePatternWhitespace x Whitespaces must be escaped with “\s” or “\ “,
otherwise they are ignored.
# is interpreted as the beginning of a comment
rather than a literal symbol.
*Note: Whitespaces aren’t ignored in character
classes, bracketed quantifiers, and character
sequences with a language element. *
RightToLeft Not Available Changes the direction in which the regex engine
processes the text from the default setting (left-
to-right) to right-to-left
EMCAScript Not Available Use EMCAScript matching behavior instead of
the default canonical regex behavior.
EMCAScript doesn’t support Unicode (which
means culture specific symbols aren’t
supported).
A regex capture class with a backreference to
itself must be updated with each capture
iteration. This feature allows EMCAScript to
capture certain repeated expressions that
default canonical regex can’t.
Numbered backreferences are interpreted as
literals if the capture group doesn’t exist.

CultureInvariant Not Available Ignores the differences in language culture.


With this feature, symbols specific to a
language will be processed and matched using
the conventions of that language (English,
Turkish, French, etc..)
Advanced Examples
The advanced examples will give you a group of strings which to match and a walkthrough of each step
to help guide you in understanding the complete RegEx pattern by the end of the steps.
5.1 Matching Phone Numbers
Capture phone numbers in the following different configurations.

808-555-0196 409-889-4566 (429)966-8125

577 699 4359 6909708770 1 358 566 0922

Let’s break down each part of the RegEx which we need to capture these phone numbers
Step 1: How do we handle the possibility of the country code of “1” appearing in a phone number.
Try using Pattern: 1? on the string “123 321”, both 1’s will be captured by this pattern
Step 2: Next we noticed that the number starting with the country code is followed by a space, but not
every number’s first number is followed by a space. We just used an optional pattern match on 1, how
we can do the same with blank spaces?
Try to use the Pattern: \s? On the same string as before and we do see it captures the blank
space between the numbers and if we combine our Step 1 & 2 we get all the ones and the blank
space.
Step 3: The following character we need to be able to handle is the 3 digits of the area code but also
there could be an open parenthesis ”(”. Following the 3 digits we can also handle the close parenthesis
“)”
Try to use the Pattern: On “(429)966-8125" the optional parenthesis can be captured using \(?
The 3 digits can be captured using simply (\d{3}) and again get the closing parenthesis \)? So far
we have created the Pattern: 1?\s?\(?(\d{3})\)? If we take everything we done so far.
Step 4: is to handle between the area code numbers and the rest of the phone number so before we
handled the space with \s? And the next possible character is the hyphen”-” so for this we can use [\s-]?
Step 5: The rest of the phone number is straightforward, 3 digits followed by the nothing, a space, or
hyphen which gets followed by 4 digits, translated to RegEx and that would be \d{3}[\s-]? \d{4}
And our final Pattern will be 1?\s?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4}

5.2 Matching Email RegEx


Capture the following emails

aje+vegimm-9633@yopmail.com 8crazy1.n1h@zeusrisky07.ml mmokaana937@rifkian.tk

xawais.s.khan.39@decox.ru 8dspeak97v@geraldlover.org fshaher.alagates@kad03.ml

In this example we want to capture each email whole, we can build a simple email matching regex. So,
let’s start with assuming the base case of an email being of the format “foo@bar.com”.
We can quickly handle this base case using this
[a-z]+@[a-z]+\.[a-z]{2,3}
Breaking this up into parts
[a-z]+ : We want at least one alphabet in the front of the @ for the alias, user, group, or department
@ : The at sign is required in any email
[a-z]+ : Would grab the beginning of the domain, again we want at least length of one.
\. : The dot(.) we need in the domain
[a-z]{2,} : The shortest domain suffixes are at least of length 2.
Now from this we can expand to cover the simple emails with number
[a-z0-9]+@[a-z0-9]+\.[a-z]{2,3}
Any special characters can be added also
[a-z0-9+\.-]+@[a-z0-9]+\.[a-z]{2,3}
If we add another email to match such as, “tea+biscuits@verybritish.co.uk”
For this email we will have to add to our pattern which is not capturing the “.uk”
How can we capture this email while still getting the ones from before?
We can create an optional group following the end of our current pattern.
[a-z0-9+\.-]+@[a-z0-9]+\.[a-z]{2,3}(\.[a-z]{2,3})?
While this matches all our emails there is still room for much improvement for emails that must meet
different standards
5.3 Matching Header Information inside an Email
The goal of this example is given an email, grab all the header information while not getting the body of
the email. The Cc and Subject lines will be optional in the regex.
From: Happy Bot
Sent: Wednesday, March 20, 2019 3 :21PM
To: Happy Bot
Cc: Happy Bot
Subject: Hello, Happy Bot!

Greetings,
He do subjects prepared bachelor juvenile ye oh. He feelings removing informed he as ignorant we
prepared. Evening do forming observe spirits is in. Country hearted be of justice sending. On so they as
with room cold ye. Be call four my went mean. Celebrated if remarkably especially an. Going eat set she
books found met aware.
Thanks,
Happy Bot
Bot of Happiness

We have different way to tackle this problem, but we will use the multiline option for regex to develop
our solution. We will break down the following RegEx which solves our stated problem.
^(From.+)$\n^(Sent.+)$\n^(To.+)$\n(^(Cc.+)$\n)?(^(Subject.+)$\n)?
One thing you notice is because of the format and using our regex on multiple lines it’s better to use
Anchors (^ and $) and the use of the newline character for regex.
^(From.+)$ Our first grouping would be the first line we care about and need. We use both our Anchors
at the start and end, and we capture the whole length of what follows “From”
\n is required here to match on the next lines of the email and gives use a simple way to move down the
text.
^(Sent.+)$\n^(To.+)$\n We create two more groups for each line that is required using the same
pattern.
Then the next two steps are the optional parts of the email heading, we will repeat the same grouping
using anchors for each line but in the last two lines we would need to capture we can just have that be
optional by using our friendly?
(^(Cc.+)$\n)? This will get the information following Cc if it exists and the same goes for
(^(Subject.+)$\n)? for the subject line.
Assignment ~30 mins
The following assignment is to cover some basic RegEx questions for you to gain some hands-on
experience. There are four (4) questions which will need to be answered using Uipath Studio. The
program requires two inputs from you the user, to Select the question and to write the RegEx pattern
for the selected question. Provided along with the program is the PDF (Article_1.pdf), the scraping has
already been done, you can find the extracted text in the Text Examples folder.
Change “Input” to match question you are working on
Questions:
1. Count the number of vowels in the first Headline (SuperEasy Ways To Learn Everything).
Note: Caps Insensitive, reference is Headline.txt

2. What is the name of the first and second article authors?


Note: Caps Sensitive, reference is ShortText.txt

3. Rewrite the date format to be Month Day, Year. (RegEx needed to grab the “dd MONTH yyyy”)
Note: Caps Sensitive, reference is ShortText.txt

4. Find all the words that begin with a capital letter


Note: Caps Sensitive, reference is FullText.txt

For each question provide a screenshot of the Input Number, the RegEx pattern, and the correct popup

Example Image:
Download the Program using this Link: Accelirate RegEx Assignment
Quick References
Microsoft .Net Documentation: .NET Regular Expressions
Video Link that goes over the basics: Regular Expressions (Regex) Tutorial: How to Match Any Pattern of
Text
Details about the Activity that used Regex: Uipath Activity (matches)
Regex Online Tester: regexstorm.net/tester
Regex Online Tester: regex101
regular-expressions.info:
• regular-expressions.info/lookaround
• regular-expressions.info/engine

You might also like