Professional Documents
Culture Documents
Regular Expressions Guide and Practice
Regular Expressions Guide and Practice
Regular Expressions Guide and Practice
Introduction
1.1 What is RegEx?
RegEx (short for Regular Expressions) are an efficient and flexible way to process text. It uses a
pattern-matching notation to parse through large amounts of data. For many situations, it can
be a better alternative than relying on basic string manipulation functions. It might even be your
best friend when dealing with invoices .
2.3 Anchors
Anchors are assertions that determine what must or must not be in a string of text.
Anchor Description Example
^ Match must occur at the beginning of the Text: “I lied.
string. In multiline mode, match must occur And I cried.”
at the beginning of the line. Regex: ^I\s\w+
Match: “I lied”
$ Match must occur at the end of the string or Text: “Oh ho... You’re coming right at me?
before \n at the end of the string. In Instead of running away...”
multiline mode, match must occur at the Regex: .*\.$
end of the line or before \n at the end of the Match: “Instead of running away...”
line.
\A Match must occur at the beginning of the Text: “I lied.
string. And I cried.”
Regex: \AI\s\w+
Match: “I lied”
\Z Match must occur at the end of the string or Text: “Oh ho... You’re coming right at me?
before \n at the end of the string Instead of running away...”
Regex: .*\.\Z
Match: “Instead of running away...”
\z The match must occur at the end of a string Text: “I like pie A. You like pie B.
only. She likes pie C.”
Regex: pie\s.\.\z
Match: “pie C.”
\G The match must start at the position where Text: “Cheese, bread, milk, ...Oops! Dropped my
the previous match ended. soda! Coconut, apple, carrot, salt.”
This anchor enforces a continuous chain of Regex: \G(\w+\,?\s?)
matches. After the first successful match, Matches: “Cheese, “, “bread, “, “milk, “
each subsequent match must be preceded
by a successful match. Matching stops upon
the first failed match in the chain.
\b The match must occur on a word boundary. Refer to section 2.1 on Escape characters for the
example.
\B The match must not occur on a word Text: “Added fan made content as well.”
boundary. Regex: \Ba.
Matches: “an”, “ad”
2.5 Quantifiers
Quantifiers specify how many instances of a character, group, or character class must be present
in a given input for a match to be successful.
Quantifier Description Example
* Match zero or more times Text: “A1, B96, C, D901, E=?”
Regex: \w\d*
Matches: “A1”, “B96”, “C”, “D901”,
“E”
+ Match one or more times Text: “A1, B96, C, D901, E=?”
Regex: \w\d+
Matches: “A1”, “B96”, “D901”
? Match zero or one time Text: “B Ba Baa Baaa Bab”
Regex: Ba?(?=\s)
Matches: “B”, “Ba”
{n} Match n times. Text: “10, 220, 3450, 98759, 1987”
Where n is a positive integer Regex: \d\d{3}
Matches: “3450”, “1987”
{n, } Match at least n times Text: “Now this is podracing!”
Where n is a positive integer Regex: \w{4,}
Matches: “this”, “podracing”
{n,m} Match from n to m times Text: “1, 10, 101, 1010, 10101,
Where n and m are positive 101010, 1010101”
integers Regex: \d{2,4}
Matches: “10”, “101”, “1010”
Basic Examples
3.1 Matching names starting with S
Let's examine this list of names:
• Michael, Alex, Sarah, Joseph, Stephanie, Zack, Krystal, Sean, Batman, Jack, George, Jess
We would like to get all the names from this like that start with the letter “S”. Why? Because we have a
strong distaste for names with that start with “S” and we want to send a strongly worded letter to each
of those individuals. Being that this is a small list, it would be relatively easy to pick out all the names.
However, if the list has one hundred, one thousand, or even one million names, it will become
extremely cumbersome to pick out all the names from the list.
So instead, lets extract the names with a regular expression.
We already know that the name must start with S and there are only names in this list. As such, we can
start out our regular expression with the uppercase character “S”.
From this small list, we can see that there are 3 names that start with S: Sarah, Stephanie, and Sean.
Each of these names not only has a different arrangement of characters, but the lengths also vary. If we
have an excessively long list, it would be difficult to keep track of all the different arrangements of
letters in a given name. To simplify this example, let’s also assume the format for all the names in the list
is always an uppercase “S” followed by a variable amount of lowercase letters.
First let’s tackle the problem of varying arrangements. For any given name, each position after “S”
character contains a lowercase letter from “a” to “z”. We can capture each single character easily with a
character class. The expression to match a single lowercase letter is:
• [a-z]
Now that we can capture any letter in any position, we must now account for the variable amount of
characters in each name. Let’s assume there is at least one or more character’s after the uppercase “S”.
The simplest quantifier to account for this case is the “+” quantifier.
All together, we have a regular expression that is the following:
• S[a-z]+
Congratulations! We can now send out all of our strongly worded letters.
3.2 Matching variable text using static text
Consider the following text snippet from the Accelirate website:
Do you need to get your automation initiatives in shape? Are you struggling to identify the right
processes to show the immediate return of man-hours back to the business? RPA90X was built
for just these reasons. Accelirate was born from the Enterprise Consulting world and
understands the challenges Fortune 1000 companies face in their automation journeys, from
platform selection, to setting up infrastructure with IT, to building an automation process
pipeline, to development and testing; RPA90X covers it all!
When you are first embarking on your automation journey, there will be a lot of questions, a lot
of uncertainty about which product is best suited, and usually it is all being done by someone
that doesn’t even have an automation background. Regardless of the person’s background,
finding the right information to select the best platform and the right resources to perform a
proper pilot program can be overwhelming. Usually this person still has their full-time job to
perform, which is why we have a whole article HERE outlining the importance of a Head of CoE
right from the beginning.
The RPA90X program was designed to get in and get real results quickly through our E^3
strategy
Given this text as input, we would like to find every instance of “RPA90X” and only display the first three
words that follow each instance.
Let’s start by breaking down each part of the text we’re looking for.
First, we know that in each string, there is a reference to “RPA90X”. Since we constantly reference this
specific text, we can use this exact text into our regular expression.
Second, we know that the text we are looking for comes after “RPA90X”. Conversely, this means that
“RPA90X” is behind the text we’re trying to extract. This fact is very important.
It should be obvious that we can use a grouping construct here but how do we choose the right one? If
we look back at the table on grouping constructs in section 2.4, we’ll notice that each grouping construct
has a typical use case and may also have an alternate use case. We should first check if our situation falls
within any of the typical use cases since these cases are usually more straight-forward when it comes to
the use of the grouping construct. If our situation doesn’t apply to any of the typical cases, then we
should check the alternate cases.
Luckily, our situation matches typical use case of the zero-width positive lookbehind assertion:
• (?<=subexpression)
In the typical use case for this assertion, the text that we’re looking for should begin with the text
captured by the subexpression, but our final output will not include the subexpression itself. If we make
the subexpression in this assertion “RPA90X” and add another regex to capture the first three words
after this assertion, we will get our desired result.
We can easily capture the first 3 words after “RPA90X” with the regex:
• (\s[A-Za-z]+\s){3}
This regex captures a space ( \s ), followed by one or more letters ( [A-Za-z]+ ), followed by another
space ( \s ) exactly 3 times ( {3} ).
Our final regular expression is:
• (?<=RPA90X\s)([A-Za-z\S?]+\s){3}
The matches found by this regular expression are:
• “ was built for “
• “ covers it all “
• “ program was designed “
Advanced RegEx
Advanced Functionality
Regex has additional functionality to allow you to create a more powerful expression. Back
reference constructs, alternation constructs, substitutions, and options can all prove useful as
they allow you to reference back a specific capturing group within a regular expression, add
conditionality to your expressions, or even change how the regular expression processes the
text or regular expression itself.
4.1 Backreference Constructs
Back reference constructs provide a couple of ways to reference back to a capturing within a
regular expression. This may make it easier to identify multiple occurrences of a character or
substring.
Backreference Construct Description Example
\number Numbered Backreference: Text: “Save 15% on car
Where number is a positive integer The number in this construct represents insurance and 15% on home
the position of a capturing group within a insurance.”
regular expression. It’s like an index given Regex: (\d{2}%)(.+)\s\1
to each capturing group. Match: “15% on car insurance
Using this backreference has the same and 15%”
effect as writing the captured group
again in the regular expression.
*Non-capturing groups can’t be referred
to by this construct.
First assign the name: Named Backreference: Text: “The British are coming!
(?< name > subexpression ) The British are coming! Hide
Then make backreference: The name in this construct is like a your kids and hide your
\k< name > variable name in a programming wives.”
\k’ name ‘ language. Regex:
After naming the subexpression, you can (?<alert>(.{22}\!\s))\k<alert>
Match: “The British are
repeat it again in the regular expression
coming! The British are
by using the backreference. coming! “
Assign the name (not mandatory): Named Numeric Backreference: Text: “The British are coming!
If a subexpression is named with the The British are coming! Hide
(?< number > subexpression ) number, this construct works the same your kids and hide your
Make backreference: as \k<name>. Otherwise, it functions like wives.”
\k< number > \number where the number denotes the Regex: (.{22}\!\s)\k<1>
position of the captured group. Match: “The British are
coming! The British are
coming! “
4.3 Substitutions
Substitutions are elements used in replacement functions of regular expressions. These
elements are ONLY valid in replacement patterns (they won’t work in regular expression
patterns).
Substitution Description Example
$ number Indicates that the capture group Text: “Pepperidge Farm
Where number is a positive referenced by “number” is inserted remembers.”
integer into the replacement string. Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $1$3
New Text: “Pepperidge Fm
remembers.”
${name} Indicates that the capture group Text: “Pepperidge Farm
referenced by “name” is inserted remembers.”
into the replacement string. Regex: (F)(?<rifle>ar)(m)
Match: “Farm”
Substitution: ${rifle}
New Text: “Pepperidge ar
remembers.”
$$ Inserts a literal dollar sign into the Text: “Pepperidge Farm
replacement string remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $$
New Text: “Pepperidge $
remembers.”
$& Inserts the entire regex match in Text: “Pepperidge Farm
the substitution string remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $&$&
New Text: “Pepperidge FarmFarm
remembers.”
$` Inserts all of the text BEFORE the Text: “Pepperidge Farm
match in the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $`
New Text: “Pepperidge Pepperidge
remembers.”
$’ Inserts all of the text AFTER the Text: “Pepperidge Farm
match in the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $’
New Text: “Pepperidge remembers
remembers.”
$+ Inserts the last group captured in Text: “Pepperidge Farm
the replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $+
New Text: “Pepperidge m
remembers.”
$_ Inserts the entire input string in the Text: “Pepperidge Farm
replacement string. remembers.”
Regex: (F)(ar)(m)
Match: “Farm”
Substitution: $_
New Text: “Pepperidge Pepperidge
Farm remembers. remembers.”
Let’s break down each part of the RegEx which we need to capture these phone numbers
Step 1: How do we handle the possibility of the country code of “1” appearing in a phone number.
Try using Pattern: 1? on the string “123 321”, both 1’s will be captured by this pattern
Step 2: Next we noticed that the number starting with the country code is followed by a space, but not
every number’s first number is followed by a space. We just used an optional pattern match on 1, how
we can do the same with blank spaces?
Try to use the Pattern: \s? On the same string as before and we do see it captures the blank
space between the numbers and if we combine our Step 1 & 2 we get all the ones and the blank
space.
Step 3: The following character we need to be able to handle is the 3 digits of the area code but also
there could be an open parenthesis ”(”. Following the 3 digits we can also handle the close parenthesis
“)”
Try to use the Pattern: On “(429)966-8125" the optional parenthesis can be captured using \(?
The 3 digits can be captured using simply (\d{3}) and again get the closing parenthesis \)? So far
we have created the Pattern: 1?\s?\(?(\d{3})\)? If we take everything we done so far.
Step 4: is to handle between the area code numbers and the rest of the phone number so before we
handled the space with \s? And the next possible character is the hyphen”-” so for this we can use [\s-]?
Step 5: The rest of the phone number is straightforward, 3 digits followed by the nothing, a space, or
hyphen which gets followed by 4 digits, translated to RegEx and that would be \d{3}[\s-]? \d{4}
And our final Pattern will be 1?\s?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4}
In this example we want to capture each email whole, we can build a simple email matching regex. So,
let’s start with assuming the base case of an email being of the format “foo@bar.com”.
We can quickly handle this base case using this
[a-z]+@[a-z]+\.[a-z]{2,3}
Breaking this up into parts
[a-z]+ : We want at least one alphabet in the front of the @ for the alias, user, group, or department
@ : The at sign is required in any email
[a-z]+ : Would grab the beginning of the domain, again we want at least length of one.
\. : The dot(.) we need in the domain
[a-z]{2,} : The shortest domain suffixes are at least of length 2.
Now from this we can expand to cover the simple emails with number
[a-z0-9]+@[a-z0-9]+\.[a-z]{2,3}
Any special characters can be added also
[a-z0-9+\.-]+@[a-z0-9]+\.[a-z]{2,3}
If we add another email to match such as, “tea+biscuits@verybritish.co.uk”
For this email we will have to add to our pattern which is not capturing the “.uk”
How can we capture this email while still getting the ones from before?
We can create an optional group following the end of our current pattern.
[a-z0-9+\.-]+@[a-z0-9]+\.[a-z]{2,3}(\.[a-z]{2,3})?
While this matches all our emails there is still room for much improvement for emails that must meet
different standards
5.3 Matching Header Information inside an Email
The goal of this example is given an email, grab all the header information while not getting the body of
the email. The Cc and Subject lines will be optional in the regex.
From: Happy Bot
Sent: Wednesday, March 20, 2019 3 :21PM
To: Happy Bot
Cc: Happy Bot
Subject: Hello, Happy Bot!
Greetings,
He do subjects prepared bachelor juvenile ye oh. He feelings removing informed he as ignorant we
prepared. Evening do forming observe spirits is in. Country hearted be of justice sending. On so they as
with room cold ye. Be call four my went mean. Celebrated if remarkably especially an. Going eat set she
books found met aware.
Thanks,
Happy Bot
Bot of Happiness
We have different way to tackle this problem, but we will use the multiline option for regex to develop
our solution. We will break down the following RegEx which solves our stated problem.
^(From.+)$\n^(Sent.+)$\n^(To.+)$\n(^(Cc.+)$\n)?(^(Subject.+)$\n)?
One thing you notice is because of the format and using our regex on multiple lines it’s better to use
Anchors (^ and $) and the use of the newline character for regex.
^(From.+)$ Our first grouping would be the first line we care about and need. We use both our Anchors
at the start and end, and we capture the whole length of what follows “From”
\n is required here to match on the next lines of the email and gives use a simple way to move down the
text.
^(Sent.+)$\n^(To.+)$\n We create two more groups for each line that is required using the same
pattern.
Then the next two steps are the optional parts of the email heading, we will repeat the same grouping
using anchors for each line but in the last two lines we would need to capture we can just have that be
optional by using our friendly?
(^(Cc.+)$\n)? This will get the information following Cc if it exists and the same goes for
(^(Subject.+)$\n)? for the subject line.
Assignment ~30 mins
The following assignment is to cover some basic RegEx questions for you to gain some hands-on
experience. There are four (4) questions which will need to be answered using Uipath Studio. The
program requires two inputs from you the user, to Select the question and to write the RegEx pattern
for the selected question. Provided along with the program is the PDF (Article_1.pdf), the scraping has
already been done, you can find the extracted text in the Text Examples folder.
Change “Input” to match question you are working on
Questions:
1. Count the number of vowels in the first Headline (SuperEasy Ways To Learn Everything).
Note: Caps Insensitive, reference is Headline.txt
3. Rewrite the date format to be Month Day, Year. (RegEx needed to grab the “dd MONTH yyyy”)
Note: Caps Sensitive, reference is ShortText.txt
For each question provide a screenshot of the Input Number, the RegEx pattern, and the correct popup
Example Image:
Download the Program using this Link: Accelirate RegEx Assignment
Quick References
Microsoft .Net Documentation: .NET Regular Expressions
Video Link that goes over the basics: Regular Expressions (Regex) Tutorial: How to Match Any Pattern of
Text
Details about the Activity that used Regex: Uipath Activity (matches)
Regex Online Tester: regexstorm.net/tester
Regex Online Tester: regex101
regular-expressions.info:
• regular-expressions.info/lookaround
• regular-expressions.info/engine