Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Metacharacters

1. the 12 punctuation characters that make regular expressions work their magic are $ ( ) *+.?[\^{| 2. notably absent from the list are ] , - and }. The first two become metacharacters only after an unescaped [, and the } only after an unescaped { 3. If you want your regex to match them literally, you need to escape them by placing a backslash in front of them

Matching literal string


Any regular expression that does not include any of the dozen characters $()*+.? [\^{| simply matches itself. By default, regular expressions are case sensitive - regex matches regex but not Regex, REGEX, or ReGeX Turn on case insensitivity by using the (?i) mode modifier, such as (?i)regex, or sensitive(?i)caseless(?-i)sensitive (local mode modifiers) in .NET or setting the /i flag when creating it in JavaScript.

Matching non printable characters


Representation \a \e \f \n \r \t \v Variations:
Using \cA through \cZ, you can match one of the 26 control characters that occupy positions 1 through 26 in the ASCII table A lowercase \x followed by two uppercase hexadecimal digits matches a single character in the ASCII set

Meaning bell escape form feed new line carriage return horizontal tab vertical tab

Hex 0x07 0x1B 0x0C 0x0A 0x0D 0x09 0x0B

Flavors .NET .NET .NET, JScript .NET, JScript .NET, JScript .NET, JScript .NET, JScript

Matching *$"'\n\d/\\+ : C# - "[$\"'\n\\d/\\\\] "


- double quotes and backslashes must be escaped with a backslash. Note: "\n" is a string with a literal line break, which is ignored as whitespace. "\\n" is a string with the regex token \n, which matches a newline.

@"[$""'\n\d/\\]
- to include a double quote in a verbatim string, double it up Note: @"\n" is always the regex token \n, which matches a newline; verbatim strings do not support \n at the string level

JavaScript - /[$"'\n\d\/\\]/
- Simply place your regular expression between two forward slashes - If any forward slashes occur within the regular expression itself, escape those with a backslash.

Creating Regular Expression Objects


C#:
try { Regex regexObj = new Regex("UserInput", RegexOptions.Compile); } catch (ArgumentException ex) { //... }

Note: RegexOptions.Compile can run up to 10 times faster than a regular expression compiled without this option (it compiles the regular expression down to CIL)

JavaScript:
var myregexp = /regex pattern/; var myregexp = new RegExp(userinput);

Match One of Many Characters


[ ] character class matches a single character out of a list of possible characters ^ (caret) - negates the character class if you place it immediately after the opening bracket - (hyphen) - creates a range when it is placed between two characters (order given by ASCII or UNICODE character table) Examples: o Hexadecimal character : [a-fA-F0-9] o Nonhexadecimal character : [^a-fA-F0-9] o Characters group : [aeiou]

Shorthands
Six regex tokens that consist of a backslash and a letter form shorthand character classes. Each lowercase shorthand character has an associated uppercase shorthand character with the opposite meaning. Token \d \w \s Matches a single digit a single word character any whitespace character
(this includes spaces, tabs, and line)

Opposite \D *^\d+) \W \S

Note - In JavaScript \w is always identical to *a-zA-Z09_+. In .NET it includes letters and digits from all other scripts (Cyrillic, Thai, etc.)

Matching any character


Solution . Matches any character, except line breaks Flavor .NET JScript Notes .NET : the dot matches line breaks option must not be set .NET : the dot matches line breaks option must be set [1] - RegexOptions.Singleline

any character, including line breaks

.NET

[\s\S]

Any character, including line breaks

JScript[2]

[1]

you can also place a mode modifier at the start of the regular expression : (?s) is the mode modifier for dot matches line breaks mode in .NET [2] an alternative solution is needed for JavaScript, which doesnt have a dot matches line breaks option (*\d\D+ and *\w\W+ have the same effect).

Match Something at the Start and/or the End of a Line (1)


Anchors - ^, $, \A, \Z, and \z - they match at certain positions, effectively anchoring the regular expression match at those positions:
Solution
\A

Matches
At the very start of the subject text, before the first character (to test whether the subject text begins with the text you want to match)

Flavor
.NET

Note
A must be uppercase

<^>

equivalent to \A, as long as you do not turn on the ^ and $ match at line breaks option; otherwise it will match at the very start of the each line
at the very end of the subject text, after the last character (to test whether the subject text ends with the text you want to match)

.NET .NET : ^ and $ match at line breaks option JScript RegexOptions.Multiline

\Z \z

.NET

Difference between \Z and \z - when the last character in your subject text is a line break. In that case, \Z can match at the very end of the subject text, after the final line break, as well as immediately before that line break

equivalent to \Z, as long as you do not turn on the ^ and $ match at line breaks option; otherwise it will

.NET .NET : ^ and $ match at line breaks option JScript RegexOptions.Multiline

Match Something at the Start and/or the End of a Line (2)


Examples ^alpha (.NET, JavaScript) matches alpha at the start of the subject text if ^ and $ match at line breaks is not set or at the start of each line otherwise \Aalpha (.NET) - matches alpha at the start of the subject text omega$ (.NET, JavaScript) matches omega at the end of the subject text if ^ and $ match at line breaks is not set or at the end of each line otherwise omega\Z (.NET) - matches omega at the end of the subject text

Match Something at the Start and/or the End of a Line (3)


Combining two anchors: \A\Z matches the empty string, as well as the string that consists of a single newline \A\z matches only the empty string ^$ matches each empty line in the subject text (in ^ and $ match at line breaks mode)
Note - In .NET, if you cannot turn on ^ and $ match at line breaks mode outside the regular expression, you can place (?m) mode modifier at the start of the regular expression

Regular Expression Options (C#)


None IgnoreCase Multiline Specifies that no options are set. Specifies case-insensitive matching. Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string (Caret and dollar match at line breaks) Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:). Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time.

ExplicitCapture

Compiled Singleline

Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n). (Dot matches line break) IgnorePatternWhitespace Eliminates unescaped white space from the pattern and enables comments marked with #. (Free-spacing). RightToLeft ECMAScript Specifies that the search will be from right to left instead of from left to right. Enables ECMAScript-compliant behavior for the expression. This value can be used only in conjunction with the IgnoreCase, Multiline, and Compiled values. The use of this value with any other values results in an exception (JavaScript flavor) - most important effect is that with this option, \w and \d are restricted to ASCII characters, as they are in JavaScript Specifies that cultural differences in language is ignored.

CultureInvariant

Setting Regular Expression Options


C#
Regex regexObj = new Regex("regex pattern", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline);

JavaScript
var myregexp = /regex pattern/im;

Regex Options
1. 2. 3. 4. 5. Free-spacing: Not supported by JavaScript. Case insensitive: /i Dot matches line breaks: Not supported by JavaScript. Caret and dollar match at line breaks: /m Additional Language-Specific Options: apply a regular expression repeatedly to the same string: /g

Test Whether a Match Can Be Found Within a Subject String


C#:
bool foundMatch = false; try { foundMatch = Regex.IsMatch(subjectString, UserInput); } catch (ArgumentNullException ex) { // Cannot pass null as the regular expression orsubject string } catch (ArgumentException ex) { // Syntax error in the regular expression }

or
bool foundMatch = Regex.IsMatch(subjectString, "regex pattern"); Note: @"\Aregex pattern\Z" - regex matches the subject string entirely

Javascript:
if (/regex pattern/.test(subjectString)) { // Successful match } else { // Match attempt failed } Note: /^regex pattern&/.test(subjectString) - regex matches the subject string

entirely

Retrieve the Matched Text


C#:
Regex regexObj = new Regex(@"\d+"); string resultString = regexObj.Match(subjectString).Value;

Note:
1. regexObj.Match("123456", 3, 2) tries to find a match in "45 2. regexObj.Match(subjectString).Index position in subject string 3. regexObj.Match(subjectString).Length length of the match

JavaScript:
var result = subject.match(/\d+/); if (result) { result = result[0]; } else { result = ''; } var matchstart = -1; var matchlength = -1; var match = /\d+/.exec(subject); if (match) { matchstart = match.index; matchlength = match[0].length; }

Match Whole Words


\b - word boundary - matches at the start or the end of a word in three positions: Example: \bdog\b - The first \b requires the d to occur at the very start of the string, or after a nonword character. The second \b requires the g to occur at the very end of the string, or before a nonword character (line break characters are nonword characters). It matches dog in My dog is stupid, but not in I will build a doghouse. \B matches at every position in the subject text where \b does not match, at every position that is not at start or end of a word. Example: \Bcat\B matches cat in scatter, but not in My cat is lazy, category, or bobcat
Note: you need to use alternation to combine \Bcat and cat\B into \Bcat|cat\B

Unicode Code Points, Properties, Blocks, and Scripts (1)


Solution \u2122 Matches Unicode code point Flavor Note .NET - a code point is one entry in the Unicode character database JScript (\u2122 trademark sign) - \u syntax requires exactly four hexadecimal digits (U+0000 through U+FFFF) .NET \p,L- - Any kind of letter from any language \p,M- - A character intended to be combined with another character (accents etc.) \p,Z- - Any kind of whitespaces or invisible characters \p,S- - Math symbols, currency signs etc. \p,N- - Any kind of numeric character in any script \p,P- - Any kind of punctuation character \p,C- - Invisible control characters and unused code points

\p{Sc}

Unicode property or category

\p{IsGreek Extended}
\P{M}\p{M }*

Unicode block
Unicode grapheme

.NET

\p{InBasic_Latin- \p{InGreek_and_Coptic- \p{InCyrillic- \p{InKatakana- etc.


Unicode grapheme - combining marks "\u00E0\u0061\u0300

.NET

Unicode Code Points, Properties, Blocks, and Scripts (2)


The uppercase \P is the negated variant of the lowercase \p. Example: \P,Sc- matches any character that does not have the Currency Symbol Unicode property. JavaScript flavor does not support Unicode categories, blocks, or scripts, you can list the characters that are in the category, block, or in a character class. Alternative versions for: Blocks - [\u1F00-\u1FFF] \p{IsGreekExtended} Category, character class you should create a character class with all the unicodes from the specific category/character class
See also: http://www.unicode.org/

Character class subtractions in .NET


General form: *class-*subtract++ Example : 1. [a-zA-Z0-9-[g-zG-Z]] 2. *\p{IsThai}-[\P,N-++ matches any of the 10 Thai digits. \p{IsThai- - matches any character in the Thai block \P,N- matches any character that doesnt have the Number property

Match One of Several Alternatives


The vertical bar, or pipe symbol, splits the regular expression into multiple alternatives Example: apply Mary|Jane|Sue to Mary, Jane, and Sue went to Mary's house the match Mary is immediately found at the start of the string The order of the alternatives in the regex matters only when two of them can match at the same position in the string. The solution would be to leave the most general string last in the enumeration.

Group and Capture Parts of the Match


A capturing group is a pair of parentheses that can capture only part of the regular expressions Example: \b(\d\d\d\d)-(\d\d)-(\d\d)\b 1. Has three capturing groups (\d\d\d\d), (\d\d) and (\d\d) 2. During the matching process the regular expression engine stores the part of the text matched by the capturing group Applied on subject string 2012 10 2 groups 2012, 10 , 2 Noncapturing groups : (?: opens the noncapturing groups (not available in Jscript flavor) You can specify mode modifiers (example: (?i: ) case insensitive

noncapturing group)

Benefits: You can add them to an existing regex without upsetting the references to numbered capturing groups Performance - a capturing group adds unnecessary overhead that you can eliminate by using a noncapturing group Note: parts of the match can be named : \b(?<year>\d\d\d\d)-(?<month>\d\d)-

(?<day>\d\d)\b or \b(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)\b (only .NET).

Match Previously Matched Text Again


Steps 1. Capture a text in a group 2. Match the same text anywhere in the regex using a backreference (backslash followed by a number) Example: \b\d\d(\d\d)-\1-\1\b matches 2012-0909, 2012-10-10, 2012-11-11 etc. Note: you can name a backreference: \b\d\d(?<magic>\d\d)-\k<magic>-\k<magic>\b

Retrieve Part of the Matched Text


C#:
string resultString = Regex.Match(subjectString, "http://([a-z0-9.]+)").Groups[1].Value; string resultString = Regex.Match(subjectString, "http://(?<domain>[a-z0-9.-]+)").Groups["domain"].Value;

JavaScript:
var result = ""; var match = /http:\/\/([a-z0-9.-]+)/.exec(subject); if (match) { result = match[1]; } else { result = ''; }

Retrieve a List of All Matches


C#:
Regex regexObj = new Regex(@"\d+"); MatchCollection matchlist = regexObj.Matches(subjectString);

JavaScript:
var list = subject.match(/\d+/g);

Note: - the /g flag tells the match() function to iterate over all matches in the string and put them into an array - regex with the /g flag, string.match() does not provide any further details about the regular expression

C#:

Iterate over All Matches

Match matchResult = Regex.Match(subjectString, @"\d+"); while (matchResult.Success) { // Here you can process the match stored in matchResult matchResult = matchResult.NextMatch(); }

JavaScript:
var regex = /\d+/g; var match = null; while (match = regex.exec(subject)) { // Don't let browsers such as Firefox get stuck in an infinite loop if (match.index == regex.lastIndex) regex.lastIndex++; // Here you can process the match stored in the match variable } Note: exec() should set lastIndex to the first character after the match if the match is zero characters long, the next match attempt will begin at the position of the match just found, resulting in an infinite loop

Repeat Part of the Regex a Certain Number of Times


Token Result {n} {n,m} {n,} repeats the preceding regex token n number of times Variable repetition (between n and m times) Infinite repetition but more than n times \d,1,- matches one or more digits \d+ \d,0,- matches zero or more digits \d* \d,0,1- matches zero or one digit \d? +, * , ? - greedy quantifiers Notes

\b\d{100}\b - a decimal number with 100 digits \b[a-f0-9]{1,8}\b - A 32-bit hexadecimal number \b[a-f0-9]{1,8}h?\b - A 32-bit hexadecimal number with an optional h suffix \b\d*\.\d+(e\d+)? - A floating-point number with an optional integer part, a mandatory fractional part, and an optional exponent

Choose Minimal or Maximal Repetition (1)


Lazy quantifiers repeats as few times as it has to, stores one backtracking position, and allows the regex to continue - the regex goes ahead only one character at a time, each time checking whether the following text can be matched You can make any quantifier lazy by placing a question mark after it: *?, +?, ??, and ,7,42-? Example: <p>The very <em>first</em> task is to find the beginning
of a paragraph. </p> <p>Then you have to find the end of the paragraph </p>

<p>.*</p> vs <p>.*?</p>

Choose Minimal or Maximal Repetition (2)


Possessive quantifiers it tries to repeat as many times as possible will never give back, not even when giving back is the only way that the remainder of the regular expression could match. do not keep backtracking positions You can make any quantifier possessive by placing a plus sign after it: *+, ++, ?+, and ,7,42-+ Possessive quantifiers Atomic group (not available in JScript) a noncapturing group, with the extra job of refusing to backtrack the opening bracket simply consists of the three characters (?> \b\d++\b \b(?>\d+)\b \w++\d++ (?>\w+)(?>\d+)

Test for a Match Without Adding It to the Overall Match


Lookaround - checks whether certain text can be matched without actually matching it: 1. lookbehind positive : (?<=a)b matches the b (and only the b) that is preceded by an "a" negative: (?<!a)b matches a "b" that is not preceded by an "a 2. lookahead positive : q(?=u) matches a "q" that is followed by a "u" negative : q(?!u) matches a "q" not followed by a "u
Note: JavaScript supports only lookahead

Match One of Two Alternatives Based on a Condition


(?(1)then|else) - checks whether the first capturing group has already matched something Example:
1. \b(?:(?:(one)|(two)|(three))(?:,|\b)){3,}(?(1)|(?!))(?(2)|(?!))(?(3)|(?!)) (?(1)|(?!)) - if named group "(1)" - then empty regex "" (always pass) - else empty negative lookahead (?!) (always fail) (a)?b(?(1)c|d) abc|bd

2.

Insert Literal Text into the Replacement Text (1)


Key characters: \ - literal character does not need to be escaped $ - need to be escaped only when they are followed by a digit, &, `, ", _, +, or $; to escape a dollar sign, precede it with another dollar sign. Example: $%\*$1\1 => $%\*$$1\1 Note: $1 and/or \1 are a backreference to a capturing group and $& refers to whole regex

Insert Literal Text into the Replacement Text (2)


Examples: 1. Regular expression: http:\S+ Replacement: <a href="$&">$&</a> 2. Regular expression: \b(\d{4})(\d{3})(\d{3})\b Replacement: ($1) $2-$3 3. Regular expression: \b(?<g1>\d{3})(?<g2>\d{3})(?<g3>\d{4})\b Replacement: (${g1}) ${g2}-${g3} Note: .NET and JavaScript leave backreferences to groups that dont exist as literal text in the replacement.

Replace All Matches


C#:
Regex regexObj = new Regex("pattern"); string resultString = regexObj.Replace(subjectString, replacement, count);
Example: Replace(subject, replacement, 3) replaces only the first three regular expression matches, and further matches are ignored.

JavaScript:
result = subject.replace(/before/g, "after");

Note: if you want to replace all regex matches in the string, set the /g flag when
creating your regular expression object; if you dont use the /g flag, only the first match will be replaced.

C#:

Replace Matches Reusing Parts of the Match

string resultString = Regex.Replace(subjectString, @"(\w+)=(\w+)", "$2=$1"); or Regex regexObj = new Regex(@"(\w+)=(\w+)"); string resultString = regexObj.Replace(subjectString, "$2=$1");

With named groups: Regex regexObj = new Regex(@"(?<left>\w+)=(?<right>\w+)"); string resultString = regexObj.Replace(subjectString, "${right}=${left}");

JavaScript:
result = subject.replace(/(\w+)=(\w+)/g, "$2=$1");

C#:

Replace Matches with Replacements Generated in Code

Regex regexObj = new Regex(@"\d+"); string resultString = regexObj.Replace(subjectString, new MatchEvaluator(ComputeReplacement)); public String ComputeReplacement(Match matchResult) { int t= int.Parse(matchResult.Value) * 2; return t.ToString(); }

JavaScript:
var result = subject.replace(/\d+/g, function(match) { return match * 2; } ); Note: replacement function may accept one or more parameters:

the first parameter will be set to the text matched by the regular expression. If the regular expression has capturing groups, the second parameter will hold the text matched by the first capturing group, the third parameter gives you the text of the second capturing group, and so on.

Split a string
C#:
string[] splitArray = Regex.Split(subjectString, "<[^<>]*>");

JavaScript:
var list = []; var regex = /<[^<>]*>/g; var match = null; var lastIndex = 0; while (match = regex.exec(subject)) { // Don't let browsers such as Firefox get stuck in an infinite loop if (match.index == regex.lastIndex) regex.lastIndex++; // Add the text before the match list.push(subject.substring(lastIndex, match.index)); lastIndex = match.index + match[0].length; }

C#:

Search Line by Line

string[] lines = Regex.Split(subjectString, "\r?\n"); Regex regexObj = new Regex("regex pattern"); for (int i = 0; i < lines.Length; i++) { if (regexObj.IsMatch(lines[i])) { // The regex matches lines[i] } else { // The regex does not match lines[i] } }

JavaScript:
var lines = subject.split(/\r?\n/); var regexp = /regex pattern/; for (var i = 0; i < lines.length; i++) { if (lines[i].match(regexp)) { // The regex matches lines[i] } else { // The regex does not match lines[i] } }

Validation and Formatting (1)


Email address
^[\w!#$%&'*+/=?`{|}~^]+(?:\.[!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

International Phone Numbers


^\+(?:[0-9]\x20?){6,14}[0-9]$

Validate Traditional Date Formats


^(?:(0?2)/([12][0-9]|0?[1-9])|(0?[469]|11)/(30|[12][0-9]|0?[19])|(0?[13578]|1[02])/(3[01]|[12][0-9]|0?[1-9]))/((?:[0-9]{2})?[0-9]{2})$

Limit the Number of Lines in Text


^(?:(?:\r\n?|\n)?[^\r\n]*){0,5}$

Validate Affirmative Responses


^(?:1|t(?:rue)?|y(?:es)?|ok(?:ay)?)$

Validation and Formatting (2)


Find Words Near Each Other
\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

Remove Duplicate Lines


^(.*)(?:(?:\r?\n|\r)\1)+$ replaced with $1

Validating URL
^((https?|ftp)://|(www|ftp)\.)[a-z0-9-]+(\.[a-z0-9-]+)+([/?].*)?$

Extracting the Query from a URL


^[^?#]+\?([^#]+)

Validate Windows Paths


^(?:[a-z]:|\\\\[a-z0-9_.$]+\\[a-z0-9_.$]+)\\(?:[^\\/:*?"<>|\r\n]+\\)*[^\\/:*?"<>|\r\n]*$

You might also like