Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

3.

4
2017
Text Processing and Pattern Matching

Department of CSE,Coimbatore
Objectives
• To explore how computers can be used to create, process, and
reason about textual information.
• To understand strings and string operations.
• To learn to process textual information.
• To learn to recognize patterns.
• To understand that string literals can be rewritten as patterns.

Department of CSE,Coimbatore
Strings
✓Although much of the information stored on computers involves
numbers, the majority of data is textual rather than numeric.
• your name, social security number, address, and
Facebook status are all textual in nature.
• Computer programmers use the term string when
referring to textual data.
• A string is simply a piece of text, or more formally, an
ordered sequence of individual characters.

Department of CSE,Coimbatore
Character
✓You can think of a single character as the result of any
one key that you press on a keyboard.
• A character is usually a letter of the alphabet,
• It might also be a punctuation symbol such as a
comma, semicolon, or question mark.
• It may even be a nonprintable character such as
a tab or a linefeed.

Department of CSE,Coimbatore
String Length
• The length of a string is the number of characters contained in
the string.

• No string has a negative length since it is not possible to have a


sequence that contains a negative number of characters.

Department of CSE,Coimbatore
String Literal
• In most programming languages, string data is denoted
by using double quotes to surround the text.
• For example, “Hello” is a string having five characters.
• Any sequence of characters that is enclosed by
double-quotes is known as a string literal.
✓The double quotes are not part of the string itself.
✓They simply serve to notify the computer that the enclosed
text is a string literal.

Department of CSE,Coimbatore
• Since digits appear on the keyboard, strings may contain
digits in addition to alphabetic characters.

• Consider, for example the string literal “04/13/65”.


• This string has a length of 8 where the characters at
indices 2 and 5 are both forward slashes (/), but the
remaining characters are digits.

Department of CSE,Coimbatore
String Indices
• The characters in a string are indexed such that the first character
has an index of 0, the second character has an index of 1, and so on.
• In the Figure: Indexing in the string “Hello” , each character of the
string “Hello” is given an index.
• The top row shows the indices of each character, while the
characters themselves occur in the second row.
• We note that the character at index 0 is H, the character at index 1 is
e, and the character at index 4 is o.

0 1 2 3 4 Indices
H e l l o

Department of CSE,Coimbatore
Observe
• The length of the string “Hello” is 5, while the largest index is
4.

• This observation suggests that for any string of length n, the


largest valid index is n–1.

• Since the smallest possible index is always 0, we note that for


any string of length n, the only valid indices are in the interval 0
to n–1.

Department of CSE,Coimbatore
Indexing operation
• Given a string literal we can access the character at a
particular index by using a bracket notation.
• The expression stringliteral[index of character]
produces a string containing one letter.
• This is referred to as an indexing operation.
0 1 2 3 4 5 6
p o p c o r n

Department of CSE,Coimbatore
Indexing operation 0 1 2 3 4 5 6
p o p c o r n
• Eg. Access the fourth letter of the string literal
“popcorn”.
• The fourth character has an index of 3.
• We write the number 3 inside of the brackets
following the string literal as “popcorn” [3].
• This expression produces a string containing a
lowercase c.
• Eg. “popcorn” [15] is an invalid indexing.
• The indices are in the interval 0 to 6.
• The expression produces an error.
Department of CSE,Coimbatore
Length operation

• The expression stringliteral.length produces the


length of the string.
• Eg. “popcorn”.length produces the number 7 (the
length of the string).
• Note that “popcorn”.length is interchangeable with
the value produced by the expression (the number 7).

Department of CSE,Coimbatore
Length operation

✓Length is 9

✓Length is 5

Department of CSE,Coimbatore
Concatenation operation

• It takes two strings and splices them to form a third


string as output.
• String concatenation is usually expressed, as a plus
symbol (+).
• Although we usually think of the plus symbol as referring to
the mathematical addition of two numbers, the plus symbol is
also employed to concatenate two strings.

Department of CSE,Coimbatore
Eg. “mother” + “land”
concatenates the two strings “mother” and “land”
to produce the string “motherland”.

Department of CSE,Coimbatore
String variables
• Recall that variables are bound to data through a name
binding operation that we denote using the left-arrow
symbol (←).
• On the left of this symbol must be a variable name and a
value must occur on the right of the arrow.
x ← “pop”
y ← “corn”
z←x+y
Figure: String variables
Figure: String variables shows how we might use string variables to
refer to string literals.
Department of CSE,Coimbatore
• In this sequence of actions we tell the computer to
(1) bind the name x to the string literal “pop”
(2) bind the name y to the string literal “corn”
(3) bind the name z to the string “popcorn”

• z is produced by concatenating the strings referred to by


the variables x and y.

Department of CSE,Coimbatore
Substring operation

• Although indexing allows us to obtain part of a string,


we often want to obtain: a subsequence of a string.
• The expression stringvariable.substring(a,b)
produces the sequence of characters spanning indices a
to b-1 of the string variable.
• The substring function allows us to obtain part of a
string if we know the indices of the first and last characters
that we want to extract from the string.

Department of CSE,Coimbatore
Substring operation

• Eg.
x ← “computational thinking”
y ← x.substring(3, 6)

The function substring is applied to the x variable,


which is bound to the string “computational thinking”.

Department of CSE,Coimbatore
• Why use index 6 ?
• It denotes the index of the character that is not
included in the output i.e.(3,6].
• We are telling the computer to give us the
sequence of characters starting from the
character at index 3 and ending with the
character at index 5 of the variable x.
• The variable y is bound to the string “put”.

Department of CSE,Coimbatore
IndexOf operation- Handy for searching!

• Sometimes it is useful to find the index of some


character in a string literal.
• The expression
stringliteral.indexOf(character) searches a
string literal for a character and returns the index
of the first occurrence of the character.

Department of CSE,Coimbatore
• Consider, for example, the e-mail address
“elvispresley@heartbreak.hotel.com”.

• We might want to know where the character ‘@’ occurs


in the string so that we can split the string into two
parts: the user name and the name of the e-mail service.

• x←“elvispresley@heartbreak.hotel.com”. indexOf(“@”)

Department of CSE,Coimbatore
x ← “popcorn”.indexOf(“c”)
Figure: Obtain index of first occurrence of “c”

• The value produced by this expression is the number 3, since


that first lowercase c occurs at index 3 in the string literal
“popcorn”.

• Note that , “popcorn”.indexOf(“c”) is interchangeable with the


number 3(the data that the expression produces).

• An invalid index produces the number –1.

Department of CSE,Coimbatore
Processing e-Mail Addresses

We will consider how e-mail addresses, a very


common piece of textual data, can be
automatically processed and analyzed for use
in a business setting.

Department of CSE,Coimbatore
• Perhaps we are creating a company to sell T-shirts
to college students.
• We establish a policy that requires users to register
prior to browsing our catalog and ordering
products.
• We require that each user provide an e-mail
address and a password.
Department of CSE,Coimbatore
We know that an e-mail (Figure: e-Mail addresses)
consists of two general parts: a user name (also referred
to as the local part) and a host site (also referred to as
the domain part). These two parts are separated by the
ampersat (@) symbol.

Department of CSE,Coimbatore Figure: e-Mail addresses


To verify that the user is a college student:
• We establish a policy requiring that the provided e-
mail address terminate with the characters edu.
• This part of an e-mail address is referred to as the
domain extension.
• By convention, any e-mail address having edu as the
domain extensions is understood to be an educational
institution.
To track the number of users associated with each
educational institution:
• Separate the user name from the e-mail host site

Department of CSE,Coimbatore
• We realize that the web registration system must
accept any e-mail address typed in by the user

• We must extract three vital subsequences: the


user name, the host site, and the domain
extension.

Department of CSE,Coimbatore
• Consider the e-mail address
bob.dylan@love.and.theft.edu

• This e-mail address has


• a length of 28 and consists of the user name
bob.dylan,
• the host site is love.and.theft.edu, and
• the final three characters of the host site are
edu.

Department of CSE,Coimbatore
• Since the ampersat occurs at index 9, we know that the
username consists of the first 9 characters of the
address and hence we use the substring statement to
extract the corresponding character sequence.

• Also, since the ampersat is at index 9, we can extract


the host site by taking the characters starting at index
10 and moving up until the end of the string.

• The final three characters are those characters whose


indices are given as 28 – 3, 28 – 2, and 28 – 1; where
28 is the length of the address.
Department of CSE,Coimbatore
The registration web page has a text-entry field from
which we obtain the text entered by the user:

address ← readAddressFromUser( )

Use the text-processing commands to extract the three


relevant strings:

username ← address.substring(0, 9)
hostsite ← address.substring(10,28)
extension ← address.substring(25,28)

Department of CSE,Coimbatore
Generalizing for any e-mail address
• Although the discussed strategy extracts username, host
site, and extension from the e-mail
addressbob.dylan@love.and.theft.edu,
it will not work for most other e-mail addresses, say,
elvis.presley@heartbreak.hotel.com
• We find that:
• the username is elvis.pre,
• the host site is ley@heartbreak.hot, and
• the extension is hot.
Department of CSE,Coimbatore
Observe that we have made two assumptions
that are not generally true of all e-mails:
(1) the ampersat occurs at index 9.
(2) the length of the e-mail address is 28.

Department of CSE,Coimbatore
• The first assumption is encoded in “address.substring(0, 9)”.
• we used the number 9 as a result of assuming that the index of
the ampersat is 9.

• The first assumption is encoded in “address.substring(10,28)”.


• we understood the number 10 to be the index of the first
character following the ampersat.

• The second assumption is encoded in “address.substring(10,28)”


• we understand the 28 to be the length of the address.

Department of CSE,Coimbatore
Removing the assumptions
1. First find the index of the first occurrence of the ampersat in
any address
2. Then find the length of the address.
We can then make use of the values to extract the username, host
site, and extension from any e-mail address the user chooses to
type.

Department of CSE,Coimbatore
Extracting information from an e-mail address

address ← readAddressFromUser()

ampersatIndex ← address.indexOf(“@”)

length ← address.length

username ← address.substring(0, ampersatIndex)

hostsite ← address.substring(ampersatIndex+1,
length-4)

extension ← address.substring(length-3, length)


Department of CSE,Coimbatore
Processing dates
• The European date format is essentially the reverse of the American date
format.
• Most Americans write a date by putting the month before the day while
Europeans put the day ahead of the month.
• The date April 13, 1965, would be written as 04/13/1965 by an American
and as 13/04/1965 by a European.
• Consider writing a website that requires users to enter a date.
• Perhaps we require the user to enter their birthdate or the date that their
driver’s license was granted.
• We might want to allow a European to enter a date using the European
format but then convert the date to an American format so that it can be
stored in our server’s database using the same format as American users.
✓We can use text-processing operations to perform this conversion.
✓Can you try it?
Department of CSE,Coimbatore
Patterns

Department of CSE,Coimbatore
Patterns
• Patterns are a very useful technique for processing
textual data.
• A pattern defines a set of properties that some
strings will possess and other strings will not.

It is a way of determining whether a particular string is a


member of the family defined by the pattern or whether a
particular string is not a member of the family.

Department of CSE,Coimbatore
Recognizing patterns
• In the Social Security Number “123-45-6789”, the observed
pattern is:
• First 3 digits followed by a dash (-)
• followed by any 2 digits
• followed by a dash
• followed by any 4 digits
• A string literal that matches this pattern can be reasonably
understood as a member of the Social Security number family,
whereas a string literal that does not match this pattern is not a
Social Security number.

Department of CSE,Coimbatore
• A regular expression defines a pattern such
that a particular string will either match the pattern
or will not match the pattern.

• Regular expressions are extremely powerful


techniques for processing textual data.
• We know that a String is a sequence of characters.

• Regular Expression ( Regex), is a description of one or


more strings to match when you search a body of text.

• A Regular Expression is a sequence of character


strings that represents a search pattern.

• It serves as a pattern to compare with the text being


searched.
String matching applications
String matching applications
String matching applications
String matching applications
Writing Expressions

dot (.) Match any character.

[] Match a range of characters contained within the


square brackets.

[^ ] Match a character which is not one of those


contained within the square brackets.
. Match zero or more of the preceding item.

+ Match one or more of the preceding item.

? Match zero or one of the preceding item.


Writing Expressions

{n} Match exactly n of the preceding item.

{n,m} Match between n and m of the preceding


item.

{n,} Match n or more of the preceding item.

\ Escape, or remove the special meaning of


the next character.
An exact string (or sequence) of characters

• It is the most basic pattern


Eg: A search for the characters th
……..searching for character t followed directly by
character h

th
There is no th
eory of evolution. Only a list of animals
Chuck Norris allows to live.

• You may be wondering why th in there was not picked up as a match.


• The reason is that There contains a capital T as opposed to
lowercase which is what the regular expression was searching
for.

• We know that they are the same character, just in a different form.
• Regular expressions do not however.

• Regular expressions do not interpret any meaning from


the search pattern.

• All they do is look for exact matches to specifically


what the pattern describes.

Note: It is possible to make a regular expression look for


matches in a case insensitive way ……...
• A very basic expression like this is really no different to a search
you may do in a search engine or in your favourite word
processor or such.
• It's not really that exciting.

• Metacharacters are characters which have a special


meaning.
• They help us to create more interesting patterns than just a
string of specific characters.
The dot - any character

• The dot ( . ) (or full stop) character is what we refer to as a


metacharacter.

• The dot ( . ) represents any character.

Eg: To search for charactera followed by any single


character and followed by c
a.c
• Look for character b followed by any character, followed by
character g

b.g

The big bag of bits was bugged.


Note: the . matches only a single character.
• The . matches only a single character. We may get it to match
more than a single character using multipliers

• To match an i, followed by two characters, followed by e

i..e

You can live like a king but make sure


it isn't a lie.
Ranges of Characters
• Specify a range of characters by enclosing them within square brackets []
• Look for character t followed by either character e or
o, followed by character d
t[eo]d

When today is over Ted will have a tedious


time tidying up.
• There is no limit to how many characters you may
place in side the square brackets.

• You could place a single character, [y] (which


would be a bit silly but nevertheless it is legal),

• you could have many, [grf4s2#lknx]


Shortcut for characters in a row

• Look for a digit between 1 and 8: [12345678]

but there is a shortcut ……………….. [1-8]

Room Allocations: G4 G9 F2 H1 L0 K7 M9
Expression Matching Definition

[469] Matches the single digit 4, 6 or 9

[0-9] Matches any single digit from 0 - 9

[A-Za-z0-9] Matches any single character that is


either an uppercase letter or a
lowercase letter or a digit
Combine a set of characters along with other characters

• Search for the digits 1, 2, 3, 4 or 9

[1-49]

Room Allocations: G4 G9 F2 H1 L0 K7 M9
Combine multiple sets

• look for 1, 2, 3, 4, 5, a, b, c, d, e, f, x

[1-5a-fx]
Negating - Find characters that aren't

• Presence of a character which is not a range of characters.


• Placing a caret ( ^ ) at the beginning of the range.

• Look for character t followed by a character which is not


either e or o, followed by the character d

t[^eo]d
when today is over Ted will have a tedious time
tidying up.
Multipliers

Multipliers allow us to increase the number of times


an item may occur in our regular expression.

* item occurs zero or more times


+ item occurs one or more times
? item occurs zero or one times
{n} item occurs n times.(exact n times)
{m,n} item occurs between m and n times
{n,} item occurs at least n times
Escaping Metacharacters

• To search for one of the characters which is a


metacharacter.
• Place backslash\ in front of a metacharacter
• This removes it's special meaning.
previous OR next character/group
| ()

(Mon)|(Tues)day

matches "Monday" or "Tuesday"


^
• Beginning of a string
match strings that begin with http: ^http

• Within a character range [] negation


match any character not 0-9: [^0-9]
$
• End of a string
match "exciting" but not "ingenious“: ing$
1. ab followed by exactly three c’s and
followed by d

2. ab followed by zero or more c’s and


followed by d

3. ab followed by one or more c’s and followed


by d

4. ab followed by an optional c and followed by


d
5. Character l followed by the character o zero or more
times: lo*
Are you looking at the lock or the silk?
l in silk is also matched.Why???

6. The given Regex is : this. How will you rewrite the


expression to match the pattern this.?

7.
Solution

1. abc{3}d

2. abc*d

3. abc+d

4. abc?d

5. it is an l followed by zero o

6. this\.
Regular
Meaning of quantifier Matches
expression

Matches [0-9] zero or more


-------------------
Chapter [1-9][0-9]* times.

Matches [0-9] one or two


Chapter [0-9]{1,2} ---------------------
times.

Chapter [1-9][0- Matches [0-9] zero or one


-----------------------
9]{0,1} time.
Solution
Regular Meaning of
Matches
expression quantifier

"Chapter 1", "Chapter 25",


Matches [0-9] zero or
"Chapter 40"
Chapter [1-9][0-9]* more times.
"Chapter 401"

"Chapter 0", "Chapter 03",


Matches [0-9] one or
Chapter [0-9]{1,2} "Chapter 1", "Chapter 25",
two times.
"Chapter 40"

Chapter [1-9][0- Matches [0-9] zero or "Chapter 1", "Chapter 25",


9]{0,1} one time. "Chapter 40"
Set pattern Matches

[xyz]
[a-z]
[a-zA-Z0-9_]
[^a-z]

[\s\S]
Solution
Set pattern Representation
[xyz] x, y or z
[a-z] anything between a to z

a to z, A to Z, 0 to 9 or an underscore. Equivalent
[a-zA-Z0-9_]
to \w

[^a-z] anything except a to z

any whitespace character or any non-whitespace


[\s\S]
character. Essentially, anything!
---------------------------
matches "Ahhhhh" or "A“

--------------------------
matches "Al" or "Ah"

---------------------------
matches "Ah" or "Ahhh" but not "A“

----------------------------
matches "Hungry?“

----------------------------- matches "dog", "door", "dot", etc.


matches "315" but not "31“
--------------

matches "12", "123", and "1234“


-------------

matches "1234567..."
---------------

matches strings that begin with http,


--------------
such as a url
Solution
Match zero, one or more of the
* Ah* matches "Ahhhhh" or "A"
previous

Match zero or one of the


? Ah? matches "Al" or "Ah"
previous

Match one or more of the Ah+ matches "Ah" or "Ahhh" but


+
previous not "A"

Used to escape a special


\ Hungry\? matches "Hungry?"
character

Wildcard character, matches any do.* matches "dog", "door", "dot",


.
character etc.
[cbf]ar matches "car", "bar", or
"far“

[0-9]+ matches any positive


integer
Matches a range of
[ ] characters
[a-zA-Z] matches ascii letters a-z
(uppercase and lower case)

[^0-9] matches any character not


0-9.
[0-9]{3} matches "315" but
not "31“

Matches a specified
number of [0-9]{2,4} matches "12",
{}
occurrences of the "123", and "1234“
previous

[0-9]{2,} matches
"1234567..."
Email matching
To check if a string is a valid email address or not

We need an expression that specifies the following:


1. One or more occurrences of alphabets followed by digits
followed by any one of the mentioned symbols
2. @
3. One or more occurrences of letter followed by digits

4. .
5. One or more occurrences of alphabets, min length 2 and
max length 4
1.One or more occurrences of alphabets followed by digits
followed by any one of the mentioned symbols:
[A-Z0-9._%+-]
2. @ : @
3. One or more occurrences of letter followed by digits:
[A-Z0-9]
4. . : \.
5. One or more occurrences of alphabets, min length 2 and
max length 4 : [A-Z]{2,4}
Email matching

[A-Z0-9._%+-] + @ [A-Z0-9] + \. [A-Z]{2,4}


Date matching

To check if a string is a date or not (dd-mm-yyyy format)


Solution

^(0?[1-9]|[12][0-9]|3[01] \-)
(0?[1-9]|1[012] \-)
\d{4}$
What has been described?
• How operations can be performed on strings.
• How to process textual information.
• How computers can be used to create, process, and
reason about textual information.
• How to recognize patterns.
• How to rewrite string literals as patterns.

Credits
▪Google images

Department of CSE,Coimbatore

You might also like