Professional Documents
Culture Documents
LN#18 CTPS 2018
LN#18 CTPS 2018
4
2017
Text Processing and Pattern Matching
Department of CSE,Coimbatore
Objectives
• To explore how computers can be used to create, process, and
reason about textual information.
• To understand strings and string operations.
• To learn to process textual information.
• To learn to recognize patterns.
• To understand that string literals can be rewritten as patterns.
Department of CSE,Coimbatore
Strings
✓Although much of the information stored on computers involves
numbers, the majority of data is textual rather than numeric.
• your name, social security number, address, and
Facebook status are all textual in nature.
• Computer programmers use the term string when
referring to textual data.
• A string is simply a piece of text, or more formally, an
ordered sequence of individual characters.
Department of CSE,Coimbatore
Character
✓You can think of a single character as the result of any
one key that you press on a keyboard.
• A character is usually a letter of the alphabet,
• It might also be a punctuation symbol such as a
comma, semicolon, or question mark.
• It may even be a nonprintable character such as
a tab or a linefeed.
Department of CSE,Coimbatore
String Length
• The length of a string is the number of characters contained in
the string.
Department of CSE,Coimbatore
String Literal
• In most programming languages, string data is denoted
by using double quotes to surround the text.
• For example, “Hello” is a string having five characters.
• Any sequence of characters that is enclosed by
double-quotes is known as a string literal.
✓The double quotes are not part of the string itself.
✓They simply serve to notify the computer that the enclosed
text is a string literal.
Department of CSE,Coimbatore
• Since digits appear on the keyboard, strings may contain
digits in addition to alphabetic characters.
Department of CSE,Coimbatore
String Indices
• The characters in a string are indexed such that the first character
has an index of 0, the second character has an index of 1, and so on.
• In the Figure: Indexing in the string “Hello” , each character of the
string “Hello” is given an index.
• The top row shows the indices of each character, while the
characters themselves occur in the second row.
• We note that the character at index 0 is H, the character at index 1 is
e, and the character at index 4 is o.
0 1 2 3 4 Indices
H e l l o
Department of CSE,Coimbatore
Observe
• The length of the string “Hello” is 5, while the largest index is
4.
Department of CSE,Coimbatore
Indexing operation
• Given a string literal we can access the character at a
particular index by using a bracket notation.
• The expression stringliteral[index of character]
produces a string containing one letter.
• This is referred to as an indexing operation.
0 1 2 3 4 5 6
p o p c o r n
Department of CSE,Coimbatore
Indexing operation 0 1 2 3 4 5 6
p o p c o r n
• Eg. Access the fourth letter of the string literal
“popcorn”.
• The fourth character has an index of 3.
• We write the number 3 inside of the brackets
following the string literal as “popcorn” [3].
• This expression produces a string containing a
lowercase c.
• Eg. “popcorn” [15] is an invalid indexing.
• The indices are in the interval 0 to 6.
• The expression produces an error.
Department of CSE,Coimbatore
Length operation
Department of CSE,Coimbatore
Length operation
✓Length is 9
✓Length is 5
Department of CSE,Coimbatore
Concatenation operation
Department of CSE,Coimbatore
Eg. “mother” + “land”
concatenates the two strings “mother” and “land”
to produce the string “motherland”.
Department of CSE,Coimbatore
String variables
• Recall that variables are bound to data through a name
binding operation that we denote using the left-arrow
symbol (←).
• On the left of this symbol must be a variable name and a
value must occur on the right of the arrow.
x ← “pop”
y ← “corn”
z←x+y
Figure: String variables
Figure: String variables shows how we might use string variables to
refer to string literals.
Department of CSE,Coimbatore
• In this sequence of actions we tell the computer to
(1) bind the name x to the string literal “pop”
(2) bind the name y to the string literal “corn”
(3) bind the name z to the string “popcorn”
Department of CSE,Coimbatore
Substring operation
Department of CSE,Coimbatore
Substring operation
• Eg.
x ← “computational thinking”
y ← x.substring(3, 6)
Department of CSE,Coimbatore
• Why use index 6 ?
• It denotes the index of the character that is not
included in the output i.e.(3,6].
• We are telling the computer to give us the
sequence of characters starting from the
character at index 3 and ending with the
character at index 5 of the variable x.
• The variable y is bound to the string “put”.
Department of CSE,Coimbatore
IndexOf operation- Handy for searching!
Department of CSE,Coimbatore
• Consider, for example, the e-mail address
“elvispresley@heartbreak.hotel.com”.
• x←“elvispresley@heartbreak.hotel.com”. indexOf(“@”)
Department of CSE,Coimbatore
x ← “popcorn”.indexOf(“c”)
Figure: Obtain index of first occurrence of “c”
Department of CSE,Coimbatore
Processing e-Mail Addresses
Department of CSE,Coimbatore
• Perhaps we are creating a company to sell T-shirts
to college students.
• We establish a policy that requires users to register
prior to browsing our catalog and ordering
products.
• We require that each user provide an e-mail
address and a password.
Department of CSE,Coimbatore
We know that an e-mail (Figure: e-Mail addresses)
consists of two general parts: a user name (also referred
to as the local part) and a host site (also referred to as
the domain part). These two parts are separated by the
ampersat (@) symbol.
Department of CSE,Coimbatore
• We realize that the web registration system must
accept any e-mail address typed in by the user
Department of CSE,Coimbatore
• Consider the e-mail address
bob.dylan@love.and.theft.edu
Department of CSE,Coimbatore
• Since the ampersat occurs at index 9, we know that the
username consists of the first 9 characters of the
address and hence we use the substring statement to
extract the corresponding character sequence.
address ← readAddressFromUser( )
username ← address.substring(0, 9)
hostsite ← address.substring(10,28)
extension ← address.substring(25,28)
Department of CSE,Coimbatore
Generalizing for any e-mail address
• Although the discussed strategy extracts username, host
site, and extension from the e-mail
addressbob.dylan@love.and.theft.edu,
it will not work for most other e-mail addresses, say,
elvis.presley@heartbreak.hotel.com
• We find that:
• the username is elvis.pre,
• the host site is ley@heartbreak.hot, and
• the extension is hot.
Department of CSE,Coimbatore
Observe that we have made two assumptions
that are not generally true of all e-mails:
(1) the ampersat occurs at index 9.
(2) the length of the e-mail address is 28.
Department of CSE,Coimbatore
• The first assumption is encoded in “address.substring(0, 9)”.
• we used the number 9 as a result of assuming that the index of
the ampersat is 9.
Department of CSE,Coimbatore
Removing the assumptions
1. First find the index of the first occurrence of the ampersat in
any address
2. Then find the length of the address.
We can then make use of the values to extract the username, host
site, and extension from any e-mail address the user chooses to
type.
Department of CSE,Coimbatore
Extracting information from an e-mail address
address ← readAddressFromUser()
ampersatIndex ← address.indexOf(“@”)
length ← address.length
hostsite ← address.substring(ampersatIndex+1,
length-4)
Department of CSE,Coimbatore
Patterns
• Patterns are a very useful technique for processing
textual data.
• A pattern defines a set of properties that some
strings will possess and other strings will not.
Department of CSE,Coimbatore
Recognizing patterns
• In the Social Security Number “123-45-6789”, the observed
pattern is:
• First 3 digits followed by a dash (-)
• followed by any 2 digits
• followed by a dash
• followed by any 4 digits
• A string literal that matches this pattern can be reasonably
understood as a member of the Social Security number family,
whereas a string literal that does not match this pattern is not a
Social Security number.
Department of CSE,Coimbatore
• A regular expression defines a pattern such
that a particular string will either match the pattern
or will not match the pattern.
th
There is no th
eory of evolution. Only a list of animals
Chuck Norris allows to live.
• We know that they are the same character, just in a different form.
• Regular expressions do not however.
b.g
i..e
Room Allocations: G4 G9 F2 H1 L0 K7 M9
Expression Matching Definition
[1-49]
Room Allocations: G4 G9 F2 H1 L0 K7 M9
Combine multiple sets
• look for 1, 2, 3, 4, 5, a, b, c, d, e, f, x
[1-5a-fx]
Negating - Find characters that aren't
t[^eo]d
when today is over Ted will have a tedious time
tidying up.
Multipliers
(Mon)|(Tues)day
7.
Solution
1. abc{3}d
2. abc*d
3. abc+d
4. abc?d
5. it is an l followed by zero o
6. this\.
Regular
Meaning of quantifier Matches
expression
[xyz]
[a-z]
[a-zA-Z0-9_]
[^a-z]
[\s\S]
Solution
Set pattern Representation
[xyz] x, y or z
[a-z] anything between a to z
a to z, A to Z, 0 to 9 or an underscore. Equivalent
[a-zA-Z0-9_]
to \w
--------------------------
matches "Al" or "Ah"
---------------------------
matches "Ah" or "Ahhh" but not "A“
----------------------------
matches "Hungry?“
matches "1234567..."
---------------
Matches a specified
number of [0-9]{2,4} matches "12",
{}
occurrences of the "123", and "1234“
previous
[0-9]{2,} matches
"1234567..."
Email matching
To check if a string is a valid email address or not
4. .
5. One or more occurrences of alphabets, min length 2 and
max length 4
1.One or more occurrences of alphabets followed by digits
followed by any one of the mentioned symbols:
[A-Z0-9._%+-]
2. @ : @
3. One or more occurrences of letter followed by digits:
[A-Z0-9]
4. . : \.
5. One or more occurrences of alphabets, min length 2 and
max length 4 : [A-Z]{2,4}
Email matching
^(0?[1-9]|[12][0-9]|3[01] \-)
(0?[1-9]|1[012] \-)
\d{4}$
What has been described?
• How operations can be performed on strings.
• How to process textual information.
• How computers can be used to create, process, and
reason about textual information.
• How to recognize patterns.
• How to rewrite string literals as patterns.
Credits
▪Google images
Department of CSE,Coimbatore