Professional Documents
Culture Documents
HKU - 7001 - 3.1 Managing Data I
HKU - 7001 - 3.1 Managing Data I
HKU - 7001 - 3.1 Managing Data I
Regular Expression
Managing NumPy
Data
Pandas
StatsModels
Tableau
Data
Visualization Matplotlib
Agenda
• Sources of Data
• Reading from and Writing to Files
TEXT
CSV
JSON
• Modules and Packages
• Regular Expressions (Regex)
Sources of Data
Where do we find and store data?
text
csv
In Files JSON
Excel
Database
HTML
On Web APIs
Text
Text File Structure
• A text file can be thought of as a sequence of lines
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
• Or, you can also use the absolute path, by using the full path
of the file such as “/home/MSBA7001/data/mbox.txt”
When Files are Missing
• Use try and except
handle = open(r'mbox.txt')
count = 0
for line in handle:
count = count + 1
print('Line Count:', count)
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
...
From: stephen.marquard@uct.ac.za\n
\n
From: louis@media.berkeley.edu\n
\n
From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
\n
...
Searching Through a File (fixed)
• We can strip the whitespace from the right-hand side of the
string using rstrip() from the string library
• The newline is considered “white space” and is stripped
handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if line.startswith('From:'):
print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
....
Skipping with continue
• We can conveniently skip a line by using the continue
statement
handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if not line.startswith('From:'):
continue
print(line)
Using in to Select Lines
• We can look for a string anywhere in a line as our selection
criteria using in or not in
handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if not '@uct.ac.za' in line:
continue
print(line)
handle = open(r'mbox.txt','r')
do something with handle
handle.close()
Writing to a Text File
• Let us open a text file then write some random text into it
by using the write() method.
import csv
import csv
[‘name', 'Age']
['Peter', '22']
['Jasmine', '21']
['Sam', '24']
JSON
What is JSON
• JSON: JavaScript Object Notation.
• JSON is a syntax for storing and exchanging data.
• It is still text, treat it as strings. It’s called JSON strings.
• It is easy for humans to read and write.
• It is easy for machines to parse and generate.
{"employees":[
{ "firstName":"John", "lastName":"Doe" },
{ "firstName":"Anna", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Jones" }
]}
JSON Syntax Rules
• Data is in key/value pairs, like a dictionary in Python
• Pairs is separated by commas
• keys must be strings, written with double quote
• values must be one of the following data types:
a string { "name":"John" }
a number { "age":30 }
an object (JSON object)
an array
a boolean { "sale":true }
null { "middlename":null }
JSON Syntax Rules
• Curly braces hold objects, like dictionaries in a dictionary in
Python
{
"employee":{ "name":"John", "age":
30, "city":"New York" }
}
import json
x = '''{"Course Code":"MSBA7001",
"Instructor":"Chao Ding", "Capacity":90}'''
y = json.loads(x)
print(y)
print(y['Capacity'])
90
JSON & Python – loads()
• Read a json file
import json
dict
JSON & Python – loads()
json_dict
{'quiz': {'sport': {'q1': {'question': 'Which one is correct team name in NBA?',
'options': ['New York Bulls',
'Los Angeles Kings',
'Golden State Warriros',
'Huston Rocket'],
'answer': 'Huston Rocket'}},
'maths': {'q1': {'question': '5 + 7 = ?',
'options': ['10', '11', '12', '13'],
'answer': '12'},
'q2': {'question': '12 - 8 = ?',
'options': ['1', '2', '3', '4'],
'answer': '4'}}}}
json_dict['quiz']['maths']['q2']['options'][1]
'2'
JSON & Python – dumps()
• dumps() converts a Python object to a JSON string
import json
x = {
'name': 'John',
'age': 30,
'city': 'New York'
}
print(json.dumps(x))
{
"name": "John",
"age": 30,
"city": "New York"
}
dumps()
JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None
Summary
• File structure
• Opening a file
• Reading a file
• Searching for lines
• Reading file names
• Dealing with bad files
• Writing a File
• Reading a CSV file
• JSON and Python
Exercise 1
• Read file “mbox.txt”. Find and print all the domain names of
email addresses from lines that start with ‘From’.
• For example:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
• uct.ac.za is the domain name
import package.module
from package.module import method
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
Regular Expression Quick Guide
handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if re.search('From:', line):
print(line)
Returns True/None
Using re.search() like startswith()
handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if line.startswith('From:'):
print(line)
Returns True/False
import re
handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if re.search('^From:', line):
print(line)
There is a hat ^
Compiling a pattern
• We can use the compile() method to compile a regular
expression pattern into a regular expression object, which
can be used in other methods such as search
import re
pattern = re.compile('SPAM’)
if pattern.search('X-DSPAM-Result'):
print('yes')
yes
Wild-Card Characters
• The dot character matches any one character
• If you add the asterisk character, the character is “repeating
zero-or more times”
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match any one character
Wild-Card Characters
• What if the text is as follows
^X-\S+:
Match any one non-
whitespace character
Matching and Extracting Data
• If we actually want the matching strings to be extracted, we
use re.findall(). The return is a list.
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
[AEIOU]+
>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+:', x)
>>> print(y)
['From: Using Fun :']
One or more
characters
^F.+:
>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+?:', x)
>>> print(y)
['From:']
^F.+?:
One or more characters
but not greedy
Fine-Tuning String Extraction
• Further fine-tune our match
>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['stephen.marquard@uct.ac.za’]
\S+@\S+
<(\S+@\S+)>
Parsing using Regex
re.findall('@([^ ]*)',line)
Look through the string
until you find an at sign Match many of them
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$[0-9.]+',x)
>>> print(y[0])
$10.00
At least one
or more
\$[0-9.]+
A real dollar sign A digit or period
More Escape
\b Matches the specified characters that are at the beginning or at
the end of a word. (use with raw string pattern)
\B Matches the specified characters that are not at the beginning or
at the end of a word. (use with raw string pattern)
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [ˆ0-9].
\n Newline
\t Matches with a tab
\. Matches with a dot
\\ Matches with a backslash
\/ Matches with a slash
\[ Matches with a left square bracket
https://www.w3schools.com/python/python_regex.asp
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$\d+’, x)
>>> print(y[0])
$10
Summary
• What is regular expression
• Import re
• Using re.search()
• Compiling a pattern
• Using re.findall()
• Fine tuning string extraction
• Non-greedy matching
• Using escape
Exercise 2
• Extract all the revision numbers from “mbox_short.txt”. A
sample line in the file is
New Revision: 39756
• The revision number in this line is 39756
Exercise 3
• Extract all the senders’ IP addresses from “mbox_short.txt”.
A sample line in the file is
Received: from murder (mail.umich.edu [141.211.14.90])
• The IP address in this line is 141.211.14.90
Exercise 4
• Extract all the url that starts with ‘http’ or ‘https’ from
“mbox_short.txt”. Examples of a url are
https://collab.sakaiproject.org/portal
http://source.sakaiproject.org/viewsvn/?view=rev&rev=39766