HKU - 7001 - 3.1 Managing Data I

Managing Data I
MSBA7001 Business Intelligence and Analytics

HKU Business School
The University of Hong Kong
Instructor: Dr. DING Chao

About this course
Text, CSV, JSON
Regular Expression
Managing NumPy
Data
Pandas
StatsModels
Web Beautiful Soup

Scraping
Tableau
Data
Visualization Matplotlib
Agenda
• Sources of Data
• Reading from and Writing to Files
TEXT
CSV
JSON
• Modules and Packages
• Regular Expressions (Regex)
Sources of Data
Where do we find and store data?
text
csv
In Files JSON
Excel
Database
HTML
On Web APIs
Text
Text File Structure
• A text file can be thought of as a sequence of lines
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
This is an excerpt from mbox_short.txt

Text File Structure
• A text file has newlines at the end of each line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n

Return-Path: <postmaster@collab.sakaiproject.org>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: source@collab.sakaiproject.org\n
From: stephen.marquard@uct.ac.za\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
Opening a File
• Before we can read the contents of the file (txt or csv or any
Python supported files), we must tell Python which file we
are going to work with and what we will be doing with the
file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform
operations on the file
• Similar to “File -> Open” in a Word Processor
Opening a File
handle = open(filepath/filename, mode)
handle = open('mbox.txt', 'r')
• It returns a handle which we use to manipulate the file

• filename is a string
• mode is optional and should be 'r' if we are planning to read
the file only and 'w' if we are going to write to the file only
More about mode
What is a Handle?
>>> handle = open('mbox.txt', 'r')

>>> print(handle)
<_io.TextIOWrapper name='mbox.txt' mode='r' encoding='UTF-8'>
Filenames and paths
• If the file is in the same directory with your code, simply
type the file name
• If it’s in the parallel directory, say the mbox.txt is stored in a

folder named “data”, use “..” to go up one level in the
directory, and then use “/” to locate the “data” folder. This is
called relative path
handle = open('../data/mbox.txt', 'r')
• Or, you can also use the absolute path, by using the full path
of the file such as “/home/MSBA7001/data/mbox.txt”
When Files are Missing
• Use try and except
>>> handle = open('stuff.txt')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or
directory: 'stuff.txt'
Prompt for File Name
fname = input('Enter the file path: ')
handle = open(fname)
What if the file name/path is

incorrect, or file missing?
Use try and except
fpath = input('Enter the file path: ')

try:
handle = open(fpath)
except:
print('File cannot be opened:', fpath)
File Handle as a Sequence
• A file handle opens for read can be treated as a sequence of
strings where each line in the file is a string in the sequence
• We can use the for statement to iterate through a
sequence
• Remember - a sequence is an ordered set

for line in handle:
print(line)
Counting Lines in a File
• Open a file read-only
• Use a for loop to read each line
• Count the lines and print out the number of lines
handle = open(r'mbox.txt')
count = 0
for line in handle:
count = count + 1
print('Line Count:', count)
Line Count: 132045

Reading the *Whole* File
• We can read the whole file (newlines and all) into a single
string by using the read() method on the file handle.
>>> handle = open(r'mbox_short.txt')

>>> inp = handle.read()
>>> print(len(inp))
94626
>>> print(inp[:20])
From stephen.marquar
Reading Lines in the File
• Use readline() to read one line each time from the file.
• To read all the lines in a list we use readlines() method.
>>> handle = open(r'mbox_short.txt')

>>> line1 = handle.readline()
>>> lines = handle.readlines()
Searching Through a File
• We can put an if statement in the for loop to only print
lines that meet some criteria
handle = open(r'mbox_short.txt')
for line in handle :
if line.startswith('From:'):
print(line)
From: stephen.marquard@uct.ac.za Returns True/False
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
...
What are all these blank lines doing here?

Searching Through a File
• Each line from the file has a newline at the end
• The print statement adds a newline to each line
From: stephen.marquard@uct.ac.za\n
\n
From: louis@media.berkeley.edu\n
\n
From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
\n
...
Searching Through a File (fixed)
• We can strip the whitespace from the right-hand side of the
string using rstrip() from the string library
• The newline is considered “white space” and is stripped
for line in handle:
line = line.rstrip()
print(line)
From: zqian@umich.edu
From: rjlowe@iupui.edu
....
Skipping with continue
• We can conveniently skip a line by using the continue
statement
for line in handle:
if not line.startswith('From:'):
continue
print(line)
Using in to Select Lines
• We can look for a string anywhere in a line as our selection
criteria using in or not in
for line in handle:
if not '@uct.ac.za' in line:
continue
print(line)

X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using –f
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f...
Closing a File
• After opening a file, we should always close it with the
close() method.
>>> handle = open(r'mbox.txt')

>>> handle.close()
Closing a File – An Alternative
• In real life scenarios we should try to use the with
statement. It will take care of closing the file for us.
with open(r'mbox.txt','r') as handle:

do something with handle
handle = open(r'mbox.txt','r')
do something with handle
handle.close()
Writing to a Text File
• Let us open a text file then write some random text into it
by using the write() method.
The return value is the

number of characters that
>>> fout = open(r'output.txt', 'w', newline = '')
were written.
>>> line1 = "This here's the wattle,\n"
>>> fout.write(line1)
24
>>> line2 = "the emblem of our land.\n"
>>> fout.write(line2)
24
>>> fout.close()
When you are done writing,

you should close the file. It must be a string
CSV
Reading from/Writing to a CSV File
• A CSV (comma separated values) file is a readable text file
where each line has different fields separated by a comma.
Opened in text editor Opened in Excel
SN, Name, City SN Name City

1, Michael, New Jersey
1 Michael New Jersey
2, Jack, California
3, Donald, Texas 2 Jack California
3 Donald Texas
Reading from/Writing to a CSV File
• The csv module/library implements classes to read and
write tabular data in csv format.
• writer() build a csv writer object.
• We can use the writerow(), or the writerows()
method to write a list or tuple or a list of lists to a csv file.
import csv
csvdata = [['Peter', '22'], ['Jasmine', '21'], ['Sam', '24']]

with open('person.csv', 'w') as outfile:
to_write = csv.writer(outfile)
to_write.writerow(['name', 'age'])
to_write.writerows(csvdata)
Reading/Writing to a CSV File
• The csv module’s reader() build a csv reader object.
• It reads each row in a list.
import csv
with open('person.csv', 'r') as infile:

to_read = csv.reader(infile)
for row in to_read:
print(row)
[‘name', 'Age']
['Peter', '22']
['Jasmine', '21']
['Sam', '24']
JSON
What is JSON
• JSON: JavaScript Object Notation.
• JSON is a syntax for storing and exchanging data.
• It is still text, treat it as strings. It’s called JSON strings.
• It is easy for humans to read and write.
• It is easy for machines to parse and generate.
{"employees":[
{ "firstName":"John", "lastName":"Doe" },
{ "firstName":"Anna", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Jones" }
]}
JSON Syntax Rules
• Data is in key/value pairs, like a dictionary in Python
• Pairs is separated by commas
• keys must be strings, written with double quote
• values must be one of the following data types:
 a string { "name":"John" }
 a number { "age":30 }
 an object (JSON object)
 an array
 a boolean { "sale":true }
 null { "middlename":null }
JSON Syntax Rules
• Curly braces hold objects, like dictionaries in a dictionary in
Python
{
"employee":{ "name":"John", "age":
30, "city":"New York" }
}
• Square brackets hold arrays, like lists in a dictionary in

Python
{
"employees":[ "John", "Anna", "Peter" ]
}
JSON & Python – loads()
• First import the json module
• loads() converts a JSON string to a Python dictionary
import json
x = '''{"Course Code":"MSBA7001",
"Instructor":"Chao Ding", "Capacity":90}'''
y = json.loads(x)
print(y)
{'Course Code': 'MSBA7001', 'Instructor': 'Chao Ding', 'Capacity': 90}
print(y['Capacity'])
90
• Read a json file
import json
json_handle = open(r'../data/json_example.json', 'r')

json_string = json_handle.read()
json_dict = json.loads(json_string)
type(json_dict)
dict
json_dict
{'quiz': {'sport': {'q1': {'question': 'Which one is correct team name in NBA?',
'options': ['New York Bulls',
'Los Angeles Kings',
'Golden State Warriros',
'Huston Rocket'],
'answer': 'Huston Rocket'}},
'maths': {'q1': {'question': '5 + 7 = ?',
'options': ['10', '11', '12', '13'],
'answer': '12'},
'q2': {'question': '12 - 8 = ?',
'options': ['1', '2', '3', '4'],
'answer': '4'}}}}
json_dict['quiz']['maths']['q2']['options'][1]
'2'
JSON & Python – dumps()
• dumps() converts a Python object to a JSON string
import json
x = {
'name': 'John',
'age': 30,
'city': 'New York'
}
print(json.dumps(x))
{"name": "John", "age": 30, "city": "New York"}

JSON & Python – dumps()
• We can add two more arguments to the method
print(json.dumps(x, indent = 2))
{
"name": "John",
"age": 30,
"city": "New York"
}
print(json.dumps(x, indent = 4, sort_keys = True))

{
"age": 30,
"city": "New York",
"name": "John"
}
JSON & Python
loads()
dumps()
JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None
Summary
• File structure
• Opening a file
• Reading a file
• Searching for lines
• Reading file names
• Dealing with bad files
• Writing a File
• Reading a CSV file
• JSON and Python
Exercise 1
• Read file “mbox.txt”. Find and print all the domain names of
email addresses from lines that start with ‘From’.
• For example:
• uct.ac.za is the domain name
• Then, write the domain names to a file named

“domain_name.txt”
Modules and Packages
Modules
• A module is a Python file whose name ends with .py.
• Objects (names) defined in a module can be imported and
used elsewhere.
import math
from math import pi, sin
from random import random as rd
from random import *
• When calling the methods/functions, the names will be

different depending on how they are imported
Packages
• Python allows you to group
related modules together method1
as a package. method2
method3
• The structure of a Python
package with subpackages
• A package is a directory on
the Python search path.
Importing anything from method1
method2
the package will execute
__init__.py and insert the
resulting module into
sys.modules under the
package name.
Packages
• We can import a module from a package using dot syntax.
import package.module
from package.module import method
• Packages themselves may contain packages; these are

subpackages. To access subpackages, you just need to use a
few more dots.
import package.subpackage.module
from package.subpackage import module
Regular Expressions
(Regex)
Regular Expressions
• In computing, a regular expression, also referred to as
“regex” or “regexp”, provides a concise and flexible means
for matching strings of text, such as particular characters,
words, or patterns of characters.
• A regular expression is written in a formal language that can
be interpreted by a regular expression processor.
• Really clever “wild card” expressions for matching and
parsing strings
Regular Expressions
• Our data file may include millions of lines, we want to
extract a specific section of data, e.g., the date and time the
email was sent, or the email addresses. Regular expressions
provide a great way to do it.

Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
Regular Expression Quick Guide
^ Matches the beginning of a line

$ Matches the end of the line
. Matches any one character (except for newline)
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression Module
• Before you can use regular expressions in your program, you
must import the module using import re
• You can use re.search() to see if a string matches a
regular expression, similar to using the find() method for
strings
• You can use re.findall() to extract portions of a string
that match your regular expression, similar to a combination
of find() and string slicing
Using re.search() like find()
• re.search() returns a True/None depending on whether
the string matches the regular expression
handle = open('mbox_short.txt') Returns an integer (position)
for line in handle: -1 if not found
if line.find('From:') >= 0:
print(line) From: stephen.marquard@uct.ac.za
import re From: zqian@umich.edu
handle = open('mbox_short.txt')
for line in handle:
if re.search('From:', line):
print(line)
Returns True/None
Using re.search() like startswith()
for line in handle:
print(line)
Returns True/False
import re
for line in handle:
if re.search('^From:', line):
print(line)
There is a hat ^
Compiling a pattern
• We can use the compile() method to compile a regular
expression pattern into a regular expression object, which
can be used in other methods such as search
import re
pattern = re.compile('SPAM’)
if pattern.search('X-DSPAM-Result'):
print('yes')
yes
Wild-Card Characters
• The dot character matches any one character
• If you add the asterisk character, the character is “repeating
zero-or more times”
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
Match the start

Match the previous
of the line
character zero or more times
^X.*:
Match any one character
Wild-Card Characters
• What if the text is as follows

X-Plane is behind schedule: two weeks
X-: Very short
• You only want to keep X-some letter

Fine-Tuning Your Match
• Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit
X-: Very Short
X-Plane is behind schedule: two weeks
Match the previous character

Match the start
one or more times
of the line
^X-\S+:
Match any one non-
whitespace character
Matching and Extracting Data
• If we actually want the matching strings to be extracted, we
use re.findall(). The return is a list.
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
[0-9]+ What happens without the + ?
One digit One or more

between 0 and 9 times
Matching and Extracting Data
• When we use re.findall(), it returns an empty list if
there is no match
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
[AEIOU]+
Any one letter in

the brackets
Warning: Greedy Matching
• The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+:', x)
>>> print(y)
['From: Using Fun :']
One or more
characters
^F.+:
First character in the

Last character in the
match is an F
match is a :
Non-Greedy Matching
• Not all regular expression repeat codes are greedy! If you
add a ? character, the + and * chill out a bit...
>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+?:', x)
>>> print(y)
['From:']
^F.+?:
One or more characters
but not greedy
Fine-Tuning String Extraction
• Further fine-tune our match
x = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’
>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['stephen.marquard@uct.ac.za’]
\S+@\S+
At least one non-

whitespace character
what if x = ‘From <source@collab.sakaiproject.org> Sat Jan…’

Fine-Tuning String Extraction
• We can refine the match for re.findall() and separately
determine which portion of the match is to be extracted by
using parentheses
• Parentheses are not part of the match - but they tell where
to start and stop what string to extract
• Again, it returns a list
x = ‘From <stephen.marquard@uct.ac.za> Sat Jan 5 09:14:16 2008’
>>> pattern = re.compile('<(\S+@\S+)>')

>>> y = pattern.findall(x)
>>> print(y)
['stephen.marquard@uct.ac.za']
<(\S+@\S+)>
Parsing using Regex
line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’
Extract the non-blank

characters
re.findall('@([^ ]*)',line)
Look through the string
until you find an at sign Match many of them
Match non-blank character

Parsing using Regex
line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’
re.findall('^From .*@([^ ]*)',line)
Starting at the beginning of the

line, look for the string 'From ' Skip a bunch of characters,
looking for an at sign
Escape Character
• If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$[0-9.]+',x)
>>> print(y[0])
$10.00
At least one
or more
\$[0-9.]+
A real dollar sign A digit or period
More Escape
\b Matches the specified characters that are at the beginning or at
the end of a word. (use with raw string pattern)
\B Matches the specified characters that are not at the beginning or
at the end of a word. (use with raw string pattern)
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [ˆ0-9].
\n Newline
\t Matches with a tab
\. Matches with a dot
\\ Matches with a backslash
\/ Matches with a slash
\[ Matches with a left square bracket
https://www.w3schools.com/python/python_regex.asp
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$\d+’, x)
>>> print(y[0])
$10
Summary
• What is regular expression
• Import re
• Using re.search()
• Compiling a pattern
• Using re.findall()
• Fine tuning string extraction
• Non-greedy matching
• Using escape
Exercise 2
• Extract all the revision numbers from “mbox_short.txt”. A
sample line in the file is
New Revision: 39756
• The revision number in this line is 39756
Exercise 3
• Extract all the senders’ IP addresses from “mbox_short.txt”.
A sample line in the file is
Received: from murder (mail.umich.edu [141.211.14.90])
• The IP address in this line is 141.211.14.90
Exercise 4
• Extract all the url that starts with ‘http’ or ‘https’ from
“mbox_short.txt”. Examples of a url are
https://collab.sakaiproject.org/portal
http://source.sakaiproject.org/viewsvn/?view=rev&rev=39766

HKU - 7001 - 3.1 Managing Data I

Uploaded by

Copyright:

Available Formats

You might also like

HKU - 7001 - 3.1 Managing Data I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HKU - 7001 - 3.1 Managing Data I

Uploaded by

Copyright:

Available Formats

Managing Data I

MSBA7001 Business Intelligence and Analytics

Instructor: Dr. DING Chao

Web Beautiful Soup

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

This is an excerpt from mbox_short.txt

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n

handle = open(filepath/filename, mode)

handle = open('mbox.txt', 'r')

• It returns a handle which we use to manipulate the file

>>> handle = open('mbox.txt', 'r')

• If it’s in the parallel directory, say the mbox.txt is stored in a

>>> handle = open('stuff.txt')

What if the file name/path is

Use try and except

fpath = input('Enter the file path: ')

handle = open('mbox.txt', 'r')

Line Count: 132045

>>> handle = open(r'mbox_short.txt')

>>> handle = open(r'mbox_short.txt')

From: stephen.marquard@uct.ac.za Returns True/False

What are all these blank lines doing here?

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

>>> handle = open(r'mbox.txt')

with open(r'mbox.txt','r') as handle:

The return value is the

When you are done writing,

Opened in text editor Opened in Excel

SN, Name, City SN Name City

csvdata = [['Peter', '22'], ['Jasmine', '21'], ['Sam', '24']]

with open('person.csv', 'r') as infile:

• Square brackets hold arrays, like lists in a dictionary in

{'Course Code': 'MSBA7001', 'Instructor': 'Chao Ding', 'Capacity': 90}

json_handle = open(r'../data/json_example.json', 'r')

{"name": "John", "age": 30, "city": "New York"}

print(json.dumps(x, indent = 2))

print(json.dumps(x, indent = 4, sort_keys = True))

• Then, write the domain names to a file named

• When calling the methods/functions, the names will be

• Packages themselves may contain packages; these are

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

^ Matches the beginning of a line

Match the start

X-Sieve: CMU Sieve 2.3

• You only want to keep X-some letter

Match the previous character

[0-9]+ What happens without the + ?

One digit One or more

Any one letter in

First character in the

x = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

At least one non-

what if x = ‘From <source@collab.sakaiproject.org> Sat Jan…’

>>> pattern = re.compile('<(\S+@\S+)>')

line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

Extract the non-blank

Match non-blank character

line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

re.findall('^From .@([^ ])',line)