HKU - 7001 - 3.1 Managing Data I

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Managing Data I

MSBA7001 Business Intelligence and Analytics


HKU Business School
The University of Hong Kong

Instructor: Dr. DING Chao


About this course
Text, CSV, JSON

Regular Expression
Managing NumPy
Data
Pandas

StatsModels

Web Beautiful Soup


Scraping

Tableau
Data
Visualization Matplotlib
Agenda
• Sources of Data
• Reading from and Writing to Files
TEXT
CSV
JSON
• Modules and Packages
• Regular Expressions (Regex)
Sources of Data
Where do we find and store data?

text
csv
In Files JSON
Excel
Database

HTML
On Web APIs
Text
Text File Structure
• A text file can be thought of as a sequence of lines

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008


Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

This is an excerpt from mbox_short.txt


Text File Structure
• A text file has newlines at the end of each line

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n


Return-Path: <postmaster@collab.sakaiproject.org>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: source@collab.sakaiproject.org\n
From: stephen.marquard@uct.ac.za\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
Opening a File
• Before we can read the contents of the file (txt or csv or any
Python supported files), we must tell Python which file we
are going to work with and what we will be doing with the
file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform
operations on the file
• Similar to “File -> Open” in a Word Processor
Opening a File

handle = open(filepath/filename, mode)

handle = open('mbox.txt', 'r')

• It returns a handle which we use to manipulate the file


• filename is a string
• mode is optional and should be 'r' if we are planning to read
the file only and 'w' if we are going to write to the file only
More about mode
What is a Handle?

>>> handle = open('mbox.txt', 'r')


>>> print(handle)
<_io.TextIOWrapper name='mbox.txt' mode='r' encoding='UTF-8'>
Filenames and paths
• If the file is in the same directory with your code, simply
type the file name
handle = open('mbox.txt', 'r')

• If it’s in the parallel directory, say the mbox.txt is stored in a


folder named “data”, use “..” to go up one level in the
directory, and then use “/” to locate the “data” folder. This is
called relative path
handle = open('../data/mbox.txt', 'r')

• Or, you can also use the absolute path, by using the full path
of the file such as “/home/MSBA7001/data/mbox.txt”
When Files are Missing
• Use try and except

>>> handle = open('stuff.txt')


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or
directory: 'stuff.txt'
Prompt for File Name
fname = input('Enter the file path: ')
handle = open(fname)

What if the file name/path is


incorrect, or file missing?

Use try and except

fpath = input('Enter the file path: ')


try:
handle = open(fpath)
except:
print('File cannot be opened:', fpath)
File Handle as a Sequence
• A file handle opens for read can be treated as a sequence of
strings where each line in the file is a string in the sequence
• We can use the for statement to iterate through a
sequence
• Remember - a sequence is an ordered set

handle = open('mbox.txt', 'r')


for line in handle:
print(line)
Counting Lines in a File
• Open a file read-only
• Use a for loop to read each line
• Count the lines and print out the number of lines

handle = open(r'mbox.txt')
count = 0
for line in handle:
count = count + 1
print('Line Count:', count)

Line Count: 132045


Reading the *Whole* File
• We can read the whole file (newlines and all) into a single
string by using the read() method on the file handle.

>>> handle = open(r'mbox_short.txt')


>>> inp = handle.read()
>>> print(len(inp))
94626
>>> print(inp[:20])
From stephen.marquar
Reading Lines in the File
• Use readline() to read one line each time from the file.
• To read all the lines in a list we use readlines() method.

>>> handle = open(r'mbox_short.txt')


>>> line1 = handle.readline()
>>> lines = handle.readlines()
Searching Through a File
• We can put an if statement in the for loop to only print
lines that meet some criteria
handle = open(r'mbox_short.txt')
for line in handle :
if line.startswith('From:'):
print(line)

From: stephen.marquard@uct.ac.za Returns True/False

From: louis@media.berkeley.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu
...

What are all these blank lines doing here?


Searching Through a File
• Each line from the file has a newline at the end
• The print statement adds a newline to each line

From: stephen.marquard@uct.ac.za\n
\n
From: louis@media.berkeley.edu\n
\n
From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
\n
...
Searching Through a File (fixed)
• We can strip the whitespace from the right-hand side of the
string using rstrip() from the string library
• The newline is considered “white space” and is stripped

handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if line.startswith('From:'):
print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
....
Skipping with continue
• We can conveniently skip a line by using the continue
statement

handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if not line.startswith('From:'):
continue
print(line)
Using in to Select Lines
• We can look for a string anywhere in a line as our selection
criteria using in or not in
handle = open(r'mbox_short.txt')
for line in handle:
line = line.rstrip()
if not '@uct.ac.za' in line:
continue
print(line)

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008


X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using –f
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f...
Closing a File
• After opening a file, we should always close it with the
close() method.

>>> handle = open(r'mbox.txt')


>>> handle.close()
Closing a File – An Alternative
• In real life scenarios we should try to use the with
statement. It will take care of closing the file for us.

with open(r'mbox.txt','r') as handle:


do something with handle

handle = open(r'mbox.txt','r')
do something with handle
handle.close()
Writing to a Text File
• Let us open a text file then write some random text into it
by using the write() method.

The return value is the


number of characters that
>>> fout = open(r'output.txt', 'w', newline = '')
were written.
>>> line1 = "This here's the wattle,\n"
>>> fout.write(line1)
24
>>> line2 = "the emblem of our land.\n"
>>> fout.write(line2)
24
>>> fout.close()

When you are done writing,


you should close the file. It must be a string
CSV
Reading from/Writing to a CSV File
• A CSV (comma separated values) file is a readable text file
where each line has different fields separated by a comma.

Opened in text editor Opened in Excel

SN, Name, City SN Name City


1, Michael, New Jersey
1 Michael New Jersey
2, Jack, California
3, Donald, Texas 2 Jack California
3 Donald Texas
Reading from/Writing to a CSV File
• The csv module/library implements classes to read and
write tabular data in csv format.
• writer() build a csv writer object.
• We can use the writerow(), or the writerows()
method to write a list or tuple or a list of lists to a csv file.

import csv

csvdata = [['Peter', '22'], ['Jasmine', '21'], ['Sam', '24']]


with open('person.csv', 'w') as outfile:
to_write = csv.writer(outfile)
to_write.writerow(['name', 'age'])
to_write.writerows(csvdata)
Reading/Writing to a CSV File
• The csv module’s reader() build a csv reader object.
• It reads each row in a list.

import csv

with open('person.csv', 'r') as infile:


to_read = csv.reader(infile)
for row in to_read:
print(row)

[‘name', 'Age']
['Peter', '22']
['Jasmine', '21']
['Sam', '24']
JSON
What is JSON
• JSON: JavaScript Object Notation.
• JSON is a syntax for storing and exchanging data.
• It is still text, treat it as strings. It’s called JSON strings.
• It is easy for humans to read and write.
• It is easy for machines to parse and generate.
{"employees":[
{ "firstName":"John", "lastName":"Doe" },
{ "firstName":"Anna", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Jones" }
]}
JSON Syntax Rules
• Data is in key/value pairs, like a dictionary in Python
• Pairs is separated by commas
• keys must be strings, written with double quote
• values must be one of the following data types:
 a string { "name":"John" }
 a number { "age":30 }
 an object (JSON object)
 an array
 a boolean { "sale":true }
 null { "middlename":null }
JSON Syntax Rules
• Curly braces hold objects, like dictionaries in a dictionary in
Python
{
"employee":{ "name":"John", "age":
30, "city":"New York" }
}

• Square brackets hold arrays, like lists in a dictionary in


Python
{
"employees":[ "John", "Anna", "Peter" ]
}
JSON & Python – loads()
• First import the json module
• loads() converts a JSON string to a Python dictionary

import json
x = '''{"Course Code":"MSBA7001",
"Instructor":"Chao Ding", "Capacity":90}'''
y = json.loads(x)
print(y)

{'Course Code': 'MSBA7001', 'Instructor': 'Chao Ding', 'Capacity': 90}

print(y['Capacity'])
90
JSON & Python – loads()
• Read a json file

import json

json_handle = open(r'../data/json_example.json', 'r')


json_string = json_handle.read()
json_dict = json.loads(json_string)
type(json_dict)

dict
JSON & Python – loads()
json_dict

{'quiz': {'sport': {'q1': {'question': 'Which one is correct team name in NBA?',
'options': ['New York Bulls',
'Los Angeles Kings',
'Golden State Warriros',
'Huston Rocket'],
'answer': 'Huston Rocket'}},
'maths': {'q1': {'question': '5 + 7 = ?',
'options': ['10', '11', '12', '13'],
'answer': '12'},
'q2': {'question': '12 - 8 = ?',
'options': ['1', '2', '3', '4'],
'answer': '4'}}}}

json_dict['quiz']['maths']['q2']['options'][1]

'2'
JSON & Python – dumps()
• dumps() converts a Python object to a JSON string

import json

x = {
'name': 'John',
'age': 30,
'city': 'New York'
}

print(json.dumps(x))

{"name": "John", "age": 30, "city": "New York"}


JSON & Python – dumps()
• We can add two more arguments to the method

print(json.dumps(x, indent = 2))

{
"name": "John",
"age": 30,
"city": "New York"
}

print(json.dumps(x, indent = 4, sort_keys = True))


{
"age": 30,
"city": "New York",
"name": "John"
}
JSON & Python
loads()

dumps()

JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None
Summary
• File structure
• Opening a file
• Reading a file
• Searching for lines
• Reading file names
• Dealing with bad files
• Writing a File
• Reading a CSV file
• JSON and Python
Exercise 1
• Read file “mbox.txt”. Find and print all the domain names of
email addresses from lines that start with ‘From’.
• For example:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
• uct.ac.za is the domain name

• Then, write the domain names to a file named


“domain_name.txt”
Modules and Packages
Modules
• A module is a Python file whose name ends with .py.
• Objects (names) defined in a module can be imported and
used elsewhere.
import math
from math import pi, sin
from random import random as rd
from random import *

• When calling the methods/functions, the names will be


different depending on how they are imported
Packages
• Python allows you to group
related modules together method1
as a package. method2
method3
• The structure of a Python
package with subpackages
• A package is a directory on
the Python search path.
Importing anything from method1
method2
the package will execute
__init__.py and insert the
resulting module into
sys.modules under the
package name.
Packages
• We can import a module from a package using dot syntax.

import package.module
from package.module import method

• Packages themselves may contain packages; these are


subpackages. To access subpackages, you just need to use a
few more dots.
import package.subpackage.module
from package.subpackage import module
Regular Expressions
(Regex)
Regular Expressions
• In computing, a regular expression, also referred to as
“regex” or “regexp”, provides a concise and flexible means
for matching strings of text, such as particular characters,
words, or patterns of characters.
• A regular expression is written in a formal language that can
be interpreted by a regular expression processor.
• Really clever “wild card” expressions for matching and
parsing strings
Regular Expressions
• Our data file may include millions of lines, we want to
extract a specific section of data, e.g., the date and time the
email was sent, or the email addresses. Regular expressions
provide a great way to do it.

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008


Return-Path: <postmaster@collab.sakaiproject.org>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
Regular Expression Quick Guide

^ Matches the beginning of a line


$ Matches the end of the line
. Matches any one character (except for newline)
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression Module
• Before you can use regular expressions in your program, you
must import the module using import re
• You can use re.search() to see if a string matches a
regular expression, similar to using the find() method for
strings
• You can use re.findall() to extract portions of a string
that match your regular expression, similar to a combination
of find() and string slicing
Using re.search() like find()
• re.search() returns a True/None depending on whether
the string matches the regular expression
handle = open('mbox_short.txt') Returns an integer (position)
for line in handle: -1 if not found
line = line.rstrip()
if line.find('From:') >= 0:
print(line) From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
import re From: zqian@umich.edu

handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if re.search('From:', line):
print(line)

Returns True/None
Using re.search() like startswith()

handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if line.startswith('From:'):
print(line)

Returns True/False
import re

handle = open('mbox_short.txt')
for line in handle:
line = line.rstrip()
if re.search('^From:', line):
print(line)

There is a hat ^
Compiling a pattern
• We can use the compile() method to compile a regular
expression pattern into a regular expression object, which
can be used in other methods such as search

import re

pattern = re.compile('SPAM’)
if pattern.search('X-DSPAM-Result'):
print('yes')

yes
Wild-Card Characters
• The dot character matches any one character
• If you add the asterisk character, the character is “repeating
zero-or more times”
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain

Match the start


Match the previous
of the line
character zero or more times

^X.*:
Match any one character
Wild-Card Characters
• What if the text is as follows

X-Sieve: CMU Sieve 2.3


X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short

• You only want to keep X-some letter


Fine-Tuning Your Match
• Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-: Very Short
X-Plane is behind schedule: two weeks

Match the previous character


Match the start
one or more times
of the line

^X-\S+:
Match any one non-
whitespace character
Matching and Extracting Data
• If we actually want the matching strings to be extracted, we
use re.findall(). The return is a list.

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']

[0-9]+ What happens without the + ?

One digit One or more


between 0 and 9 times
Matching and Extracting Data
• When we use re.findall(), it returns an empty list if
there is no match

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]

[AEIOU]+

Any one letter in


the brackets
Warning: Greedy Matching
• The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string

>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+:', x)
>>> print(y)
['From: Using Fun :']

One or more
characters

^F.+:

First character in the


Last character in the
match is an F
match is a :
Non-Greedy Matching
• Not all regular expression repeat codes are greedy! If you
add a ? character, the + and * chill out a bit...

>>> import re
>>> x = 'From: Using Fun : character'
>>> y = re.findall('^F.+?:', x)
>>> print(y)
['From:']

^F.+?:
One or more characters
but not greedy
Fine-Tuning String Extraction
• Further fine-tune our match

x = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['stephen.marquard@uct.ac.za’]

\S+@\S+

At least one non-


whitespace character

what if x = ‘From <source@collab.sakaiproject.org> Sat Jan…’


Fine-Tuning String Extraction
• We can refine the match for re.findall() and separately
determine which portion of the match is to be extracted by
using parentheses
• Parentheses are not part of the match - but they tell where
to start and stop what string to extract
• Again, it returns a list
x = ‘From <stephen.marquard@uct.ac.za> Sat Jan 5 09:14:16 2008’

>>> pattern = re.compile('<(\S+@\S+)>')


>>> y = pattern.findall(x)
>>> print(y)
['stephen.marquard@uct.ac.za']

<(\S+@\S+)>
Parsing using Regex

line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

Extract the non-blank


characters

re.findall('@([^ ]*)',line)
Look through the string
until you find an at sign Match many of them

Match non-blank character


Parsing using Regex

line = ‘From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008’

re.findall('^From .*@([^ ]*)',line)

Starting at the beginning of the


line, look for the string 'From ' Skip a bunch of characters,
looking for an at sign
Escape Character
• If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'

>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$[0-9.]+',x)
>>> print(y[0])
$10.00

At least one
or more

\$[0-9.]+
A real dollar sign A digit or period
More Escape
\b Matches the specified characters that are at the beginning or at
the end of a word. (use with raw string pattern)
\B Matches the specified characters that are not at the beginning or
at the end of a word. (use with raw string pattern)
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [ˆ0-9].
\n Newline
\t Matches with a tab
\. Matches with a dot
\\ Matches with a backslash
\/ Matches with a slash
\[ Matches with a left square bracket

https://www.w3schools.com/python/python_regex.asp

>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('\$\d+’, x)
>>> print(y[0])
$10
Summary
• What is regular expression
• Import re
• Using re.search()
• Compiling a pattern
• Using re.findall()
• Fine tuning string extraction
• Non-greedy matching
• Using escape
Exercise 2
• Extract all the revision numbers from “mbox_short.txt”. A
sample line in the file is
New Revision: 39756
• The revision number in this line is 39756
Exercise 3
• Extract all the senders’ IP addresses from “mbox_short.txt”.
A sample line in the file is
Received: from murder (mail.umich.edu [141.211.14.90])
• The IP address in this line is 141.211.14.90
Exercise 4
• Extract all the url that starts with ‘http’ or ‘https’ from
“mbox_short.txt”. Examples of a url are
https://collab.sakaiproject.org/portal
http://source.sakaiproject.org/viewsvn/?view=rev&rev=39766

You might also like