SSRN Id3943966

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 155

UsingPythonfor

TextualAccountingResearch
(Version 2021.10.16.9)

Ken H. Guo
Ph.D., CPA/CMA (Canada)

WANTED
* MUST KNOW I.T. *

THE ACCOUNTANT
WITH NO FEAR

REWARD CAREER$
© Ken H. Guo 2021

First published online 2021


This version 2021.10.16.9

This book may be cited as:


Guo, K.H. (2021). Using Python for Textual Accounting
Research. Available at SSRN: https://ssrn.com/abstract=
3943966.

Ken Guo is currently an associate professor of accounting at


California State University, Fullerton.

For comments, suggests, or requests for program code/-


datasets, please write to ken.guo@gmail.com .

WANTED
* MUST KNOW I.T. *

THE ACCOUNTANT
WITH NO FEAR

REWARD CAREER$
Contents

Introduction 1

I Python Programming Language 5

1 Python Basics 7
1.1 Installation (Ubuntu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Installation (Windows) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Installation (Mac) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Running Python (Ubuntu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Running Python (Windows) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Tools for Writing Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Python File Operations 19


2.1 Creating New File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Reading Existing File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Copying File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Data Structure: List 25


3.1 Defining and Using List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 List of Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Data Structure: String 35


4.1 Using String to Build Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Using String to Build HTML Table . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Data Structure: String/List (2) 49


5.1 List Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Using List to Build HTML Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Text Search/Count 59

i
CONTENTS ii

6.1 Search Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


6.2 Searching for Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Data Structure: Dictionary 71


7.1 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Saving Data in CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Functions 79
8.1 Condition Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2 Complex Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Multiple Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.4 Defining and Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9 Display Progress 87
9.1 Enumerating Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.2 Date/Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.3 Monitoring Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10 Downloading Files from the Internet 95

11 Cleaning HTML Code 97


11.1 Using Regular Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.2 Using the lxml Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

II Analyzing Corporate Filings 105

12 Processing Filing Indices 107


12.1 Downloading Index Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
12.2 Processing Index Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

13 Downloading Filings 115

14 Cleaning HTML Code 119

15 Extracting Text Sections 125

16 Simple Analysis and Quality Check 129


CONTENTS iii

III Additional Topics 133

17 Merging Data 135


17.1 Using Master Index Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
17.2 Using Audit Analytics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
17.3 Extracting Data from 10-K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

18 Sentiment Analysis 141


18.1 VADER Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
18.2 Using VADER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

19 Other Learning Materials 145


[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

iv
Introduction

Inspired by early works such as Smith and Taffler (1992) and Li (2008), accounting re-
searchers have paid more and more attention to the textual quality of corporate financial
reports. A significant challenge for interested researchers is that textual analysis often re-
quires specific computing tools. Researchers generally have two options: using commercial
software packages (e.g., Bonsall IV et al., 2017; Cazier et al., 2021) or developing computer
programs (e.g., Li, 2008). While software packages can often offer textual measurements
that are otherwise not available, they are generally intended for individual use and hence
lack the capability for bulk processing (see e.g., Bonsall IV et al., 2017). Developing com-
puter programs, on the other hand, can give researchers more flexibility in terms of what to
measure and how many reports to process. Computer programming, however, is not easy
as most accounting researchers do not have any related formal training and, as result, they
have to self-learn new programming languages before they can do any meaningful textual
research.1
In this book I aim to help those accounting researchers who have no or limited program-
ming knowledge to jump start their self-learning by introducing the Python language from
a non-programmer perspective and helpfully make their initial learning curve less over-
whelming.

1 They can of course hire research assistants to develop computer programs. This is often not an option for

reasons such as budget constraints and limited hours research assistants are allowed to work.

1
Introduction 2

General Approaches and Book Structure

Toward that end, I take the following approaches to developing this book:

Top-down (pull) The overall approach of this book is top-down or pull. It begins with a
clear “accounting research objective”: bulk-downloading filings from SEC’s EDGAR
platform, conducting textual analysis (as simple as counting words), and then save
the result in portable format such as CSV. From this objective, we then try to figure out
how to write some Python code to do just that.
Narrow coverage Related to the top-down approach, this book covers only what is needed
to accomplish the research objective. In other words, it gives accounting researchers
just enough and is thus not intended for developing full-fledged programming pack-
ages. For this reason, it omits many concepts such as class and object which are nev-
ertheless important for introductory computer-science textbooks.
Effectiveness focus Throughout this book I will focus on effectiveness and pay less atten-
tion to efficiency. In other words, it is more important to make it work than to make
it faster at the beginning. Solving efficiency issues such as algorithm optimization re-
quires advanced knowledge in computing and is thus beyond the scope of this book.
That being said, we will consider some simple efficiency issues without getting into
too much technical details.
Accounting examples Whenever possible, I try to use examples that accounting researchers
are likely familiar with. Particularly, many examples are based on a real accounting
research project and can be readily “copy-pasted” by interested readers.

Given the above general approaches, this book is structured as follows:

Part I Python Programming Language The first part of the book covers some basic con-
cepts of Python. Although I take a top-down (or pull) approach, it is still necessary for
anyone to have some initial sense of the language in order to understand Python programs,
big or small. Any readers who have sufficient knowledge in Python can of course safely skip
this part.

Part II Analyzing Corporate Fillings The second part of the book explains the general
steps in downloading, processing, and analyzing corporate financial reports. I will use 10-
K filings (available on SEC’s EDGAR platform) as an example to demonstrate step-by-step
procedures. The ultimate output of these procedures includes:

• Individual 10-K filings;


• Text extracted from 10-K filings (one text file per filing); and
• A data file for the result of textual analysis (in CSV format).

Part III Related Topics The third part of the book covers several somewhat advanced top-
ics such as merging textual analysis result with other data and simple statistical analysis.
Introduction 3

Testing Environment

A somewhat technical matter is related to the many different versions of the Python lan-
guage. For example, programs that worked perfectly in Python 2.0 may not work in Python
3.0; and there are some subtle differences among different operating systems. In other
words, in some cases codes you copy-pasted from this book may not work in your computer.
All programs included in this book have been tested in the following system:

• Ubuntu 20.04 with Python 3.8.10 (installed from the Ubuntu repository)

Note that Python is a collection of individual packages, each of which has its own version
number.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

4
Part I

Python Programming Language

5
Chapter 1

Python Basics

1.1 Installation (Ubuntu)

If you use Ubuntu or any Linux distros, the base Python packages should have been auto-
matically installed along with the operating system. To install additional packages, you can
first search for packages in the terminal by running something like:

apt search ' python3 -.* xlsx .* '

This will give you:

python3 - xlsxwriter

Note that python3-xlsxwriter is a package for creating XLSX files. The search command
uses regular expression. Just in case you don’t already know, . is a placeholder for anything
(but one thing only); * means it will match zero or more instances. Basically, .* means
’anything’. Note that any packages named as python-.+ is for an older version (Python 2).
Once you find the package you need, you can install run something like:

sudo apt install python3 - xlsxwriter

If you don’t know the name of any particular package, you can of course use GUI-based
package managers such as Synaptic.

7
CHAPTER 1. PYTHON BASICS 8

1.2 Installation (Windows)

If you use Windows, you can download Python from . Usually you should get the most
recent version (aka “release”). When you run the installer:

1. Select Install launcher for all users and Add Python 3.9 to PATH 1 . Then select Customize Installation
2 (You may need “administrative” right to have full control of your computer.)

2. On the Optional Features window, select all features and click on Next .

3. On the Advanced Options window, select Install for all users 3 , then specify what to
install, the shorter path the better 4 , and then click on Install .
CHAPTER 1. PYTHON BASICS 9

4. The setup should start. Close it when it is finished.


CHAPTER 1. PYTHON BASICS 10

Windows Configuration

You need to configure the Python program so that it can properly handle UTF-8 encod-
ing/decoding. In a completely non-technical sense, if not configured, you may have prob-
lems dealing with non-English files that have any characters beyond what you can type on
an English keyboard. To make sure Python always use UTF-8, add a new “environment
variable” to your Windows system. To do this, go to:
Start
Settings
About (at bottom left)
Advanced system settings (on the right)
Advanced (tab)
Environment Variables (at the bottom)

1. Under System variables (the lower part), select Path 1 , then click on Edit... 2 .

2. On the next window, click on New 3  Enter PYTHONUTF8=1 4 (the row position
doesn’t matter), then OK 5 to complete the configuration. (Note that there are mul-
tiple windows to close.)
CHAPTER 1. PYTHON BASICS 11

Information
You do not need things like Anaconda, which is a package man-
agement system and will make things more complicated.
CHAPTER 1. PYTHON BASICS 12

1.3 Installation (Mac)

Sorry, I don’t use Mac. You are on your own...But you find help on many websites such as:

• https://docs.python.org/3.9/using/mac.html
CHAPTER 1. PYTHON BASICS 13

1.4 Running Python (Ubuntu)

There are two basic ways to run Python: interactive and script.

Interactive Mode The interactive mode works like a terminal or command shell. You enter
a command and wait for result. It works very much like a calculator.
In Ubuntu, simply open a terminal and run python3 to enter the Python environment:

1 usr@com :~ $ python3
2 Python 3.8.10 ( default , Jun 2 2021 , 10:49:15)
3 [ GCC 9.4.0] on linux
4 Type " help " , " copyright " , " credits " or " license " for more
information .
5 >>> 1+2
6 3
7 >>> quit ()
8 usr@com :~ $

Some details about the above example line by line:

1. usr is the name of the user and com is the name of the computer. ~$ is something
like the command prompt C:\> in Windows. python3 is the command to enter the
Python environment.
2. It shows the version of the Python program installed in the computer and the date/time
when this example was run.
3. Additional information about the program and the type of operating system.
4. This is some information about some commands that you can run to get additional
information.
5. The >>> prompt indicates you are in the Python environment and it is ready to do
things for you. 1+2 simply asks Python to do a calculation.
6. The result 3 is shown and it will go back to the Python prompt.
7. You use quit() to go back to the terminal prompt.

Script In the script mode, all programs are saved in files, which can then be run by the
python3 command. Below is an example.

Assuming a Python program file has been saved as calculate.py below ( py is the suffix
for naming Python files.)

Listing 1.1: calculate.py


1 x = 1 + 2
2 print ( 'x = ' , x )

The two lines of code in this simple program work as follows:


CHAPTER 1. PYTHON BASICS 14

1. A variable named x is defined and assigned with a formula 1+2


2. The print() command tells the program to “print” a label x= and then the calculated
result to the terminal. (Note it does not mean sending the text to a real printer.)

You can then run the program in terminal (after you cd to the folder where the file is
located):

python3 calculate . py

The result will be displayed in the terminal:

x= 3
CHAPTER 1. PYTHON BASICS 15

1.5 Running Python (Windows)

There isn’t much difference between Ubuntu (Linux) and Windows.

Interactive Mode To use interactive mode in Windows, you can go to the Start menu:

• Open a Command Prompt window and run py -3 (see below for an example), or

• Run IDLE (Python 3.x) (it will start a “shell”), or

• Run Python 3.x

1 C :\ > py -3
2 Python 3.9.7 ( tags / v3 .9.7:1016 ef3 , Aug 30 2021 , 20:19:38) [ MSC v
.1929 64 bit ( AMD64 ) ] on win32
3 Type " help " , " copyright " , " credits " or " license " for more
information .
4 >>> 1+2
5 3
6 >>>

Some explanation about the above code:

1. py is the command to launch Python; -3 here refers to the version number. (You can
use py -2 to launch version 2. But this is not something you need to do.)
2. Some information about the Python system installed in your computer.
3. Some additional information about how to get more information.
4. >>> indicates that you are now in the Python environment. 1+2 is an example of how
to use Python as a calculator. Press Enter to continue.
5. The result 3 is displayed.
6. After the calculation, it goes back to the Python command prompt. To exit, run quit() .

Script To use the script mode, all programs must be saved in files, which can then be run
by the py -3 command. Everything else is the same as Ubuntu. Just in case you skipped
the whole Ubuntu section (I’m sure you did if you don’t use Linus), below is the example
again.
Assuming a Python program file has been saved as calculate.py below ( py is the suffix
for naming Python files.)

Listing 1.2: calculate.py


1 x = 1 + 2
2 print ( 'x = ' , x )

If you want to test it, you can just use Notepad or any other text editor to create a file by
copying the code. The two lines of code in this simple program work as follows:
CHAPTER 1. PYTHON BASICS 16

1. A variable named x is defined and assigned with a formula 1+2


2. The print() command tells the program to “print” a label x= and then the calculated
result to the terminal. (Note it does not mean sending the text to a real printer.)

You can then run the program in terminal (after you cd to the folder where the file is
located):

1 C :\ pytar > py -3 calculate . py


2 x= 3
3
4 C :\ pytar >

Here in this example:

1. pytar is the folder where the code file is located. py -3 is the command to run
calculate.py .

2. The result 3 is displayed.


3. (An empty line)
4. After the run, it goes back to the command prompt.

Another option is to use any IDE software. Here is an example of how to use IDLE.

1. Go to Start  IDLE (Python 3.x) (it will start a “shell”).

2. Go to File  Open and find your file to open. (If you don’t have your code file ready,
you can click on New File to create one.)
3. On the window with your code file open, go to Run  Run Module .
4. The result will appear on the other window (shell) you first.

1 2

1 The window on the left shows the code file.


2 The window on the right shows the shell with results. The result of each run is
separated by a double-line with the code file name.
CHAPTER 1. PYTHON BASICS 17

1.6 Tools for Writing Python Code

To write Python code, you only need a text editor—any editor. Here are few examples:

Ubuntu gedit, mousepad


Windows Notepad

There are of course many specialized tools available. Such tools are often called “inte-
grated development environment” or “IDE”. They have some useful features such as high-
lighting keywords or reserved words and pairing parentheses. You can simply search the
Internet for “Python IDE” and you will find plenty. Here are a few examples:

Ubuntu eric, Pycharm


Windows IDLE (see the previous section for an example), Microsoft Visual Studio (Com-
munity Edition) or Visio Studio Code

Information
We will use the script mode throughout the book. The interactive
mode is useless.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

18
Chapter 2

Python File Operations

In the first chapter we’ve seen a very simple Python program:

Listing 2.1: calculate.py


1 x = 1 + 2
2 print ( 'x = ' , x )

When you run this program, it will do some calculation and display the result on the
terminal (computer screen).

x is the scalar variable whose value is assigned by the formula.


x = 1 + 2 does the calculation and assign the result to x . 1 and 2 are of course integers
and = and + are reserved symbols for math operations (and others).

The display is handled by the command print() . The command is often referred to as
a function or method, which requires the pair of parentheses to handle input. Here is this
example, the print() function takes two inputs, also called parameters, which are separated
by a comma , :

'x=' is a string object (more about object later). Anything enclosed by the pair of quotation
marks is taken as a string.
x is the scalar variable whose value has been assigned by the formula.
Note that none of these elements use quotation marks, which will make them different
things: x is not the same as 'x' , 1 is not the same as '1' .

While it is useful, displaying something on the terminal is often not what we need. In
most cases we need to save the result to a file in the hard drive. So this is what we will do
next.

19
CHAPTER 2. PYTHON FILE OPERATIONS 20

2.1 Creating New File

Here is a “file” version of the above calculation:

Listing 2.2: calculate-to-file.py


1 x =1+2
2 # print (' x= ' , x )
3 with open ( ' calculate - result . txt ' , 'w ') as fw :
4 fw . write ( 'x =1+2= ')
5 fw . write ( str (x ) )

If you run this program, there will be no display on the terminal; but a file will be create.
The file has only one line:

Listing 2.3: calculate-result.txt


x =1+2=3

Here is some explanation about the Python program:

1. This is the same calculation we did earlier: x=1+2 .


2. We still have the line for displaying #print('x=',x) . But the difference here is that we
have an extra hashtag # at the beginning of the line. The hashtag is used for comment-
ing. So the whole line is just “comment”, not real code. Commenting is often used to
explain the purpose of the code. Here we use it to disable the code but put it there for
reference.
3. The with open('calculate-result.txt','w') as fw: line opens a file and keeps it open
ready for input. Additional information about this line:
• The function open() opens a file named calculate-to-result.txt (note the quo-
tation marks in the code but not in the actual file name). The flag 'w' tells it open
the file in the write mode, i.e., the file can be changed. If there is an existing file
with the same name, the old file will be replaced.
So the open() function requires at least two parameters: the first is file name and
the second is mode, separated by a comma.
• The keyword with is used to “keep” the file open. The strange sign ␣ is used to
typeset a white space in this book so that you know there is a white space there;
it is not in the code.
• When the function open() opens a file, the system will create a file object in the
computer memory. The keyword as is used to assign a name to the object so
that we can refer to it later. fw here is the variable name to be assigned to the file
object.
• The colon : indicates what’s coming next is to be done until the file is closed
(more about “close” later).
4. The line fw.write('x=1+2=') tells the file object to “self-write” something. write() is
a method of the fw object. So the syntax is “object.method()” (the parentheses are
always required). The thing to write is a string, as indicated by the quotation marks.
CHAPTER 2. PYTHON FILE OPERATIONS 21

• Notice that there are four white spaces at the beginning of the line. These four
white spaces are very important in Python. They serve as the indentation (along
with the colon : ) that Python defines program scope. So here it mean the fw.write('x=1+2=')
operation is within the scope of the file opening operation specified by
with open('calculate-result.txt','w') as fw: .

• The write() method works like a typewriter (technically “cursor”): it does not
do anything other than what is explicitly. So the code fw.write('x=1+2=') will
write x=1+2= in the line where the cursor is currently at. The cursor then stops at
the end of x=1+2= . Just like typewriting, it will stay at the end of the line unless
you explicitly hard-press the carriage return.
5. As indicated by the four white space, the line fw.write(str(x)) is aligned with the
previous line and also within the scope of the file open operation. So the system will
continue to write something at the current cursor position (i.e., at the end of x=1+2= ).
Here because x is a number, we have to convert it into a string by using the str()
function. Like print() and open() , str() is a system function and does not need an
object (as in “object.method()”).

There is nothing else left in the code file. So the scope of the file open operation ends.
This is when Python will automatically close the file. This is important because if you
open a file and do not close it, the file may be damaged or you may waste the computer
memory and slow it down.

If you want to put the result =3 in a new line, then you have to specifically write a new
line, which is '\n' in Python. The backslash (\) is an escape key. In '\n' it “escape”
the character 'n' . So it is not a literal n anymore, it is a new line character (which is not
printable or visible).
The code will be like this:

Listing 2.4: calculate-to-file-2.py


1 x =1+2
2 # print (' x= ' , x )
3 with open ( ' calculate - result -2. txt ', 'w ') as fw :
4 fw . write ( 'x =1+2\ n = ')
5 fw . write ( str (x ) )

If you run the code, a file named calculate-result-2.txt will be created as:

Listing 2.5: calculate-result-2.txt


x =1+2
=3
CHAPTER 2. PYTHON FILE OPERATIONS 22

2.2 Reading Existing File

Now you know how to create a new file and write something in it. Other than creating new
files, in most cases we need to open/read existing files. We are of course not talking about
how to double-click to open.
Let’s look at an example:

Listing 2.6: open-display-file.py


1 with open ( ' masteridx . txt ' ,'r ') as fr :
2 Text = fr . read ()
3 print ( Text )

Here the with open('masteridx.txt','r') as fr: line is similar to what we’ve seen earlier.
The only difference is that we are now opening the file for reading, not writing, so the flag
is 'r' . The function read() takes the content as a string and assign it to Text .
Assuming the file masteridx.txt exists in the current folder, the program will display
the content in terminal:

Listing 2.7: masteridx.txt


1 Description : Master Index of EDGAR Dissemination Feed
2 Last Data Received : March 31 , 2020
3 Comments : webmaster@sec . gov
4 Anonymous FTP : ftp :// ftp . sec . gov / edgar /
5 Cloud HTTP : https :// www . sec . gov / Archives /
6
7
8
9
10 CIK | Company Name | Form Type | Date Filed | Filename
11 -----------------------------------------------
12 1000229| CORE LABORATORIES N V |10 - K |2020 -02 -10| edgar / data
/1000229/0001564590 -20 -004075. txt
13 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt

You may have seen such text before. It is a snippet of a master index file downloaded
from SEC’s EDGAR platform. Note that it is not a good idea to use this method if the file is
too big.
CHAPTER 2. PYTHON FILE OPERATIONS 23

2.3 Copying File

So far we’ve looked at how to create a new file and how to read an existing one. But how do
we copy the content of the existing file into a new one? It can be done by copying the file
object at the OS (operating system) level. But we will use what we’ve just learned: read the
content of the existing file and then write it to a new one. Here is the code:

Listing 2.8: open-copy-file.py


1 with open ( ' masteridx . txt ' ,'r ') as fr :
2 Text = fr . read ()
3 with open ( ' masteridx - copy . txt ', 'w ') as fw :
4 fw . write ( Text )
5 fw . write ( '\ n\ n ')
6 fw . write ( Text )

Some explanation about the above code:

1. The first line with open('masteridx.txt','r') as fr: opens up the source file in the
read model ( 'r' ) and assigns it to a file object fr .
2. The second line Text = fr.read() take the content of the file object as string and assign
it to Text . Notice the indention here (four white spaces), which means that anything
intended below is within the scope of file-reading operation.
3. The third line with open('masteridx-copy.txt', 'w') as fw: opens a new file in the
write mode ( 'w' ) and assigns it to another file object fw , which is ready to be written
with input. Notice the indentation (one indentation with four white spaces), which is
the same as the Text = fr.read() line. So this file-writing operation is also with the
scope of the file-reading operation. So it means that the source file keeps open until
the file-writing operation is completed.
4. The line fw.write(Text) writes the content of Text into the new file. Notice the line
is indented twice (eight white spaces). So it is within the file-writing scope.
5. The line fw.write('\n\n') writes two blank lines.
6. The line fw.write(Text) writes the content of Text into the new file...again. So it
means the new file has repeated content copied from the source file. See below for an
example.

Listing 2.9: masteridx-copy.txt


1 Description : Master Index of EDGAR Dissemination Feed
2 Last Data Received : March 31 , 2020
3 Comments : webmaster@sec . gov
4 Anonymous FTP : ftp :// ftp . sec . gov / edgar /
5 Cloud HTTP : https :// www . sec . gov / Archives /
6
7
8
9
CHAPTER 2. PYTHON FILE OPERATIONS 24

10 CIK | Company Name | Form Type | Date Filed | Filename


11 ----------------------------------------------
12 1000229| CORE LABORATORIES N V |10 - K |2020 -02 -10| edgar / data
/1000229/0001564590 -20 -004075. txt
13 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt
14
15
16 Description : Master Index of EDGAR Dissemination Feed
17 Last Data Received : March 31 , 2020
18 Comments : webmaster@sec . gov
19 Anonymous FTP : ftp :// ftp . sec . gov / edgar /
20 Cloud HTTP : https :// www . sec . gov / Archives /
21
22
23
24
25 CIK | Company Name | Form Type | Date Filed | Filename
26 ----------------------------------------------
27 1000229| CORE LABORATORIES N V |10 - K |2020 -02 -10| edgar / data
/1000229/0001564590 -20 -004075. txt
28 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt

Information
We will look at other file operations in latter chapters.
Chapter 3

Data Structure: List

Consider the following table:

1 2 3 4 5 6 7 8 9

1 1 2 3 4 5 6 7 8 9
2 2 4 6 8 10 12 14 16 18
3 3 6 9 12 15 18 21 24 27
4 4 8 12 16 20 24 28 32 36
5 5 10 15 20 25 30 35 40 45
6 6 12 18 24 30 36 42 48 54
7 7 14 21 28 35 42 49 56 63
8 8 16 24 32 40 48 56 64 72
9 9 18 27 36 45 54 63 72 81

I’m sure you know what it is...a 9x9 multiplication table kids use to help them learn
math. Of course we are not here to learn multiplication! Instead, we want to learn how to
use Python to build such a simple table.
We will look at a couple of simple solutions first. But just in case you think they are too
simple, later we will look at a somewhat more complex way, something like the following:
   
1 1 2 ··· 9
   
   
2 [ ] 2 ··· 18
   4 
 × 1 ··· 9 =  (3.1)
. 2 . .. .. 
 ..   .. ..
.
   . . 
   
   
9 9 18 ··· 81

25
CHAPTER 3. DATA STRUCTURE: LIST 26

3.1 Defining and Using List

How do we define things when there are many of them? In Python (and other programming
languages) we can use the datatype called list , which of course means a “list” of things
(anything!). Below are some examples:

1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 lstCik = [ ' 51143 ', ' 117134 ' , ' 78901 ']
4 lstFy = [ ' 2021 ']
5 lstFyn = [2021 , 2022]
6 lstFirm = [ ' IBM ' , ' Starbucks ' , ' TikTok ']

Here we have two lists of integers lstx and lsty . Note that those white spaces don’t
matter, so the two lists are identical. A list is defined by enclosing things within square
brackets [ ] . The things (“elements”) are of the same type. (If you want to put different
types of things together, you need a tuple . We will leave it for another day.)
Other lists are very straightforward. We have a list of CIKs, fiscal years, and firms. Note
the two fiscal year lists:

• lstFy has only one element but you still need the square brackets. Here the first year
is a string (so you can’t use the list to do math calculation like '2021' - '2020' )
• lstFyn takes fiscal years as numbers.

You can “print” a list to display its content. For example:

Listing 3.1: list-one.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 print ( lstx )

will give you:

[1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]

A typical use case of lists is when we have to repeat the same operation for each individual
element one by one. So instead of printing the list as a whole, we can print the elements one
by one and do other things if we want:

Listing 3.2: list-one.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 print ( lstx )
3 for x in lstx :
4 print ( 'x = ' , x )

Running the above code will give you:


CHAPTER 3. DATA STRUCTURE: LIST 27

[1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
x= 1
x= 2
x= 3
x= 4
x= 5
x= 6
x= 7
x= 8
x= 9

Let’s go through the above code to see how it works:

1. The first line defines the list


2. The second line displays the list (as a whole) in the list notation with square brackets.
3. The third line for x in lstx: is a looping statement, very much like the with open()...as...:
line we’ve previously seen. Its meaning is very straightforward: for each element
(which is given a variable name x ) in the list lstx , the system is going to do what is
stated after the colon : .
4. The scope of what will be done is determined by indentation. So here print('x= ', x)
will be done for each element x . Notice the space in 'x= ' . Also note that the function
print() automatically starts a new line. That is why all ’x= ...’ are in different lines.

Just like how we read an existing file and write a new one at the same time, we can of
course handle two lists at the same time as well. Here is an example:

Listing 3.3: list-two.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 print ( 'y ', '* ', 'x ' , '= ' , 'y* x ')
4 for y in lsty :
5 for x in lsty :
6 print (y , '* ', x , '= ', y * x )

Let’s look at the code in more details line by line:

1. The first line defines a list lstx , nothing special here.


2. The second line defines lsty .
3. This line prints a label, for to make sure we know what we are getting. (This line
mimics the other print line below. But if you want, you can put everything together by
removing all commas and extra quotation marks.)
4. This line loops through lsty . So for each element y , everything that follows the :
will be done.
5. This line loops through lstx . So for each element x , everything that follows the :
will be done. Because it is indented, so it falls within the scope of the lsty loop. The
CHAPTER 3. DATA STRUCTURE: LIST 28

two for loops together mean: for each y , handle all x one by one.
6. This line is further indented and falls within the cope of the lstx loop. It display in
one line: y , multiplication sign, x , equal sign, and then the multiplication of y and
x.

The above code will give the following result (partial):

y * x = y*x
1 * 1 = 1
1 * 2 = 2
1 * 3 = 3
...
9 * 7 = 63
9 * 8 = 72
9 * 9 = 81

Now we are very close to but not exactly what we wanted...a 9x9 Multiplication Table.
We want a table like a matrix but not a long list of individual calculations. So we have to
modified the code a little bit. This time we will write the result (hopefully it’s a table) to a
file.

Listing 3.4: 9x9-math-table.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 with open ( '9 x9 - math - table . txt ' , 'w ') as Fw :
4 for y in lsty :
5 for x in lstx :
6 Fw . write ( str (y * x ) )
7 Fw . write ( '\ n ')

Let’s go through the above code line by line:

1. The first line defines a list lstx , nothing special here.


2. The second line defines lsty .
3. The with open('9x9-math-table.txt', 'w') as Fw: line opens a new file in the write
mode.
4. The for y in lsty: loops through the list. So for each y in lsty it will repeat what
follows after the colon : . Notice the indentation. So the looping will be done until
the file closes.
5. The next line, for x in lstx: , loops through the list lstx . So for each x , it will repeat
what follows the colon : .
6. The Fw.write(str(y*x)) line calculates y*x by using the current values of y and x ,
converts the result into a string, then writes the string to the file. Notice that the line is
indented and falls within the scope of the looping of lstx .
7. The Fw.write('\n') writes a new line symbol at the end of line (in the output file) so
that additional text will start a new line. Notice how the line is indented: it is one in-
CHAPTER 3. DATA STRUCTURE: LIST 29

dentation after for y in lsty: , so the symbol is written once for each y after looping
through all x .

The above code will give the following output:

Listing 3.5: 9x9-math-table.txt


123456789
24681012141618
369121518212427
4812162024283236
51015202530354045
61218243036424854
71421283542495663
81624324048566472
91827364554637281

The result looks terrible...it’s not even close to a table! So we have to format it a little bit
by aligning numbers properly. Here is one solution:

Listing 3.6: 9x9-math-table-2.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 with open ( '9 x9 - math - table -2. txt ' , 'w ') as Fw :
4 for y in lsty :
5 for x in lstx :
6 Fw . write ( str (y * x ) . rjust (5 , ' ') )
7 Fw . write ( '\ n ')

Here the code is almost identical to the previous one. The only difference the calculation
of y*x . We use the rjust() method of string to pad the result with white space. The syntax
is string.rjust(number, character) :

• string is any string you want to format. A string is an object in the computer program-
ming sense and has methods, functions, or behaviors.
• rjust() is the method to align the string to the right, given the allowed spaces. (To
align to the left, use ljust() ; to center it, use center() )
• rjust() takes two inputs separated by comma. number is the total number of allowed
spaces (one space one character). If the string is shorter than the number, extra spaces
will be filled up with the character specified in the second input character .

So in the above code, str(y*x).rjust(5, ' ') will pad the result with white spaces to
make everything five characters long. For example, '91' will become ' 91' .
The revised code will result in something like:

Listing 3.7: 9x9-math-table.txt


CHAPTER 3. DATA STRUCTURE: LIST 30

1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81

Now we have the essence of the 9x9 Multiplication Table. But it is still not quite good
enough. For example, there are no column and row headings. So it is hard to tell what it
is. Even if you can figure out it’s a multiplication table, it is somewhat incomplete, because
there is no indication of something like 1 * x = x . So let’s do some plastic surgery by adding
row/column headings. Below is how to do it:

Listing 3.8: 9x9-math-table-3.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 with open ( '9 x9 - math - table -3. txt ' , 'w ') as Fw :
4 Fw . write ( ' * ')
5 for x in lstx :
6 Fw . write ( str ( x ) . rjust (4 , ' '))
7 Fw . write ( '\ n ')
8 Fw . write ( '- ' *40)
9 Fw . write ( '\ n ')
10 for y in lsty :
11 Fw . write ( str ( y ) )
12 Fw . write ( ' | ')
13 for x in lstx :
14 Fw . write ( str (y * x ) . rjust (4 , ' ') )
15 Fw . write ( '\ n ')

The above code will generate the following result:

Listing 3.9: 9x9-math-table-3.txt


* 1 2 3 4 5 6 7 8 9
----------------------------------------
1 | 1 2 3 4 5 6 7 8 9
2 | 2 4 6 8 10 12 14 16 18
3 | 3 6 9 12 15 18 21 24 27
4 | 4 8 12 16 20 24 28 32 36
5 | 5 10 15 20 25 30 35 40 45
6 | 6 12 18 24 30 36 42 48 54
7 | 7 14 21 28 35 42 49 56 63
8 | 8 16 24 32 40 48 56 64 72
9 | 9 18 27 36 45 54 63 72 81

Now let’s go throught the code in more details. It’s somewhat repetitive from the previ-
CHAPTER 3. DATA STRUCTURE: LIST 31

ous one, but just to make it more complete:

1. The first line defines a list lstx , nothing special here.


2. The second line defines lsty .
3. The with open('9x9-math-table-3.txt', 'w') as Fw: line opens a new file in the write
mode.
4. Fw.write(' *') writes a marker at the top-left corner of the table. Notice the indenta-
tion.
5. for x in lstx: loops through each element (designated as x ) the list lstx . Notice
it’s aligned with the previous line with one indentation.
6. Fw.write(str(x).rjust(4, ' ')) writes the value of x , with proper formatting. To-
gether with the previous line, the whole list lstx will be written in one line.
7. Fw.write('\n') at a new line mark at the end of the line.
8. Fw.write('-'*40) write a whole bunch of ‘-’ to make an horizontal line. Note the aster-
isk * is means multiplication in the math mode. Here it has been re-assigned to mean
“repeat”. So the hyphen will be repeated for 40 times. (In computer programming it’s
called “operator overloading”.)
9. Fw.write('\n') at a new line mark at the end of the horizontal line.
10. The for y in lsty: loops through the list. So for each y in lsty it will repeat what
follows after the colon : . Notice the indentation. So the looping will be done until
the file closes.
11. Fw.write(str(y)) write the value of y with proper formatting. Notice the indentation.
It is within the scope of the looping of lsty .
12. Fw.write(' |') writes a vertical bar to be used as a vertical line.
13. The next line, for x in lstx: , loops through the list lstx . So for each x , it will repeat
what follows the colon : . Notice the indentation. It’s within the scope of the looping
of lsty
14. The Fw.write(str(y*x)) line calculates y*x by using the current values of y and x ,
converts the result into a string, then writes the string to the file. Notice that the line is
indented and falls within the scope of the looping of lstx .
15. The Fw.write('\n') writes a new line symbol at the end of line (in the output file) so
that additional text will start a new line. Notice how the line is indented: it is one in-
dentation after for y in lsty: , so the symbol is written once for each y after looping
through all x .
CHAPTER 3. DATA STRUCTURE: LIST 32

3.2 List of Lists

Another way to build the 9x9 Multiplication Table is to use list of lists (or array ). We can
simply—well, not that simple—use arrays to do matrix multiplication. Here is the array
version of building the table:

Listing 3.10: 9x9-math-table-matrix.py


1 import numpy as np
2 aryx = np . array ([[1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9]])
3 aryy = np . array ([[1] ,[2] ,[3] ,[4] ,[5] ,[6] ,[7] ,[8] ,[9]])
4 mtx99 = np . matmul ( aryy , aryx )
5 with open ( '9 x9 - math - table - matrix . txt ', 'w ') as Fw :
6 for row in mtx99 :
7 for n in row :
8 Fw . write ( str (n ) . rjust (4 , ' ') )
9 Fw . write ( '\ n ')

Here is how it works:

1. The first line import numpy as np imports a package called numpy , which specializes
in matrix (sort of) manipulation and calculation. We assign an acronym by using the
keyword as . So from this point on, we can refer numpy as np . If we want to use any
functions from the numpy package, we have to use np as a qualifier, in the format of
np.something() .

2. The first line aryx = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]]) defines an array by us-


ing the function called np.array() . Don’t forget the parentheses. The input is a list of
lists, so there are two pairs of square brackets. Here aryx can be seen a list of one list of
nine elements, to be more specific. In numpy it’s called as “2-D” array. You can think
of aryx as a 1x9 matrix.
3. The next line aryy = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9]]) also defines an
array. In this case, the input is a list of nine lists, each of which has a single element.
So it is basically a 9x1 matrix.
4. The next line mtx99 = np.matmul(aryy, aryx) does the matrix multiplication by using
a function called np.matmul() and assign the result to a variable. There are two input
( aryy and aryx ) separated by a comma. Note that the sequence is important as re-
quired by matrix multiplication. If you reverse the sequence, the result is of course
different. Note that the result, mtx99 , is also an array (i.e., a list of lists).
5. The line with open('9x9-math-table-matrix.txt', 'w') as Fw: opens a file in the write
mode. Nothing new here.
6. The next indented line for row in mtx99: loops through the array mtx99 . Because an
array is a list of lists, so row here is also a list. For each row we will do something.
7. The next line for n in row: loops through row (which is a list). Now n is a number.
8. The next line Fw.write(str(n).rjust(4, ' ')) writes the number n in a proper format.
9. Fw.write('\n') puts a new line symbol at the end of each row. Notice the indentation!
CHAPTER 3. DATA STRUCTURE: LIST 33

If you run the code, you will get the following result:

Listing 3.11: 9x9-math-table-matrix.txt


1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81

You may have noticed that there are no row and column headings here. Adding column
heading is relatively easy but adding row heading is not. So we can accept this imperfect
table as the solution. Let’s stop right here...matrix stuff is so mind- boggling! Feel like I’m in
The Matrix already...
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

34
Chapter 4

Data Structure: String

Consider the following table...WHAT? AGAIN?!:

1 2 3 4 5 6 7 8 9

1 1 2 3 4 5 6 7 8 9
2 2 4 6 8 10 12 14 16 18
3 3 6 9 12 15 18 21 24 27
4 4 8 12 16 20 24 28 32 36
5 5 10 15 20 25 30 35 40 45
6 6 12 18 24 30 36 42 48 54
7 7 14 21 28 35 42 49 56 63
8 8 16 24 32 40 48 56 64 72
9 9 18 27 36 45 54 63 72 81

Well, yes. But rest assured...we’re not going to repeat the thing all over again. Instead,
we will use different ways to build the table that you’re already familiar with.
In the previous chapter we have successfully built the 9x9 Multiplication Table (think
about the idea of effectiveness I mentioned at the beginning of the book). But the solutions
have an efficiency problem. Although I said that effectiveness is more important than effi-
ciency, we still have to improve efficiency whenever we can.
The efficiency problem is what the programs write more than 99 times—way too many—
to the hard drive. You’ve have not noticed anything because the programs are very small.
But when you try to download and process one year of SEC 10-K reports (20,000 10-K report
per year, each of which may have 500 pages), such programs will make you wait forever.
So let’s try to improve the programs by limiting the number of write operations. First
thing first, let’s take a look at how Python (and other languages) does the writing (Figure
4.1).
As mentioned in the previous chapter, Python can only write strings to the file. This
is why previously we had to convert numbers into strings first. Imaging a file as a piece
of paper, when the function with open() as... is called, the system works very much like

35
CHAPTER 4. DATA STRUCTURE: STRING 36

1 2 3 4 5 6 7 8 9

1 1 2 3 4 5 6 7 8 9
2 2 4 6 8 10 12 14 16 18
3 3 6 9 12 15 18 21 24 27
4 4 8 12 16 20 24 28 32 36
5 5 10 15 20 25 30 35 40 45
6 6 12 18 24 30 36 42 48 54
7 7 14 21 28 35 42 49 56 63
8 8 16 24 32 40 48 56 64 72
9 9 18 27 36 45 54 63 72 81

Figure 4.1: How Python Writes File

a typewriter: starting from the top-left corner, moving from left to right, then if a carriage
return is pressed, starting a new line. This process repeats until there is nothing else to write.
We know that Python will start a new line if there is a '\n' signal. In the previous chapter,
the solution we had writes more than 99 strings (converted from numbers) to the file.
But everything together is just a long string, from the beginning to the end. So instead of
writing more than 99 strings once a time, we can build the long string first and write it all at
once. In this way we can drastically reduce the number of disk writing operations. So let’s
look at strings in more details.
CHAPTER 4. DATA STRUCTURE: STRING 37

4.1 Using String to Build Table

String Basics

A string or str is just a sequence of characters such as letters, digits, and punctuation
marks etc. What commonly referred to as “textual data” are basically strings. So when we
do textual analysis, we actually analyze “strings” instead.
As we have already seen in the previous chapter, strings are defined by enclosing some-
thing with quotation marks (either single or double). For example:

Listing 4.1: string-example.py


1 a = ' accounting research '
2 b = " It 's often said that ' everyday is a new day '"
3 c = '\ u4e5d \ u4e5d \ u516b \ u5341 \ u4e00 '
4 print (b , c )

Some explanation about the above code:

1. The string is specified by single or double quotation marks.


2. You can embed single quotation marks within double quotation marks (or the other
way around).
3. The variable c is assigned with a Unicode string, in which each character is designated
by \u with corresponding Unicode number (“code point”). The string is literally “9
9 81” in Chinese, meaning “9 times 9 equals 81”.

The above code will give something like:

accounting research 九 九 八 十 一

Information
If your result is not the same, it means your Python installation
does not support Unicode. Check the Python installation guide
to make UTF-8 the default encoding. If not, you may encounter
problems with some 10-K reports.
CHAPTER 4. DATA STRUCTURE: STRING 38

String Operations

We’ve already seen a few string operations in the previous chapter:

• '-'*40 — makes a string with 40 hyphens. The operator * here means repeat rather
than multiplication (if you recall, this is referred to as “operator overloading”). 'accounting'*2
will give accountingaccounting
• '81'.rjust(5, ' ') — it pads the number string with white spaces to the left and
makes it five characters long and right-aligned. (There is also a left-align version:
ljust() .)

Another often-used string operation is concatenation, i.e., putting two or more strings
together to make one. To do this, we can use the overloaded operator + . For example,
'a' + 'b' will give you 'ab' . (Note that if there are too many strings to concatenate, the
+ operator is not efficient, hence not a good choice. It’s better to use the join() method.
We will save it for another day.)
Some additional string methods that may be useful for textual analysis:

• capitalize() — it capitalizes the first character the string but make the rest in lower-
case. E.g., 'accounting research'.capitalize() will give Accounting research .
• title() — it makes the string in the title case, i.e., the first character of every word in
the string will be capitalized but the rest in lowercase. E.g., 'accounting research'.title()
will give Accounting Research .
• upper() – it capitalizes everything in the string. So 'accounting research'.upper()
will give ACCOUNTING RESEARCH .
• lower() — it makes everything in lowercase.
CHAPTER 4. DATA STRUCTURE: STRING 39

Rebuilding the 9x9 Multiplication Table

Now we have enough tools to rebuild the 9x9 Multiplication Table. We can simply make a
copy of the previous solution and revise it. Here is an example:

Listing 4.2: 9x9-math-table-4.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 Text = " do no research ." . upper () + ' Do math !\ n \n '
4 Text += ' * '
5 for x in lstx :
6 Text += str ( x ) . rjust (4 , ' ')
7 Text += '\ n '
8 Text += '- ' *40
9 Text += '\ n '
10 for y in lsty :
11 Text += str ( y )
12 Text += ' | '
13 for x in lstx :
14 Text += str (y * x ) . rjust (4 , ' ')
15 Text += '\n '
16 with open ( '9 x9 - math - table -4. txt ' , 'w ') as Fw :
17 Fw . write ( Text )

Here is some explanation about the code:

1. The first line defines a list of digits 0-9.


2. Another list of digits.
3. In this line we create a new string variable Text . Its initial values is the concatenation
of two strings: the upper-case version of 'do no research.' and ' Do math!\n\n . No-
tice the white space and the two newline symbols. The first \n ends the line and the
second essentially makes an empty line (as shown below in Line 2 of the output).
4. The line Text += ' *' will append the string ' *' to Text . += means adding some-
thing to self (roughly speaking). It’s equivalent to Text = Text + ' *' (although they
are technically different in some ways, particularly when non-numeric variables are in-
volved).
5. The line for x in lstx: loops through the list lstx .
6. The next indented line, Text += str(x).rjust(4, ' ') , formats and adds each x to
Text . So this essentially build the column-heading.

7. Text += '\n' appends a newline character to Text ; so it ends the line for column-
heading.
8. Text += '-'*40 appends 40 hyphens to make a horizontal line.
9. Text += '\n' appends a newline character again to end the horizontal line.
10. Now we need to loop through the other list lsty and do the following indented lines
for each element y .
CHAPTER 4. DATA STRUCTURE: STRING 40

11. Text += str(y) converts y into a string by using the str() system function and then
append it to Text .
12. Text += ' |' appends a vertial bar to make the vertical line. Notice the white space.

13. for x in lstx: loops through the list lstx . Note that this loop is within the scope of
the lsty loop. So there will be a lstx for each y .
14. The line Text += str(y*x).rjust(4, ' ') formats and appends the multiplication to
Text .

15. Text += '\n' appends a newline character. Notice the indentation here. It means there
will be a newline for each y .
16. Finally, we can now open a new file to save.
17. Fw.write(Text) writes the whole, long string Text to the file only once.

Running the above code will give us the same table:

Listing 4.3: 9x9-math-table-4.txt


1 DO NO RESEARCH . Do math !
2
3 * 1 2 3 4 5 6 7 8 9
4 ----------------------------------------
5 1 | 1 2 3 4 5 6 7 8 9
6 2 | 2 4 6 8 10 12 14 16 18
7 3 | 3 6 9 12 15 18 21 24 27
8 4 | 4 8 12 16 20 24 28 32 36
9 5 | 5 10 15 20 25 30 35 40 45
10 6 | 6 12 18 24 30 36 42 48 54
11 7 | 7 14 21 28 35 42 49 56 63
12 8 | 8 16 24 32 40 48 56 64 72
13 9 | 9 18 27 36 45 54 63 72 81
CHAPTER 4. DATA STRUCTURE: STRING 41

4.2 Using String to Build HTML Table

So far those 9x9 Multiplication Tables we have built are all in plain text, i.e., there is no
formatting such as color. The only thing we did is using hyphens and vertical bars to mimic
horizontal and vertical lines. To add formatting, we can write an HTML file instead of a
plain text one.

Information
Knowing how HTML works is critical because 10-K reports use the
HTML format. (Pre-2002 reports use plain text.)
CHAPTER 4. DATA STRUCTURE: STRING 42

HTML Basics

Let’s take a look at HTML first (assuming you don’t know anything about it).
HTML stands for HyperText Markup Language (there is another popular name, by the
way), which in a nutshell is basically a whole bunch of predefined tags for “marking up” or
formatting text. Below are some characteristics of HTML tags:

• Tags are (almost) always in pairs: an opening tag and a closing tag. Anything in-
between is the content to be displayed in web browsers.
• Paired tags use the same name and are designated by angle brackets   ; but the clos-
ing tag is indicated by an extra forward slash / .
• A tag pair can be wholly embedded with another pair; cross-over is invalid.
• Some tags are related and the embedding of one pair within another has to follow
predefined hierarchy.
• An HTML file can be named as ’.html’ or ’.htm’.

Below is a simple HTML file with some common tags:

Listing 4.4: html-table-example.htm


1 < html >
2 < body >
3 < table >
4 <tr >
5 <td >a
6 </ td >
7 <td >1
8 </ td >
9 </ tr >
10 <tr > < td >b </ td >< td >2 </ td > </ tr >
11 </ table >
12 </ body >
13 </ html >

In the above example, all tags are paired and some are embedded with others.

• <htm> </html> is the outermost pair that specifies the whole HTML page. So usually
if you find such tags in a file, it indicates it’s an HTML page.
• <body> </body> refers to the main “body” section of the page. (Usually before the
body, there is a “head” section marked by <head> </head> , which is not shown here.)
• <table> </table> marks a table, which has rows.
• <tr> </tr> marks a row. This table has two rows, for which the tags are parallel to
each other. They don’t embed one another.
CHAPTER 4. DATA STRUCTURE: STRING 43

• <td> </td> marks a cell. By definition, the pair is embedded within a row (marked by
<tr> </tr> . There are two cells in each row in this example; and the cells within the
same row are parallel to each other.

Note that there is no such thing as “column” in HTML. What appears to be columns
depends on how cells (“td”) are laid out in rows. In addition, when you open the HTML
page in a web browser, the layout of the page is controlled by those tags; how the code is
laid out is irrelevant. Below is a screenshot of the page rendered in a browser (Figure 4.2).

Figure 4.2: Rendered HTML Table

As you can see, the rendered table looks like plain text without any formatting such
colors and lines. We will not dig down into too much details here, but let’s look at a simple
example:

Listing 4.5: html-table-example-2.htm


1 < html >
2 < head >
3 < style type = " text / css " >
4 td { background - color :#87 CEEB ;}
5 </ style >
6 </ head >
7 < body >
8 < table >
9 < tr > < td >a </ td >< td >1 </ td > </ tr >
10 < tr > < td >b </ td >< td >2 </ td > </ tr >
11 </ table >
12 </ body >
13 </ html >

The following are new things in comparison to the previous HTML page:

• There is a new block <head> </head> ( 2 and 6 ). It is placed before <body>


• Within the head block, there is a style block for formatting as specified by <style> </style>
( 3 and 5 ). The opening tag has an attribute called type . text/css is the typical
type of style we use in HTML pages.
• Within the style tags, td background-color:#87CEEB; ( 4 ) changes the formatting of all
cells with a background color. Attributes are separated by semi-colon ; and enclosed
in curly brackets {} .
CHAPTER 4. DATA STRUCTURE: STRING 44

Color is often specified as a combination of red, green, and blue in the RGB scheme.
Each component is given a value from 0 to 256. For example, white is RGB (255, 255,
255), black is RGB (0,0,0). In HTML, each value is expressed as hexadecimal values
in the form of #RRGGBB (first two character as red, next two green, then blue). For
example, White is #FFFFFF and black is #000000 . In the above example, #87CEEB is
sky blue.

If you open the file in a web browser, you will see something like Figure 4.3.

Figure 4.3: Rendered HTML Table (2)

Now we have enough tools. Let’s rebuild the 9x9 Multiplication Table in HTML.
CHAPTER 4. DATA STRUCTURE: STRING 45

Rebuilding the 9x9 Multiplication Table in HTML

Note that everything in the code of an HTML page is still a string, although sometimes the
string can be very long (think about a 10-K report of 500 pages). So what we can do is to
copy and revise the previous string solution. Here is an example:

Listing 4.6: 9x9-math-table-5.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 Text = """
4 < html >
5 < head >
6 < style type =" text / css ">
7 td { background - color :#87 CEEB ;}
8 </ style >
9 </ head >
10 < body >
11 < table >
12 """
13 Text += '<tr > <td >* </ td > '
14 for x in lstx :
15 Text += ( '<td > ' + str (x ) + ' </ td > ')
16 Text += ' </ tr > '
17 for y in lsty :
18 Text += ' <tr > '
19 Text += ( '<td > ' + str (y ) + ' </ td > ')
20 for x in lstx :
21 Text += ( ' <td > ' + str ( y * x ) . rjust (4 , ' ') + ' </ td > ')
22 Text += ' </ tr > '
23 Text += """
24 </ table >
25 </ body >
26 </ html >
27 """
28 with open ( '9 x9 - math - table -5. htm ' , 'w ') as Fw :
29 Fw . write ( Text )

Some explanation about the above code line by line:

1. The first line define a list variable lstx .


2. Defines a list variable lsty .
3. Define a string variable by assign a string block. The opening of the string block is
marked by three consecutive quotation marks. Everything that follows (until another
three quotation marks) is considered as part of the string. Line breaks (what you see
in the code) are considered as '\n' . For example, there is a line break after at the end
of the three opening quotation marks! (Note that you will get error messages if you
use one opening and one closing quotation mark to enclose multiple lines.)
...
12. These three quotation marks close the string block.
CHAPTER 4. DATA STRUCTURE: STRING 46

13. The line Text += '<tr><td>*</td>' starts a row with <tr> and builds a complete cell
with <td> </td> . * here is the content of the cell.
14. for x in lstx: loops through the list lstx to add labels (“column heading”) to the
first row.
15. The next indented line Text += ('<td>' + str(x) + '</td>') (within the lstx loop)
adds one cell a time. Each cell here includes the opening tag, formatted x , and the
closing tag. Note that it’s always a good idea to use parentheses to make sure the
operation sequence is correct. Here the code means: first concatenate the three small
strings and then append the result to Text .
16. The next line Text += '</tr>' appends a tag to close the first row.
17. Now we start to build other rows. The line for y in lsty: loops through the list lsty .
18. The indented line (within the scope of the lsty loop) Text += '<tr>' adds an opening
row tag. So for each y there will be a separate row.
19. The next line Text += ('<td>' + str(y) + '</td>') adds a cell for “row heading” with
a converted y as its content.
20. Now comes the calculation. The indented line for x in lstx: loops through the list
lstx for each y .

21. The next line (further indented) Text += ('<td>' + str(y*x).rjust(4, ' ') + '</td>')
adds a cell with formatted x*y as its content.
22. Text += '</tr>' adds a tag to close the row for each y (notice the indentation).
23. Once all calculations are done, we “close” the table, body, then the whole HTML page
by adding relevant closing tags. The additional string block starts with three quotation
marks.
...
27. These three quotation marks indicate the end of the string block.
28. Finally, we can save it to a file. The line with open('9x9-math-table-5.htm', 'w') as Fw:
opens a file in the write mode.
29. The last intended line Fw.write(Text) writes the long string to the file.

Now we have an HTML file ready (9x9-math-table-5.htm). if you open the HTML file in
a text editor (or display page source in the browser), you will see the following:

Listing 4.7: 9x9-math-table-5.htm (Code)


1
2 < html >
3 < head >
4 < style type = " text / css " >
5 td { background - color :#87 CEEB ;}
6 </ style >
7 </ head >
8 < body >
9 < table >
10 < tr > < td >* </ td >< td >1 </ td > < td >2 </ td >< td >3 </ td > <td >4 </ td >< td >5 </ td > <
CHAPTER 4. DATA STRUCTURE: STRING 47

td >6 </ td >< td >7 </ td > < td >8 </ td >< td >9 </ td > </ tr > < tr > <td >1 </ td > < td >
1 </ td > < td > 2 </ td >< td > 3 </ td > <td > 4 </ td > < td > 5 </ td >< td
> 6 </ td >< td > 7 </ td >< td > 8 </ td > < td > 9 </ td ></ tr >< tr > < td >2
</ td > < td > 2 </ td >< td > 4 </ td > <td > 6 </ td > < td > 8 </ td >< td >
10 </ td >< td > 12 </ td > <td > 14 </ td > < td > 16 </ td >< td > 18 </ td > </ tr
>< tr > < td >3 </ td >< td > 3 </ td > <td > 6 </ td >< td > 9 </ td >< td > 12 <
/ td >< td > 15 </ td > <td > 18 </ td > < td > 21 </ td >< td > 24 </ td > <td >
27 </ td ></ tr >< tr > < td >4 </ td >< td > 4 </ td > < td > 8 </ td >< td > 12 </
td > <td > 16 </ td > < td > 20 </ td >< td > 24 </ td > <td > 28 </ td >< td > 32
</ td > < td > 36 </ td ></ tr >< tr > < td >5 </ td >< td > 5 </ td > < td > 10 </ td >
<td > 15 </ td >< td > 20 </ td >< td > 25 </ td > < td > 30 </ td >< td > 35 </
td > <td > 40 </ td > < td > 45 </ td ></ tr > <tr >< td >6 </ td > <td > 6 </ td > <
td > 12 </ td >< td > 18 </ td > <td > 24 </ td >< td > 30 </ td > <td > 36 </ td
>< td > 42 </ td > <td > 48 </ td > < td > 54 </ td ></ tr > <tr >< td >7 </ td > <td >
7 </ td >< td > 14 </ td > < td > 21 </ td >< td > 28 </ td > <td > 35 </ td > <
td > 42 </ td >< td > 49 </ td > <td > 56 </ td >< td > 63 </ td > </ tr > <tr >< td
>8 </ td >< td > 8 </ td > <td > 16 </ td > < td > 24 </ td >< td > 32 </ td > <td >
40 </ td >< td > 48 </ td > < td > 56 </ td >< td > 64 </ td > <td > 72 </ td > </
tr > <tr >< td >9 </ td > <td > 9 </ td > < td > 18 </ td >< td > 27 </ td > <td >
36 </ td >< td > 45 </ td > <td > 54 </ td > < td > 63 </ td >< td > 72 </ td > <td >
81 </ td ></ tr >
11 </ table >
12 </ body >
13 </ html >

Notice:

• The first line in the HTML code is empty. This is the effect of the Text = """ line ( 3 )
in the Python program.
• Line 10 in the HTML code is very long. There are no newline marks '\n' . As we
mentioned earlier, the layout of an HTML page is controlled by tags.

If you open it in a web browser, it will be something like Figure 4.4.

Figure 4.4: 9x9 Multiplication Table (HTML Rendered)


CHAPTER 4. DATA STRUCTURE: STRING 48

Information
Some 10-K reports put everything—literally everything—in a
single line in the HTML code. Your computer may crash
if you open such reports in text editors (longer lines take
more memory/CPU to process). For example, Starbucks’ 2018
10-K report has more than 100 pages but the main HTML
code is in a single line (with a few additional tags making
up 18 other lines). The single line has more than 3.6 mil-
lion characters! (https://www.sec.gov/Archives/edgar/data/
829224/000082922418000052/sbux-9302018x10xk.htm)
Chapter 5

Data Structure: String/List (2)

Consider the following 9x9 Multiplication Table...for a third time:

1 2 3 4 5 6 7 8 9

1 1 2 3 4 5 6 7 8 9
2 2 4 6 8 10 12 14 16 18
3 3 6 9 12 15 18 21 24 27
4 4 8 12 16 20 24 28 32 36
5 5 10 15 20 25 30 35 40 45
6 6 12 18 24 30 36 42 48 54
7 7 14 21 28 35 42 49 56 63
8 8 16 24 32 40 48 56 64 72
9 9 18 27 36 45 54 63 72 81

So far we have used two general methods to build the table: (1) writing small strings
many time to the output file, (2) concatenating small strings into a long one first and then
writing it to the output file all at once. So far so good.
But there is another efficiency problem: building a long string by concatenating small
strings one by one is not efficient. In fact, such method is a no-no in computer programming.1
But don’t blame us...we are accountants!
So let’s look at another way to build the long string by avoiding the concatenation method.
Before doing that, we will take another look at the operations of list and string .

1 See e.g., (1) https://stackoverflow.com/questions/44487537/ (2) https://docs.python.org/3/library/

stdtypes.html

49
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 50

5.1 List Operations

Up to now we have looked at how to define and loop through lists. But there are many other
things lists can do (or rather we can do about lists) such as changing elements, extending
lists, or dropping elements, among others. Let’s look at some examples.
Below is the two lists we’ve used so far:

1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]

Copying List

In the above code we define two lists separately, even though they are identical. Sometimes it
may become annoying if the lists are significantly long. But it would be wrong if you attempt
to save time by assigning one to another in the form of lsty = lstx . This code does not give
you two separate lists; rather, it gives you two labels that point to the same list. (In Python
and many other programming languages, variables are pointers). In this case, if you change
lstx , you will be changing lsty at the same time because they are one list, not two.

To make a new, identical list, we can use the copy() method:

1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lstxa = lstx # lstxa and lstx are the same thing .
3 lsty = lstx . copy () # lsty is a new , separate list .

(Recall that # ) makes the rest of the line “comment”.)

Adding Elements

A list has two methods for adding new elements to itself: append() and extend() . These
two methods work slightly differently:

• lstx.append(x) will add x to lstx at its end (to the right, of course). Note that here
x is added as a whole, no matter what it is. So in the end the new lstx will have one
more element. The input can be a scalar value (i.e, a single value) or an iterable object
(i.e., something that made up with others).
• lstx.extend(x) will also add x to the end of lstx . But the difference is that, x is a
list (or similar iterable data type) and each of its elements will be added to lstx one
by one. So the length of the new lstx depends on the length of x ; it’s not just one
more.

Here are some example:

Listing 5.1: list-operations.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 51

2 lsty = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
3 lstx . append (10)
4 lsty . extend ([10])
5 print ( ' new lstx 1: ' , lstx )
6 print ( ' new lsty 1: ' , lsty )
7 lstx . append ([11 , 12 , 13])
8 lsty . extend ([11 , 12 , 13])
9 print ( ' new lstx 2: ' , lstx )
10 print ( ' new lsty 2: ' , lsty )

Some explanation about the above code:

1. Define a list lstx .


2. Define a list lsty .
3. “Append” a number to lstx (technically it is the list lstx appends the number to
itself).
4. “Extend” the list lsty with the same number. Again, technically it is lsty that extents
itself. Note that the extend() requires an iterable as input. So any scalar value is
invalid. In this line, the number has to be put in a list with one element [10] . Keep in
mind that 10 is not the same as [10] .
5. Display the new lstx .
6. Display the new lsty .
7. Append a list to lstx .
8. Extend lsty with the same list.
9. Display the new lstx again.
10. Display the new lsty again.

Running the above code will give the following result:

1 new lstx 1: [1 , 2, 3, 4, 5, 6, 7, 8, 9, 10]


2 new lsty 1: [1 , 2, 3, 4, 5, 6, 7, 8, 9, 10]
3 new lstx 2: [1 , 2, 3, 4, 5, 6, 7, 8, 9, 10 , [11 , 12 , 13]]
4 new lsty 2: [1 , 2, 3, 4, 5, 6, 7, 8, 9, 10 , 11 , 12 , 13]

Notice that:

1. The list lstx now has 10 elements.


2. Similarly, the list lsty has 10 elements as well. These two new lists are equivalent.
3. lstx has 11 elements after “appending” the list [11,12,13] , which is added as a
whole.
4. lsty has 13 elements after “extending” itself with the list [11,12,13] , whose elements
are added to lsty one by one. Note that the two new lists 3 and 4 are not equivalent.
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 52

List Comprehension

List comprehension is a method of making a new list by using each of the elements of another
list. Often some operations are applied to the elements and only the results are used in the
new list. Below is an example.

Listing 5.2: list-comprehension.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 newx = [ x **2 for x in lstx ]
3 newxt = [ str ( x **2) . rjust (3 , '. ') for x in lstx ]
4 print ( ' lstx : ' , lstx )
5 print ( ' newx : ' , newx )
6 print ( ' newxt : ', newxt )

Here is how the code works:

1. The first line defines a list, nothing new.


2. The next line newx = [x**2 for x in lstx] makes a new list named newx by using the
squared value of each element from lstx . Just like how we manually define a list, the
new list newx is marked by square brackets. for x in lstx is the same as what we
used in previous chapters to loop through the list, with each element is assigned to a
variable x .
3. The next line does something similar to create a new list newxt . Here we convert the
square of each element into a string and pad it with dots.
4. The next three lines display the three lists.
...

The code will give the following result (shown in terminal):

1 lstx : [1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9]
2 newx : [1 , 4 , 9 , 16 , 25 , 36 , 49 , 64 , 81]
3 newxt : [ '..1 ' , '..4 ' , '..9 ' , '.16 ' , '.25 ' , '.36 ' , '.49 ' , '.64 ' ,
'.81 ']

Converting String to List

It is often necessary to convert a string into a list or making a string from a list. These opera-
tions are similar to the “text to column” and “concatenate” functions in Microsoft Excel. For
example, in those master index files you can download from SEC’s EDGAR platform, you
can find lines like:

1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt

The above line has several fields separated by vertical bars ( | ) (which are called delim-
iters). In Microsoft Excel you can use the “text to column” function to convert the line into
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 53

several columns. We can do something similar in Python: convert it into a list so that we can
process the elements one by one. Here is an example:

Listing 5.3: string-to-list.py


1 idx = ' 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt '
2 lidx = idx . split ( '| ')
3 print ( lidx )

Some explanation about the above code:

1. Define a string variable called idx .


2. Use the method split() to “split” the string idx into a new list lidx by using | as
the delimiter. Technically, it is the string calls the method to split itself into a new list.
3. Display the new list in terminal.

Here is what the new list looks like:

[ '1000230 ' , ' OPTICAL CABLE CORP ' , '10 -K ' , '2020 -01 -27 ' , ' edgar /
data /1000230/0001437749 -20 -001224. txt ']

Making String From List

We can also convert the list into a string. For example, we can re-make the index line by
using anything other than the vertical bar | . Below is an example:

Listing 5.4: list-to-string.py


1 idx = ' 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt '
2 lidx = idx . split ( '| ')
3 sidx = '* '. join ( lidx )
4 print ( sidx )

Here is how the code works:

1. The first line defines a string, nothing new here.


2. Same as the previous example, the second line lidx = idx.split('|') splits the string
into a new list lidx by using | as the delimiter.
3. The next line sidx = '*'.join(lidx) uses the asterisk * to “join” or concatenate all
elements of the list lidx . So the * here works like a glue. Note that technically and
somewhat confusingly, the “leading actor” here is the asterisk * , which is a string. It
calls its own method join() . Also note that the input to the join() method has to be
a list or other types of iterable.
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 54

The code will give the following result (displayed in terminal):

1000230* OPTICAL CABLE CORP *10 - K *2020 -01 -27* edgar / data
/1000230/0001437749 -20 -001224. txt
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 55

5.2 Using List to Build HTML Table

Now we have enough tools to rebuild the 9x9 Multiplication Table again. This time we use
lists to improve efficiency. Below is one solution:

Listing 5.5: 9x9-math-table-6.py


1 lstx = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9]
2 lsty = [1 , 2, 3 , 4 , 5 , 6 , 7 , 8 , 9]
3 lstText = [ """
4 < html >
5 < head >
6 < style type =" text / css ">
7 td { background - color :#87 CEEB ;}
8 </ style >
9 </ head >
10 < body >
11 < table >
12 """ ]
13 lstText . append ( ' <tr >< td >* </ td > ')
14 for x in lstx :
15 lstText . extend ([ ' <td > ', str ( x ) , ' </ td > '])
16 lstText . append ( ' </ tr > ')
17 for y in lsty :
18 lstText . append ( '<tr > ')
19 lstText . extend ([ ' <td > ', str ( y ) , ' </ td > '])
20 for x in lstx :
21 lstText . extend ([ ' <td > ' , str ( y * x ) . rjust (4 , ' ') , ' </ td > ' ])
22 lstText . append ( ' </ tr > ')
23 lstText . append ( """
24 </ table >
25 </ body >
26 </ html >
27 """ )
28 with open ( '9 x9 - math - table -6. htm ' , 'w ') as Fw :
29 Fw . write ( ' '. join ( lstText ) )

The code here is quite similar to 9x9-math-table-5.py on page 45. But just for the purpose
of completeness, the code here is explained line by line in full even if there is any repetitive-
ness:

1. The first line define a list variable lstx .


2. Defines a list variable lsty .
3. Define a list lstText with a string block as the first element. The opening of the string
block is marked by three consecutive quotation marks. Everything that follows (until
another three quotation marks) is considered as part of the string. Line breaks (what
you see in the code) are considered as '\n' . For example, there is a line break after at
the end of the three opening quotation marks! (Note that you will get error messages
if you use one opening and one closing quotation mark to enclose multiple lines.)
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 56

...
12. These three quotation marks close the string block. Note that the whole string block is
put within the square brackets.
13. The line lstText.append('<tr><td>*</td>') appends a string to the list. The string starts
a row with <tr> and builds a complete cell with <td> </td> . * here is the content of
the cell.
14. for x in lstx: loops through the list lstx to add labels (“column heading”) to the
first row.
15. The next indented line lstText.extend(['<td>', str(x), '</td>']) (within the lstx
loop) “extends” lstText with a list of three elements. As usual, the elements are
separated by commas and enclosed in square brackets. This line adds one cell a time.
Each cell here includes the opening tag, formatted x , and the closing tag.
16. The next line lstText.append('</tr>') appends a tag to close the first row.
17. Now we start to build other rows. The line for y in lsty: loops through the list lsty .
18. The indented line (within the scope of the lsty loop) lstText.append('<tr>') adds
an opening row tag. So for each y there will be a separate row.
19. The next line lstText.extend(['<td>', str(y), '</td>']) adds a cell for “row heading”
with a converted y as its content. Again, the list lstText extends itself with a list of
three elements.
20. Now comes the calculation. The indented line for x in lstx: loops through the list
lstx for each y .

21. The next line (further indented) lstText.extend(['<td>', str(y*x).rjust(4, ' '), '</td>'])
adds a cell with formatted x*y as its content. So again, technically, the list lstText
extends itself with a list of three elements.
22. lstText.append('</tr>') adds a tag to close the row for each y (notice the indenta-
tion).
23. Once all calculations are done, we “close” the table, body, then the whole HTML page
by adding relevant closing tags. The additional string block starts with three quotation
marks and is appended to the list lstText as an element.
...
27. These three quotation marks indicate the end of the string block.
28. Finally, we can save it to a file. The line with open('9x9-math-table-6.htm', 'w') as Fw:
opens a file in the write mode.
29. The last intended line Fw.write(' '.join(lstText)) uses a white space to glue all ele-
ments of the list lstText and then writes the long string to the file.

The above code will generate an HTML file named 9x9-math-table-6.htm . If you open it
in a text editor, the code will be something like the following:

Listing 5.6: 9x9-math-table-6.htm (Code)


1
2 < html >
CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 57

3 < head >


4 < style type =" text / css " >
5 td { background - color :#87 CEEB ;}
6 </ style >
7 </ head >
8 < body >
9 < table >
10 <tr >< td >* </ td > <td > 1 </td > <td > 2 </ td > <td > 3 </ td > <td > 4 </ td
> <td > 5 </ td > <td > 6 </ td > <td > 7 </ td > <td > 8 </ td > <td > 9 </
td > </ tr > <tr > <td > 1 </ td > <td > 1 </ td > <td > 2 </ td > <td
> 3 </td > <td > 4 </ td > <td > 5 </ td > <td > 6 </ td > <
td > 7 </ td > <td > 8 </ td > <td > 9 </ td > </ tr > <tr > <td >
2 </td > <td > 2 </ td > <td > 4 </ td > <td > 6 </ td > <td >
8 </ td > <td > 10 </ td > <td > 12 </ td > <td > 14 </ td > <td >
16 </ td > <td > 18 </ td > </ tr > <tr > <td > 3 </ td > <td > 3 </
td > <td > 6 </ td > <td > 9 </ td > <td > 12 </ td > <td > 15
</ td > <td > 18 </ td > <td > 21 </ td > <td > 24 </ td > <td > 27
</ td > </ tr > <tr > <td > 4 </ td > <td > 4 </ td > <td > 8 </ td >
<td > 12 </ td > <td > 16 </ td > <td > 20 </ td > <td > 24 </ td >
<td > 28 </ td > <td > 32 </ td > <td > 36 </ td > </ tr > <tr > <td
> 5 </ td > <td > 5 </ td > <td > 10 </ td > <td > 15 </ td > <td >
20 </ td > <td > 25 </ td > <td > 30 </ td > <td > 35 </ td > <td >
40 </ td > <td > 45 </ td > </ tr > <tr > <td > 6 </ td > <td > 6
</ td > <td > 12 </ td > <td > 18 </ td > <td > 24 </ td > <td > 30
</ td > <td > 36 </ td > <td > 42 </ td > <td > 48 </ td > <td >
54 </ td > </ tr > <tr > <td > 7 </ td > <td > 7 </ td > <td > 14 </ td
> <td > 21 </ td > <td > 28 </ td > <td > 35 </ td > <td > 42 </
td > <td > 49 </ td > <td > 56 </ td > <td > 63 </ td > </ tr > <tr >
<td > 8 </ td > <td > 8 </ td > <td > 16 </ td > <td > 24 </ td > <
td > 32 </ td > <td > 40 </ td > <td > 48 </ td > <td > 56 </ td >
<td > 64 </ td > <td > 72 </ td > </ tr > <tr > <td > 9 </ td > <td >
9 </ td > <td > 18 </ td > <td > 27 </ td > <td > 36 </ td > <td >
45 </ td > <td > 54 </ td > <td > 63 </ td > <td > 72 </ td > <td >
81 </ td > </ tr >
11 </ table >
12 </ body >
13 </ html >

If you open it in a web browser, it will be something like Figure 5.1.


CHAPTER 5. DATA STRUCTURE: STRING/LIST (2) 58

Figure 5.1: 9x9 Multiplication Table HTML Rendered (2)


Chapter 6

Text Search/Count

We’ve had enough numbers and multiplications...so no more 9x9 Multiplication Table! Let’s
talk about words. A very basic task in textual analysis is searching for and counting words.
So in this chapter we will look at how to write Python code to perform this task. We will use
the following text as an example, which is copied from a Starbucks 10-K report:

Listing 6.1: starbucks-10k-one-paragraph.txt


1 Starbucks is the premier roaster , marketer and retailer of
2 specialty coffee in the world , operating in 83 markets . Formed
3 in 1985 , Starbucks Corporation ’ s common stock trades on the
4 NASDAQ Global Select Market ( “ NASDAQ ” ) under the symbol
5 “ SBUX . ” ... We also sell a variety of coffee and tea products
6 and license our trademarks through other channels such as
7 licensed stores , as well as grocery and foodservice through our
8 Global Coffee Alliance with Nestlé S . A . ( “ Nestlé ” ).

Some basic information about the above text:

• 8 lines (which were manually created to fit the text box);


• 75 words (counted by LibreOffice Write, which is similar to Microsoft Word);
• 474 characters (again, according to LibreOffice Writer);
• 2 numbers; and
• some punctuation marks.

(You may have notices something strange: why is the ’s’ after ’Starbucks Corporate’ in
Line 3 is way off after the apostrophe, as if there is a white space (there isn’t)?)

59
CHAPTER 6. TEXT SEARCH/COUNT 60

6.1 Search Basics

Text search functions are provided by a Python module called re or regular expression. The
basic function of re can be summarized as follows:

1. It scans the text you gave.


2. During the scan, it tries to match the characters in the text with patterns you specify.
Patterns are somewhat like formulas in Microsoft Excel although they are totally dif-
ferent things. Patterns are specified by using some special characters.
3. If it finds a match, the string of the matched characters is put into a list.
4. It continues the search until the end of the text.
5. It returns a list of strings. If there is no match, the list is empty.

Now let’s do some search by using the Starbucks text. Maybe some “soul searching” as
well?)

Counting All Characters

Let’s first check how many characters there are in the text. To do this, we can simply use a
system function called len() , which does not require the re module.

Listing 6.2: search-1.py


1 Text = ' '
2 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
3 Text = Fr . read () ;
4 print ( ' All characters : ', len ( Text ) )

Some explanation about the above code:

1. The first line define a string variable by assigning an empty string. Note that such
initial definition is not required in Python, but is often a good idea to keep track what
variables you have. It is not a problem if your code is short. But when it gets longer, it
will become harder to keep track. This is because variables are used within scopes. If
you inadvertently reuse variable names, you may not get what you want.
2. The second line with open() as opens the text file in the read mode (as a file object).
3. The line Text = Fr.read() uses the read() method of the file object to get its string
and assigns the string to the variable Text .
4. Finally, we use the len() system function to get the length of the string as number of
characters.

If you run the above code, the result will be (in terminal):

All characters : 482


CHAPTER 6. TEXT SEARCH/COUNT 61

It means there are 482 characters in the file. But how come LibreOffice Write gave us 474?
This is because word processing software such as LibreOffice Write and Microsoft Word only
count visible or printable characters; they don’t count non-printable ones such as line breaks
or newline symbols \n (which are called paragraph marks in word processors). So in this
example, the line breaks make up the difference in character count.
Let’s check if there are indeed 8 line breaks in the file. This time we need the re module.

Counting Specific Characters (1)

The following is an example of how to do the counting:

Listing 6.3: search-special.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 Pattern = r '\ n '
7 lstResult = re . findall ( Pattern , Text )
8 Count = len ( lstResult )
9 print ( ' Newline characters : ' , Count )

Some explanation about the above code:

1. The first line import re loads/imports the re module by using the import keyword.
This is needed because Python does not load everything automatically to avoid slow-
ing down your computer with unused things. (Just like don’t buy things you don’t
use.)
2. The second line defines an empty string variable Text .
3. The next line defines an integer variable Count with an initial value of 0.
4. The line with open() as opens the input file as a file object in the read mode.
5. The read() method takes the string of the file object and assigns it to Text
6. The line Pattern = r'\n' builds a pattern by using the r'' mark. The letter r means
“raw”, which will ignore any special characters in the string in the quotation marks.
The letter is not always required but it’s always a good idea to use it. It is, however,
required when you have special characters in the string. This line of code is an example,
we have a backslash here.
7. The next line, lstResult = re.findall(Pattern, Text) , uses the findall() function from
the re module to do the search (the syntax is module.function ). The findall() func-
tion requires two inputs separated by a comma: the first is the pattern we want to use,
and the second is the string we want to search.
Note that the findall() returns search research result as a list, each element is a string
that matches the pattern. Here we assign the result to a new variable lstResult , which
we didn’t pre-define.
CHAPTER 6. TEXT SEARCH/COUNT 62

8. The line Count = len(lstResult) uses the system function (without any dot) to get the
length of the list (number of elements) and assigns it to the variable Count
9. The last line prints out the count.

If you run the above code, you will see the following:

Newline characters : 8

There are indeed 8 newline characters in the 10-K file.

Counting Specific Characters (2)

Now let’s try to count how many punctuation marks are in the file. First, let’s try counting
quotation marks first. We know that quotation marks are special characters used to “mark”
strings, not in the “quotation” sense in our everyday languages.
Here is an example of how to search for double quotation marks:

Listing 6.4: search-special-2.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 Pattern = r '" '
7 lstResult = re . findall ( Pattern , Text )
8 Count = len ( lstResult )
9 print ( ' Double quotation marks : ', Count )

This code is almost identical to the previous one. Here we search for double quotation
mark, so the search pattern is r'"' . If you run this code, you will get:

Double quotation marks : 0

How come it is a zero? We know there are double quotation marks there in the file. This
has something to do with the “strange things” we talked about the beginning.
More specifically, those double quotation marks in the 10-K report are different from
what we use in the code:

In the code The double quotation mark is based on the ASCII table. They can be entered
directly by using the related key on your keyboard (typically English keyboard). There
is only one key for double quotation mark.
In the 10-K report The double quotation marks are often used in non-English languages.
There are two different marks: left and right. To deal with such non-ASCII marks, we
have to use Unicode.

Table 6.1 a list of quotation marks and their Unicode numbers:


CHAPTER 6. TEXT SEARCH/COUNT 63

Table 6.1: Unicode Quotation Marks

Display Unicode Name (All Uppercase) Python Notes

” QUOTATION MARK \u0022 ASCII Double


’ APOSTROPHE \u0027 ASCII Single
“ LEFT DOUBLE QUOTATION MARK \u201C
” RIGHT DOUBLE QUOTATION MARK \u201D
‘ LEFT SINGLE QUOTATION MARK \u2018
’ RIGHT SINGLE QUOTATION MARK \u2019

You may have noticed: the ASCII double and single marks in the table look different from
those in the Python programs we have. They are supposed to be the same. The difference
in appearance is because different font types are used in this book.
Now we can revise the code and redo the search:

Listing 6.5: search-special-3.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 Pattern = r '[\ u201C \ u201D ] '
7 lstResult = re . findall ( Pattern , Text )
8 Count = len ( lstResult )
9 print ( ' Double quotation marks count : ', Count )
10 print ( ' Double quotation marks list : ' , lstResult )

The code here is largely the same as the previous one, with one difference:

...
6. The new pattern is r'[\u201C\u201D]' . Here we use square brackets to enclose two
marks (left and right double quotation). The square bracket means “any of”, i.e., a
character will be included if it matches any of the characters in the bracket.
...
10. Print out the result (list).

The code will give the following result:

Double quotation marks count : 6


Double quotation marks list : [ ' “ ', ' ” ', ' “ ' , ' ” ' , ' “ ' , ' ” ']
CHAPTER 6. TEXT SEARCH/COUNT 64

Now we’re talking! There are indeed 6 double quotation marks in the file.

Counting Words

Now let’s count some words. The problem is that there is no such thing called ‘word’ in
Python; there are only strings and characters. And there is no perfect way to find words
with 100 percent accuracy. So here we only aim for rough counts, i.e., good enough counts.
In the following example, we assume that words are bounded by white spaces (including
any non-printable characters); so anything in between two white spaces is a word. So here
we only need to “split” the whole text into a list and then check its length.

Listing 6.6: search-count-words.py


1 Text = ' '
2 Count = 0
3 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
4 Text = Fr . read () ;
5 lstResult = Text . split ()
6 Count = len ( lstResult )
7 print ( ' Number of words : ' , Count )
8 print ( ' List of words : ' , lstResult )

In comparison with the previous code (search-special-3.py, p. 63), the differences are:

• We don’t do any real search here. So the regular expression module ( re ) is not
needed.
• For the same reason, we don’t need any search patterns.
• The new line here 5 lstResult = Text.split() simply uses the split() method of
the string Text to split it into a list of (smaller) strings. Because we don’t give any
input, the function uses the default separator (which is any white spaces).

The code will generate the following result (shown in terminal):

Number of words : 75
List of words : [ ' Starbucks ' , 'is ' , 'the ' , ' premier ' , ' roaster , ' , '
marketer ', 'and ' , ' retailer ' , 'of ' , ' specialty ' , ' coffee ' , 'in
', 'the ' , ' world , ' , ' operating ' , 'in ' , '83 ' , ' markets . ' , '
Formed ' , 'in ', '1985 , ' , ' Starbucks ' , ' Corporation ’ s ' , ' common
', ' stock ' , ' trades ' , 'on ' , 'the ' , ' NASDAQ ' , ' Global ' , ' Select
', ' Market ' , '( “ NASDAQ ” ) ', ' under ' , 'the ' , ' symbol ' , ' “ SBUX .
” ' , '... ' , 'We ' , ' also ' , ' sell ' , 'a ' , ' variety ' , 'of ' , ' coffee
', 'and ' , 'tea ' , ' products ' , 'and ' , ' license ' , 'our ' , '
trademarks ' , ' through ' , ' other ' , ' channels ' , ' such ' , 'as ' , '
licensed ', ' stores , ' , 'as ' , ' well ' , 'as ' , ' grocery ' , 'and ' , '
foodservice ' , ' through ' , 'our ' , ' Global ' , ' Coffee ' , ' Alliance ' ,
' with ' , ' Nestlé ' , 'S . ' , 'A . ' , '( “ Nestlé ” ) . ']

As you can see, we got the exact number as what LibreOffice Writer counted. The solution
is not perfect because:
CHAPTER 6. TEXT SEARCH/COUNT 65

• Punctuation marks are included in words, e.g., ("NASDAQ") , S. , and ... .


• Numbers are counted as words, e.g., 83 and 1985,

Counting Numbers

We know there are two numbers in the sample 10-K report. Let’s write a Python program
to find out.

Listing 6.7: search-count-numbers.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 Pattern = r '\ d+ '
7 lstResult = re . findall ( Pattern , Text )
8 Count = len ( lstResult )
9 print ( ' Number of numbers : ', Count )
10 print ( ' List of numbers : ' , lstResult )

This program is very much similar to the previous code (search-special-3.py, p. 63). For
the purpose of completeness, let’s go through the lines one by one:

1. The first line import re loads/imports the re module by using the import keyword.
2. The second line defines an empty string variable Text .
3. The next line defines an integer variable Count with an initial value of 0.
4. The line with open() as opens the input file as a file object in the read mode.
5. The read() method takes the string of the file object and assigns it to Text
6. The line Pattern = r'\d+' builds a pattern by using the r'' mark. The letter r means
“raw”, which will ignore any special characters in the string in the quotation marks.

• \d represents any of the digits 0-9.


• + means to match one or more digits, until it reaches a non-digit character.

7. The next line, lstResult = re.findall(Pattern, Text) , uses the findall() function from
the re module to do the search (the syntax is module.function ). The findall() func-
tion requires two inputs separated by a comma: the first is the pattern we want to use,
and the second is the string we want to search.
Note that the findall() returns search research result as a list, each element is a string
that matches the pattern. Here we assign the result to a new variable lstResult , which
we didn’t pre-define.
8. The line Count = len(lstResult) uses the system function (without any dot) to get the
length of the list (number of elements) and assigns it to the variable Count
9. The last line prints out the count.
CHAPTER 6. TEXT SEARCH/COUNT 66

The code will generate the following result (shown in terminal):

Number of numbers : 2
List of numbers : [ '83 ' , '1985 ']
CHAPTER 6. TEXT SEARCH/COUNT 67

6.2 Searching for Keywords

Now let’s do some keyword searching. We are interested in how many time Starbucks used
the word “coffee” in the first paragraph of its annual report. Here is an example:

Listing 6.8: search-keyword.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 # ## a
7 Pattern = r ' coffee '
8 lstResult = re . findall ( Pattern , Text )
9 Count = len ( lstResult )
10 print ( ' coffee ( a ): ', Count , lstResult )
11 # ## b
12 Pattern = r ' coffee '
13 lstResult = re . findall ( Pattern , Text , flags = re . I )
14 Count = len ( lstResult )
15 print ( ' coffee ( b ): ', Count , lstResult )

The code will give the following result:

coffee ( a) : 2 [ ' coffee ' , ' coffee ']


coffee ( b) : 3 [ ' coffee ' , ' coffee ' , ' Coffee ']

There are two different counts here. You may have already figured out: the first one is all
in lowercase and the second include capital letters. Let’s take a look at how the code works:

1. The first five lines are exactly the same as what we had earlier: import the re module,
define variables, and read the text from the input file.
...
6. This line is just a comment to make the code more readable. (You may argue that the
so-called “readability” in the accounting literature is really a misnomer.)
7. The line Pattern = r'coffee' defines the pattern we want to use. Here we simply
spell out the keyword “coffee”. Note that here the code does not care where the word
is located; it treats the word not as a word but six characters.
8. The line lstResult = re.findall(Pattern, Text) uses the findall() method to scan
the text to find matches.
9. Count = len(lstResult) takes the length of the result list.
10. Nothing special here...this line prints out the result in terminal.
11. The next few lines repeat the above procedure, which only one difference.
...
13. In this line, we give one more input flags=re.I to the re.findall() function. re.I
CHAPTER 6. TEXT SEARCH/COUNT 68

means “ignore case”. This is necessary because Python strings are case sensitive by
default. flags is the name of the input, which is required because there are other
option input variables used in the function. If you don’t specify flags= , you will give
different result, which is often not what you wanted.
....

Let’s look another example. Here we want to know how many times Starbucks used the
word “market”. So we can simply replace “coffee” with “market” in the previous example.
Here is the complete code:

Listing 6.9: search-keyword-2.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 # ## a
7 Pattern = r ' market '
8 lstResult = re . findall ( Pattern , Text , flags = re . I )
9 Count = len ( lstResult )
10 print ( ' market ( a ): ', Count , lstResult )

The code will give the following result:

market ( a) : 3 [ ' market ' , ' market ' , ' Market ']

Everything appears to be correct: we found three words here. But if you examine the
source report, there are only two “market”. The other word is part of “marketer”, which is
certainly not what we are looking for. So we have to revise the code to exclude such mis-
matches. Other potential mismatches include:

• marketing
• supermarket

So basically, we are looking for only two possible variants of the word: “market” and
“markets”, and they much whole words, not part of anything. To accomplish this objective,
we can modify the code as follows:

Listing 6.10: search-keyword-3.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 # ## a
7 Pattern = r '\ bmarkets ?\ b '
8 lstResult = re . findall ( Pattern , Text , flags = re . I )
9 Count = len ( lstResult )
CHAPTER 6. TEXT SEARCH/COUNT 69

10 print ( ' market ( a ): ', Count , lstResult )

The code will give the following result, which is what we want:

market ( b) : 2 [ ' markets ' , ' Market ']

The code is almost identical to the previous one, with only one difference in how the
pattern is specified. The pattern r'\bmarkets?\b' works as follows:

• The first \b marks “word boundary” before the search string. Word boundary means
anything that does not normally appear in a word. So this will exclude something like
“supermarket”.
• The question mark ? means zero or one. So s? means to match one letter “s” or none.
This will match “market” and “markets”.
• The other \b also marks “word boundary”, but it’s after the search string. So this will
exclude something like “marketing”.

What if we want to search for words that match either “coffee” or “market” in the same
search? We can modify the code as follows:

Listing 6.11: search-keyword-4.py


1 import re
2 Text = ' '
3 Count = 0
4 with open ( ' starbucks -10 k - one - paragraph . txt ', 'r ') as Fr :
5 Text = Fr . read () ;
6 # ## b
7 Pattern = r ' coffee |\ bmarkets ?\ b '
8 lstResult = re . findall ( Pattern , Text , flags = re . I )
9 Count = len ( lstResult )
10 print ( ' market or coffee : ', Count , lstResult )

The code will give the following result, which is what we want:

market or coffee : 5 [ ' coffee ' , ' markets ' , ' Market ' , ' coffee ' , '
Coffee ']

Note that the code is still not perfect. Consider the following:

• The result “Market” is probably not what we want, because it is in a label used by
NASDAQ; it has nothing to do with the general meaning of “market” in the sense of
market competition. You can revise to code to exclude any words with a capital “M”,
but this will cause another problem: what if the word is at the beginning of a sentence?
• The same problem with “Coffee”, which comes from an organization’s name. It does
not mean any products related to coffee.
• If we search for “coffee”, do we have to include “cafe” as well? But if we include “cafe”,
how do we deal with the same term meaning a coffee shop?
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

70
Chapter 7

Data Structure: Dictionary

When we do textual accounting research, it is often the case that we have to process many
reports, search for keywords, and then save the result in structured data format such as CSV
(comma-separated value). Take the searches we did in the previous chapter an example. In
the end, we should have a CSV file like the following:

FileName , Coffee , Market


starbucks -10 k - one - paragraph . txt ,3 ,2

In the CSV file, the first row is column heading (or “variables” in statistical software such
as SAS) and the rest is actual data.
So in this chapter we will take a look at how to save search results in a CSV file. To do
this, we need a new data structure: dictionary.

71
CHAPTER 7. DATA STRUCTURE: DICTIONARY 72

7.1 Dictionary

The dictionary data type in Python is like a real dictionary: a list of words and their meaning.
In the programming terminology, it’s a list of key and value parings. Below is an example
that looks like a real dictionary:

Listing 7.1: dictionary.py


1 dctTar = { ' accounting ': 'n. a continuous act of counting sheep . ' ,
2 ' research ': 'v. repeatedly search for nothing . '
3 }
4 print ( ' Accounting : ' , dctTar . get ( ' accounting '))

Running the above code will show the definition of “accounting”:

Accounting : n. a continuous act of counting sheep .

Here is how the code works:

1. The first three line define a dictionary variable called dctTar . The variable definition
can be put in one line. For the purpose of readability, it’s better to break it into multiple
ones (think again about the term “readability” in accounting research). But be careful
with breaking and padding your code with white spaces, which sometime may cause
problems.

The dictionary is enclosed in curly brackets {} . Within the curly brackets is a list of
key: value pairs, which are separated by commas , . Within each pair, the key and
the value are separated by a colon : . In the dictionary, we have two words (keys) and
their definitions (values).
...
4. The last line prints out the definition of the word “accounting” in terminal. Here we
use a method of the dictionary dctTar.get() to “get” the value of the key ( 'accounting' ).
The method requires one input (key).

Some requirements for defining a dictionary:

• All keys must be unique.


• Keys can be numbers or strings in the same dictionary. But it is often a good idea to
make sure that all keys are so the same data types. The same applied to values.
• String keys are case-sensitive. So “Accounting” is not the same as “accounting”.

Let’s look at another example:

Listing 7.2: dictionary-2.py


1 dctTar = { ' accounting ': 'n. a continuous act of counting sheep . ' ,
2 ' research ': 'v. repeatedly search for nothing . '
CHAPTER 7. DATA STRUCTURE: DICTIONARY 73

3 }
4 dctTar [ ' Accounting '] = 'n . creating t - accounts . '
5 dctTar [8] = 'n. a lucky number . '
6 print ( ' dctTar : ' , dctTar )
7 print ( ' accounting : ' , dctTar . get ( ' accounting '))
8 print ( ' Accounting : ' , dctTar . get ( ' Accounting '))
9 print ( ' 8: ' , dctTar . get (8) )

Some explanation about the code:

1. The first three line define a dictionary variable called dctTar (same as the previous
code).
...
4. In the line dctTar['Accounting'] = 'n. creating t-accounts.' we try to change the def-
inition of “accounting”. We use the square brackets to reference any key in the dictio-
nary. So dctTar['Accounting'] points to the key 'Accounting' . But in the dictionary
there is no such key, because Python is case-sensitive. So the system will add a new
key instead and assign the string 'n. creating t-accounts.' as its value.
5. The line dctTar[8] = 'n. a lucky number.' defines the number 8 in the dictionary.
Because there is such key in the existing dictionary, a new one will be added. Note
that the key 8 here is an integer, not a string as '8' .
6. The next four lines print out the dictionary and some individual values.
...

The above code will generate the following result:

1 dctTar : {' accounting ': 'n . a continuous act of counting sheep . ' , '
research ': 'v . repeatedly search for nothing . ' , ' Accounting ': '
n. creating t - accounts . ' , 8: 'n . a lucky number . '}
2 accounting : n. a continuous act of counting sheep .
3 Accounting : n. creating t - accounts .
4 8: n. a lucky number .

As you can see in the above print-out, when you “print” the dictionary variable, ev-
erything is displayed in one line ( 1 ), which is not human-readable (another example of
“readability”). To solve this problem, we can loop through the dictionary and display items
one by one:

Listing 7.3: dictionary-3.py


1 dctTar = { ' accounting ': 'n. a continuous act of counting sheep . ' ,
2 ' research ': 'v. repeatedly search for nothing . '
3 }
4 dctTar [ ' Accounting '] = 'n . creating t - accounts . '
5 dctTar [8] = 'n. a lucky number . '
6 print ( ' dctTar : ' , dctTar )
7 print ( ' accounting : ' , dctTar . get ( ' accounting '))
8 print ( ' Accounting : ' , dctTar . get ( ' Accounting '))
CHAPTER 7. DATA STRUCTURE: DICTIONARY 74

9 print ( ' 8: ' , dctTar . get (8) )

Here is how the code works:

1. The first five lines define a dictionary variable called dctTar and add two more key/-
value pairs (same as the previous code).
...
5. In this line we changed the key from an integer 8 to its string form '8' to make the
data type consistent with other keys.
6. In the line for k, v in dctTar.items() we get a list of key/value pairs by using the
method called items() and then loop through the list. Here because the list consists
of key/value pairs, we have to use two temporary variables to designate the key and
the corresponding value of each pair. Note the variables k and v are separated by a
comma.
7. The last intended line prints out each pair. Here we use the ljust() method to pad
white spaces to the right of a string and the print-out 20 characters wide.

The code will generate the following result:

1 accounting : n. a continuous act of counting sheep .


2 research : v. repeatedly search for nothing .
3 Accounting : n. creating t - accounts .
4 8 : n. a lucky number .
CHAPTER 7. DATA STRUCTURE: DICTIONARY 75

7.2 Saving Data in CSV

Now we have the tool to save search result in the CSV format. To to this, we will use a Python
module called csv , which has some standard methods for handling writing structured data.
Let’s look at an example by using the one-paragraph 10-K report from the previous chapter:

Listing 7.4: search-save-csv.py


1 import re
2 import csv
3 SourceFiles = [ ' starbucks -10 k - one - paragraph . txt ',
4 ' starbucks -10 k - one - paragraph . txt ' ,
5 ' starbucks -10 k - one - paragraph . txt ']
6 SaveFile = ' search - result . csv '
7 VarList = [ ' FILENAME ' , ' ALLWORDS ' , ' COFFEEWORDS ', ' MARKETWORDS ']
8 VarValues = dict ()
9 for Var in VarList :
10 VarValues [ Var ] = -1
11 with open ( SaveFile , 'w ', newline = '') as Sf :
12 Writer = csv . DictWriter ( Sf , fieldnames = VarList ,
lineterminator = '\ n ')
13 Writer . writeheader ()
14 for File in SourceFiles :
15 VarValues [ ' FILENAME '] = File
16 with open ( File , 'r ') as Fr :
17 Text = Fr . read ()
18 VarValues [ ' ALLWORDS '] = len ( Text . split () )
19 VarValues [ ' COFFEEWORDS '] = len ( re . findall ( r '\ bcoffee \b
', Text , flags = re . I ) )
20 VarValues [ ' MARKETWORDS '] = len ( re . findall ( r '\ bmarkets
?\ b ', Text , flags = re . I ) )
21 Writer . writerow ( VarValues )

As you can see, we are having longer and longer code over time. Let’s go through the
code line by line:

1. The first line import re “imports” the regular expression module re .


2. The second line import csv imports the csv module for handling CSV data.
3. The next three lines defines a list of files (i.e., 10-K reports) to process. Here we repeat
the same file for three times, pretending that they are different. The purpose is to show
how to loop through many files and count words for each of them one by one.
...
6. The line SaveFile = 'search-result.csv' specifies a CSV file to save.
7. This line defines a list of variables to be used in the CSV file. (“Variables” are also
called “fields” in CSV.)
8. The line VarValues = dict() defines an empty dictionary to hold values for the vari-
ables in the previous line.
9. The next two lines loop through the variable list and add each variable to the dictionary
CHAPTER 7. DATA STRUCTURE: DICTIONARY 76

with a value of -1 . (If there is any value of -1 in the result, it means something is
wrong.)
...
11. Now comes the processing part. We first open the CSV file in the w (write) mode. The
option newline='' is used to handle different specifications across different operating
systems. But we don’t need to worry about the technical details here.
12. The intended line Writer = csv.DictWriter(Sf, fieldnames = VarList, lineterminator='\n')
creates an object (which we call Writer ) by using the class called DictWriter (from the
csv module). This class takes three inputs (separated by commas):

• The first one Sf is the CSV file we just opened. So this means the Writer object
will write data to this file.
• The second input fieldnames = VarList specifies the field names to be used in the
CSV file. Here we take field names from the variable list we defined earlier.
• The third input lineterminator='\n' helps get rid of an empty row at the end of
the CSV file.
13. The line Writer.writeheader() uses the method writeheader() of the Writer object
to write the first line of the CSV file as column headers. There is no input required,
because we have already specified where to get the field names in the previous line.
14. The line for File in SourceFiles: loops through the list of source files we want to
process. So for each file, called File , we will open, search, and write a data line in the
CSV file.
15. First, the line VarValues['FILENAME'] = File assigns File (which is a string) to the
FILENAME key in the dictionary called VarValues .
16. Now we open the file in the r (read) mode.
17. The line Text = Fr.read() takes the string content of the file using the read() method
and assign it to the variable Text . Note that we didn’t define this variable earlier. So
here we are creating the variable on the go. But this is not good programming practice.
18. The next three lines count all words and search for the two keywords (coffee and mar-
ket) and assign the results to different keys in dictionary. Note that here we use con-
densed forms of the searches.
...
21. Lastly, we write a data row (for each source file) in the CSV file. Here we use the
writerow() method of the Writer object. The method takes a dictionary as input. It
will match the keys of the dictionary with the field names we wrote in the first row,
and then write the values of those keys accordingly.

If you run the code, a file called “search-result.csv’ will be created in the same folder
where the code file is located. Below is an example:

Listing 7.5: search-result.csv


1 FILENAME , ALLWORDS , COFFEEWORDS , MARKETWORDS
2 starbucks -10 k - one - paragraph . txt ,75 ,3 ,2
3 starbucks -10 k - one - paragraph . txt ,75 ,3 ,2
4 starbucks -10 k - one - paragraph . txt ,75 ,3 ,2
CHAPTER 7. DATA STRUCTURE: DICTIONARY 77

It will look like a table if you open it in LibreOffice Write or Microsoft Excel.
(Note that here we searched the same file for three times, so the data lines look exactly
the same. It is problematic if you import it into a relational database system.)
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

78
Chapter 8

Functions

So far we have used some system functions and module methods to do things such as build-
ing 9x9 Multiplication Table and search for keywords. Functions and methods can be used
again and again to perform the same tasks. In many cases, we may need to build our own
functions for at least two reasons:

• to reuse code (so that we don’t need to write the same code again in different places);
and
• to make code more structural and, therefore, more readable (think about “readability”
again?)

In addition, it is often the case that we perform different tasks in different situations.
Sometimes the tasks may have only minor differences. So functions will make our lives
easier. In Python, different situations are treated as “conditions”. So we will look at how to
handle conditions first and then how to write and use functions.

79
CHAPTER 8. FUNCTIONS 80

8.1 Condition Basics

An example of performing different tasks/actions in different situations is letter grading in


student performance evaluation: converting points or percentages into letter grades, e.g.:

• If percentage is below 60, give Fail; and


• If percentage is above, give Pass.

To put the above conditions and actions in Python code, we will use the if...else...
statement. Below is an example:

Listing 8.1: condition.py


1 Grade = 88
2 if Grade < 60:
3 print ( ' Grade : ' , Grade , '= Fail ')
4 else :
5 print ( ' Grade : ' , Grade , '= Pass ')

Here is how the code works:

1. Define a variable and assign an integer value.


2. The second line if Grade < 60: specifies the condition by using the if . The condition
here is just like any math conditions using math comparison operators. The line ends
with a colon : , which has the meaning of “then do the following”.
Additional comparison operators can be found in Table 8.1.

Table 8.1: Comparison Operators

Operator Meaning Example

== Equal to x == 0
!= Not equal to x != 0
> Greater than x > 0
< Less than x < 0
>= Greater than or equal to x >= 0
<= Less than or equal to x <= 0

Technically, the comparison Grade < 60 will return a value True or False . These two
values are case-sensitive and are equivalent to integers 1 and 0 , respectively.
3. The indent line print out something if the condition is met.
4. The else: specifies the opposite (i.e., if the condition Grade < 60 is not met. Note the
colon : here; it’s required to specify what action to take.
5. The indent line print out something “else” if the condition Grade < 60 is not met.
CHAPTER 8. FUNCTIONS 81

If you run the above code in a terminal, you will get the following result:

Grade 88 = Pass

Note that the else block is not syntactically required. If you remove the block, the above
example will not print out anything. In other word, if the condition Grade < 60 is not met,
do nothing.
CHAPTER 8. FUNCTIONS 82

8.2 Complex Conditions

Sometimes a condition may involve multiple comparisons. In this case we have to write
(somewhat) complex conditions by using logical operators:

Table 8.2: Logical Operators

Operator Meaning Example

and Returns True if both conditions are true (x > 0) and (x < 2)
or Returns True if either condition is true (x > 0) or (x < 2)
not Returns the opposite not ((x > 0) and (x < 2))

Note that the parentheses are not always required. But it’s always a good idea to use
them to avoid any confusion about how multiple conditions are evaluated one by one in the
correct order.
The code below shows an example of how to write multiple conditions:

Listing 8.2: condition-2.py


1 Grade = 88
2 Exam = 50
3 if Grade >= 60 and Exam >= 60:
4 print ( ' Grade ', Grade , ' Exam ' , Exam , '= P ')
5 else :
6 print ( ' Grade ', Grade , ' Exam ' , Exam , '= F ')

The above example assumes that students can only pass the course with a minimum
total grade of 60 and a minimum of exam score of 60. In other words, they have to meet both
conditions to pass and everything else is considered a fail.
The above code will generate the following result:

Grade 88 Exam 50 = F
CHAPTER 8. FUNCTIONS 83

8.3 Multiple Conditions

In many cases we may have more than two conditions. An example is the following grading
scheme:

• If percentage is below 60, give F;


• If percentage is between 60 and 80, give C;
• If percentage is between 80 and 90, give B; and
• If percentage is above 90, give A;

To put all these conditions in Python code we can use elif (which means “else if”):

Listing 8.3: condition-3.py


1 Grade = 88
2 if Grade < 60:
3 print ( ' Grade ', Grade , '= F ')
4 elif 60 <= Grade < 80:
5 print ( ' Grade ', Grade , '= C ')
6 elif 80 <= Grade < 90:
7 print ( ' Grade ', Grade , '= B ')
8 else :
9 print ( ' Grade ', Grade , '= A ')

Here is how the code works:

...
4. The next three lines of elif and else further break down those cases that do not
meet the first condition ( Grade < 60 ).

Note that the condition 60 <= Grade < 80 is a compressed version of (60 <= Grade) and (Grade < 80) .
...

If you run the above code in a terminal, you will get the following result:

Grade 88 = B

But as you can see, the above programs are not optimal if we have many grades to check.
This is when functions can help.
CHAPTER 8. FUNCTIONS 84

8.4 Defining and Using Functions

By now we have used many system functions or methods. For example:

• strn converts the number n into a string.


• re.findall(Ptn, Text) finds all strings in Text that match the pattern Ptn .
• print(s) displays the string s in a terminal.
• Fr.read() gets the string content of the file object called Fr .

These functions—actually all functions—require zero, one, more inputs, which are also
called arguments or parameters in the computer programming universe. Parentheses are al-
ways required after the function name.
The syntax for defining a function is like the following:

1 def function_name ( Input1 , Input2 ) :


2 ...
3 return Something

1. In the first line, the keyword def starts the definition. function_name is any name you
want to give and it’s immediately followed by the left parenthesis (note that there is no
white space in between). In the parentheses are required inputs. As usual, the colon
specifies the beginning of the function’s scope.
2. ... Any code for the function will be included here, intended.
3. The keyword return ends the function by “returning” something. The Something here
can be nothing ( None in Python) or any valid data type. If there is nothing to return,
you can simply use return instead of return Something .

If there is nothing to return, the keyword is not necessary. But it is always a good idea
to use the keyword to explicitly end the function, so that you know where the function
ends (another readability issue).

Let’s look at one example. Here we want to define a function to calculate the total of all
integers between two numbers:

Listing 8.4: function.py


1 def sum_up (m , n ) :
2 rngNum = range (m , n +1)
3 t = sum ( rngNum )
4 return t
5 # ####
6 print ( ' 1+...+10= ', sum_up (1 , 10) )
7 print ( ' 1+...+100= ' , sum_up (1 , 100) )

Here is how the code works:


CHAPTER 8. FUNCTIONS 85

1. The first line def sum_up(m, n): starts the definition of a function called sum_up() ,
which requires two inputs in the parentheses. Here we will use m and n as the be-
ginning and ending, respectively, of the range. The colon indicates the beginning of
the scope.
2. The second line rngNum = range(m, n+1) creates a new variable, whose value is given
by the system function called range() . The range() function here uses two inputs:
• The first input m is the beginning of the range. So the range includes m .
• The second input n+1 specifies the ending of the range. Note that the range ex-
cludes n+1 but includes n . (The range() can have another input: step. Here we
don’t have it. So the range will start from m and increase by one at each step until
it reach n+1 .)
3. The line t = sum(rngNum) uses the system function sum() to get the total of all numbers
in the range.
4. Finally, the function returns the value of t
5. Those lines after the comment # show how the function is used.
6. The first use calculates the total of the numbers from 1 to 10.
7. The second use calculates the total of the numbers from 1 to 100.

If you run the above code, the following result will appear:

1+...+10= 55
1+...+100= 5050

(Are you sure the totals are correct?)


Now let’s define a function to convert numerical grades into letter ones. Below is an
example:

Listing 8.5: function-2.py


1 def get_letter ( Grade ) :
2 if Grade < 60:
3 return 'F '
4 elif 60 <= Grade < 80:
5 return 'C '
6 elif 80 <= Grade < 90:
7 return 'B '
8 else :
9 return 'A '
10 # ####
11 Grades = [59 ,60 ,79 ,80 ,89 ,90]
12 for Grade in Grades :
13 print ( Grade , '= ', get_letter ( Grade ) )

The code will result in the following output:

59 = F
CHAPTER 8. FUNCTIONS 86

60 = C
79 = C
80 = B
89 = B
90 = A

Can you explain how the code works? Note that there are multiple return . Each return
will effectively ends the execution of the function and ignore any code lines after it.
Chapter 9

Display Progress

When doing textual analysis of 10-K reports, we often have to process many of them one
by one, each of them may take quite some time depending to the complexity and length
of the reports and how good our Python programs are. So it is a good idea to show the
progress when we run the programs. Otherwise, we may have no clue what is happening if
the programs are taking too long and seemingly stopped.
So in this chapter we will look another way to loop through lists and then write code to
keep track time and show progress.

87
CHAPTER 9. DISPLAY PROGRESS 88

9.1 Enumerating Lists

When we have a list and want to loop through its items one by one, we can use the for
statement such as for x in lstx . This may not be enough if we need another piece of infor-
mation: the position of an item in the list. It is also called index or count in Python.
Lists are indexed from zero: the first item is indexed as 0 , second item 1 , and so on.
Such index is often referred to as zero-based and is used in most computer programming
languages.
To get the indices and items at the same time, we can use the system function enumerate() .
It requires a list (or other types of iterables) as input and returns a tuple (similar to a list)
that contains the indices and items.
Here is an example:

Listing 9.1: enumerate-1.py


1 Weekdays = [ ' Sun ', ' Mon ' , ' Tue ', ' Wed ' , ' Thu ', ' Fri ' , ' Sat ']
2 for i , d in enumerate ( Weekdays ) :
3 print (i , d )

1. The first line defines a list of weekdays Weekdays .


2. The second line enumerates and loop through the list by using the for...in statement:
• i, d assigns temporary variables to the index and value of each item. The first
variable i is index and the second d is item value, separated by the comma , .
• enumerate() is the function to get both index and item.
• As usual, the for...in... statement requires a colon : to mark the beginning
of the scope.
3. The indented line prints out index and item.

The code will result in the following output:

0 Sun
1 Mon
2 Tue
3 Wed
4 Thu
5 Fri
6 Sat

(Zero-based indexing does make sense here in the weekday example. Monday is Day 1,
Saturday is Day 6. This is much clearer in Chinese: Monday is 星期一, literally Weekday 1;
Saturday is 星期六, literally Weekday 6. But there is no equivalent to Weekday 7. Sunday
is 星期日, literally Weekday Sun. But be careful: some systems internally assign Day 0 as
Monday and Day 6 as Sunday.)
We can also show one-based counting in the form of “one of seven”:
CHAPTER 9. DISPLAY PROGRESS 89

Listing 9.2: enumerate-2.py


1 Weekdays = [ ' Sun ', ' Mon ' , ' Tue ', ' Wed ' , ' Thu ', ' Fri ' , ' Sat ']
2 t = len ( Weekdays )
3 for i , d in enumerate ( Weekdays ) :
4 print (i , d , ' is Day ' , i +1 , ' of ' , t , ' days ')

The code will give the following output:

0 Sun is Day 1 of 7 days


1 Mon is Day 2 of 7 days
2 Tue is Day 3 of 7 days
3 Wed is Day 4 of 7 days
4 Thu is Day 5 of 7 days
5 Fri is Day 6 of 7 days
6 Sat is Day 7 of 7 days

We can also use two lists to build a translation table:

Listing 9.3: enumerate-3.py


1 Weekdays = [ ' Sun ', ' Mon ' , ' Tue ', ' Wed ' , ' Thu ', ' Fri ' , ' Sat ']
2 WeekdaysCn = [ '\ u65e5 ' , '\ u4e00 ' , '\ u4e8c ' , '\ u4e09 ' , '\ u56db ', '\
u4e94 ' , '\ u516d ' , '\ u4e03 ']
3 for i , d in enumerate ( Weekdays ) :
4 print (i , d , '\ u661f \ u671f ' , WeekdaysCn [ i ])

1. The first line defines a list of weekdays.


2. The second line is a list of weekdays in Chinese, expressed in Unicode code points.
3. The for...in... line enumerates and loops through the first list (weekdays in En-
glish).
4. The last indented line prints out something.
• The third input in Unicode means “week” in Chinese.
• The fourth input WeekdaysCn[i] uses the index i to get the corresponding item
in the Chinese version of weekdays.

The code will give the following output (if your system supports Unicode):

0 Sun 星期 日
1 Mon 星期 一
2 Tue 星期 二
3 Wed 星期 三
4 Thu 星期 四
5 Fri 星期 五
6 Sat 星期 六
CHAPTER 9. DISPLAY PROGRESS 90

9.2 Date/Time

Date/time in any computer systems can be very confusing because of things like timezone
and daylight savings. There may be multiple time in the same country. For example, for the
same moment there are three times in the U.S. and five in Australia.
Different operating systems and software also handle date/time differently, causing trou-
ble if you transfer data from one system to another. For example, you may erroneous date
information when you transfer data from Microsoft Excel format to SAS if any date column
is not formatted properly.
The good thing here is that, when we want to keep track progress, we only need to handle
the elapsed time between two date/time points; We don’t need to worry about the exact
date/time of any specific point. Elapsed time has nothing to do with timezone and daylight
savings. In Python, elapsed time is called timedelta.
Some information about how date/time is handled in Python:

• The base unit of measurement is second .


• The epoch time (where the time starts) in Python is January 1, 1970, 00:00:00 (UTC).
(Microsoft Excel and SAS have different epoch times.)
• Time is kept in seconds from the epoch time. There is no such thing as day or hour, which
are converted from seconds.

Let’s look at an example:

Listing 9.4: date-time-1.py


1 from datetime import datetime , timedelta
2 StartTime = datetime . now ()
3 EndTime = datetime . now ()
4 TimeUsed = EndTime - StartTime
5 print ( ' Time used : ' , str ( TimeUsed ) . rjust (20 , ' ') )

(Do NOT name the your Python file as ”datetime.py” if you want to try out this code. It
causes problems because the built-in module in Python is called ”datetime” and named as
”datetime.py”.)

1. The first line imports two classes datetime and timedelta from the module called
datetime . Don’t be confused by the two datetime :

• from datetime specifies the module “from” which we are going to import some-
thing.
• import datetime specifies the class called datetime that we are importing. So
throughout the rest of the code the term datetime refers to this class, not the
module.
• We have two classes to import. So we need a comma , to separate them.

2. The second line uses the now() method of the datetime class to get the date/time of
the very moment this line is executed when you run the code. The method return a
CHAPTER 9. DISPLAY PROGRESS 91

datetime object (which is not the same as seconds, minutes, or hours). We take this
date/time as the start time.
3. This line is similar to the previous one. We use it as the end time.
4. The line TimeUsed = EndTime - StartTime calcuates the elapsed time between the two
points.
5. The last line prints out the elapsed time with proper format. Note that the variable
TimeUsed here is a timedelta object, which is not a string. So we have to use the str()
function to convert it into a string, and then use the rjust() to pad it with white spaces.

If you run the above, you will get something similar to the following:

Time used : 0:00:00.000003

It takes a tiny faction of a second to run the code.


CHAPTER 9. DISPLAY PROGRESS 92

9.3 Monitoring Progress

Now we have enough tools to show the progress of code execution. Below is an example:

Listing 9.5: progress.py


1 from datetime import datetime , timedelta
2 from time import sleep
3 # ##
4 def print_progress ( lbl , ith , cnt , tim ) :
5 lblFmt = str ( lbl ) . ljust (15 , ' ')
6 ithFmt = str ( ith ) . rjust (5 , ' ')
7 cntFmt = str ( cnt ) . rjust (5 , ' ')
8 timFmt = str ( tim ) . rjust (20 , ' ')
9 print ( '\ r ', lblFmt , ': ' , ithFmt , ' of ' , cntFmt , timFmt , end = '
', flush = True )
10 # ##
11 StartTime = datetime . now ()
12 EndTime = datetime . now ()
13 Tics = [ ' IBM ' ,' MSFT ', ' BABA ' ,' MMM ' ,' ABB ' ,' ACN ', ' ACE ', ' AXP ']
14 n = len ( Tics )
15 for i , tic in enumerate ( Tics ) :
16 EndTime = datetime . now ()
17 t = EndTime - StartTime
18 print_progress ( tic , i +1 , n , t )
19 sleep (0.5)
20 print ( ' ')

If you run the above, you will see something like the following in the terminal:

AXP : 8 of 8 0:00:03.504740

(Note that here it is the static end result. You should see a slow-motion timer in the actual
terminal.)
Let’s look at the code in more detail:

1. The first line imports two classes datetime and timedelta from the module called
datetime .

2. This line imports the sleep() function from the time module.  This is used to put
the execution of the code and hence slow it down. It’s included in this program for
the purpose of showing the progress. You should not use it in your real accounting
research projects.
3. This is just a comment line to make the code more readable.
4. This line starts the definition of a function for displaying (printing) the progress in the
terminal. It takes four inputs:
• lbl is used to show a label.
• ith is entered as the ith item in a list.
CHAPTER 9. DISPLAY PROGRESS 93

• cnt is entered as the total count of items in the list.


• tim is entered as the time elapsed.
...
9. The print() line displays something in the terminal:

• The '\r' is not something to be printed. It is a flag telling the print command to
move the cursor back to the very beginning to start over.
• The next few items are displayed one by one.
• The end='' option tells the print command to stay on the current line so that it
does not make a new line. Without this option, the print command always makes
a new line (equivalent to end='\n' ).
• The flush=True option tells the print to flush out any buffer (this is somewhat
technical). The real effect is that the progress line is displayed while the code is
executing. Otherwise, the display will happen only after the code execution is
finished.
10. Another comment line to make the code more readable.
11. Get the start time StartTime .
12. Get the end time. The code execution has not ended yet. It’s used here to initialize the
variable EndTime to avoid any confusion.
13. Define a list of firms (tickers).
14. Get the total count of the list.
15. Now we loop through the enumerated list by assigning two temporary variables i
and tic , which refer to the index and corresponding ticker.
16. Now we get the end time of the looping of the ith item.
17. Calculate the elapsed time.
18. Use the function we defined to print out the progress by specifying the four inputs:
• tic : we use the ticker as a label, so we know which firm we are processing.
• i+1 : the ith item we are processing. Because the index is zero-base, we use i+1
to get the real counting of “how many so far”.
• n : the total count of the list.
• t : the elapsed time we just calculated.
19. The line sleep(0.5) asks the program to slow down by taking a nap of 0.5 seconds.
 It’s included in this program for the purpose of showing the progress. You should
not use it in your real accounting research projects.
20. The last line (unintended) prints out nothing but move the cursor to a new line.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

94
Chapter 10

Downloading Files from the


Internet

In this chapter we will look at how to download files from the internet. You may have heard
fancier terms such as “web scraping”. But what we do in terms of accounting textual analysis
is simply to download some files first. Any extraction or analysis will be done afterwards.
Note that we will only deal with static web pages, not anything that requires JavaScript
(some code used to generate the content). If a web page uses JavaScript, then there is prob-
lem of “what you download is not what you see”. The good news is that all 10-K reports on
SEC EDGAR are static web pages.
There is a Python package that is specifically built for web downloading: requests . More
information about this package can be found at:
https://docs.python-requests.org/en/latest/
Let’s look at an example of using requests to download a Wikipedia entry about the
Python programming language:
https://en.wikipedia.org/wiki/Python_(programming_language)

Listing 10.1: download.py


1 import requests
2 url = ' https :// en . wikipedia . org / wiki / Python_ ( programming_language )
'
3 SaveFile = ' python - wiki . htm '
4 with open ( SaveFile , ' wb ') as Sf :
5 Sf . write ( requests . get ( url ) . content )

Here is how the code works:

1. The first line imports the package called requests .


2. Define the url of the wiki page.
3. Specify the file name to save. It will be located in the same folder as the code file.
4. Open the file in the write model wb . Here w means write, b means binary. The binary
mode can help avoid any decoding/encoding issues. It may not matter much if the file

95
CHAPTER 10. DOWNLOADING FILES FROM THE INTERNET 96

we download is all text. In other cases such as images, binary mode is required.
5. The indented line writes something to the opened file:
• requests.get(url) sends a request to the Wikipedia website and gets an object
called Response , which basically includes what the website sends back. The at-
tribute content refers to the content of the Response, in bytes (not text).

If you open the downloaded file python-wiki.htm in a web browser, it is something like
the following (partial):

It’s not exactly what you see on the Wikipedia website:

The cause for the different looks is that some linked files are not downloaded, most no-
tably stylesheet files (.css) and images. (In other cases JavaScript is the other reason.)
Chapter 11

Cleaning HTML Code

In Chapter 4 we tried to build the 9x9 Multiplication Table in the HTML format by using
various tags. When we do textual analysis, however, we have to reverse the process: instead
of adding tags, we have to remove tags to get plain text. This is what we are going to look at
in this chapter.

97
CHAPTER 11. CLEANING HTML CODE 98

11.1 Using Regular Expression

The most straightforward way to cleaning up HTML code (tags) is to use regular expression
( re ). Let’s look at an example by using the 9x9 Multiplication Table we built in Chapter 4.
For your convenience, the code and rendered page are repeated here:

Figure 11.1: 9x9 Multiplication Table (Code and Rendered Page)

As you can see on the left figure, although any web browser can properly format and
render the page, there is no obvious structure in the HTML code. The whole table is just one
line of code. As we’ve seen the Starbucks example, the worst scenario is when the whole 10-K
report is just one or few lines. This creates huge problems for textual analysis. Nevertheless,
let’s give it a try by using regular expression.
There are two things to do in this example:

1. Get rid of the style section, which is useless when we deal with cleaned text.
2. Get rid of all HTML tags such as <html>, <table> etc.

Listing 11.1: clean-html-1.py


1 import re
2 HtmlFile = '9 x9 - math - table -5. htm '
3 with open ( HtmlFile , 'r ') as Fr :
4 Text = Fr . read ()
5 Pattern = r ' < style .*?/ style > '
6 Text = re . sub ( Pattern , '' , Text , flags = re . I | re . DOTALL )
7 Pattern = r ' \ <.*?\ > '
8 Text = re . sub ( Pattern , '' , Text , flags = re . I | re . DOTALL )
9 print ( Text )

Here is how the code works:

1. Import the regular expression module re .


2. Define a variable to set the name of the HTML file.
3. Open the HTML file in the read mode.
CHAPTER 11. CLEANING HTML CODE 99

4. Use the read() method of the file object to extract and assign its text content to the
variable Text .
5. Define the pattern for matching the style section of the HTML code. Here the dot .
means any character, * means matching zero or more of the preceding character (i.e.,
the . in this case), and the question mark ? makes the * less greedy. Greedy or not
matters if there are multiple /style> in the text. “Less greedy” means the search will
stop at the first instance of /style> ; “Greedy” (i.e., a * without the question mark),
on the other hand, will stop at the very last instance.
6. Use the re.sub() method to replace anything that matches the pattern with an empty
string (i.e., nothing). Note that we use two flags (options):
• re.I makes it case insensitive (“Ignore case”).
• re.DOTALL (all uppercase) makes the dot . to include line breaks \n as well.
Otherwise, the current search will stop at \n and continue to the next line as a
new search.
• The vertical bar | is the operator meaning “OR” (uppercase). It’s bitwise, not
exactly the same as the lowercase version or . (You don’t need to worry about
the technical details about bitwise operators, unless you aim to study computer
science. This is the only place where we use | in this book.)
7. We define a new pattern (using the same variable name) for matching HTML tags.
Note that angle brackets <> are also used as special characters in Python. So it’s often
a good idea to escape it by using the backslash \ .
8. To the replacement again with the new pattern.
9. Print out the final result in the terminal.

Running the code will generate the following result:

1 *1234567891 1 2 3 4 5 6 7 8 92 2 4 6 8
10 12 14 16 183 3 6 9 12 15 18 21 24 274 4
8 12 16 20 24 28 32 365 5 10 15 20 25 30 35 40
456 6 12 18 24 30 36 42 48 547 7 14 21 28 35
42 49 56 638 8 16 24 32 40 48 56 64 729 9 18
27 36 45 54 63 72 81

As you can see, everything is in the same line and the first few numbers are lumped
together! This is certainly not what we’d look for. To improve the above leaning procedure,
we can:

1. Replace each </td> with a white space to make sure cells are separated.
2. Replace each </tr> with a line break \n to make sure each row is a separate line.

Listing 11.2: clean-html-2.py


1 import re
2 HtmlFile = '9 x9 - math - table -5. htm '
3 with open ( HtmlFile , 'r ') as Fr :
CHAPTER 11. CLEANING HTML CODE 100

4 Text = Fr . read ()
5 Pattern = r ' < style .*?/ style > '
6 Text = re . sub ( Pattern , '' , Text , flags = re . I | re . DOTALL )
7 Pattern = r ' </ tr > '
8 Text = re . sub ( Pattern , '\\ n ' , Text , flags = re . I | re . DOTALL )
9 Pattern = r ' </ td > '
10 Text = re . sub ( Pattern , ' ' , Text , flags = re . I | re . DOTALL )
11 Pattern = r ' \ <.*?\ > '
12 Text = re . sub ( Pattern , '' , Text , flags = re . I | re . DOTALL )
13 print ( Text )

Note that in the code:

...
8. The replacement string has two backslashes, the first “escape” the second. Alterna-
tively, you can use r'\n' to make it a “raw” string.
...
10. The replacement string here is a white space, not empty.
...

Running the code will generate the following result (empty lines omitted):

1 * 1 2 3 4 5 6 7 8 9
2 1 1 2 3 4 5 6 7 8 9
3 2 2 4 6 8 10 12 14 16 18
4 3 3 6 9 12 15 18 21 24 27
5 4 4 8 12 16 20 24 28 32 36
6 5 5 10 15 20 25 30 35 40 45
7 6 6 12 18 24 30 36 42 48 54
8 7 7 14 21 28 35 42 49 56 63
9 8 8 16 24 32 40 48 56 64 72
10 9 9 18 27 36 45 54 63 72 81

This one looks much better!


CHAPTER 11. CLEANING HTML CODE 101

11.2 Using the lxml Package

For complex web pages, we can use a Python package called lxml , which has some modules
for handling HTML code. Here is an example of using lxml to process the Wikipedia page
about the Python programming language. We downloaded it in Chapter 10 and here is a
screenshot of the page source (partial):

Here is a Python program to clean up the page:

Listing 11.3: clean-html-3.py


1 import re
2 import lxml . html . clean as lxmlclean
3 # Function
4 def get_clean_text ( HtmlText ) :
5 Cleaner = lxmlclean . Cleaner (
6 style = True ,
7 safe_attrs =[ ''] ,
8 allow_tags =[ ''] ,
9 remove_unknown_tags = False
10 )
11 return Cleaner . clean_html ( HtmlText )
12 # Function
13 def deep_clean ( Text ) :
14 Pattern = r ' \ <.*?\ > '
15 Text = re . sub ( Pattern , '' , Text , flags = re . DOTALL )
16 Pattern = r '\s {5 ,} '
17 Text = re . sub ( Pattern , r '\ n\ n ', Text )
18 return Text
19 # Processing
20 HtmlFile = ' python - wiki . htm '
21 TextFile = ' python - wiki . txt '
22 with open ( HtmlFile , 'r ') as Fr :
23 Text = re . sub ( r '> ' , r '> ' , Fr . read () )
24 Text = get_clean_text ( Text )
25 Text = deep_clean ( Text )
CHAPTER 11. CLEANING HTML CODE 102

26 with open ( TextFile , 'w ') as Fw :


27 Fw . write ( Text )

The general idea of the above code is:

• Define a function to use the lxml package to remove HTML tags.


• Define a function to do additional cleaning.
• Use the two function to process the wiki page.
• Save the cleaned text in file.

Here is how the code works:

1. Import the re module.


2. Import the lxml.html.clean module and assign a shorter name to use in this program.
3. Comment line.
4. The next few lines define a function, which takes some HTML text as input.
5. This line creates an instance of the Cleaner class of the lxml module we imported. We
reuse “Cleaner” as the variable name. We specify some parameters of the class (all
within the parentheses and separated by commas).
6. The style=True line tells the cleaner to remove all styles in the HTML code.
7. The line safe_attrs=[''] assigns a list of one empty string to the parameter safe_attrs .
So it means there will be no safe attributes...everything inside < > will be removed.
8. The line allow_tags=[''] assigns a list of one empty string to the parameter allow_tags .
So it means no tags are allowed or all tags will be deleted.
9. Because of the allow_tags parameter, remove_unknown_tags must be set to be False ,
otherwise the code doesn’t run.
10. An right parenthesis to end the definition of the Cleaner instance.
11. The function use a method of the Cleaner instance to clean up the HTML code and
return the cleaned text.
12. Another comment line to make the code more readable.
13. Starting from this line we define another function to do some additional cleaning. This
is because the Cleaner from the lxml will add an extra pair of tags <div> </div> in
the cleaned text...making it less clean.
14. Define a pattern for tags.
15. Replace each tag with an empty string.
16. Define another pattern for 5 or more consecutive non-print white spaces \s , including
tabs \t and line breaks \n . This will make resulting text less sparse. The general
syntax is min,max (min and max separated by comma).
17. This line replaces any 5 or more consecutive non-print white spaces with two link
breaks. This is to avoid lumping everything together.
18. The function will return the further cleaned text.
CHAPTER 11. CLEANING HTML CODE 103

19. Yet another comment line.


20. Now comes the processing. This line specifies the HTML file we want to clean.
21. Where we want to save the cleaned text.
22. Open the HTML file in the read mode.
23. Use the read() of the file object to get its text content, then replace all '>' with '> '
by adding a white space. This can help avoid two words being lumped together to
become a non-word. (A side effect is that there will be extra white spaces but this is a
trivial issue.)
24. Use the function defined above to clean the text, and then assign the result to back the
variable Text
25. Do some more cleaning of Text .
26. Open a file in the write mode.
27. Write the cleaned text Text to the text file.

Running the above code will result in a text file saved in the same folder as the Python
code file. Below is a partial screenshot:
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

104
Part II

Analyzing Corporate Filings

105
Chapter 12

Processing Filing Indices

We are ready to do some real textual analysis! The first step is to figure out where to down-
load 10-K files. To do this, we need the index files published by the SEC. From the index
files we can get the exact link to any specific filing.

107
CHAPTER 12. PROCESSING FILING INDICES 108

12.1 Downloading Index Files

The SEC publishes two types of index files on its website: daily and quarterly. Usually quar-
terly index files are a better choice for accounting researchers (we don’t do textual analysis
on a daily basis).
For the same index file, there are three different copies: (1) sorted by company name (file
name company.idx ), (2) by form type (file name form.idx ), and (3) by CIK ( master.idx ),
respectively. It doesn’t really matter as to which one you should use.
All index files can be found at:
https://www.sec.gov/Archives/edgar/full-index/
Because there are limited number of index files, you can download them manually. Note
that for each file there is a zip version. So if your Internet is slow, you can download the zip
version and then unzip it.
You can also write a Python program to automate the process, if there are too many index
files to download. Index files are organized by year and quarter.1 For example, the master
index file for the first quarter of 2021 is:
https://www.sec.gov/Archives/edgar/full-index/2021/QTR1/master.idx

 Before you write Python code to automatically download anything from the SEC’s
website, make sure to review its requirements at:
https://www.sec.gov/os/accessing-edgar-data

Important
The SEC wants to know who you are. So your code
must have a “user-agent”, something like “Your School Name,
Your.Name@YourSchool.edu”. Without it, you will get garbage.

Below is an example of how to automate the downloading process. The general idea of
the downloading:

• Specify the index files of which year(s) and quarter(s) to download.


• For each year each quarter, get the URL of the SEC index file.
• Download and save the index file as year-quarter.txt . For example, for 2021 Quarter
1, the index file will be named as 2021-QTR1.txt .

Here is the code:

Listing 12.1: edgar-download-index-files.py


1 # Your organization and your email . DO NOT USE MINE !
2 Headers = { 'User - Agent ': ' xxx University , name@whateveruniversity .
edu '}
3 # Years and Quarters as decimals : year . quarter
4 BeginYQ = 2021.1
1 The “year” and “quarter” here refer to filing dates and they have nothing to do with fiscal year and quarter.
CHAPTER 12. PROCESSING FILING INDICES 109

5 EndYQ = 2021.1
6 # Required packages / modules
7 import requests
8 # Function
9 def get_url (Y , Q ) :
10 SecPath = ' https :// www . sec . gov / Archives / edgar / full - index / '
11 SecFileName = ' master . idx '
12 Slash = '/ '
13 Steps = [ SecPath , str (Y ) , Slash , ' QTR ' , str (Q ) , Slash ,
SecFileName ]
14 return ''. join ( Steps )
15 # Another function
16 def set_file_name (Y , Q ) :
17 return ''. join ([ str ( Y) , ' - ', ' QTR ' , str ( Q ) , '. txt ' ])
18 # Another function
19 def download ( url , SaveFile ) :
20 with open ( SaveFile , ' wb ') as Sf :
21 Sf . write ( requests . get ( url , headers = Headers ) . content )
22 # Processing
23 Count = 0
24 if BeginYQ > EndYQ :
25 print ( ' BeginYQ > EndYQ ??? ')
26 else :
27 Years = range ( int ( BeginYQ ) , int ( EndYQ ) +1)
28 Quarters = range (1 , 5)
29 for Y in Years :
30 for Q in Quarters :
31 if BeginYQ <= ( Y + Q /10) <= EndYQ :
32 download ( get_url (Y , Q ) , set_file_name (Y , Q ) )
33 Count += 1
34 print ( Count , ' index files downloaded ')
35 # End

If you are going to use the above code in your own Python program:

...
2. Define a user-agent required by the SEC. It is a dictionary data type.

 Use your own organization and name. Do NOT use mine! I didn’t anonymize
the code because I had to test it from time to time to make sure everything worked as
expected, and I don’t want to risk having my IP address blocked by the SEC by using
fake name and email.
...
21. The Headers dictionary is used here in this line as a parameter for the requests.get()
method.
...
24. Check if year/quarter specification is wrong.
...
CHAPTER 12. PROCESSING FILING INDICES 110

27. If year/quarter specification is okay, get a range of years.


28. Get a range of quarters (1...4).
29. For each year and each quarter that is within the desired timeframe, download the
index file.
...
CHAPTER 12. PROCESSING FILING INDICES 111

12.2 Processing Index Files

Once the index files have downloaded, we can then process them to get the links to specific
filings. We will use the following mini version of an downloaded index file as an example
(don’t process too many filings for testing runs). We’ve already seen it before. It’s included
here so that you don’t need to flip through the pages.

Listing 12.2: masteridx.txt


1 Description : Master Index of EDGAR Dissemination Feed
2 Last Data Received : March 31 , 2020
3 Comments : webmaster@sec . gov
4 Anonymous FTP : ftp :// ftp . sec . gov / edgar /
5 Cloud HTTP : https :// www . sec . gov / Archives /
6
7
8
9
10 CIK | Company Name | Form Type | Date Filed | Filename
11 -----------------------------------------------
12 1000229| CORE LABORATORIES N V |10 - K |2020 -02 -10| edgar / data
/1000229/0001564590 -20 -004075. txt
13 1000230| OPTICAL CABLE CORP |10 - K |2020 -01 -27| edgar / data
/1000230/0001437749 -20 -001224. txt

As you can see, the first few lines are useless. Line 5 https://www.sec.gov/Archives/ tells
you the path you need to form a complete URL. Line 11 is column heading, which tells you
what data each column has. The real data starts from Line 12. Each line is a separate record,
of which data fields are separated by vertical bars | . So for the second firm in the index file,
the URL of its filing is:
https://www.sec.gov/Archives/edgar/data/1000230/0001437749-20-001224.txt
In the above URL the long string of digits and dashes 0001437749-20-001224 is referred
to as “accession” in SEC’s terminology.
Based on the file name, you can also check the landing page (or “homepage”) for all
related files. There are actually multiple files for each filing; the .txt file is just one of them.
For example, the URL of the landing page of the above firm’s filing is something like the
following:
https://www.sec.gov/Archives/edgar/data/1000230/0001437749-20-001224-index.html
(just replace .txt with -index.html in the above link.)
The landing page is shown in Figure 12.1.

1 is the accession number, which is used in the file name.


2 is the file specified in the index file. Note that it’s a complete file that includes all related
files of the same filing, more than what is listed on the landing page.

We are somewhat off-track now...let’s go back to the filing URL. The objective of this
chapter is to process the index file, build correct URLs, and save them in a separate file, which
will then be used to download filings. Note that processing index files and downloading
CHAPTER 12. PROCESSING FILING INDICES 112

Figure 12.1: Filing Landing Page

filings can be combined into one step in the code, but for the sake of clarity, we separate
them into different steps/chapters.
The general steps to process the index file include:

• Identify data rows. In our example each data row has several vertical bars, which
separate data fields.
• Identify non-data rows. Using vertical bars to identity data rows will include the field
label CIK|Company Name|... . Such a label row is usually required for CSV files. But for
our purpose it is not necessary.
To exclude the label row, we can use the string .txt as another criterion: a real data
row must have the string but a label row does not.
• Identify 10-K rows. Each index file downloaded from SEC EDGAR includes all form
types. So we have to use the Form Type field to keep 10-K reports but ignore all others.
CHAPTER 12. PROCESSING FILING INDICES 113

(This will exclude the label line.)


• Identify data fields. Based on the label row 10 , we know the data of each field:
– Field 1: CIK
– Field 2: Company Name
– Field 3: Form Type
– Field 4: Date Filed
– Field 5: Filename

Below is an implementation of the above procedures:

Listing 12.3: edgar-process-index-files.py


1 # Initial Settings
2 IndexFile = ' masteridx . txt '
3 SaveFile = ' master - url . txt '
4 FormTypes = [ '10 - K ']
5 SecPath = ' https :// www . sec . gov / Archives / '
6 # Function
7 def check_form_type ( Field , FormTypes ) :
8 if Field in FormTypes :
9 return True
10 else :
11 return False
12 # Processing
13 with open ( SaveFile , 'w ') as Fs :
14 with open ( IndexFile , 'r ') as Fr :
15 for Line in Fr :
16 if Line . count ( '| ') > 2:
17 Fields = Line . split ( '| ')
18 if Fields [2] in FormTypes :
19 url = ' '. join ([ SecPath , Fields [4]])
20 Fs . write ( url )
21 # End

Here is how the code works:

...
2. This is the mini index file we use as an example. We will process only one index file
here. If you have many, you can use a loop statement.
...
4. We use a list to specify form types. If you want to include amendments, you can use
['10-K', '10-K/A'] .
...
15. The index file opened in the read mode Fr is treated as a tuple of lines and we loop
through it line by line.
16. A line is valid only if it has more than two vertical bars.
CHAPTER 12. PROCESSING FILING INDICES 114

17. Make a list of fields by using the vertical bar as the separator.
18. Check the third field to see if the form is what we want. Remember that lists are zero-
based.
19. Make a URL for each data row.
20. Save the line in the new file opened in the write mode.

Running the above code will generate something like the following:

Listing 12.4: master-url.txt


1 https :// www . sec . gov / Archives / edgar / data
/1000229/0001564590 -20 -004075. txt
2 https :// www . sec . gov / Archives / edgar / data
/1000230/0001437749 -20 -001224. txt

You can manually check if the URLs work.


Chapter 13

Downloading Filings

We are now ready to download 10-K filings by using the URLs we created in the previous
chapter. Just to make it more convenient, the file is repeated here:

Listing 13.1: master-url.txt


1 https :// www . sec . gov / Archives / edgar / data
/1000229/0001564590 -20 -004075. txt
2 https :// www . sec . gov / Archives / edgar / data
/1000230/0001437749 -20 -001224. txt

The general procedures to download the filings:

• Open the URL file and loop through it line by line where each line is a separate URL.
• Use the requests package to download and save each filing.

Important...Again!
The SEC wants to know who you are. So your code
must have a “user-agent”, something like “Your School Name,
Your.Name@YourSchool.edu”. Without it, you will get garbage.

Below is an implementation of the above procedures:

Listing 13.2: edgar-download-files.py


1 # Your organization and your email . DO NOT USE MINE !
2 Headers = { 'User - Agent ': ' xxx University , name@whateveruniversity .
edu '}
3 # Required packages / modules
4 import re
5 import requests
6 from datetime import datetime , timedelta
7 # Initial Settings
8 UrlFile = ' master - url . txt '

115
CHAPTER 13. DOWNLOADING FILINGS 116

9 # Function
10 def set_file_name ( url ) :
11 return re . findall ( r ' .{20}\. txt ' , url ) [0]
12 # Another function
13 def download ( url , SaveFile ) :
14 with open ( SaveFile , ' wb ') as Sf :
15 Sf . write ( requests . get ( url , headers = Headers ) . content )
16 # Another function
17 def print_progress ( lbl , ith , cnt , tim ) :
18 lblFmt = str ( lbl ) . ljust (15 , ' ')
19 ithFmt = str ( ith ) . rjust (5 , ' ')
20 cntFmt = str ( cnt ) . rjust (5 , ' ')
21 timFmt = str ( tim ) . rjust (20 , ' ')
22 print ( '\ r ', lblFmt , ': ' , ithFmt , ' of ' , cntFmt , timFmt , end = '
', flush = True )
23 # Processing
24 StartTime = datetime . now ()
25 EndTime = datetime . now ()
26 with open ( UrlFile , 'r ') as Fr :
27 Lines = Fr . readlines ()
28 TotalCount = len ( Lines )
29 for i , Line in enumerate ( Lines ) :
30 EndTime = datetime . now ()
31 t = EndTime - StartTime
32 print_progress ( set_file_name ( Line ) , i +1 , TotalCount , t )
33 download ( Line , set_file_name ( Line ) )
34 EndTime = datetime . now ()
35 t = EndTime - StartTime
36 print_progress ( set_file_name ( Line ) , i +1 , TotalCount , t )
37 print ( ' ')
38 # End

Here is how the code works:

...
2. Define a user-agent required by the SEC. It is a dictionary data type.

 Use your own organization and name. Do NOT use mine! I didn’t anonymize
the code because I had to test it from time to time to make sure everything worked as
expected, and I don’t want to risk having my IP address blocked by the SEC by using
fake name and email.
...
10. Define a function to find the file name in the URL. The accession number has 20 char-
acters, including two hyphens.
...
13. Define a function for downloading files.
...
15. The above header with user-agent information is used in the requests.get() method
CHAPTER 13. DOWNLOADING FILINGS 117

for downloading files.


...
... Define a function to show progress. It is helpful when you have many files to down-
load.
...
27. We make a new list of lines read from the file. This is needed to get the total count of
files.
...

Running the above code will save the files in the same folder as the code file and display
something like the following in the terminal:

0001437749 -20 -001224. txt : 2 of 2 0:00:01.091356

Here it took a little over one second to download two 10-K filings (total size 26MB).
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

118
Chapter 14

Cleaning HTML Code

The structure of the HTML source largely matches what is shown on the filing landing page.
We have seen an example in Figure 12.1 on Page 112. It is repeated here for convenience.

The HTML source of the corresponding complete file 0001437749-20-001224.txt is struc-

119
CHAPTER 14. CLEANING HTML CODE 120

tured as follows (partial):

Listing 14.1: 0001437749-20-001224.txt


1 <SEC - DOCUMENT >
2 <SEC - HEADER >
3 </ SEC - HEADER >
4 < DOCUMENT >
5 < TYPE >10 - K
6 < SEQUENCE >1
7 < FILENAME > occ20191031_10k . htm
8 < DESCRIPTION > FORM 10 - K
9 < TEXT >
10 < XBRL >
11 ....... The whole 10 - K would be here .......
12 </ XBRL >
13 </ TEXT >
14 </ DOCUMENT >
15 </ DOCUMENT >
16 < DOCUMENT >
17 ...
18 </ DOCUMENT >
19 < DOCUMENT >
20 ...
21 </ DOCUMENT >
22 </ SEC - DOCUMENT >

1. The whole thing is enclosed in the tag pair <SEC-DOCUMENT> and </SEC-DOCUMENT> .
2. The <SEC-HEADER>...</SEC-HEADER> block holds some general information about the
filer.
...
4. The <DOCUMENT>...</DOCUMENT> block holds the whole 10-K report (shown on the land-
ing page as a separate file.)
5. The <TYPE> single tag shows the document type. Here it shows that it’s the 10-K report.
Note that this tag stands on its own and does not have a closing one.
6. The <SEQUENCE> single tag corresponds to the Seq column on the landing page.
7. The <FILENAME> single tag corresponds to the Document column on the landing page.
It’s the name of the separate file for the 10-K report.
8. The <DESCRIPTION> single tag describes the nature of the corresponding file.
9. The <TEXT>...</TEXT> block holds the text part of the 10-K report.
10. The <XBRL>...</XBRL> block indicates that the text part has inline XBRL elements. Old
filings do not have such block.
...

The sequence of the <DOCUMENT>...</DOCUMENT> blocks is based on the column Seq 1


on the landing page. So 10-K is listed first (Seq = 1) and the Extracted XBRL Instance
CHAPTER 14. CLEANING HTML CODE 121

Document is listed last (Seq = 37). Note that there are many other files not shown on the
landing page. That is why the sequential numbers in the Seq column have gaps and appear
to be non-sequential.
To extract the 10-K part, we only need what is between the first pair of <DOCUMENT> and
</DOCUMENT>
The general procedures to extract the 10-K report are:

• Open the file in the read mode.


• Extract the 10-K report.
• Remove HTML tags.
• Save the extract 10-K report as a new file. (We save it in a new file just in case we have
to redo the extraction. If disk space is an issue, you can manually delete the old file.)

Let’s look at an example:

Listing 14.2: edgar-clean-html.py


1 # Required modules
2 import re
3 import lxml . html . clean as lxmlclean
4 from datetime import datetime , timedelta
5 # Inital Settings
6 HtmlFiles = [ ' 0001564590 -20 -004075. txt ' , ' 0001437749 -20 -001224. txt
']
7 # Function
8 def get_clean_text ( HtmlText ) :
9 Cleaner = lxmlclean . Cleaner (
10 style = True ,
11 safe_attrs =[ ''] ,
12 allow_tags =[ ''] ,
13 remove_unknown_tags = False
14 )
15 return Cleaner . clean_html ( HtmlText )
16 # Function
17 def deep_clean ( Text ) :
18 Pattern = r ' \ <.*?\ > '
19 Text = re . sub ( Pattern , '' , Text , flags = re . DOTALL )
20 Pattern = r '\s {5 ,} '
21 Text = re . sub ( Pattern , r '\ n\ n ', Text )
22 return Text
23 # Another function
24 def print_progress ( lbl , ith , cnt , tim ) :
25 lblFmt = str ( lbl ) . ljust (15 , ' ')
26 ithFmt = str ( ith ) . rjust (5 , ' ')
27 cntFmt = str ( cnt ) . rjust (5 , ' ')
28 timFmt = str ( tim ) . rjust (20 , ' ')
29 print ( '\ r ', lblFmt , ': ' , ithFmt , ' of ' , cntFmt , timFmt , end = '
', flush = True )
30 # Processing
31 Pattern10k = r ' < DOCUMENT >.+? </ DOCUMENT > '
CHAPTER 14. CLEANING HTML CODE 122

32 Patternix = r '< ix : header >.+? </ ix : header > '


33 PatternTable = r '< table .*? >.+? </ table > '
34 StartTime = datetime . now ()
35 EndTime = datetime . now ()
36 for i , HtmlFile in enumerate ( HtmlFiles ) :
37 TotalCount = len ( HtmlFiles )
38 EndTime = datetime . now ()
39 t = EndTime - StartTime
40 print_progress ( HtmlFile , i +1 , TotalCount , t )
41 TextFile = ' '. join ([ HtmlFile , '. clean ' ])
42 with open ( HtmlFile , 'r ') as Fr :
43 Text = re . findall ( Pattern10k , Fr . read () , flags = re . DOTALL )
[0]
44 Text = re . sub ( Patternix , '\ n ', Text , flags = re . DOTALL )
45 Text = re . sub ( PatternTable , '\ n ', Text , flags = re . I | re .
DOTALL )
46 Text = re . sub ( r ' > ', r ' > ' , Text )
47 Text = get_clean_text ( Text )
48 Text = deep_clean ( Text )
49 with open ( TextFile , 'w ') as Fw :
50 Fw . write ( Text )
51 EndTime = datetime . now ()
52 t = EndTime - StartTime
53 print_progress ( HtmlFile , i +1 , TotalCount , t )
54 print ( ' ')
55 # End

Here is how the code works:

...
2. Import required packages/modules.
...
6. A list of HTML files to clean up. Note that here we manually enter the two files we
downloaded earlier. In your real 10-K research projects you may scan the folder to get
a list of files.
...
8. Define a function to clean up HTML code. This is the same as what we have seen in a
previous chapter.
...
17. Define a function to do additional cleaning. This is the same as what we have seen in
a previous chapter.
...
24. Define a function to show progress.
...
31. Define the pattern to search for <DOCUMENT> </DOCUMENT> blocks.
32. Define the pattern to search for <ix:header> </ix:header> blocks. These blocks are
CHAPTER 14. CLEANING HTML CODE 123

used for inline XBRL but useless for textual analysis. So any such blocks will be re-
moved.
33. Define the pattern to search for tables. This is described in Li (2008). Note that such
removal can be invasive in some cases where firms may use tables to structure text
(such as headings) but not tabular content.
...
43. We search for all <DOCUMENT> </DOCUMENT> blocks but only take the first one. Note that
the re.findall() method returns a list of matched strings.
44. Replace all inline XBRL blocks with an empty line.
45. Replace all table blocks with an empty line.
46. Add a white space at the end of every > to avoid words being lumped together.
47. Now clean up HTML code.
48. Additional cleaning.
49. Open a new file to write the cleaned content. The new file name is the old nave plus
the suffix .clean .

Running the above code will save two new files in the same folder as the code file and
the HTML source files. It will display something like the following in the terminal:

0001437749 -20 -001224. txt : 2 of 2 0:00:01.383134

Here it took a little over one and half seconds to clean up the two 10-K filings. The size
of the two cleaned files is about 250KB, which is less one percent of the original 26MB.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

124
Chapter 15

Extracting Text Sections

Once the HTML files are cleaned, we are now ready to extract specific sections of 10-K filings.
Up to now we have already done some extraction such as the <DOCUMENT> </DOCUMENT blocks.
So we can simply reuse the code we have. The hard part is how to define patterns.
Let’s look at how to extract the MDA (Management’s Discussion and Analysis) sections
of 10-K reports. Some general characteristics of the MDA section:

• It usually begins with a heading like “Item 7 Management’s Discussion and Analysis
of Financial Condition and Results of Operations”. But there can be many variations
(you can never imagine all possibilities):
– “Item 7.” instead of “Item 7”
– “Management’s” instead of “Management’s” (the apostrophes are different!)
– “Item 7 - Management’s” instead of “Item 7 Management’s”
• There is no standard ending. But usually we can use “Item 8” as the cutoff point.
The heading is something like “Item 8 Financial Statements and Supplementary Data”.
Again, we have to consider possible variations.

Here is an example of how to extract the MDA sections of the two 10-K reports we pro-
cessed in the previous chapter.

Listing 15.1: edgar-extract-section.py


1 # Required modules
2 import re
3 from datetime import datetime , timedelta
4 # Inital Settings
5 TextFiles = [ ' 0001564590 -20 -004075. txt . clean ', '
0001437749 -20 -001224. txt . clean ']
6 # Function
7 def print_progress ( lbl , ith , cnt , tim ) :
8 lblFmt = str ( lbl ) . ljust (15 , ' ')
9 ithFmt = str ( ith ) . rjust (5 , ' ')
10 cntFmt = str ( cnt ) . rjust (5 , ' ')
11 timFmt = str ( tim ) . rjust (20 , ' ')

125
CHAPTER 15. EXTRACTING TEXT SECTIONS 126

12 print ( '\ r ', lblFmt , ': ' , ithFmt , ' of ' , cntFmt , timFmt , end = '
', flush = True )
13 # Processing
14 PatternMda = r '\ n\ s * Item \ s *7\.?\ s * Management .s \ s* Discussion \ s * And \
s* Analysis .+\ n \s * Item \ s *8\.?\ s * Financial \ s * Statements \s * and \s *
Supplementary \ s* Data \s *\ n '
15 StartTime = datetime . now ()
16 EndTime = datetime . now ()
17 for i , TextFile in enumerate ( TextFiles ) :
18 TotalCount = len ( TextFiles )
19 EndTime = datetime . now ()
20 t = EndTime - StartTime
21 print_progress ( TextFile , i +1 , TotalCount , t )
22 SaveFile = ' '. join ([ TextFile , '. mda '])
23 with open ( TextFile , 'r ') as Fr :
24 Texts = re . findall ( PatternMda , Fr . read () , flags = re . I | re .
DOTALL )
25 with open ( SaveFile , 'w ') as Fw :
26 Fw . write ( '\n '. join ( Texts ) )
27 EndTime = datetime . now ()
28 t = EndTime - StartTime
29 print_progress ( TextFile , i +1 , TotalCount , t )
30 print ( ' ')
31 # End

The code is very similar to what we had in the previous chapter. Some additional expla-
nation about the pattern definition:

...
14. The pattern ( PatternMda ) takes whatever is between “Item 7” and “Item 8” (inclusive).
More specifically:

• In \n\s*Item the \n makes sure the headings start on new lines; \s deals with
potential non-print white spaces such as tab; * means none or more.
• In 7\.?\s*Management the \.? accepts zero or one character of anything. So this
helps deal with things such as 7. and 7: . So between 7 and Management , there
can be one character of anything plus any white spaces.
• In Management.s , the dot . accepts one character of anything. So this deals with
any non-standard apostrophes.
• .+ accepts one or more of any characters. So this basically the whole MDA sec-
tion. It requires the re.DOTALL flag in the re.findall() method.

...
24. The re.findall() method will return a list of all matched sections, if there are more
than one. For example, it may extract the items in the table of contents if pattern is
matched. Although such things are not what we need, they usually trivial as far as
textual analysis is concerned.
...
CHAPTER 15. EXTRACTING TEXT SECTIONS 127

26. The result list Texts cannot be written into the new file directly. It must be converted
into a string by using the join() method. In this case, a newline will join all matched
blocks.

Below is the extracted MDA section of one of the filings:

Listing 15.2: 0001437749-20-001224.txt.clean.mda


1 Item 7 .    MANAGEMENT ’ S DISCUSSION AND ANALYSIS OF FINANCIAL
CONDITION AND RESULTS OF OPERATIONS
2
3 The information contained under the caption “ Management ’ s
Discussion and Analysis of Financial Condition and Results of
Operations ” of our Annual Report for the fiscal year ended
October 31 , 2019 , filed as Exhibit 13.1 to this report on Form
10 -K , is incorporated herein by reference .
4
5 Item 7A .    QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET
RISK
6
7 The Company did not engage in transactions in derivative financial
instruments or derivative commodity instruments . As of
October 31 , 2019 , the Company ’ s financial instruments were not
exposed to significant market risk due to interest rate risk ,
foreign currency exchange risk , commodity price risk or equity
price risk .
8
9 Item 8 .   FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA

In this example, the MDA section has been successfully extracted, but there is another
problem: the MDA section further references another file. This cannot be done here and
requires manual processing.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

128
Chapter 16

Simple Analysis and Quality


Check

In this chapter we will do some simply counting (e.g., total number of word, selected key-
words) and build URLs for manual checking (comparing the result with corresponding sec-
tions on the SEC’s EDGAR website).
Here is an example:

Listing 16.1: edgar-quality-check.py

1 # Required modules
2 import re
3 import csv
4 # Initial setting
5 UrlFile = ' master - url . txt '
6 KeywordPatterns = [ r '\ brisk \b ' , r '\ brisks \b ' , r '\ brisky \b ']
7 SaveData = ' edgar - quality - check - mda . csv '
8 # Function
9 def get_landing_page ( url ) :
10 return re . sub ( ' \. txt ' , '- index . htm ' , url )
11 # Function
12 def get_file_name ( url ) :
13 File = re . findall ( ' .{20}\. txt ', url ) [0]
14 return ''. join ([ File , '. clean . mda ' ])
15 # Function
16 def search_keywords ( KeywordPatterns , Text ) :
17 Cnt = 0
18 for Kp in KeywordPatterns :
19 Cnt = Cnt + len ( re . findall ( Kp , Text , flags = re . I ) )
20 return Cnt
21 # Function
22 def count_words ( Text ) :
23 return len ( re . findall ( r ' [^\ W \ d ]+ ', Text ) )
24 # #########################################################
Processing
25 VarList = [ ' SECTIONFILE ' , ' WORDS ' , ' KEYWORDS ', ' URL ']

129
CHAPTER 16. SIMPLE ANALYSIS AND QUALITY CHECK 130

26 VarValues = dict ()
27 for Var in VarList :
28 VarValues [ Var ] = -1
29 # Get files
30 Urls =[]
31 Files =[]
32 with open ( UrlFile , 'r ') as Fr :
33 Urls = Fr . read () . split ()
34 with open ( SaveData , 'w ', newline = '') as Df :
35 Writer = csv . DictWriter ( Df , fieldnames = VarList ,
lineterminator = '\ n ')
36 Writer . writeheader ()
37 for url in Urls :
38 VarValues [ ' SECTIONFILE '] = get_file_name ( url )
39 VarValues [ ' URL '] = get_landing_page ( url )
40 with open ( get_file_name ( url ) , 'r ') as Fr :
41 Text = Fr . read ()
42 VarValues [ ' WORDS '] = count_words ( Text )
43 VarValues [ ' KEYWORDS '] = search_keywords (
KeywordPatterns , Text )
44 # ## Write to CSV data file
45 Writer . writerow ( VarValues )
46 print ( ' ')
47 # End

Let’s look at how the code works:

...
5. The URL file is what we built in the previous chapter. It has a list of URLs pointing to
“complete” files on SEC EDGAR website.
6. A list of keywords we will search for.
7. A data file (CSV) to save the search result.
...
10. A function to convert the URLs (pointing to complete files) into corresponding URLS
that point to landing pages (see the previous chapter for more information).
...
12. A function to get the file names we used in the previous chapter for saving extracted
10-K sections.
...
16. A function to search for keywords.
...
22. A function to count words.
...
26. The rest of the code loops through the URLs, do the searching and counting, and then
save the result in a CSV file.
CHAPTER 16. SIMPLE ANALYSIS AND QUALITY CHECK 131

Running the above code will create a CSV file in the same folder as the code file. It looks
like the following:

Listing 16.2: edgar-quality-check-mda.csv


1 SECTIONFILE , WORDS , KEYWORDS , URL
2 0001564590 -20 -004075. txt . clean . mda ,7155 ,17 , https :// www . sec . gov /
Archives / edgar / data /1000229/0001564590 -20 -004075 - index . htm
3 0001437749 -20 -001224. txt . clean . mda ,116 ,6 , https :// www . sec . gov /
Archives / edgar / data /1000230/0001437749 -20 -001224 - index . htm

It is perhaps better to open the CSV file in a spreadsheet application to visually check the
data:

Based on the above result, we can tell that:

1. The first 10-K appears to have been correctly extracted. The MDA section has more
than 7,000 words.
2. The second one, on the other hand, appears to be problematic. The whole MDA section
has only 100 words. Manual checking is needed in this case.
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

132
Part III

Additional Topics

133
Chapter 17

Merging Data

A final step in most accounting research is to merge textual analysis result with data from
other sources such as Compustat. The key problem is that Compustat data is based on fiscal
year and has a data filed called datadate (fiscal year end date). The textual analysis we’ve
done so far is based on calendar year (file date). The two dates usually don’t match.
There are at least three approaches we can take to deal with the problem:

1. Using master index data;


2. Using Audit Analytics data; and
3. Extracting additional data from 10-K filings

We will look at these three approaches in more details.

135
CHAPTER 17. MERGING DATA 136

17.1 Using Master Index Data

We can reuse the master index data to merge textual analysis result with Compustat by
loosely matching file date with fiscal year end date. The idea is that for any particular fiscal
year n , the file date of 10-K must be after the end date but before the end date of fiscal year
n+1 .
The overall matching is shown in the following figure:

Compustat

DataDate < F ileDate < DataDate + 365 Master Index

Textual Analysis Result

Figure 17.1: Merging Data (1)


CHAPTER 17. MERGING DATA 137

17.2 Using Audit Analytics Data

The second, perhaps better, approach is to use Audit Analytics data, which has both fiscal
year end date and file date . Although file date here refers to the date of filing audit opinion,
it is generally the same as 10-K filing date. The matching process is shown the following
figure:

Compustat

Audit Analytics

Master Index

Textual Analysis Result

Figure 17.2: Merging Data (2)


CHAPTER 17. MERGING DATA 138

17.3 Extracting Data from 10-K

if in some rare cases you don’t have Audit Analytics data, you can extract relevant filing
information from the sec-header sections from downloaded 10-K filings. Each filing has
a <SEC-HEADER> </SEC-HEADER block at the very beginning. For the overall structure of 10-K
filings, see Listing 14.1 0001437749-20-001224.txt on Page 120.
The figure below shows a specific example:

File Name

Fiscal Year End Date

File Date

CIK

Figure 17.3: Merging Data (3)

Here is a code example for extracting filing information:

Listing 17.1: edgar-extract-filing-info.py


CHAPTER 17. MERGING DATA 139

1 # ### Required modules


2 import re
3 import csv
4 # ### Initial setting
5 HtmlFiles = [ ' 0001564590 -20 -004075. txt ' , ' 0001437749 -20 -001224. txt
']
6 SaveData = ' edgar - filing - info . csv '
7 # ### Function
8 def get_file_name ( File ) :
9 return ''. join ([ File , '. clean . mda ' ])
10 # ##################### Processing
11 PatternHeader = r ' <SEC - HEADER >.+? </ SEC - HEADER > '
12 # ###
13 VarList = [ ' FILENAME ' , ' ACCESSION ', ' FYENDDATE ' , ' FILEDATE ', ' CIK '
]
14 VarValues = dict ()
15 for Var in VarList :
16 VarValues [ Var ] = ' '
17 # ###
18 with open ( SaveData , 'w ', newline = '') as Df :
19 Writer = csv . DictWriter ( Df , fieldnames = VarList ,
lineterminator = '\ n ' , quoting = csv . QUOTE_ALL )
20 Writer . writeheader ()
21 for HtmlFile in HtmlFiles :
22 VarValues [ ' FILENAME '] = get_file_name ( HtmlFile )
23 with open ( HtmlFile , 'r ') as Fr :
24 Header = re . findall ( PatternHeader , Fr . read () , flags = re
. DOTALL ) [0]
25 Lines = Header . split ( '\ n ')
26 DictData = dict ()
27 for Line in Lines :
28 if re . findall ( ': ' , Line ) :
29 key , val = re . split ( ': ', Line )
30 DictData [ ' '. join ( re . split ( '\W + ', key ) ) ] = ' '.
join ( re . findall ( '[- \w ]+ ' , val ) )
31 # print ( DictData )
32 VarValues [ ' ACCESSION '] = DictData [ ' ACCESSIONNUMBER ']
33 VarValues [ ' FYENDDATE '] = DictData [ ' CONFORMEDPERIODOFREPORT
']
34 VarValues [ ' FILEDATE '] = DictData [ ' FILEDASOFDATE ']
35 VarValues [ ' CIK '] = DictData [ ' CENTRALINDEXKEY ']
36 # ## Write to CSV data file
37 Writer . writerow ( VarValues )
38 # ### End

Here is how the code works:

1. Import required modules


...
5. A list of HTML files to extract.
CHAPTER 17. MERGING DATA 140

6. Where to save the result.


...
8. Define a function to append file suffix. This depends on how you name your files in
previous steps.
...
11. Define the pattern of the SEC-HEARDER block.
...
19. quoting=csv.QUOTE_ALL here means all data values will be put within quotation makes.
When you import data into stats software such as SAS, they will be recognized as text.
...
28. In the SEC-HEADER block, label and value are separated by a colon : .
29. The code here will split each line in the header block on the colon : and assign the
two elements to the two variables key and val respectively.
30. We add the key and val pair to the dictionary variable DictData . Note that in the
source file there are many non-print white spaces such as tabs. They have to be re-
moved.
...

The code will generate the following result, which can be used to match Compustat data:
Chapter 18

Sentiment Analysis

In this chapter we use the Natural Language Toolkit (NLTK) (https://www.nltk.org/) to do


a simple sentiment analysis. There are many different methods or algorithms for analyze
sentiment (for recent reviews see e.g. Zhang et al., 2019). We will look at one of them:
VADER (Valence Aware Dictionary and sEntiment Reasoner, https://github.com/cjhutto/
vaderSentiment)

141
CHAPTER 18. SENTIMENT ANALYSIS 142

18.1 VADER Basics

VADER was originally developed to analyze social media content, although it is generally
applicable to other domains (Hutto and Gilbert, 2014). It uses a lexicon of more than 7,500
words, each of which has a sentiment score between −4 and 4. The sign of a score refers to
polarity (negative/positive) and the number intensity. Below are three examples:

1 risk -1.1 0.7 [ -1 , -1 , 0 , -2 , -1 , 0 , -2 , -2 , -1 , -1]


2 brisk 0.6 0.8 [0 , 0 , 0 , 0 , 1 , 1 , 0 , 2 , 0 , 2]
3 liabilities -0.8 0.9798 [ -1 , 2 , -1 , -1 , -1 , -1 , -1 , -1 , -2 , -1]

The word “risk” is rated by 10 human subjects (those numbers in square brackets) as
negative (mean score: −1.1, standard deviation: 0.7); “brisk” slightly positive (mean 0.6);
and “liabilities” slightly negative (mean −0.8).
Researchers usually use the compound score as a uni-dimensional measure of sentiment
of any given text. It is calculated by summing the scores of all words, adjusted according to
some enhancement rules, and then normalized (−1 most extreme negative, +1 most extreme
positive). A normalized score of 0.05 or above is considered positive sentiment, between
−0.05 and 0.05 neural, and −0.05 or lower negative. For example, the sentence “The book
was good” is given a compound score of 0.4404, so it has a positive sentiment.
CHAPTER 18. SENTIMENT ANALYSIS 143

18.2 Using VADER

To use VADER in your textual analysis, install NLTK first.

Ubuntu

sudo apt install python3 - nltk

Windows and Mac

Follow the instruction on NLTK webiste (https://www.nltk.org/install.html)


Now let’s do some sentiment analysis by using the two 10-K reports we’ve downloaded,
cleaned, and extracted in the previous chapters. Our objective is to calculate a compound
score for each report and export the result to a CSV file.

Listing 18.1: edgar-sentiment.py


1 # ### Required modules
2 import re
3 import csv
4 import nltk
5 # nltk . download ( ' vader_lexicon ')
6 from nltk . sentiment import SentimentIntensityAnalyzer
7 # ### Initial setting
8 HtmlFiles = [ ' 0001564590 -20 -004075. txt . clean . mda ' , '
0001437749 -20 -001224. txt . clean . mda ']
9 KeywordPatterns = [ r '\ brisk \b ' , r '\ brisks \b ' , r '\ brisky \b ']
10 SaveData = ' edgar - sentiment . csv '
11 # ### Function
12 def search_keywords ( KeywordPatterns , Text ) :
13 Cnt = 0
14 for Kp in KeywordPatterns :
15 Cnt = Cnt + len ( re . findall ( Kp , Text , flags = re . I ) )
16 return Cnt
17 # ### Function
18 def count_words ( Text ) :
19 return len ( re . findall ( r ' [^\ W \ d ]+ ', Text ) )
20 # ###
21 VarList = [ ' FILENAME ' , ' WORDS ' , ' KEYWORDS ' , ' SENTIMENT ']
22 VarValues = dict ()
23 for Var in VarList :
24 VarValues [ Var ] = ' '
25 # ###
26 with open ( SaveData , 'w ', newline = '') as Df :
27 Writer = csv . DictWriter ( Df , fieldnames = VarList ,
lineterminator = '\ n ' , quoting = csv . QUOTE_NONNUMERIC )
28 Writer . writeheader ()
29 for HtmlFile in HtmlFiles :
CHAPTER 18. SENTIMENT ANALYSIS 144

30 VarValues [ ' FILENAME '] = HtmlFile


31 with open ( HtmlFile , 'r ') as Fr :
32 Text = Fr . read ()
33 Analyzer = SentimentIntensityAnalyzer ()
34 Scores = Analyzer . polarity_scores ( Text )
35 VarValues [ ' SENTIMENT '] = Scores [ ' compound ']
36 VarValues [ ' WORDS '] = count_words ( Text )
37 VarValues [ ' KEYWORDS '] = search_keywords (
KeywordPatterns , Text )
38 # ## Write to CSV data file
39 Writer . writerow ( VarValues )
40 # ### End

Here is how the code works:

1. Import required modules


...
4. Import the nltk package.
5. Download the required VADER lexicon. Note that you only need to download it once.
You can comment it out afterwards.
6. Import the module for VADER sentiment analysis.
...
8. Specify the two files we want to analyze.
9. Define the pattern for search for “risk” and related words. Note that the boundary \b
is important, otherwise you will get words like “brisk”.
...
33. This line creates an instance of the SentimentIntensityAnalyzer class and assign it to
the variable Analyzer .
34. We use the polarity_scores() method of the Analyzer object. It returns a dictionary
of scores.
35. We use the key compound to get the related score.
...

Running the code will generate the following result:

Does the result make sense? Recall that the word “risk” has a negative score. The first
report ( 0001564590-20-004075 ) has an extreme positive sentiment despite the fact that it men-
tions 17 times, many more than the second report (0001437749-20-001224), which as a strong
negative sentiment. A plausible explanation is that the second report is very short and hence
has a higher risk-to-total ratio.
Chapter 19

Other Learning Materials

NLTK

The book Natural Language Processing with Python (Bird et al., 2019) is a must-read for any-
one who is serious about conducting textual analysis. It introduces many fundamental con-
cepts in textual analysis and provides detailed guide on how to use Natural Language Toolkit
(NLTK) (https://www.nltk.org/).

Python for Social Science

The online book Python for Social Science by Jean Mark Gawron (https://gawron.sdsu.edu/
python_for_ss/course_core/book_draft/index.html) covers some basics of Python and ad-
vanced topics such as visualization and social networks.

Using Python for Text Analysis in Accounting Research

The online book Using Python for Text Analysis in Accounting Research by Vic Anand and col-
leagues (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3576098) offers account-
ing researchers a slightly different approach to learning and using Python.

145
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

146
Bibliography

Bird, S., Klein, E., and Loper, E. (2019). Natural Language Processing with Python – Analyzing
Text with the Natural Language Toolkit. https://www.nltk.org/book/.

Bonsall IV, S. B., Leone, A. J., Miller, B. P., and Rennekamp, K. (2017). A plain english mea-
sure of financial reporting readability. Journal of Accounting and Economics, 63(2-3):329–
357.
Cazier, R. A., McMullin, J. L., and Treu, J. S. (2021). Are lengthy and boilerplate risk factor
disclosures inadequate? an examination of judicial and regulatory assessments of risk
factor language. The Accounting Review, 96(4):131–155.
Hutto, C. J. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment
analysis of social media text. In The Eighth International AAAI Conference on Weblogs and
Social Media ((ICWSM-14), Ann Arbor, MI.
Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal
of Accounting and economics, 45(2-3):221–247.
Smith, M. and Taffler, R. (1992). Readability and understandability: Different measures
of the textual complexity of accounting narrative. Accounting, Auditing & Accountability
Journal, 5(4):84–98.
Zhang, M. C., Stone, D. N., and Xie, H. (2019). Text data sources in archival accounting
research: insights and strategies for accounting systems’ scholars. Journal of Information
Systems, 33(1):145–180.

147
[THIS PAGE IS INTENTIONALLY LEFT BLANK.]

148

You might also like