Fun With Pandas 1

Fun with pandas #1
General Comments
Pandas is a general-purpose swiss army knife for 2-D tables (think spreadsheets).
Although pandas is built on top of numpy, the standard python array-handling library,
pandas was designed with specific use cases in mind and it's semantics are different
enough from generic array-manipulating software that it can be somewhat confusing at
first.
In particular, pandas was designed with time-series data in mind. It is definitely not
limited to that use case, but much of its basic design and defaults make more sense if
you keep that in mind.
For example, in pandas, the most basic data structure is called a Series. This is just an
indexed 1-d array. By default, the index is a zero-based sequence of integers. Here's a
very simple example of using the Series constructor to create a series from a list:
import pandas as pd
pd.Series(['a', 'b', 'c'])
Out[90]:
0 a
1 b
2 c
dtype: object
The numbers in the left-hand column are the default index values, but the index could just as
well be a list of times (or any other python type), e.g.
Out[108]:
time
2022-01-01 17:36:37 a
2022-01-04 19:39:25 b
2022-01-02 11:22:15 c
Name: data, Length: 288, dtype: object
In this case, "data" is the name of the Series object and "time" is the name of its index
In a pandas Dataframe, the basic spreadsheet or table object, each column is a Series
object, and all the columns share the same index
Here's an example, which also demonstrates a handy way to create a test dataframe
from a numpy array of random floats:
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase) # handy way to get the alphabet as a

list
data_array = np.random.random(size=(5, 6)) # create 5x6 array of random

floats
df = pd.DataFrame(data_array, columns=letters[:6])
df
Out[109]:
a b c d e f
0 0.242051 0.190953 0.426463 0.666517 0.344178 0.104935
1 0.696263 0.278143 0.749982 0.031070 0.314006 0.941106
2 0.809044 0.213278 0.407564 0.193904 0.236370 0.705355
3 0.654276 0.038093 0.736514 0.704652 0.964255 0.767741
4 0.707839 0.076659 0.650944 0.834884 0.819356 0.334707
To reference a column, for example column 'b' in the above dataframe, the simplest
options are:
df.b
Out[110]:
0 0.190953
1 0.278143
2 0.213278
3 0.038093
4 0.076659
Name: b, dtype: float64
df['b']
Out[111]:
0 0.190953
1 0.278143
2 0.213278
3 0.038093
4 0.076659
Name: b, dtype: float64
Although the first requires less typing, it is also generally less useful for the following reasons:
1. It doesn't work if the column name contains a space
2. It doesn't work if you want to refer to the column using a string variable that contains the
column name
3. It doesn't work for referring to multiple columns
You can refer to multiple columns as follows:
df[['a', 'c', 'f']]
Out[112]:
a c f
0 0.242051 0.426463 0.104935
1 0.696263 0.749982 0.941106
2 0.809044 0.407564 0.705355
3 0.654276 0.736514 0.767741
4 0.707839 0.650944 0.334707
Here's a much weirder example that actually came up while parsing qualys log4j
vulnerability data in our titan table. In that case, there was one row in the titan table for
each host but, if a host contained multiple vulnerable files, those were all encoded as a
single string in a "details" column. The below code takes an example of one such string
and parses it into a dataframe:
# string containing a table, where rows are pipe-delimited
# columns are tab-delimited
s =
'PATH\tVERSION\tJNDI_CLASS_STATUS\tBASE_DIR|/app/radiantone/backup/20220122_
1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup|/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup'
# parse the string using a python list comprehension, no pandas here.
# this creates a list of rows (where each row is itself a list of strings)
l = [ss.split("\t") for ss in s.split("|")]
Out[114]:
[['PATH', 'VERSION', 'JNDI_CLASS_STATUS', 'BASE_DIR'],
['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar',
'2.7',
'JNDI CLASS NOT FOUND',
'/app/radiantone/backup'],
['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar',
'2.7',
'JNDI CLASS NOT FOUND',
'/app/radiantone/backup']]
# Here's the pandas part
# row zero contains the column names, row 1 onward contains the data
pd.DataFrame(l[1:], columns=l[0])
Out[113]:
PATH VERSION
JNDI_CLASS_STATUS BASE_DIR
0 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT

FOUND /app/radiantone/backup
1 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT

FOUND /app/radiantone/backup

Fun With Pandas 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fun With Pandas 1

Uploaded by

Copyright:

Available Formats

Fun with pandas #1

pd.Series(['a', 'b', 'c'])

Name: data, Length: 288, dtype: object

letters = list(string.ascii_lowercase) # handy way to get the alphabet as a

data_array = np.random.random(size=(5, 6)) # create 5x6 array of random

0 0.242051 0.190953 0.426463 0.666517 0.344178 0.104935

1 0.696263 0.278143 0.749982 0.031070 0.314006 0.941106

2 0.809044 0.213278 0.407564 0.193904 0.236370 0.705355

3 0.654276 0.038093 0.736514 0.704652 0.964255 0.767741

4 0.707839 0.076659 0.650944 0.834884 0.819356 0.334707

Name: b, dtype: float64

Name: b, dtype: float64

0 0.242051 0.426463 0.104935

1 0.696263 0.749982 0.941106

2 0.809044 0.407564 0.705355

3 0.654276 0.736514 0.767741

4 0.707839 0.650944 0.334707

# columns are tab-delimited

# parse the string using a python list comprehension, no pandas here.

l = [ss.split("\t") for ss in s.split("|")]

[['PATH', 'VERSION', 'JNDI_CLASS_STATUS', 'BASE_DIR'],

'JNDI CLASS NOT FOUND',

'JNDI CLASS NOT FOUND',

# Here's the pandas part

0 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT

1 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT

You might also like