Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Fun with pandas #1

General Comments
Pandas is a general-purpose swiss army knife for 2-D tables (think spreadsheets).
Although pandas is built on top of numpy, the standard python array-handling library,
pandas was designed with specific use cases in mind and it's semantics are different
enough from generic array-manipulating software that it can be somewhat confusing at
first.
In particular, pandas was designed with time-series data in mind. It is definitely not
limited to that use case, but much of its basic design and defaults make more sense if
you keep that in mind.
For example, in pandas, the most basic data structure is called a Series. This is just an
indexed 1-d array. By default, the index is a zero-based sequence of integers. Here's a
very simple example of using the Series constructor to create a series from a list:
import pandas as pd

pd.Series(['a', 'b', 'c'])

Out[90]:

0 a

1 b

2 c

dtype: object

The numbers in the left-hand column are the default index values, but the index could just as
well be a list of times (or any other python type), e.g.
Out[108]:

time

2022-01-01 17:36:37 a

2022-01-04 19:39:25 b

2022-01-02 11:22:15 c

Name: data, Length: 288, dtype: object

In this case, "data" is the name of the Series object and "time" is the name of its index
In a pandas Dataframe, the basic spreadsheet or table object, each column is a Series
object, and all the columns share the same index
Here's an example, which also demonstrates a handy way to create a test dataframe
from a numpy array of random floats:
import numpy as np

import pandas as pd

import string

letters = list(string.ascii_lowercase) # handy way to get the alphabet as a


list

data_array = np.random.random(size=(5, 6)) # create 5x6 array of random


floats

df = pd.DataFrame(data_array, columns=letters[:6])

df

Out[109]:

a b c d e f

0 0.242051 0.190953 0.426463 0.666517 0.344178 0.104935

1 0.696263 0.278143 0.749982 0.031070 0.314006 0.941106

2 0.809044 0.213278 0.407564 0.193904 0.236370 0.705355

3 0.654276 0.038093 0.736514 0.704652 0.964255 0.767741

4 0.707839 0.076659 0.650944 0.834884 0.819356 0.334707

To reference a column, for example column 'b' in the above dataframe, the simplest
options are:
df.b

Out[110]:

0 0.190953

1 0.278143

2 0.213278

3 0.038093

4 0.076659

Name: b, dtype: float64

df['b']

Out[111]:

0 0.190953

1 0.278143

2 0.213278

3 0.038093

4 0.076659

Name: b, dtype: float64

Although the first requires less typing, it is also generally less useful for the following reasons:
1. It doesn't work if the column name contains a space
2. It doesn't work if you want to refer to the column using a string variable that contains the
column name
3. It doesn't work for referring to multiple columns
You can refer to multiple columns as follows:
df[['a', 'c', 'f']]

Out[112]:

a c f

0 0.242051 0.426463 0.104935

1 0.696263 0.749982 0.941106

2 0.809044 0.407564 0.705355

3 0.654276 0.736514 0.767741

4 0.707839 0.650944 0.334707

Here's a much weirder example that actually came up while parsing qualys log4j
vulnerability data in our titan table. In that case, there was one row in the titan table for
each host but, if a host contained multiple vulnerable files, those were all encoded as a
single string in a "details" column. The below code takes an example of one such string
and parses it into a dataframe:
# string containing a table, where rows are pipe-delimited

# columns are tab-delimited

s =
'PATH\tVERSION\tJNDI_CLASS_STATUS\tBASE_DIR|/app/radiantone/backup/20220122_
1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup|/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup'

# parse the string using a python list comprehension, no pandas here.

# this creates a list of rows (where each row is itself a list of strings)

l = [ss.split("\t") for ss in s.split("|")]

Out[114]:

[['PATH', 'VERSION', 'JNDI_CLASS_STATUS', 'BASE_DIR'],

['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar',

'2.7',

'JNDI CLASS NOT FOUND',

'/app/radiantone/backup'],

['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar',

'2.7',

'JNDI CLASS NOT FOUND',

'/app/radiantone/backup']]

# Here's the pandas part

# row zero contains the column names, row 1 onward contains the data

pd.DataFrame(l[1:], columns=l[0])

Out[113]:

PATH VERSION
JNDI_CLASS_STATUS BASE_DIR

0 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT


FOUND /app/radiantone/backup

1 /app/radiantone/backup/20220122_1340_ngvds-wc-... 2.7 JNDI CLASS NOT


FOUND /app/radiantone/backup

You might also like