Professional Documents
Culture Documents
Fun With Pandas 1
Fun With Pandas 1
General Comments
Pandas is a general-purpose swiss army knife for 2-D tables (think spreadsheets).
Although pandas is built on top of numpy, the standard python array-handling library,
pandas was designed with specific use cases in mind and it's semantics are different
enough from generic array-manipulating software that it can be somewhat confusing at
first.
In particular, pandas was designed with time-series data in mind. It is definitely not
limited to that use case, but much of its basic design and defaults make more sense if
you keep that in mind.
For example, in pandas, the most basic data structure is called a Series. This is just an
indexed 1-d array. By default, the index is a zero-based sequence of integers. Here's a
very simple example of using the Series constructor to create a series from a list:
import pandas as pd
Out[90]:
0 a
1 b
2 c
dtype: object
The numbers in the left-hand column are the default index values, but the index could just as
well be a list of times (or any other python type), e.g.
Out[108]:
time
2022-01-01 17:36:37 a
2022-01-04 19:39:25 b
2022-01-02 11:22:15 c
In this case, "data" is the name of the Series object and "time" is the name of its index
In a pandas Dataframe, the basic spreadsheet or table object, each column is a Series
object, and all the columns share the same index
Here's an example, which also demonstrates a handy way to create a test dataframe
from a numpy array of random floats:
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data_array, columns=letters[:6])
df
Out[109]:
a b c d e f
To reference a column, for example column 'b' in the above dataframe, the simplest
options are:
df.b
Out[110]:
0 0.190953
1 0.278143
2 0.213278
3 0.038093
4 0.076659
df['b']
Out[111]:
0 0.190953
1 0.278143
2 0.213278
3 0.038093
4 0.076659
Although the first requires less typing, it is also generally less useful for the following reasons:
1. It doesn't work if the column name contains a space
2. It doesn't work if you want to refer to the column using a string variable that contains the
column name
3. It doesn't work for referring to multiple columns
You can refer to multiple columns as follows:
df[['a', 'c', 'f']]
Out[112]:
a c f
Here's a much weirder example that actually came up while parsing qualys log4j
vulnerability data in our titan table. In that case, there was one row in the titan table for
each host but, if a host contained multiple vulnerable files, those were all encoded as a
single string in a "details" column. The below code takes an example of one such string
and parses it into a dataframe:
# string containing a table, where rows are pipe-delimited
s =
'PATH\tVERSION\tJNDI_CLASS_STATUS\tBASE_DIR|/app/radiantone/backup/20220122_
1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup|/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar\t2.7\tJNDI CLASS NOT
FOUND\t/app/radiantone/backup'
# this creates a list of rows (where each row is itself a list of strings)
Out[114]:
['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/appserver/glassfish/domains/domain1/lib/log4j-core-
2.7.jar',
'2.7',
'/app/radiantone/backup'],
['/app/radiantone/backup/20220122_1340_ngvds-wc-
a2p.sys.comcast.net/vds/lib/log4j-core-2.7.jar',
'2.7',
'/app/radiantone/backup']]
# row zero contains the column names, row 1 onward contains the data
pd.DataFrame(l[1:], columns=l[0])
Out[113]:
PATH VERSION
JNDI_CLASS_STATUS BASE_DIR