Plotting With ApacheSpark and Python

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Plotting with ApacheSpark and python's matplotlib

In this video, we introduce you to matplotlib. Matplotlib is the state-of-the-art plotting library for
Python and also to a set of diagrams very useful to exploratory data analysis. Matplotlib is the de
facto standard for plotting in Python. It is open source and under active development in the
Python community. Throughout this course, we will use matplotlib and Python for plotting.
One important aspect when plotting is on data size. So plotting libraries run on a single machine
and expect a rather small input data set in the form of vectors or matrices in order to render the
plot. So once you have too much data, you either run into main memory problems or
performance problems. The solution is sampling. Sampling takes only a subset of your original
data, but due to inherent randomness in selecting the values actually returning, sampling
preserves most of the properties of your original data. Sampling reduces cost because at
downstream data processing steps, only a fraction of data has to be considered. So let's start with
our first diagram type, box plots. Box plots show many statistical properties of your data at the
same time, mean, standard deviation, skew and outlier content. Basically, it looks a bit like a
histogram introduced in the last module from a vertical perspective. Which tells you about the
distribution of your data. So let's actually create such a box plot using Apache Spark.
Let's start with the already created data frame from week two. Assume we want to obtain some
insights how the voltage behaves. Using Spark SQL, we issue a SQL query to get a voltage
values. Note that this virtual table also contains data from other sensors. So some rows might not
contain values for voltage. Therefore we select only values which are containing a value for
voltage. Now we have obtained a data frame containing the value for voltage. So let's see what it
exactly contains. Seems to be a list of values of type be int. But wait, seems that the values are
somehow wrapped in a row object. So let's get rid of those. As previously mentioned, data
frames are wrappers of rdds. So we now access the wrapped rdd and the rdd API in
order to extract the containing values in the row wrapper objects. So again, we use a Lambda
function to obtain individual instances of the row wrapper objects. And in the Lambda function,
you basically can directly access the wrapped value. Let's check if this works by looking at the
first ten results. Now, we apply the most important function throughout this week. So please
make sure you understand why we are doing this. We are using the sample function in order to
obtain a random fraction of the original data. This is not necessary here. But imagine that this
data frame could potentially contain trillions of rows on petabytes of data. There is no way to
pass such an amount of data to a plotting library since the plotting code is always executed on a
single machine. In this case, we get a random fraction of 10% but if you really have a lot of data,
then 0.01 or 0.001 or even less would be appropriate. As a rule of thumb, not more than a
hundred of data points should be plotted. Now of course, we can call collect on the rdd without
any problems because the sample function only return the subset of the data. Let's store resulting
array to a variable called result underscore array in order to show that it is a plain old python data
type instead of an rdd. Then we print the contents to the screen by running the notebook.
This looks fine. This is the subset of all voltage values coming from the cloud and
CouchDB NoSQL database. Now if we have a meaningful array containing integer values
reflecting the voltage of the power source of a washing machine in different points in
time accessible to a locust back driver in a python array. So let's plot it. But first, we have to
configure the jupyter notebook to display images generated by matplotlib directly under the code
by setting the inline parameter. This is not python code but an instruction sent to the IPython
kernel running inside a jupyter notebook. Now we import the plotting library called matplotlib.

You might also like