Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Assignment 5

Introduction to Data Science

Subhasis Ray∗

November 19, 2023

Note
This is an individual assignment. You may not consult your peers or AI
tools to do these tasks except where explicitly asked for. Any misconduct
will result in a 0 score in this entire assignment, and will be noted and
reported to the Academic Integrity Committee.

Submit to CodePost: (1) all code in a single python script with a comment
indicating the problem number before the solution to each problem, and (2)
plots under headings indicating the problem numbers in a word document on
CodePost.
General rules for plots:
• Always label your axes. When multiple axes share the same
axis, you can label it once. Your label should indicate the
unit where applicable.
• If there are multiple axes in the same figure, provide a
meaningful title for each axes.
• If there are multiple plots in the same axes, provide a leg-
end.
• Provide a colorbar when making plots with color-coded val-
ues (colormap).
• Ensure legibility of your plots (e.g., adjust marker sizes so
that they do not overlap, control spacing between axes so
the ticklabels are visible, customize and rotate ticklabels,
etc.)

subhasis.ray@plaksha.edu.in

1
• Title the figure.
• Briefly describe your observations from the plots.

1 Introduction
This dataset is from Loghub, also available on Zenodo, and the related article
is:

Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R.


Lyu. Loghub: A Large Collection of System Log Datasets for
AI-driven Log Analytics. IEEE International Symposium on
Software Reliability Engineering (ISSRE), 2023. URL: https:
//arxiv.org/abs/2008.06448.

Specifically, you are going to use the Android Application Framework


log for this assignment.

2 Loading the data


1. Logs generated by computer programs follow some format. Inspect
the log file and the description in the README.md file and summarize
what the columns in the log represent. Now write code to parse the
log into a pandas DataFrame. Hint: You can use a regular expression
as separator in pandas.read_csv, or load every line as a single row
and then use pandas.Series.str.extract with a regular expression;
A possible resource https://regex101.com/ to test your regular ex-
pression. You can also extract the columns using plain Python string
processing, which may be more readable. Do not worry about the cor-
ner cases with Chinese or Japanese characters. 7marks

3 Explore the data


1. Plot a histogram of the events by the hour of day. Make sure the tick
labels are clearly visible, rotate the labels if necessary. 5marks

2. Which are the top five programs in these log messages? Plot the share
of these, and the rest of the programs together as the category other
in a pie chart. 5marks

2
4 Hypothesis testing
1. Do you notice a difference in the number of events in two time win-
dows in this data? Form a testable hypothesis about this and conduct
a hypothesis test. Summarize your results and make boxplots to com-
pare the event rates / counts in these time windows. 5marks

2. If you plot the event time of the top 10 processes, 'SendBroadcastPermission'


and 'ActivityManager' seem to happen often close in time. Com-
pute the event rates of these two processes in 5 minute intervals, and
then make a scatter plot of the former against the latter. What do
you think of their relationship? Fit a linear regression model, print the
summary of the model, and plot the residuals. What does the residual
plot suggest? 8marks

You might also like