Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Copy of Summary of the

Chapter "Working with Missing


Values"
Metrics for Evaluating Traffic Sources
Theory
There are many kinds of traffic sources:
Search results (organic traffic)
Contextual ads
Email newsletters
Social media
Click-throughs from other websites
The reason for introducing performance metrics for a traffic source is to
evaluate and compare traffic sources to identify which of them is best.
Knowing how traffic sources perform helps to quickly manage your marketing
strategy.
There are two methods for calculating website conversion. The first involves
calculating the percentage of website visitors who perform target actions.
The second calculates the percentage of target actions.
A visit is a sequence of actions a visitor takes, starting when they get to a
website and ending when they’ve been inactive for 30 minutes.
Another important metric for evaluating a traffic source is the repeat
purchase rate. This metric calculates the ratio of users who make at least two
purchases to the number of users who make at least one purchase.
Practice
In pandas, you can perform arithmetic operations on columns: addition,
subtraction, multiplication, and division. For example:
data['column1'] = data['column12'] + data['column3']

Copy of Summary of the Chapter "Working with Missing Values" 1


User IDs and cookies
Theory
A counter collects data about user behavior on web pages and sends it to a
web analytics system, such as Yandex.Metrics. This counter is several lines of
the website’s code. The counter collects general information about users:
which traffic source they come from, which pages users view, and purchases
they make. Counters assign a unique number to each user that differentiates
them from other users. This is called a user ID.
Data from these counters is called raw data. Web analytics systems convert
this raw data into reports on audience, traffic, and sources. Reports allow you
to combine different metrics and visualize results.
Cookies are used to identify a user who returns to a website. Cookies are text
files left in your device’s memory after your first visit to a website and sent to
the server when you visit again.
But if a user visits a website from different browsers, the same user is
assigned different user IDs because each browser has its own cookies. That’s
why we collect additional information, like email addresses. The user ID and
email address are encrypted to protect personal data.
Text files that contain information about website visits are called logs.
When new information is added to data that has already been collected in a
particular way, this is called data enrichment. A dataset, however, may
contain missing information. Sometimes these missing values can be ignored,
but other times they need to be processed and filled in for analysis.
Practice
The method is used to search for unique values in the column:
unique()

.
data['column'].unique()

To delete rows with missing values, call the method. To renumber, call
dropna()

the method with the argument


reset_index() .
drop=True

You discovered NaN and None


Copy of Summary of the Chapter "Working with Missing Values" 2
Theory
NaN and None indicate that there is no value in a cell. NaN means a number is
missing from a cell. NaN is a float data type, which means that you can
perform mathematical operations on it. None is a NoneType object, and you
can’t perform mathematical operations on it. NaN values can lead to incorrect
results when grouping data. Don’t rush to delete rows with these values:
missing values can often be restored.
Practice
In pandas, the method returns unique values and their counts.
value_counts()

The isnull() method returns a Boolean list, in which ‘True` means that a value
is missing in the column.
To substitute a value for a missing one, use the method with the
fillna()

value argument.

Categorical and quantitative variables


There are two types of variables: categorical and quantitative. A categorical
variable takes one value from a limited set of values, while a quantitative
value takes any value from a range. Unlike categorical variables, quantitative
variables can be compared.
Variables can also be logical Boolean), meaning they indicate whether a
statement is true or false. If a statement is true, the variable takes the value 1.
If a statement is false, it displays a 0.

Working with missing values in


categorical variables
Theory
Before processing missing values, you have to determine whether there is a
pattern in the missing values. That is, you have to decide whether their
appearance in the data set is random.
There are three types of missing values:
—Missing completely at random MCAR means that the likelihood of missing
values doesn’t depend on any other values. The answer to this question
Copy of Summary of the Chapter "Working with Missing Values" 3
doesn’t depend on the nature of the question or other questions in the
questionnaire. The missing value can be restored by name.
—Missing at random MAR means the likelihood of missing values depends
on other values in the data set, not on values in the column itself. A missing
value can be due to a category not existing, for example.
—Missing not at random MNAR means the likelihood of missing values
depends on other values, including values in the column itself. A missing value
depends on the nature of the question as well as the variable’s value in another
column.
Practice
There are several ways to replace missing categorical values. For example, by
replacing them with default values. This option works well for filling missing
values at random. That’s what the pandas method is for.
fillna()

Not all empty values can be filled with the method, though. For
fillna()

example, it can’t be used for missing None values. The method only recognizes
NaN values in a table. To replace None values, call the method. Boolean
loc

indexing allows you to select all rows containing None in the desired column
and replace them with a new value.
The agg() method is used to apply some functions to particular columns. The
column name and the functions themselves are recorded in a data structure
called a dictionary. A dictionary is made up of a key and a value. The key is
the name of the column the functions must be used on, while the value is the
list of function names.
{'column':['function1','function2']}

After you use the agg() method, column names become binary. To refer to the
result of using ['function1'] on , simply write them one after the
['column']

other:
data['column']['function1']

Copy of Summary of the Chapter "Working with Missing Values" 4


Working with missing values in
quantitative variables
Theory
Missing values in quantitative variables are filled with representative values.
These values represent the status of the data set selected for analysis. In order
to estimate typical values in the data set, use the mean or the median.
The mean is the sum of all values divided by the number of values.
The median is the midpoint in a data set, meaning exactly half the elements
fall above it and the other half falls below it. In the event of an even number of
values, the median is calculated using two middlemost values.
Practice
To get the mean, use the method. It can be applied to an entire table, a
mean()

column, or grouped data.


The median()method is used, as the name suggests, to find the median. The
method can be applied to a table, a column, or grouped data.

Copy of Summary of the Chapter "Working with Missing Values" 5

You might also like