Professional Documents
Culture Documents
Step 1: Ask Questions
Step 1: Ask Questions
Step 1: Ask Questions
Either you're given data and ask questions based on it, or you ask questions first and gather data based
on that later. In both cases, great questions help you focus on relevant parts of your data and direct
your analysis towards meaningful insights.
Step 2: Wrangle data
You get the data you need in a form you can work with in three steps: gather, assess, clean. You gather
the data you need to answer your questions, assess your data to identify any problems in your data’s
quality or structure, and clean your data by modifying, replacing, or removing data to ensure that your
dataset is of the highest quality and as well-structured as possible.
Step 3: Perform EDA (Exploratory Data Analysis)
You explore and then augment your data to maximize the potential of your analyses, visualizations, and
models. Exploring involves finding patterns in your data, visualizing relationships in your data, and
building intuition about what you’re working with. After exploring, you can do things like remove
outliers and create better features from your data, also known as feature engineering.
Step 4: Draw conclusions (or even make predictions)
This step is typically approached with machine learning or inferential statistics.
Step 5: Communicate your results
You often need to justify and convey meaning in the insights you’ve found. Or, if your end goal is to build
a system, you usually need to share what you’ve built, explain how you reached design decisions, and
report how well it performs. There are many ways to communicate your results: reports, slide decks,
blog posts, emails, presentations, or even conversations. Data visualization will always be very valuable.
Data Types
Categorical Nominal data do not have an order or ranking (like the breeds of the dog
Continuous vs. Discrete
• There is a lot going on in this video - here is a recap of the big ideas.
• Rows and Columns
• If you aren't familiar with spreadsheets, this will be covered in detail
in future lessons. Spreadsheets are a common way to hold data. They
are composed of rows and columns. Rows run horizontally, while
columns run vertically. Each column in a spreadsheet commonly holds
a specific variable, while each row is commonly called an instance or
individual.
• The example used in the video is shown below.
Date Day of Week Time Spent On Site (X) Buy (Y)
June 15 Thursday 5 No
• Before collecting data, we usually start with a question, or many questions, that we
would like to answer. The purpose of data is to help us in answering these questions.
• Random Variables
• A random variable is a placeholder for the possible values of some process (mostly... the
term 'some process' is a bit ambiguous). As was stated before, notation is useful in that it
helps us take complex ideas and simplify (often to a single letter or single symbol). We see
random variables represented by capital letters (X, Y, or Z are common ways to represent a
random variable).
• We might have the random variable X, which is a holder for the possible values of the
amount of time someone spends on our site. Or the random variable Y, which is a holder
for the possible values of whether or not an individual purchases a product.
• X is 'a holder' of the values that could possibly occur for the amount of time spent on our
website. Any number from 0 to infinity really.
Capital vs. Lower Case Letters
• Random variables are represented by capital letters. Once we observe an outcome of these random variables, we
notate it as a lower case of the same letter.
• Example 1
• For example, the amount of time someone spends on our site is a random variable (we are not sure what the
outcome will be for any particular visitor), and we would notate this with X. Then when the first person visits the
website, if they spend 5 minutes, we have now observed this outcome of our random variable. We would notate
any outcome as a lowercase letter with a subscript associated with the order that we observed the outcome.
• If 5 individuals visit our website, the first spends 10 minutes, the second spends 20 minutes, the third spends 45
mins, the fourth spends 12 minutes, and the fifth spends 8 minutes; we can notate this problem in the following
way:
• X is the amount of time an individual spends on the website.
• x1 \bold{x_1}x1= 10, x2 \bold{x_2}x2= 20 x3 \bold{x_3}x3= 45 x4 \bold{x_4}x4= 12 x5
\bold{x_5}x5= 8.
• The capital X is associated with this idea of a random variable, while the observations of the random variable take
on lowercase x values.
Example 2
• Data integrity is ensured - only the data you want entered is entered, and only
certain users are able to enter data into the database.
• Data can be accessed quickly - SQL allows you to obtain results very quickly from
the data stored in a database. Code can be optimized to quickly pull results.
• Data is easily shared - multiple individuals can access data stored in a database,
and the data is the same for all users allowing for consistent results for anyone
with access to your database.