Data Analysis

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Example 1: Get started

Enter basic data:

1. Begin with a new spreadsheet.


2. Select cell A2.
3. Enter your first name.
4. Select cell B2.
5. Enter your last name.
Adjust the size of rows and columns:

To make the text fit in the rows and columns, adjust their sizes. Use either of the following
methods:

1. If your name is longer than the width of the column, select and drag the right edge of the
corresponding column until it fits.
2. To wrap text, select the cells, columns, or rows with text that you want to reformat.
3. Select the Format menu.
4. Under Wrapping, select Wrap.

Example 2: Add labels


Add labels, or attributes, to help you keep track of the data:

1. Select cell A1.


2. Enter First Name.
3. Select cell B1.
4. Enter Last Name.
5. Select cells A1 and B1. To do this, select a single cell and drag your cursor over to the
other cell to include it in the selection.
6. From the toolbar, select the bold icon.

Example 3: Add more attributes and data


Add more attributes and data to your spreadsheet:

1. Select cell C1 and enter Siblings.


2. Select cell D1 and enter Favorite Color.
3. Select cell E1 and enter Favorite Dessert.
4. Select all three cells and make them bold by selecting the bold icon from the toolbar.
5. Adjust the columns to fit the new text.
6. Enter the corresponding data in cells C2, D2, and E2 (your number of siblings, favorite
color, and favorite dessert).
7. Add data about two more people in rows 3 and 4. These can be people you know or
people you’ve just made up.
Example 4: Organize your data
One way to organize your data is by sorting it.

1. Select all columns that contain data. There are a few ways to select multiple cells:
a. To select nonadjacent cells and/or cell ranges, hold the Command (Mac) or Ctrl
(PC) key and select the cells.
b. To select a range of cells, hold the Shift key and either drag your cursor over
which cells you want to include or use the arrow keys to select a range.
c. Select a single cell and drag your cursor over the cells you want to include in your
selection.
2. Select the Data menu.
3. Select Sort range, then select Advanced range sorting options.
4. In the Advanced range sorting options window, select the checkbox for Data has
header row. Make sure that A to Z is selected.
5. Select the Sort by drop-down menu, then select Siblings.
6. Select Sort. This will organize the spreadsheet by the number of siblings, from lowest to
highest.

Example 5: Use a formula


Spreadsheets enable data professionals to analyze data. In this example, the instructor uses a
formula to calculate a sum.

1. Select the next empty cell in the Siblings column (C5).


2. Enter the formula =C2+C3+C4.
3. Press Enter on your keyboard to complete the formula.
4. The formula calculates the total number of siblings.

Terms and definitions for Course 1, Module 3


Attribute: A characteristic or quality of data used to label a column in a table

Function: A preset command that automatically performs a specified process or task using the
data in a spreadsheet

Observation: The attributes that describe a piece of data contained in a row of a table

Oversampling: The process of increasing the sample size of nondominant groups in a


population. This can help you better represent them and address imbalanced datasets

Self-reporting: A data collection technique where participants provide information about


themselves

ust as humans use different languages to communicate with others, so do computers. Structured
Query Language (or SQL, often pronounced “sequel”) enables data analysts to talk to their
databases. SQL is one of the most useful data analyst tools, especially when working with large
datasets in tables. It can help you investigate huge databases, track down text (referred to as
strings) and numbers, and filter for the exact kind of data you need—much faster than a
spreadsheet can.

If you haven’t used SQL before, this reading will help you learn the basics so you can appreciate
how useful SQL is and how useful SQL queries are in particular. You will be writing SQL
queries in no time at all.

What is a query?
A query is a request for data or information from a database. When you query databases, you
use SQL to communicate your question or request. You and the database can always exchange
information as long as you speak the same language.

Every programming language, including SQL, follows a unique set of guidelines known as
syntax. Syntax is the predetermined structure of a language that includes all required words,
symbols, and punctuation, as well as their proper placement. As soon as you enter your search
criteria using the correct syntax, the query starts working to pull the data you’ve requested from
the target database.

The syntax of every SQL query is the same:

 Use SELECT to choose the columns you want to return.


 Use FROM to choose the tables where the columns you want are located.
 Use WHERE to filter for certain information.
A SQL query is like filling in a template. You will find that if you are writing a SQL query from
scratch, it is helpful to start a query by writing the SELECT, FROM, and WHERE keywords in
the following format:

SELECT

FROM

WHERE

Next, enter the table name after the FROM; the table columns you want after the SELECT; and,
finally, the conditions you want to place on your query after the WHERE. Make sure to add a
new line and indent when adding these, as shown below:

SELECT Specifies the columns from which to retrieve data


FROM Specifies the table from which to retrieve data
WHERE Specifies criteria that the data must meet
Following this method each time makes it easier to write SQL queries. It can also help you make
fewer syntax errors.

Example of a query
Here is how a simple query would appear in BigQuery, a data warehouse on the Google Cloud
Platform.

SELECT first_name

FROM customer_data.customer_name

WHERE first_name = 'Tony'

The above query uses three commands to locate customers with the first_name, 'Tony':

1. SELECT the column named first_name


2. FROM a table named customer_name (in a dataset named customer_data) (The dataset
name is always followed by a dot, and then the table name.)
3. But only return the data WHERE the first_name is 'Tony'
The results from the query might be similar to the following:

first_name
Tony
Tony
Tony
As you can conclude, this query had the correct syntax, but wasn't very useful after the data was
returned.

Multiple columns in a query


Of course, as a data professional, you will need to work with more data beyond customers named
Tony. Multiple columns that are chosen by the same SELECT command can be indented and
grouped together.

If you are requesting multiple data fields from a table, you need to include these columns in your
SELECT command. Each column is separated by a comma as shown below:

SELECT

ColumnA,

ColumnB,

ColumnC

FROM

Table where the data lives

WHERE

Certain condition is met

Here is an example of how it would appear in BigQuery:

4
5

SELECT

customer_id,

first_name,

last_name

FROM

customer_data.customer_name

WHERE

first_name = 'Tony'

The above query uses three commands to locate customers with the first_name, 'Tony'.

1. SELECT the columns named customer_id, first_name, and last_name


2. FROM a table named customer_name (in a dataset named customer_data) (The dataset
name is always followed by a dot, and then the table name.)
3. But only return the data WHERE the first_name is 'Tony'
The only difference between this query and the previous one is that more data columns are
selected. The previous query selected first_name only while this query selects customer_id and
last_name in addition to first_name. In general, it is a more efficient use of resources to select
only the columns that you need. For example, it makes sense to select more columns if you will
actually use the additional fields in your WHERE clause. If you have multiple conditions in your
WHERE clause, they may be written like this:

9
10

SELECT

ColumnA,

ColumnB,

ColumnC

FROM

Table where the data lives

WHERE

Condition 1

AND Condition 2

AND Condition 3

Notice that unlike the SELECT command that uses a comma to separate fields / variables /
parameters, the WHERE command uses the AND statement to connect conditions. As you
become a more advanced writer of queries, you will make use of other connectors / operators
such as OR and NOT.

Here is a BigQuery example with multiple fields used in a WHERE clause:

10

SELECT

customer_id,

first_name,

last_name
FROM

customer_data.customer_name

WHERE

customer_id > 0

AND first_name = 'Tony'

AND last_name = 'Magnolia'

The above query uses three commands to locate customers with a valid (greater than 0),
customer_id whose first_name is 'Tony' and last_name is 'Magnolia'.

1. SELECT the columns named customer_id, first_name, and last_name


2. FROM a table named customer_name (in a dataset named customer_data) (The dataset
name is always followed by a dot, and then the table name.)
3. But only return the data WHERE customer_id is greater than 0, first_name is Tony, and
last_name is Magnolia.
Note that one of the conditions is a logical condition that checks to see if customer_id is greater
than zero.

If only one customer is named Tony Magnolia, the results from the query could be:

customer_id first_name last_name


1967 Tony Magnolia
If more than one customer has the same name, the results from the query could be:

customer_id first_name last_name


1967 Tony Magnolia
7689 Tony Magnolia

Key takeaways
The SELECT, FROM, and WHERE clauses are the essential building blocks of SQL queries.
Queries with multiple fields will become simpler after you practice writing your own SQL
queries later in the program.
Step 1: Explore the data for patterns
First, you ask your manager or the data owner for access to the current sales records and website
analytics reports. This includes information about how customers behave on the company’s
existing website, basic information about who visited, who bought from the company, and how
much they bought.

While reviewing the data you notice a pattern among those who visit the company’s website
most frequently: geography and larger amounts spent on purchases. With further analysis, this
information might explain why sales are so strong right now in the northeast—and help your
company find ways to make them even stronger through the new website.

Step 2: Plan your visuals


Next it is time to refine the data and present the results of your analysis. Right now, you have a
lot of data spread across several different tables, which isn’t an ideal way to share your results
with management and the marketing team. You will want to create a data visualization that
explains your findings quickly and effectively to your target audience. Since you know your
audience is sales oriented, you already know that the data visualization you use should:

 Show sales numbers over time


 Connect sales to location
 Show the relationship between sales and website use
 Show which customers fuel growth
Step 3: Create your visuals
Now that you have decided what kind of information and insights you want to display, it is time
to start creating the actual visualizations. Keep in mind that creating the right visualization for a
presentation or to share with stakeholders is a process. It involves trying different visualization
formats and making adjustments until you get what you are looking for. In this case, a mix of
different visuals will best communicate your findings and turn your analysis into the most
compelling story for stakeholders. So, you can use the built-in chart capabilities in your
spreadsheets to organize the data and create your visuals.
1) line charts can track sales over time 2) maps can connect sales to locations 3) donut charts can
show customer segments 4) bar charts can compare total visitors that make a purchase

Build your data visualization toolkit


There are many different tools you can use for data visualization.

 You can use the visualizations tools in your spreadsheet to create simple visualizations
such as line and bar charts.
 You can use more advanced tools such as Tableau that allow you to integrate data into
dashboard-style visualizations.
 If you’re working with the programming language R you can use the visualization tools
in RStudio.
Your choice of visualization will be driven by a variety of drivers including the size of your data,
the process you used for analyzing your data (spreadsheet, or databases/queries, or programming
languages). For now, just consider the basics.
Spreadsheets (Microsoft Excel or Google Sheets)
In our example, the built-in charts and graphs in spreadsheets made the process of creating
visuals quick and easy. Spreadsheets are great for creating simple visualizations like bar graphs
and pie charts, and even provide some advanced visualizations like maps, and waterfall and
funnel diagrams (shown in the following figures).

But sometimes you need a more powerful tool to truly bring your data to life. Tableau and
RStudio are two examples of widely used platforms that can help you plan, create, and present
effective and compelling data visualizations.

Visualization software (Tableau)


Tableau is a popular data visualization tool that lets you pull data from nearly any system and
turn it into compelling visuals or actionable insights. The platform offers built-in visual best
practices, which makes analyzing and sharing data fast, easy, and (most importantly) useful.
Tableau works well with a wide variety of data and includes an interactive dashboard that lets
you and your stakeholders click to explore the data interactively.

You can start exploring Tableau from the How-to Video resources. Tableau Public is free, easy
to use, and full of helpful information. The Resources page is a one-stop-shop for how-to videos,
examples, and datasets for you to practice with. To explore what other data analysts are sharing
on Tableau, visit the Viz of the Day page where you will find beautiful visuals ranging from the
Hunt for (Habitable) Planets to Who’s Talking in Popular Films.

Programming language (R with RStudio)


A lot of data analysts work with a programming language called R. Most people who work with
R end up also using RStudio, an integrated developer environment (IDE), for their data
visualization needs. As with Tableau, you can create dashboard-style data visualizations using
RStudio.
Check out their website to learn more about RStudio.

You could easily spend days exploring all the resources provided at RStudio.com, but the
RStudio Cheatsheets and the RStudio Visualize Data Primer are great places to start. When you
have more time, check out the webinars and videos which offer advice and helpful perspectives
for both beginners and advanced users.

Key takeaways
The best data analysts use lots of different tools and methods to visualize and share their data. As
you continue learning more about data visualization throughout this course, be sure to stay
curious, research different options, and continuously test new programs and platforms to help
you make the most of your data.
Que

Previously, you learned that part of a data professional’s responsibility is to make certain that
their analysis is fair. Fairness means ensuring your analysis doesn't create or reinforce bias. This
can be challenging, but if the analysis is not objective, the conclusions can be misleading and
even harmful. In this reading, you’re going to explore some best practices you can use to guide
your work toward a more fair analysis!
Consider fairness
Following are some strategies that support fair analysis:

Best practice Explanation Example


Part of your job as a data analyst is to determine A state’s Department of Transportation is
what data is going to be useful for your analysis. in measuring traffic patterns on holidays.
Often there will be data that isn’t relevant to what they only include metrics related to traffic
Consider all of
you’re focusing on or doesn’t seem to align with and the fact that the days are holidays. Bu
the available
your expectations. But you can’t just ignore it; it’s team realizes they failed to consider how
data
critical to consider all of the available data so that on these holidays might also affect traffic
your analysis reflects the truth and not just your own Considering this additional data helps them
expectations. more complete insights.
A human resources department wants to b
for employee vacation time in order to ant
As you’ll learn throughout these courses, context is staffing needs. HR uses a list of national b
key for you and your stakeholders to understand the holidays as a key part of the data-gatherin
Identify
final conclusions of any analysis. Similar to But they fail to consider important holiday
surrounding
considering all of the data, you also must understand aren’t on the bank calendar, which introdu
factors
surrounding factors that could influence the insights against employees who celebrate them. It
you’re gaining. HR less useful results because bank holid
not necessarily apply to their actual emplo
population.
A data analyst is working on a project for
Self-reporting is a data collection technique where and-mortar retailer. Their goal is to learn
participants provide information about themselves. about their customer base. This data analy
Self-reported data can be a great way to introduce they need to consider fairness when they c
fairness in your data collection process. People bring data; they decide to create a survey so tha
Include self- conscious and unconscious bias to their observations customers can self-report information abo
reported data about the world, including about other people. Using themselves. By doing that, they avoid bias
self-reporting methods to collect data can help avoid might be introduced with other demograp
these observer biases. Additionally, separating self- collection methods. For example, if they h
reported data from other data you collect provides associates report their observations about
important context to your conclusions! customers, they might introduce any unco
bias the employees had to the data.
When collecting data about a population, it’s A fitness company is releasing new digita
important to be aware of the actual makeup of that for users of their equipment. They are inte
population. Sometimes, oversampling can help you designing content that appeals to different
Use represent groups in that population that otherwise knowing that different people may interac
oversampling wouldn’t be represented fairly. Oversampling is the their equipment in different ways. For exa
effectively process of increasing the sample size of part of their user-base is age 70 or older. I
nondominant groups in a population. This can help represent these users, they oversample the
you better represent them and address imbalanced data. That way, decisions they make abou
datasets. fitness content will be more inclusive.
Think about To ensure that your analysis and final conclusions A data team kicks off a project by includi
Best practice Explanation Example
fairness measures in their data-collection
These measures include oversampling the
are fair, be sure to consider fairness from the earliest population and using self-reported data. H
stages of a project to when you act on the data they fail to inform stakeholders about thes
fairness from
insights. This means that data collection, cleaning, measures during the presentation. As a res
beginning to end
processing, and analysis are all performed with stakeholders leave with skewed understan
fairness in mind. the data. Learning from this experience, th
key information about fairness considerat
future stakeholder presentations.

Key takeaways
As a data professional, you will need to ensure you always consider fairness. This will allow you
to avoid creating or reinforcing bias or accidentally drawing misleading conclusions. Using these
best practices can help guide your analysis and make you a better data professional!
Data analyst roles and job descriptions
As technology continues to advance, being able to collect and analyze the data from that new
technology has become a huge competitive advantage for a lot of businesses. Everything from
websites to social media feeds are filled with fascinating data that, when analyzed and used
correctly, can help inform business decisions. A company’s ability to thrive now often depends
on how well it can leverage data, apply analytics, and implement new technologies.

This is why skilled data analysts are some of the most sought-after professionals in the world. A
study conducted by IBM estimates that there are over 380,000 job openings in the Data Analytics
field in the United States*. Because the demand is so strong, you’ll be able to find job
opportunities in virtually any industry. Do a quick search on any major job site and you’ll notice
that every type of business from zoos, to health clinics, to banks are seeking talented data
professionals. Even if the job title doesn’t use the exact term “data analyst,” the job description
for most roles involving data analysis will likely include a lot of the skills and qualifications
you’ll gain by the end of this program. In this reading, we’ll explore some of the data analyst-
related roles you might find in different companies and industries.

* Burning Glass data, Feb 1, 2021 - Jan 31, 2022, US

Decoding the job description


The data analyst role is one of many job titles that contain the word “analyst.”

To name a few others that sound similar but may not be the same role:

 Business analyst—analyzes data to help businesses improve processes, products, or


services
 Data analytics consultant—analyzes the systems and models for using data
 Data engineer—prepares and integrates data from different sources for analytical use
 Data scientist—uses expert skills in technology and social science to find trends through
data analysis
 Data specialist—organizes or converts data for use in databases or software systems
 Operations analyst—analyzes data to assess the performance of business operations and
workflows
Data analysts, data scientists, and data specialists sound very similar but focus on different tasks.
As you start to browse job listings online, you might notice that companies’ job descriptions
seem to combine these roles or look for candidates who may have overlapping skills. The fact
that companies often blur the lines between them means that you should take special care when
reading the job descriptions and the skills required.

The table below illustrates some of the overlap and distinctions between them:
Title: Decoding the job description data analysts: -problem solving: Use existing tools and methods
to solve problems with existing types of data -analysis: Analyze collected data to help stakeholders
make better decisions -other relevant skills: database queries, data visualization, dashboards,
reports and spreadsheets data scientists: -problem solving: Invent new tools and models, ask open-
ended questions, and collect new types of data -analysis: Analyze and interpret complex data to
make business predictions -other relevant skills: advanced statistics, machine learning, deep
learning, data optimization, and programming data specialists: -problem solving: Use in-depth
knowledge of databases as a tool to solve problems and manage data -analysis: Organize large
volumes of data for use in data analytics or business operations -other relevant skills: data
manipulation, information security, data models, scalability of data, and disaster recovery

We used the role of data specialist as one example of many specializations within data analytics,
but you don’t have to become a data specialist! Specializations can take a number of different
turns. For example, you could specialize in developing data visualizations and likewise go very
deep into that area.

Job specializations by industry


We learned that the data specialist role concentrates on in-depth knowledge of databases. In
similar fashion, other specialist roles for data analysts can focus on in-depth knowledge of
specific industries. For example, in a job as a business analyst you might wear some different
hats than in a more general position as a data analyst. As a business analyst, you would likely
collaborate with managers, share your data findings, and maybe explain how a small change in
the company’s project management system could save the company 3% each quarter. Although
you would still be working with data all the time, you would focus on using the data to improve
business operations, efficiencies, or the bottom line.

Other industry-specific specialist positions that you might come across in your data analyst job
search include:

 Marketing analyst—analyzes market conditions to assess the potential sales of products


and services
 HR/payroll analyst—analyzes payroll data for inefficiencies and errors
 Financial analyst—analyzes financial status by collecting, monitoring, and reviewing
data
 Risk analyst—analyzes financial documents, economic conditions, and client data to help
companies determine the level of risk involved in making a particular business decision
 Healthcare analyst—analyzes medical data to improve the business aspect of hospitals
and medical facilities

Key takeaways
Explore data analyst job descriptions and industry-specific analyst roles. You will start to get a
better sense of the different data analyst jobs out there and which types of roles you’re most
interested to go after.

You might also like