Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

DATACAMP CHAPTER 2

Data Sources
Data science for everyone
Course Instructor
Anam Shahid
Data Sources
Data sources
Previously, you learned about the data science workflow. In this chapter, we'll focus on the first step: data collection and
storage.

The data science workflow


Before we can start deriving insights from data, we first need to collect the data from different sources.

Sources of data
We are generating vast amounts of data on a daily basis simply by surfing the internet, tracking a run, or paying by card in
a shop. The companies behind these services that we use, collect this data internally. They use this to help them make
data-driven decisions. On the other hand, there are also many free, open data sources available. This means the data can
be freely used, shared and built-on by anyone. Note that sometimes companies share parts of their data with a wider
public as well. Let's first take a look at company data sources.

A. Company data
Some of the most common company sources of data are web events, survey data, customer data, logistics data, and
financial transactions.

1. Web data
When you visit a web page or click on a link, usually this information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of content. The following information is captured: the name
of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the
timestamp of the event, and an identifier for the user that performed the action.
2. Survey data
Data can also be collected by asking people for their opinions in surveys. This can be, for example, in the form of a face-
to-face interview, online questionnaire, or a focus group.

3. Net Promoter Score


You've likely answered a question as shown in the image. This is a very common type of survey data used by companies:
the Net Promoter Score, or NPS, which asks how likely a user is to recommend a product to a friend or colleague.

B. Open data
There are multiple ways to access open data. Two of them are APIs and public records.

1. Public data APIs


Let's begin with APIs. API stands for Application Programming Interface. It's an easy way of requesting data from a third
party over the internet. Many companies have public APIs to let anyone access their data. Some note able APIs include
Twitter, Wikipedia, Yahoo! Finance, and Google Maps, but there are many, many more.

Tracking a hashtag
Let's look at an example of the Twitter API. Suppose we want to track Tweets with the hashtag DataFramed, Data
Camp’s wonderful podcast on Data Science. We can use the Twitter API to request all Tweets with this hashtag. At this
point, we have many options for analysis. We could perform a sentiment analysis on the text of each Tweet and get an
idea of how people like our podcast. We could simply track how often hashtag DataFramed appears each week. We could
also combine this data with our downloads data and see if positive Tweets are correlated with more downloads.
2. Public records
Public records are another great way of gathering data. They can be collected and shared by international organizations
like the World Bank, the UN, or the WTO, national statistical offices, who use census and survey data, or government
agencies, who make information about for example the weather, environment or population publicly available. For
example, in the US, data-dot-gov has health, education, and commerce data available for free download. In the EU, data-
dot-europa-dot-eu has similar data.

Data types
Data types
You now know where to collect data. But what does that data look like? In this topic we'll talk about the different types of
data.

Why care about data types?


You might wonder why it's important to know what type of data you have collected. This will be essential later on in the
data science process. For instance, it's especially relevant when you want to store the data, which we'll talk about in the
next as not all types of data can be stored in the same place. Furthermore, when you're visualizing or analyzing the data
it's important to know the type of data you are dealing with. Not all visualizations or analyses can be performed with all
data types. So, let's dive in.

Quantitative vs qualitative data


There are two general types of data: qualitative and quantitative data. It’s important to understand the key differences
between both. Quantitative data can be counted, measured, and expressed using numbers. Qualitative data is descriptive
and conceptual. Qualitative data can be observed but not measured. Now that we know the differences, let’s dive into each
type of data with a real-world example.
1. Quantitative data
Quantitative data can be expressed in numbers. For example, the fridge is 60 inches tall, has two apples in it, and costs
1000 dollars.

2. Qualitative data
Qualitative data, on the other hand, are things that can be observed but not measured like: the fridge is red, was built in
Italy, and might need to be cleaned out because it smells like fish.

Other data types


Other than the traditional quantitative and qualitative data, there are many other data types that are becoming more and
more important. There is image data, text data, geospatial data, network data, and many more. Note that these other data
types aren't mutually exclusive with quantitative and qualitative data. Meaning often these other data types are a special
mix of quantitative and qualitative data. Let's look at some examples.

1. Other data types: Image data


Digital images are everywhere. An image is made up of pixels. These pixels contain information about color and intensity.
Typically, the pixels are stored in computer memory. In the example you can see that if we zoom in on the image we can
distinguish the different pixels.

2. Other data types: Text data


Emails, documents, reviews, social media posts, and so on. As you can imagine, text data can be found in many places.
This data can be stored and analyzed to find relevant insights.

3. Other data types: Geospatial data


Geospatial data are data with location information. In the example you can see that many different types of information
can be captured using geospatial data. For a specific region we can keep track of where the roads, the buildings, and
vegetation are. This is especially useful for navigation apps like Waze and Google maps.
4. Other data types: Network data
Network data consists of the people or things in a network, depicted by circles, and the relationships between them,
depicted by lines. Here you can see an example of a social network. You can easily see who knows whom.

Recap
In this we looked at the most common data types: quantitative data, qualitative data, image data, text data,
geospatial data, and network data. These can all serve as inputs for your data science analysis. But before doing
that, the data needs to be stored. That's what we'll cover in the next topic.

Data storage and retrieval

Data storage and retrieval


Previously in this chapter, you learned about different data sources and data types.

The data science workflow


Now, let's discuss efficient ways of storing and retrieving the data that was collected. As you can see this is still part of the
first step in the data science workflow we defined before.

Things to consider when storing data


When storing data there are multiple things to take into consideration. First, we need to determine where we want to
store the data (location). Then, we need to know what kind of data we are storing (data type). And lastly, we need to
consider how we can retrieve our data from storage. Let's take a closer look.
A. Location:
1. Parallel storage solutions
Data science projects could require large amounts of data. At this point the data probably can't be stored on a single
computer anymore. In order to make sure that all data is saved and easy to access, it is stored across many different
computers. Large companies often have their own set of storage computers, called a “cluster” or a “server”, on premises.

2. The cloud
Alternatively, you could pay another company to store data for you. This is referred to as “cloud storage”. Common
cloud storage providers include Microsoft Azure, Amazon Web Services, or AWS, and Google Cloud. These services
provide more than just data storage; they can also help your organization with data analytics, machine learning, and deep
learning. For now, we’ll just focus on data storage.

B. Types of data storage:


Different types of data require different storage solutions. Some data is unstructured, like email, text, video and audio
files, web pages, and social media messages. This type of data is often stored in a type of database called a Document
Database.

More commonly, data can be expressed as tables of information, like what you might find in a spreadsheet. A database
that stores information in tables is called a Relational Database. Both of these types of databases can be found on the
cloud storage providers that were mentioned earlier.

C. Retrieval: Data querying


Once data has been stored in a Document Database or a Relational Database, we’ll need to access it. At a basic level,
we’ll want to be able to request a specific piece of data, such as “All of the images that were created on March 3rd” or
“All of the customer addresses in Montana”. In addition, we might even want to do some analysis, such as summing,
counting, or averaging data.

Each type of database has its own query language; Document Databases mainly use NoSQL, while Relational
Databases mainly use SQL. SQL stands for “Structured Query Language” and NoSQL stands for “Not only SQL”
Data Pipelines
Data Pipelines
Let’s learn about data pipelines. So far we've learned about data collection and storage, but how can we scale all this? This
is where data pipelines come in.

Data collection and storage


Data engineers work to collect and store data, so that others, like analysts and data scientists can access data for their
work, whether it's for visualization or building machine learning models.

How do we scale?
But how do we scale this? Consider the different data sources you learned about - what if we're collecting data from more
than one data source? And then, what if these data sources have different types of data? For example, consider real-time
streaming data, which is data that is continuously being generated, like tweets from all around the world. This makes
storing this incoming data complicated, because as a data engineer, you want to make sure data is organized and easy to
access.

What is a data pipeline?


Enter the data pipeline. A data pipeline moves data into defined stages, for example, from data ingestion through an
API to loading data into a database. A key feature is that pipelines automate this movement. Data is constantly coming in
and it would be tedious to ask a data engineer to manually run programs to collect and store data. Instead a data engineer
schedules tasks whether it's hourly, daily, or tasks can be triggered by an event. Because of this automation, data pipelines
need to be monitored. Luckily, alerts can be generated automatically, for example, when 95% of storage capacity has been
reach or if an API is responding with an error. Data pipelines aren't necessary for all data science projects, but they are
when working with a lot of data from different sources. There isn't a set way to make a pipeline - pipelines are highly
customized depending on your data, storage options, and ultimate usage of the data. ETL, which stands for extract,
transform, and load, is a popular framework for data pipelines. Let's explore it with a case study.

Case study: smart home


1. Extract

2. Transform and Load

3. Automation
Once we've set up all those steps, we automate. For example, we can say every time we get a tweet, we transform it in
a certain way and store it in a specific table in our database. There are tools that specialized to do this; the most popular is
called Airflow.

Reference link

https://campus.datacamp.com/courses/data-science-for-everyone/data-collection-and-storage-2?ex=1

You might also like