Data Science For Business 2 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Data sources and

risks
D ATA S C I E N C E F O R B U S I N E S S

Michael Chow
Assessment Research Lead, DataCamp
Common sources of data
Web events

Customer data

Logistics data

Financial transactions

DATA SCIENCE FOR BUSINESS


Web data
Events

Timestamps

User information

user_id event_name timestamp

1234 homepage_visit 2019-01-01 12:01:01

DATA SCIENCE FOR BUSINESS


Personally Identi able Information (PII)
  

"Jane Doe" = Personally Identi able Information (PII)

DATA SCIENCE FOR BUSINESS


Data pseudonymization

Restricted access

Audit logs

DATA SCIENCE FOR BUSINESS


Data anonymization

DATA SCIENCE FOR BUSINESS


General Data Protection Regulation (GDPR)
Applies to all data inside of the EU

Give individuals control over their personal data

Regulates how long data can be stored

Mandates appropriate anonymization

Disclose data collection and gain consent

DATA SCIENCE FOR BUSINESS


Let's practice!
D ATA S C I E N C E F O R B U S I N E S S
Solicited data
D ATA S C I E N C E F O R B U S I N E S S

Michael Chow
Assessment Research Lead, DataCamp
Why do we solicit data?
Create marketing collateral

De-risk decision making

Monitor quality

DATA SCIENCE FOR BUSINESS


Types of solicited data
Surveys

Customer reviews

In-app questionnaires

Focus groups

DATA SCIENCE FOR BUSINESS


Types of solicited data
Qualitative Quantitative
Conversations Multiple choice

Open-ended questions Rating scale

DATA SCIENCE FOR BUSINESS


Revealed and stated preferences
Stated preference Revealed preference

Hypothetical Actions

Subjective Purchasing decisions

DATA SCIENCE FOR BUSINESS


Best practices
Be speci c

Do this Not that

On a scale of 1 - 5, how would you rate the quality of content on How would you rate
DataCamp? DataCamp?

DATA SCIENCE FOR BUSINESS


Best practices
Be speci c

Do this Not that

On a scale of 1 - 5, how would you rate the quality of content on How would you rate
DataCamp? DataCamp?
 

Avoid loaded language

Do this Not that

Which of the following political issues is Which of the following controversial political issues is
most important to you? most important to you?

DATA SCIENCE FOR BUSINESS


Best practices
Calibrate

Do this Not that

Rate your interest in each of the following products at Are you interested in Skill Assessment at
DataCamp. DataCamp?

DATA SCIENCE FOR BUSINESS


Best practices
Calibrate

Do this Not that

Rate your interest in each of the following products at Are you interested in Skill Assessment at
DataCamp. DataCamp?
 

Require actionable results

Do this Not that

Have a hypothesis for each question. Ask a question just because it's interesting.

DATA SCIENCE FOR BUSINESS


Let's practice!
D ATA S C I E N C E F O R B U S I N E S S
Collecting additional
data
D ATA S C I E N C E F O R B U S I N E S S

Michael Chow
Assessment Research Lead, DataCamp
Even more data
APIs

Public records

Mechanical Turk

DATA SCIENCE FOR BUSINESS


Data APIs
Application Programming Interface Twitter

Request data over the internet Wikipedia

Yahoo! Finance

Google Maps

Many more!

DATA SCIENCE FOR BUSINESS


Tracking a hashtag
All tweets with #DataFramed (DataCamp's
podcast!)

Use Twitter API

DATA SCIENCE FOR BUSINESS


Public records
For the US, data.gov

For the EU, data.europa.eu

DATA SCIENCE FOR BUSINESS


Building a training set

DATA SCIENCE FOR BUSINESS


Mechanical Turk

DATA SCIENCE FOR BUSINESS


Mechanical Turk
Resource: AWS MTurk

Label customer reviews

Extract text from a form

Highlight key words in a sentence

DATA SCIENCE FOR BUSINESS


Let's practice!
D ATA S C I E N C E F O R B U S I N E S S
Data storage and
retrieval
D ATA S C I E N C E F O R B U S I N E S S

Michael Chow
Assessment Research Lead, DataCamp
Parallel storage solutions

DATA SCIENCE FOR BUSINESS


The cloud

DATA SCIENCE FOR BUSINESS


Types of data storage
Unstructured

Email

Text

Video and audio les

Web pages

Social media

Document Database

DATA SCIENCE FOR BUSINESS


Types of data storage
Unstructured Tabular

Email Customer Name Customer Address ...


Text
Jane Doe 123 Maple St. ...
Video and audio les
  
Web pages
Relational Database
Social media

Document Database

DATA SCIENCE FOR BUSINESS


Data querying

DATA SCIENCE FOR BUSINESS


Data querying
Data Type Query Language

Document Database NoSQL

Relational Database SQL

DATA SCIENCE FOR BUSINESS


Putting it all together: Location
On-premises cluster

Cloud provider:
Azure

AWS

Google Cloud

DATA SCIENCE FOR BUSINESS


Putting it all together: Data type

DATA SCIENCE FOR BUSINESS


Putting it all together: Data type
 

Data Type Storage Solution

Unstructured Document Database

Tabular Relational Database

DATA SCIENCE FOR BUSINESS


Putting it all together: Queries

DATA SCIENCE FOR BUSINESS


Putting it all together: Queries
 

Data Type Query Language

Document Database NoSQL

Relational Database SQL

DATA SCIENCE FOR BUSINESS


Let's practice!
D ATA S C I E N C E F O R B U S I N E S S

You might also like