Starting A Data Science Team: Dr. Jonathan D. Adler

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Starting a Data Science Team

Dr. Jonathan D. Adler


jonathandadler@outlook.com
http://jadler.info
• Degrees:
• BS – Mathematical Sciences (Worcester Polytechnic Institute)
• MS – Applied Mathematics (Worcester Polytechnic Institute)
• PhD – Industrial Engineering (Arizona State University)
• Research – real-time optimization in situations with uncertainty

• Jobs:
About me • Vistaprint – Forecasting, customer segmentation, product
recommendation engine
• Boeing – Market forecasting
• Promontory Growth and Innovation – Data science consulting
• Created the data science team from scratch
• Microsoft Studios – Xbox and Windows 10 analytics

2
• When a company gets to a certain size, they often
realize a few things:
• They have a lot of data, and lots of questions
• This data probably has useful information in it
The • There are people out there who can turn the data into
realization useful insights
• “So lets hire some people to turn that data into useful
information”

3
• What can we really do with data science?
There are a • What makes a data science project successful?
LOT of open • What are the skills the employees should have?
questions • Who should be our first hire? Our second? Our fifth?
• What are best practices for running the team?

4
1. What can we really do with data science?
2. What makes a data science project successful?
Outline 3. What are the skills the employees should have?
4. Who should be our first hire? Our second? Our fifth?
5. What are best practices for running the team?

5
1.What you can do with
data science

6
• Data science tends to fall into three broad
categories:
• Investigating – aggregating and inspecting
Simple
data to get basic insights on what is
Types of data currently happening
science work • Predicting – taking the data and using it to
understand what will happen in the future
• Optimizing – using the data to choose what
the best choice of actions will be Complex

7
• Look at historic data to answer direct questions:
• If you have two products, which is selling better? How
many people are buying both?
• How frequently do customers order?
• How are sales changing each month?

Investigating • These questions are generally quick to answer and don’t


require a mathematical model
• Difficulty is in knowing which measures to use and how
to visualize/represent them
• Unfortunately, they don’t tell you much (“so what?”)

8
• Look at historic data to predict:
• How likely is a customer going to come back?
• How will a customer respond to a sale?
• Is revenue going to increase over time?
• This information is a lot more meaningful; you are more
likely to be able to act on it
Predicting
• Requires mathematical modeling: regressions, clustering
algorithms, etc.
• Sometimes the data isn’t there to make the prediction,
sometimes the prediction is wrong, or requires more
skill to do well

9
• Look at the historic data to make the best decisions
• How much inventory should be held, and when should
you reorder?
• Which product should you recommend to a website
visitor for the most profit?
Optimizing • What price should you set for a product? When should it
go on sale?
• These problems are the hardest to get right
• They also directly provide the most profit

10
• Solve problems where the core drivers aren’t in the data,
or the signal is too weak in the noise
Things you • Which start-up companies are going to succeed
can’t do with • When is the next recession going to hit

data science • Having data isn’t sufficient for success


• Knowing the last 10,000 flips of a fair coin won’t help me
predict the next flip

11
2.What makes a data
science project
successful

12
• A company that sound brownies online wanted to improve their
marketing. They had two types of customers:
• Consumers ordering for friends and family
• Businesses ordering for their clients
• Wanted to target their customers differently, but couldn’t
consistently tell if a customer was a business or a consumer
• Data science approach: analyze the text on the gift message to
determine if it had “business” or “consumer words”
Case Study 1:
• Result: a continuously running script determined if each new order
an e-commerce was for a business or consumer, and the customer was put into one
of the two categories
company
Probability of
Gift message business
THANK YOU ALL FOR YOUR AND HARD WORK, IT IS TRULY APPRECIATED BY THE
MANAGEMENT TEAM 0.989084
CONGRATULATIONS AND BEST OF LUCK ON YOUR NEW JOB! WE ARE VERY PROUD
OF YOU! LOVE MOM 0.019581

13
Questions Data
• All analyses begin with questions
• A data scientist will take the
question and investigate the data
Analysis
• Cases:
• it’s clear the data isn’t right to
The data Modeling
answer the question
• advanced modeling is required
science • the answer to the question can
be found immediately in the data
process Abort Result
• If a result is found
• It can be productionized
• It can raise more questions

Productionize

14
• A manufacturing company made highly customized machines
• “build a machine that builds lightbulbs”
• The company needs to makes quotes for the price without ever
having made that machine before
• Costs could be substantially higher or lower than quoted
• Problem wasn’t with price but with quality of estimates
Case Study 2:
• Data science used to better predict how much a machine would cost
a manufacturing
company • Question: how can we predict the cost of a machine
Questions Data • Data: features of previous machines, their estimates,
and the actual costs
Analysis
• Analysis: there is a relationship between features and
Modeling
the estimate/actual ratio
• Model: a GAMLSS to predict the true cost and the
Abort Result possible error band
• Result: model successfully predicted costs better
Productionize
• Productionize: a simple GUI for the company, and a
contract for refitting the model
15
Questions Data

Possible problems
Analysis Data inconsistent or poorly
formatted
The data Question ill formed
Modeling
science Model can be faulty – overfitting or
incorrect assumptions
process: Abort Result
Signs the model won’t work are
ignored
problems Stuck in a “what about” loop
Model not built in a way that makes
for simple productionizing
Productionize

16
Situation:
• A company sells products in bulk to state governments
• Want to discount each quote to a price that brings in the most profit
• Data Science used to determine which price

Problems:
• No clear relationship found between customer and chance of
accepting a quote
Case Study 3: • A highly advanced model was used that found a relationship, but was
a distributor unintuitive and not robust, and therefore hard to tell if working
correctly
• Model was built on training data that was a different format from the
production data, so entire model had to be rebuilt to productionize
• Was stuck in a “what about loop:” continuously cutting the data in
different ways to satisfy customer

End result: project was over-budget and under-delivered, and now is


extremely difficult to maintain

17
3. Skills needed to be a
data scientist

18
Technical Skills
• Statistics and Math – The different techniques used on data:
regressions, clustering algorithms, time series models
• Software Development – How to write code, how to manage
a code base, how to store data in a database
The five skills • Business Experience – Where companies waste money, what
makes a project succeed, how to get data from within a
needed to do company
data science Personal Skills
• Leadership – How to help other data scientists, how to train
them, and how to work with a customer to produce good
results
• Adaptability – the ability to figure out a solution when
presented with an entirely new problem

19
Statistics
Junior data scientist (J) – Has a BS, and less than
Coding
three years of experience in industry. Tends to
know the only simple statistical techniques. Business
Requires a lot of guidance, but is happy to do the Leadership
less interesting work. ($)
Adaptability

Data scientist
archetypes Expert junior scientist (E) – A junior data scientist
who has been working for 5+ years. Gets very
Statistics

Coding
comfortable doing the simple stuff and knows
Business
deeply about their business area. May have gotten
an MS to help career. ($$) Leadership

Adaptability

20
Senior data scientist (S) – A person with an
advanced degree, and enough business Statistics

experience to know what to do with it. Coding


Understands coding well enough to do things Business
right. Is still less willing to do work than a junior
Leadership
data scientist, but will if no one else is around. Big
Adaptability
difference between senior and expert junior is
Data scientist ability to independently learn. ($$$)
archetypes Statistics

Coding
Principal data scientist (P) – Just like a senior data
scientist, but also with experience leading a team Business

and a project. Difficult to find. ($$$$) Leadership

Adaptability

21
Statistics
Business intelligence analyst (B) – Understands a
Coding
lot about the business and the data powering it.
Doesn’t know much about statistics or what to Business
do with the data. Can be dangerous without Leadership
proper guidance. ($)
Data scientist Adaptability

archetypes:
danger Academic (A) – Has an MS/PhD and not too
Statistics

Coding
much business experience. Loves to think about
interesting problems, but is less willing to spend Business

the time doing the mundane work to get a Leadership


project done (and might not know how!). ($$$) Adaptability

22
4.Who you should hire

23
• You only need to hire one data scientist, they’ll hire the
rest
Hiring • Who you choose for the first hire dramatically alters
how your team will end up

24
• [Expert] Junior data scientist: “the blind
EJ EJ
leading the blind.” This team will know
how to do simple data science but
won’t know how to do advanced work. J J J J

Often won’t even know the advanced


things exist. Often very inefficient
First hire because they don’t know any better.
choice • Academic: “the ivory tower.” A team of
people who look at only the most
complex problems, and spend tons of A A A A

time talking about how they are


interesting and innovative solutions for
them. Won’t product very many
solutions.

25
• Senior data scientist: “the very strong
and expensive team.” This team will be
S S S S
efficient and generally produce good
results. The team members won’t enjoy
doing the simpler work, but it’ll get
First hire done.
choice • Principal data scientist: “the balanced.”
the principal data scientist will be P S S

expensive, but the people they hire


won’t be. They will set the groundwork J EJ J
for the team to run efficiently, but will
be able to support a junior team.

26
5. Best practices

27
Questions Data Tables in a database

Analysis
Tools are needed
for this process
From data to Modeling

a result
Abort Result A presentation

Productionize

28
Least efficient Most efficient
Tables in a database
1. Database queries to
aggregate and join the data Tables in a database
From data to 2. SAS code to analyze the
aggregated data and run a
1. A single set of R code to
aggregate and join the data,
a result: work model run the model, visualize the
output, and make a
streams 3. Excel worksheets to
visualize the data presentation
4. PowerPoint to make the A .pdf file
presentation
A .pptx file

Firms underestimate the total loss from an inefficient work stream

29
• Data science produces lots of types of files
• Raw data from the client
• Processed intermediate data
• Code to do the analysis
Storing • Base results
knowledge • Finalized reports and presentations
• Often an analysis is done once
• May never be looked at again
• In a year, someone might ask to do the analysis again with
changes

30
Least robust Most robust
Each data scientist has a folder on a Materials split into three components
share drive for each project containing 1. Input data is stored in folders for
all of the data, code, and results each project sharing a consistent
scheme
 Doesn’t make clear what files are 2. Code for analysis is stored in
used for version control to track changes
Storing  Doesn’t track changes over time 3. Output is stored in folders with
knowledge:  Doesn’t indicate what was delivered
to client
marked versions connected the
code
methods  Anything delivered to the client is
marked with how it was created
 Allows for clear change logs to see
differences in versions
 Splitting input data allows for easy
data updates

31
Questions Data • Data science process
involves many small tasks
Analysis
• Finding data
• Initial analysis
• Attempting multiple
Project Modeling models
management • With multiple projects and
multiple people,
Abort Result
coordination is non-trivial

Productionize

32
• For project set up, have a standard expected timeline,
example:
• Initial investigation: 2 weeks
• Modeling: 4 weeks
• Result validation: 2 weeks
• Productionizing: 4 weeks
Project • In the timeline, have set points to meet with the client and
management: review
• Use a card-based tool like trello to track the process of
methods individual steps.

33
Conclusion

34
1. What can we really do with data science?
Many problems can be solved that rely on data. From simple
investigation of the data to building predictive models and optimization
algorithms.
2. What makes a data science project successful?
Having a clear path from the data to the result, and ensuring the
project gets completed (or aborted) at the right time.
3. What are the skills the employees should have?
Conclusion Statistics, software development, business experience, leadership, and
adaptability.
4. Who should be our first hire? Our second? Our fifth?
Someone with all five of those skills, or failing that, someone with all of
them but leadership.
5. What are best practices for running the team?
Have a clear, efficient process for doing data science and storing the
results.

35
Questions?

36
Appendix

37
• If you can find a principal data scientist
• Hire him or her, have them set up the groundwork for the
team
• 3 months in, hire a senior data scientist
• 6 months in, hire a junior data scientist
• By 18 months in, have a team of 5-6 people
Hiring
• If you can’t find a principal data scientist
roadmap • Hire a senior data scientist to work independently
• Every 3 months hire an additional senior data scientist
• If at any point there seems to be too much simple work,
start hiring junior data scientists and assign them senior
data scientists as mentors

38
During the hiring process, check:
• Statistics and Math – Do they know how to use a linear
regression? What overfitting is? Supervised vs. unsupervised
learning?
• Software Development – Have they used: R, python, or
Ensuring MATLAB? Have they used source control? Have they pulled
data from a SQL database, and understand how to do joins?
candidates • Business Experience – Do they have experience working in a
company? Have they seen a project through to completion?
have these Can they reflect on why a project succeeded or failed?
skills • Leadership – Have they managed a project? Have they lead
employees?
• Adaptability – Do they have experience in figuring out a
solution to an entirely new problem without substantial
guidance?

39

You might also like