Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A Crash Course in Data

Science- Review
Kartikeya Bolar

What is Data Science?
• Applied Statistics ?
• Applied Machine Learning ?
• Database Management ?
• Answering Specific Questions with Data ?
• Deep Learning ?

Broad Areas of Statistics
• Descriptive - Involves Basic Summary Tables and Exploratory Data Analysis
Focus is on Profiling and exploring potential relationships)
• Inferential - Process of Drawing Conclusions about a Population from a Sample.
Focus on Parameter Estimates and Relationship between Parameters)
• Predictive - Process of getting predictions irrespective of the statistical significance of
the relationship
Focus is on Sampling variance
• Experimental Design - Balancing observed and unobserved covariates that may
contaminate our results
Focus is on Cause and Effect

Machine Learning
• Obtain generalizability by testing on novel datasets
• Supervised ( Focus on prediction through prediction performance)
• Unsupervised ( Clustering , Association, Principal Component)
• Traditional statistical approaches often differ from ML approaches
• By often placing a higher priority on parameter interpretability and simplicity
(model specification) over prediction performance

Software engineering for Data Science
• Software engineering is used to generalize data analyses into software so that
they can be applied in different situations
• Software packages provide a well-defined interface that can abstract low-level
technical details of data analysis routines
• Developing a function or a package depends upon the level of repetition of the
procedure or steps

Structure of Data Science Project
• A Data Science Project might start
with Exploratory Data Analysis or
Defining /Stating the Question

• Decision making is not the part of

data analysis process

Output of Data Science Experiment - 1
Output Types Characteristics
• Reports • Clearly written
• Presentations • Narrative
• Concise Conclusions
• Omit the Unnecessary
• Reproducible
• Tools : Rmarkdown Knittr, Presenter

Output of Data Science Experiment -2
Output Types Characteristics
• Interactive web pages (Dashboards) • Easy to use
• Apps • Documentation
• Code commented
• Version Control
• Tools : Rmarkdown Shiny, Shiny
WebApp, Tableau, PowerBI

Hallmarks of Successful Data Science
• New knowledge is created.
• Decisions or policies are made based on the outcome of the experiment.
• A report, presentation or app with impact is created.
• It is learned that the data can't answer the question being asked of it.

Data scientist’s toolbox
• Data programming languages ( Eg. R ,Python)
• Scaling computing frameworks ( Eg. Apache Spark, Hadoop Map Reduce)
• Web servers (Eg. Amazon Web Service , RStudio Cloud, Azure)
• Help websites(Eg. Stack overflow)
• Databases (Eg. Sqlserver, Excel)
• Chat tools (Eg. Slack)
• Reproducibility tools (Eg. Rmarkdown)
• Data products development tools(Eg. Shiny, Tableau ,PowerBI)

Separating Hype from Value
• What is the question you are trying to
answer with the data?
• Do you have the data to answer that
• If you could answer the question,
could you use the answer?


You might also like