Professional Documents
Culture Documents
Slide 2
Slide 2
Revision
• What is Data Science?
• What is Big Data?
• What’s driving the Data Deluge?
• 6Vs of Big Data – Volume, Variety, Velocity, Veracity & Validity, Value
and Vulnerability
• Facets of Big Data – Structured, Semi-Structured, Quasi-structured and
Unstructured Data
• Emerging Big Data Ecosystem – Data Devices, Data Collectors, Data
Aggregators, Data Users/Buyers.
Data Science Lifecycle
Lifecycle Phases –
1. Discovery
2. Data Preparation
3. Model Planning
4. Model Building
5. Communicate Results
6. Operationalize
Phase 1 - Discovery
1. Learning the business domain
2. Resources
3. Framing the problem
4. Identifying the key stake holders
• Business User
• Project Sponsor
• Project Manager
• Business Intelligence Analyst
• DBA
• Data Engineer
• Data Scientist
5. Identifying the potential data sources
Phase 2 – Data Preparation
1. Preparing the Analytic sandbox
2. Data Cleansing
3. Combining Data
4. Data Transformation
5. Common tools –
• Hadoop
• OpenRefine
• Alpine Miner
• Data Wrangler
Data Cleansing – Common Errors
outlier
outlier
outlier
Data Cleansing – Techniques for handling missing values
Combining Data
1. Data Exploration
2. Model Selection
3. Common Tools for Model
Planning Phase
• R
• Matlab
• SAS
Example - Data Exploration