Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

www.bbds.

ma
www.bigbang-datascience.com
Agenda

Introduction to Data Science Deep Learning | Machine Learning Practice

• Data Explosion
• Why Data Science? • Picking the right algorithm
• What is Data Science? • Misconceptions/Myths
• Type of Analytics • Hard Situations
• Data Science Portfolio • Limitations of ML today
• Data Science Process
• Career in Data Science

Introduction to Machine Learning


• Big Picture
• Definitions, Learning Mechanisms and
Origins Structure, Tasks and Success Metrics
• Jobs impacted by Machine Learning

6
Why should You Become
Data Scientist?
Make a Better World A fuel of 21st Century

Career of Tomorrow High Demand

Great Salaries Lucrative Career

4
Is Data Science for me?
Because of the media coverage around data science and the characterization of data scientists
as “rock stars,” you may feel like it’s impossible for you to enter into this realm. If you’re
the type of person who loves to solve puzzles and find patterns, whether or not you consider
yourself a quant, then data science is for you.

Cathy O’Neil & Rachel Schutt - Doing Data Science

8
8
Data Explosion
SomeInterestingFactsAboutData
▪Every day, we create 2.5 quintillion bytes of data, so much that
90% of the data in the world today has been created in the last two
years alone.

▪Walmart handles more than 1 million customer transactions every


hour, which is imported into databases estimated to contain more than
2.5 PB of data

▪ Twitter generates 12 TB of data every day.

▪ Airbus A380 generates 10 TB every 30 minutes of flight.

▪ NYSE generates a TB of data every month.

What do we do
we so much
amount of data?
Ignore or use it.
Howmuchdataisgettinggenerated?
Howmuchdataisgettinggenerated?

12
The model has changed …

Old Model – only a few companies were generating data (like news outlets), all others are
consuming data

New Model – all of us are generating data, and all of us are consuming data
OpportunitiesforNewApproachtoAnalytics

Over 2.5 Exabyte (2.5 billion gigabytes) of data is generated every day.

In 2020, the world will generate 50x more data than we generated in 2011
14
Data-TheMostValuableResource

“In its raw form, oil has little value. Once processed and refined, it helps power the world.”
—Ann Winblad

“Data is the new oil.”


—Clive Humby, CNBC
What is DataScience?
DataScience–ADefinition

A decade after the term data science was first used, there is continued debate among
practitioners and academics about what data science means.

Data Science is the science which uses computer science, statistics and machine

learning, visualization and human-computer interactions to collect, clean, integrate,

analyze, visualize, interact with data to create data products.

“The ability to take data—to be able to understand it, process it, to extract value
from it, to visualize it, to communicate it
—that’s going to be a hugely important skill in the next decades.”

- Hal Varian, Google’s Chief Economist

17
DataScience–ADefinition

If you torture the data long enough, it will confess to anything


Ronald Coase
18
DataScience–AVisualDefinition

Multidisciplinary

➢ Statistics quantifies numbers


➢ Data Mining explains patterns
From my perspective a data scientist
have a blend of many skills ➢ Machine Learning predicts with models
➢ Artificial Intelligence behaves and reasons
DataScience–AVisualDefinition

• Data science applies the scientific


method to analyzing data
Machine
• It lies at the intersection of several Learning
disciplines Data
Science
• It draws on industry knowledge that
makes the analysis of Big Data possible
Industry
Industry knowledge is Knowledge
essential toknowing what to
look for whenexploring data
Why DataScience ?
WhyDataScience?

Harvard Business : Data scientist is the sexiest career of the 21st century
LinkedIn: Statistical Analysis & Data Mining were the hottest skills that got
recruiters’ attention in 2014/2015/2016/2017/2018/2019/2020/2021
Glassdoor ranked data scientist as the #1 job to pursue in 2016/2017/2018/2019/2020
McKinsey: the US alone faces a shortage of 150,000+ data analysts and an
additional 1.5 million data- savvy managers

Salary trends have followed the impact of data science. With a national
average salary of $118,000 (which increases to $126,000 in Silicon
Valley), data science has become a lucrative career path where
you can solve hard problems and drive social impact.

23
“DataScience”anEmergingField

The future belongs to the


companies and people that turn
data into products

O’Reilly Radar report, 2011

GoalofDataScience
Turn data into data products.

24
Types of Analytics
There are four distinct types of Analytics

Explained what Suggests why it Indicates what Recommends what


has happened happened could happen should happen
There are four distinct types of Analytics
There are several area of Analytics
Whatdatasciencecando

Data science can be applied across any industry

Empowers Increases
management to make accountability and
better decisions validates decisions

Increases operational Identifies new


efficiency and opportunities to stay
investment from staff competitive
Data Science Portfolio
DataScientistProfile(Competencies)
1. Quantitative skills, such as mathematics or statistics

2. Technical aptitude, such as software engineering, machine learning, and programming skills.

3.Skeptical…..this may be a counterintuitive trait, although it is important that data scientists


can examine their work critically rather than in a one-sided way.

4.Curious & Creative, data scientists must be passionate about data and finding creative
ways to solve problems and portray information

5.Communicative & Collaborative: it is not enough to have strong


quantitative skills or engineering skills. To make a project resonate,
you must be able to articulate the business value in a clear way, and
work collaboratively with project sponsors and key stakeholders.

31
DataScienceIsa TeamSport

32
DataScienceIsa TeamSport

33
“Citizen Data Scientist” ?
Market trends indicate that the emergence of “ Citizen Data Scientist”
Will Data Science
take our jobs ?
What Jobs Will be Lost?
• 5% of all jobs can be completely automated today
– Collecting data, processing data and predictable
physical work
– Manufacturing, food preparation, tax
preparation, financial advising etc.

• At 60% jobs can be automated 30%


– 25% of a CEO job can be automated

• Change dependent on:


– Technical Feasibility, cost of development and
integration, labor market dynamics, economic
benefits, regulatory & social acceptance

• Least susceptible to automation:


– Applying expertise, decision making, planning, Ref: MGI 2017b
creative tasks, managing and developing others

3
5
New Jobs Due to AL
• Trainers: Teach AI systems how they should perform
– Natural-language processors and language translators make fewer errors (understanding
context, detect Sarcasm).
– How to mimic human behaviors (Empathy and humor).
– Ex: I am stressed out about the exam.

• Explainers: Bridge the gap between technologists and business leaders.


– Provide clarity, especially, in right of “right to explanation”
– Explain and correct unintended behaviors
– Ex: Ability to explain why a certain decision was reached

• Sustainers: Ensure that AI systems are operating as designed and that unintended
consequences are addressed.
– Maintain confidence in the fairness and auditability of their AI systems
– Be a watchdog and ombudsman for upholding norms of human values and morals
– Ex: Recalling only white women when search for “loving grandmother”, adverse actions
among a minority groups
Ref: Wilson 2017

3
6
Data Science Jobs have Increased
• Data Scientist has been called the
sexiest job of the 21st century

• There will be be 11.6 million new jobs


by 2026 per statistics

• Post Covid everything will digitize


faster creating more opportunities

• This is one of the worst


economic crisis in centuries

• Reports of job losses and record


unemployment worldwide

• AI, ML jobs are still expected to


grow this year and accelerate
further
New Jobs Due to AL
• Trainers: Teach AI systems how they should perform
– Natural-language processors and language translators make fewer errors (understanding
context, detect Sarcasm).
– How to mimic human behaviors (Empathy and humor).
– Ex: I am stressed out about the exam.

• Explainers: Bridge the gap between technologists and business leaders.


– Provide clarity, especially, in right of “right to explanation”
– Explain and correct unintended behaviors
– Ex: Ability to explain why a certain decision was reached

• Sustainers: Ensure that AI systems are operating as designed and that unintended
consequences are addressed.
– Maintain confidence in the fairness and auditability of their AI systems
– Be a watchdog and ombudsman for upholding norms of human values and morals
– Ex: Recalling only white women when search for “loving grandmother”, adverse actions
among a minority groups
Ref: Wilson 2017

3
6
Data Science
Lifecycle
Career in DataScience
What You Need to Learn to Become a Data Scientist

This next section covers all of the data science skills you’ll need to learn. You’ll also learn
about the tools you need to do your job.

Most data scientists use a combination of skills every day, some of which they have taught
themselves on the job or otherwise. They also come from various backgrounds.

There isn’t any one specific academic credential that is required to be an effective data
scientist.
How to Become a Data Scientist?

Domain Expertise Programing Languages Math | Stats | Probability Lingo | Foundations Projects

• Health Care • Python • Measures of Positions • Decision Tree • Customer Churn


• Sales • R • Measures of Dispersion • Random Forest • Fraud Detection
• Finance • Julia • Measures of Shape • Data Science • Basket Analysis
• IT • Scala • Measures of Relationships • Entropy • Loan default detection
• Management • JAVA • Mean, Median, and Mode • Data Split • Image classification
• Accounting • C++ • Variance and Standard Deviation • Model fitting • … etc.
• Transportation • Co-variance and Correlation • Training set
• Media • Permutations and Combinations • Testing set
• Travelling • Unions and Intersections • Target variable
• etc • Conditional Probability • Classification
• Bayes Theorem • Regression
• Binomial Distribution • Clustering
• Poisson Distribution • Big Data
• Normal Distribution • Machine Learning
• Sampling • Deep Learning
• Central Limit Theorem • NLP
• Hypothesis Testing • R2
• T-Distribution Testing • Confusion Matrix
• Regression Analysis • ...etc.
• ANOVA
• Chi Squared
• … etc.
Three-Legged Stool

Skills (5) Lifecycle (5)

Domains (5)

One way to understand the collaborations that lead to Data Science success is to think of a
three-legged stool. Each leg is critical to the stool remaining stable and fulfilling its intended
purpose
Three Legged Stool (Skills)
Data Viz
Tableau is not only an ultra-powerful tool for seasoned analytics, but is also so easy to learn … that is a great
entry point into the World of Data. Tableau is like a Data Science career hack

R/Python
These two programming languages have become the two titans of Data Science. While very different in nature,
they both facilitate the same thing – statistical analysis on unlimited complexity. Knowing at least one is must.
Knowing both puts you miles a head

SQL (PostgreSQL)
Knowing how to efficiently query database is a crucial part of Data Scientist’s job – to analyze the data you first
need to go get it. SQL programming also develops e certain way of thinking about data which helps you se the
big picture and workflow of your analysis
Statistics
Needless to say that if you want to be successful as a Data Scientist you will need to develop a certain level of
statistical acumen. Start with Logistic Regression, A/B test and the law of Large Number

Presentation
Preparing the data, building models, creating visualizations and deriving insights – are only half of the job. To be
a successful Data Scientist you need to be able to communicate your insights to your audience
Three Legged Stool (Domains)
Data Mining /BI Tools
Also known as ad-hoc analytics, data mining is the process of deriving new insights from data. Though different
in essence, creating business intelligence (BI) Tools is closely related, because often these insights need to be
streamlined and integrated into the business

Machine Learning/Modeling
Machine Learning is popping up everywhere: recommender systems on Amazon & Netflix, speech-to-text, face
recognition on your phone – the list goes one

Advanced Analytics
With Advanced Analytics you create simulations to help real-world businesses identify opportunities for
improvements

Computer Forensics
Computer Forensics/Fraud Analytics/Cyber Security all deal with slightly different things, however the overall
objectives are extractions, analysis, protection and even ethical hacking of information for legal purposes

Big Data
Big data refers to dealing with large and complex data sets which traditional applications simply cannot cope with.
Rule of “3Vs” – Volume, Variety, Velocity
Three Legged Stool (Lifecycle)
Phase 1 : Identify The Problem
Ever heard the phrase “Here’s some data, can you find some insights?”. Too often stakeholders approach Data
Scientists with vague or even undefined goals. Understanding the end goal is very important and sets up the rest
of the project for success – (Time consumption : 10%)

Phase 2 :Prepare the Data


Data can come from many sources, be in the wrong format, have anomalies and a myriad of other problems. A
single mistake in this stage can render the rest of the analysis useless – (Time consumption : 70%)

Phase 3: Analyze the Data


Creating models, performing data mining, running text analytics, setting up simulations. This is the most fun and
exciting part if the previous stages have been done correctly, analyzing the data and deriving insights will feel like
a breeze – (Time consumption : 10%)
Phase 4 : Visualize Insights
Visualizing comes hand –in-hand with analyzing. This is a very powerful technique as seeing the data in various
forms and shapes can help uncover insights that are otherwise not evident - – (Time consumption : 10%)

Phase 5: Present Findings


Presenting findings is a whole separate “Bonus” stage. You need to not only convey the insights in your
audience’s language but also get buy-in from them to take action based on those insights
BBDS 30 Weeks
Training Program
Program Overview

Our 30 Week Training program in Data Science and Machine Learning program starts in
August 1 at 10:00 AM EST. The program is 30 Weeks live stream, 5 times a week and 3
hrs. per session and the fee is $2999 with 4 payment plan and a possible discount for direct
payment. Once you complete the program, if you are ready we will help you with Resume
preparation, Interview preparation and Job placement for free. If you are not ready then we
will train you another 30 Weeks for free

BBDS Program built by identifying critical skills that hiring managers are asking for, with
actual tools used by analysts, delivered to you with 100% online training - including
practical labs where you get to use platforms hands on.

Made for the Modern Student

No stuffy classrooms, outdated textbooks, or overpriced bootcamps. This program is made


for the forward-looking learner. You will get a deep dive into not only important skills, but
strategies of how to use them
Program Overview

1 - Program Quality
- The program has been acquired by a university in Latin America and it has been
transformed into Post graduate diploma in Data Science & Machine Learning

2 - Program training material for this course is sourced from


• 7 Years of Data Science industry experience
• 4 Years of teaching this Data Science program leading to a refined curriculum
• Columbia University Master degree in Machine Learning and Applied Data Science
• Maryland University Master degree in Data Science and Analytics

3 - What is New in Batch 12?


• Batch 12 is 29 weeks compared to batch 11 (18 weeks)
• Migrated to Canvas as the LMS
• Collaboration with DataCamp : 85+ mandatory videos & 350+ mandatory exercises
• Digital blockchain Certification of Completion
Program Overview
4 - Program Highlights: 5 - Weekly Schedule:
• Hybrid classes (Virtual and Physical) • Saturday at 9:00 AM EST to 12:00 PM EST
• Flexible schedule (Evenings and Weekends) • Monday at 8:00 PM EST to 11:00 PM EST
• Wednesday at 8:00 PM EST to 11:00 PM EST
• 22+ Group Projects (R & Python)
• Thursday at 8:00 PM EST to 11:00 PM EST
• 1 Individual Capstone Project
• Extensive Live Online Training
• Instructor-Led Course
• Training Video Recordings
• Quality Training Materials
• Two-Way Interactive Sessions
• Job Oriented Training
• Mock Exam/ Graded Assessments
• Professional Certificate
• Interview Prep Job Placement and Placement Guidance
• Repeat any time at no additional cost
• Extra help if needed 24/7
Machine Learning Types & Techniques
Factor NLP – Time Series Reinforcement
Supervised Learning Unsupervised Learning
Analysis Deep Learning Learning

Dimensionality Clustering Pattern Search Time Series


Classification Regression Neural Nets Stacking
Reduction Association Rules Analysis

PCA K-Means Apriori AR – MA Genetic


Decision Tree Simple Linear ANN
EST Algorithm

Kernel Naïve Bayes Convolutional ARMA


Multiple Linear HCA Eclat Q-learning
PCA Net ARIMA

Logistic Bisecting FP -
FP-Growth Deep Learning
LDA Polynomial CNN SARSA
Regression K-Means Growth for TM

LS/Lasso Fuzzy C-Means Recurrent


T-SNE SVM Mean Shift A/B Testing NLP A3C
Ridge/ElN Ex. Maximization
Net

Locally L. Text Deep Q -


Kernel SVM Decision Tree Anomalies A/B Testing RNN
Embedding Analysis Network
Detection

Recommender Generative Topic Autoencoders


SVD - LSA KNN SVR Clustering
Systems Adversarial Modeling (seq2seq)
Anomaly
Matrix Perceptron Deep Learning
Partial
Factorization (MLP) for NLP
L.S(PLS)

Ensemble Methods Evaluation – Assessment - Optimization

Random Bagging Ada-Boost CM – ROC – R2 – MSE


G. Boosting LightGBM XGBoost
Forest Boosting CatBoost Cross V. - Grid S.
Business Data Data
Modeling Optimization Deployment
Understanding Understanding Preparation

Determine
Transform/Fix Data Select Planning
Business Design Features Model Selection
Target Variable Normalization The Model Deployment
Objectives
Frame the
Collect Redundant & Data Model Monitoring &
Problem Split Data
Initial Data Duplicates Factorization Optimization Maintenance
Assess Feasibility
Define Success Install & Import Data Quality Audit Data Data Parameters
Measurements Packages (Missing Values) Binarization Scaling Final Report
Tuning

Identify Target Read the Data Quality Audit Data


Variables (Y) Data Dummy Model Lessons Learned
(Outliers) Standardizing
Data
Identify Analytical Data Quality Audit Data
Manipulation & Build Model
Approach (Cardinality Check) Correlations
Wrangling
Data
Identify Exploratory Data Data Fit Model
Aggregation
Deployment Plan Analysis (EDA) Conversion (Train)
Binning
Produce Data Data Data Predict
Project Plan Visualization Transformation Decomposition (Test)
Feature
Identify the team Statistical Feature Assess &
Engineering
& Stakeholders Analysis Selections Evaluate
(Importance, Low variance, PCA)

Analytics Base Code Book


Data Version 2/3/4 Best Model Best Parameters ROI
Table (ABT) Quality Report
Certificate of Completion
Q&A
BIG BANG DATA SCIENCE SOLUTIONS

LEARN . ACHIEVE. STANDOUT

You might also like