Python For Data Science and Machine Learning

Python for
Data Science
and Machine
Learning
Rod Salvador
Senior Data Scientist
Reed Elsevier Philippines
https://www.extremetech.com/extreme/319005-the-day-i-learned-what-data-science-is
Contents:
• Fundamentals of Data Science
• Data Science Workflow
• Applications of Data Science
• Tools for Data Science
• Hands-on Activities
Data Engineering
Exploratory Data Analysis (EDA)
Data Preprocessing/Cleansing
Machine Learning Modeling
Data Visualization
• Q&A
Fundamentals of Data Science
Data science combines multiple fields, including

mathematics, computer science, and domain expertise
to extract value from data.
It encompasses preparing data for analysis, including

cleansing, aggregating, and manipulating the data to
perform advanced data analysis, machine learning,
visualization, and deployment [1].
Applications of Data Science
Optimize campaign efforts by analyzing which platforms

are heavily used and rarely used by our end users.
Improve sales by creating targeted recommendations for

customers based on previous purchases and spending
habits.
Determine customer churn by analyzing data from

profiles, marketing interactions, sales history, and surveys
so sales and marketing can take action to retain them.
Improve events experience by analyzing the sentiment of

exhibitors, visitors, and hosted buyers based on open text
survey responses.
Applications of Data Science
Improve patient diagnoses by analyzing medical test data
and reported symptoms so doctors can diagnose diseases
earlier and treat them more effectively.
Improve efficiency by analyzing traffic patterns, weather

conditions, and other factors so logistics companies can
improve delivery speeds and reduce costs.
Forecast the growth of COVID-19 cases in a particular

region, country, continent, etc.
Detect fraud in financial services by recognizing

suspicious behaviors and anomalous actions.
Data Science Workflow
Tools for Data Science
Programming Languages
Python
Java
Javascript
C/C++
SQL
Libraries
Data Engineering: requests, selenium, pyodbc, boto3, json5, beautifulsoup4, awswrangler, etc.
Data Analysis/Cleaning: numpy, pandas, pandas profiling, scipy, etc.
Machine Learning: scikit-learn, tensorflow, keras, pytorch, TPOT, etc.
Data Visualization: matplotlib, ggplot, d3.js, seaborn, plotly, etc.
NLP: nltk, textblob, twython, huggingface, etc.
Automation: pyautogui, selenium, etc.

Other Tools
IDE: Jupyterlab/Jupyter notebook via Anaconda navigator, VS Code, Sublime, Pycharm, etc.
Data Sources: Kaggle, google open datasets, kdnuggets, NASA, etc.
Research: arxiv.org, paperswithcode.com, google scholar, etc.
Cloud/Distributed Computing: GCP, Azure, AWS, Databricks, Hadoop, Spark, etc.
Version Control: Git/Github, Bitbucket, subversion, etc.
Deployment: Heroku, Streamlit, Flask, Django, FastAPI, Docker, Kubernetes, Jenkins, etc.
Data Engineering
Data engineering is the practice designing and

building systems for collecting, storing, and
analyzing data at scale.
The ultimate goal is to make data accessible so that

organizations can use it to evaluate and optimize their
performance [2].
Data Engineering
What’s the difference between a data scientist/analyst

and a data engineer?
Data scientists and data analysts analyze data sets to

glean knowledge and insights.
Data engineers build systems for collecting,

validating, and preparing that high-quality data [2].
Data Engineering
DEMO 1: Collect data from an API endpoint using requests library

Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is used to analyze and

investigate data sets and summarize their main
characteristics, often employing data visualization
methods [3].
It helps determine how to manipulate data sources to get

the answers you need, making it easier to:
1. discover patterns
2. spot anomalies
3. test a hypothesis
4. check assumptions
Exploratory Data Analysis
DEMO 2: Perform EDA using pandas profiling and create

exploratory visuals using matplotlib
Data preprocessing is the process of transforming raw

data into an understandable format.
The quality of the data should be checked before applying

machine learning or data mining algorithms [4].
Characteristics of a dirty data:
1. Incomplete data (e.g., missing/null values)

2. Duplicates
3. Inconsistent data (e.g., data types, data versions)
4. Outliers
5. Outdated data
6. Inaccurate data
7. Insecure data
,Data Preprocessing/Cleansing
DEMO 3: Cleanse tabular data using numpy and pandas

Break!
10 minutes
Introduction to Machine Learning
Machine learning is defined as the ability of a machine to

learn from data without being explicitly programmed.
Machine learning is best used for…
• Problems for which existing solutions require a lot of hand-tuning or long

lists of rules.
• Complex problems for which there is no good solution at all using a
traditional approach.
• Fluctuating environments: a Machine Learning system can adapt to new
data.
• Getting insights about complex problems and large amounts of data.
Machine Learning Workflow

Types of Machine Learning
Supervised learning is a type of machine learning that requires both input (features) data and
output (label) data. The goal is to find a mapping between the input and the output data.
https://ai.plainenglish.io/introduction-to-machine-learning-2316e048ade3
Types of Machine Learning
Unsupervised learning is a type of machine learning that only requires input data. The goal is to
find similarities, differences, and patterns in the data.
https://towardsdatascience.com/supervised-vs-unsupervised-learning-in-2-minutes-72dad148f242
Tasks under supervised learning
https://medium.com/big-data-at-berkeley/choosing-fine-tuning-your-machine-learning-model-8c28fc1bd2fc
Tasks under unsupervised learning
https://www.reddit.com/r/datascience/comments/d6buto/kmeans_be_like_mine_mine_mine/
https://towardsdatascience.com/dimensionality-reduction-cheatsheet-15060fee3aa
,Introduction to Machine Learning
DEMO 4: Supervised learning using scikit-learn library

,Introduction to Machine Learning
DEMO 5: Unsupervised learning using scikit-learn library

Data Visualization
Data visualization is the graphical representation of

information and data.
By using visual elements like charts, graphs, and maps, data

visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data [5].
,Data Visualization
DEMO 6: Data visualization using seaborn and plotly

Q&A
References
1. https://www.oracle.com/ph/data-science/what-is-data-science/
2. https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-
become-one
3. https://www.ibm.com/cloud/learn/exploratory-data-analysis
4. https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data-mining-a-
hands-on-guide/
5. https://www.tableau.com/learn/articles/data-visualization
Thank you
rcsalvadorjr@gmail.com

Python For Data Science and Machine Learning

Uploaded by

Copyright:

Available Formats

You might also like

Python For Data Science and Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python For Data Science and Machine Learning

Uploaded by

Copyright:

Available Formats

Python for

Data science combines multiple fields, including

It encompasses preparing data for analysis, including

Optimize campaign efforts by analyzing which platforms

Improve sales by creating targeted recommendations for

Determine customer churn by analyzing data from

Improve events experience by analyzing the sentiment of

Improve efficiency by analyzing traffic patterns, weather

Forecast the growth of COVID-19 cases in a particular

Detect fraud in financial services by recognizing

Data Analysis/Cleaning: numpy, pandas, pandas profiling, scipy, etc.

Machine Learning: scikit-learn, tensorflow, keras, pytorch, TPOT, etc.

Data Visualization: matplotlib, ggplot, d3.js, seaborn, plotly, etc.

NLP: nltk, textblob, twython, huggingface, etc.

Automation: pyautogui, selenium, etc.

Data Sources: Kaggle, google open datasets, kdnuggets, NASA, etc.

Research: arxiv.org, paperswithcode.com, google scholar, etc.

Cloud/Distributed Computing: GCP, Azure, AWS, Databricks, Hadoop, Spark, etc.

Version Control: Git/Github, Bitbucket, subversion, etc.

Data engineering is the practice designing and

The ultimate goal is to make data accessible so that

What’s the difference between a data scientist/analyst

Data scientists and data analysts analyze data sets to

Data engineers build systems for collecting,

DEMO 1: Collect data from an API endpoint using requests library

Exploratory data analysis (EDA) is used to analyze and

It helps determine how to manipulate data sources to get

DEMO 2: Perform EDA using pandas profiling and create

Data preprocessing is the process of transforming raw

The quality of the data should be checked before applying

Characteristics of a dirty data:

1. Incomplete data (e.g., missing/null values)

DEMO 3: Cleanse tabular data using numpy and pandas

Machine learning is defined as the ability of a machine to

Machine learning is best used for…

• Problems for which existing solutions require a lot of hand-tuning or long

Machine Learning Workflow

Types of Machine Learning

Types of Machine Learning

Tasks under supervised learning

Tasks under unsupervised learning

DEMO 4: Supervised learning using scikit-learn library

DEMO 5: Unsupervised learning using scikit-learn library

Data visualization is the graphical representation of

By using visual elements like charts, graphs, and maps, data

DEMO 6: Data visualization using seaborn and plotly

You might also like