Python For Data Science and Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Python for

Data Science
and Machine
Learning
Rod Salvador
Senior Data Scientist
Reed Elsevier Philippines

https://www.extremetech.com/extreme/319005-the-day-i-learned-what-data-science-is
Contents:
• Fundamentals of Data Science
• Data Science Workflow
• Applications of Data Science
• Tools for Data Science
• Hands-on Activities
Data Engineering
Exploratory Data Analysis (EDA)
Data Preprocessing/Cleansing
Machine Learning Modeling
Data Visualization
• Q&A
Fundamentals of Data Science

Data science combines multiple fields, including


mathematics, computer science, and domain expertise
to extract value from data.

It encompasses preparing data for analysis, including


cleansing, aggregating, and manipulating the data to
perform advanced data analysis, machine learning,
visualization, and deployment [1].
Applications of Data Science

Optimize campaign efforts by analyzing which platforms


are heavily used and rarely used by our end users.

Improve sales by creating targeted recommendations for


customers based on previous purchases and spending
habits.

Determine customer churn by analyzing data from


profiles, marketing interactions, sales history, and surveys
so sales and marketing can take action to retain them.

Improve events experience by analyzing the sentiment of


exhibitors, visitors, and hosted buyers based on open text
survey responses.
Applications of Data Science
Improve patient diagnoses by analyzing medical test data
and reported symptoms so doctors can diagnose diseases
earlier and treat them more effectively.

Improve efficiency by analyzing traffic patterns, weather


conditions, and other factors so logistics companies can
improve delivery speeds and reduce costs.

Forecast the growth of COVID-19 cases in a particular


region, country, continent, etc.

Detect fraud in financial services by recognizing


suspicious behaviors and anomalous actions.
Data Science Workflow
Tools for Data Science
Programming Languages

Python

Java

Javascript

C/C++

SQL
Tools for Data Science
Libraries

Data Engineering: requests, selenium, pyodbc, boto3, json5, beautifulsoup4, awswrangler, etc.

Data Analysis/Cleaning: numpy, pandas, pandas profiling, scipy, etc.

Machine Learning: scikit-learn, tensorflow, keras, pytorch, TPOT, etc.

Data Visualization: matplotlib, ggplot, d3.js, seaborn, plotly, etc.

NLP: nltk, textblob, twython, huggingface, etc.

Automation: pyautogui, selenium, etc.


Tools for Data Science
Other Tools

IDE: Jupyterlab/Jupyter notebook via Anaconda navigator, VS Code, Sublime, Pycharm, etc.

Data Sources: Kaggle, google open datasets, kdnuggets, NASA, etc.

Research: arxiv.org, paperswithcode.com, google scholar, etc.

Cloud/Distributed Computing: GCP, Azure, AWS, Databricks, Hadoop, Spark, etc.

Version Control: Git/Github, Bitbucket, subversion, etc.

Deployment: Heroku, Streamlit, Flask, Django, FastAPI, Docker, Kubernetes, Jenkins, etc.
Data Engineering

Data engineering is the practice designing and


building systems for collecting, storing, and
analyzing data at scale.

The ultimate goal is to make data accessible so that


organizations can use it to evaluate and optimize their
performance [2].
Data Engineering

What’s the difference between a data scientist/analyst


and a data engineer?

Data scientists and data analysts analyze data sets to


glean knowledge and insights.

Data engineers build systems for collecting,


validating, and preparing that high-quality data [2].
Data Engineering

DEMO 1: Collect data from an API endpoint using requests library


Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is used to analyze and


investigate data sets and summarize their main
characteristics, often employing data visualization
methods [3].

It helps determine how to manipulate data sources to get


the answers you need, making it easier to:
1. discover patterns
2. spot anomalies
3. test a hypothesis
4. check assumptions
Exploratory Data Analysis

DEMO 2: Perform EDA using pandas profiling and create


exploratory visuals using matplotlib
Data Preprocessing/Cleansing

Data preprocessing is the process of transforming raw


data into an understandable format.

The quality of the data should be checked before applying


machine learning or data mining algorithms [4].
Data Preprocessing/Cleansing

Characteristics of a dirty data:

1. Incomplete data (e.g., missing/null values)


2. Duplicates
3. Inconsistent data (e.g., data types, data versions)
4. Outliers
5. Outdated data
6. Inaccurate data
7. Insecure data
,Data Preprocessing/Cleansing

DEMO 3: Cleanse tabular data using numpy and pandas


Break!

10 minutes
Introduction to Machine Learning

Machine learning is defined as the ability of a machine to


learn from data without being explicitly programmed.

Machine learning is best used for…

• Problems for which existing solutions require a lot of hand-tuning or long


lists of rules.
• Complex problems for which there is no good solution at all using a
traditional approach.
• Fluctuating environments: a Machine Learning system can adapt to new
data.
• Getting insights about complex problems and large amounts of data.
Introduction to Machine Learning

Machine Learning Workflow


Introduction to Machine Learning

Types of Machine Learning

Supervised learning is a type of machine learning that requires both input (features) data and
output (label) data. The goal is to find a mapping between the input and the output data.

https://ai.plainenglish.io/introduction-to-machine-learning-2316e048ade3
Introduction to Machine Learning

Types of Machine Learning

Unsupervised learning is a type of machine learning that only requires input data. The goal is to
find similarities, differences, and patterns in the data.

https://towardsdatascience.com/supervised-vs-unsupervised-learning-in-2-minutes-72dad148f242
Introduction to Machine Learning

Tasks under supervised learning

https://medium.com/big-data-at-berkeley/choosing-fine-tuning-your-machine-learning-model-8c28fc1bd2fc
Introduction to Machine Learning

Tasks under unsupervised learning

https://www.reddit.com/r/datascience/comments/d6buto/kmeans_be_like_mine_mine_mine/
https://towardsdatascience.com/dimensionality-reduction-cheatsheet-15060fee3aa
,Introduction to Machine Learning

DEMO 4: Supervised learning using scikit-learn library


,Introduction to Machine Learning

DEMO 5: Unsupervised learning using scikit-learn library


Data Visualization

Data visualization is the graphical representation of


information and data.

By using visual elements like charts, graphs, and maps, data


visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data [5].
,Data Visualization

DEMO 6: Data visualization using seaborn and plotly


Q&A
References

1. https://www.oracle.com/ph/data-science/what-is-data-science/
2. https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-
become-one
3. https://www.ibm.com/cloud/learn/exploratory-data-analysis
4. https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data-mining-a-
hands-on-guide/
5. https://www.tableau.com/learn/articles/data-visualization
Thank you

rcsalvadorjr@gmail.com

You might also like