Unit2 PDS

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 17

Python for Data Science

Unit 2 - Data Science and Python

Dr Kruti Dangarwala
CSE & IT Department
SVMIT
Unit2- Part –I Syllabus
Discovering the match between data science and python:
• Defining the Sexiest Job of the 21st Century, Considering
the emergence of data science, Outlining the core
competencies of a data scientist, Linking data science, big
data, and AI , Understanding the role of programming,
Creating the Data Science Pipeline, Preparing the data,
Performing exploratory data analysis, Learning from data,
Visualizing, Obtaining insights and data products,
Understanding Python's Role in Data Science, Considering
the shifting profile of data scientists, Working with a
multipurpose, simple, and efficient language, Learning to
Use Python Fast ,Loading data, Training a model, Viewing a
result.
Unit 2- Part II Syllabus
Introducing Python's Capabilities and Wonders:
• Why Python?, Grasping Python's Core Philosophy, Contributing to
data science, Discovering present and future development goals,
Working with Python, Getting a taste of the language,
Understanding the need for indentation, Working at the command
line or in the IDE, Performing Rapid Prototyping and
Experimentation, Considering Speed of Execution, Visualizing
Power, Using the Python Ecosystem for Data Science, Accessing
scientific tools using SciPy, Performing fundamental scientific
computing using NumPy, Performing data analysis using pandas,
Implementing machine learning using Scikit-learn, Going for deep
learning with Keras and TensorFlow, Plotting the data using
matplotlib, Creating graphs with NetworkX, Parsing HTML
documents using Beautiful Soup.
Data Science
• Data science is devoted to the extraction of clean information from raw data
to form actionable insights.
• And there are lots of data out there. By 2025, it’s estimated there will be
around 175 zettabytes of data floating around (a zettabyte is a trillion
gigabytes). Data has been called the “oil of the 21st century.”
• So, what do we do with all of this data?
• How do we make it useful to us?
• What are its real-world applications?
These questions are the domain of data science.
WHAT IS DATA SCIENCE?

• Data science is the process of using tools and


techniques to draw actionable information out of
huge volumes of noisy data.
• Data science is used for everything from business
decision making to sports analytics to insurance risk
assessment.
INTRODUCTION
 The data science field is growing rapidly and
revolutionizing so many industries.
 It has incalculable benefits in business, research and our
everyday lives.
 Your route to work, your most recent search engine query for
the nearest coffee shop, your Instagram post about what you
ate, and even the health data from your fitness tracker are all
important to different data scientists in different ways.
 Sifting through massive data lakes, looking for connections
and patterns, data science is responsible for bringing us new
products, delivering breakthrough insights and making our
lives more convenient.
DATA SCIENCE PIPELINE
How Does Data Science Work?

In simple words, a pipeline in data science is “a set of actions which


changes the raw (and confusing) data from various sources (surveys,
feedbacks, list of purchases, votes, etc.), to an understandable format
so that we can store it and use it for analysis.”
Different stages within a pipeline
1. Fetching/Obtaining the Data

This stage involves the identification of data from the internet or internal/external
databases and extracts into useful formats.
Prerequisite skills:
Distributed Storage: Hadoop, Apache Spark/Flink.
Database Management: MySQL, PostgresSQL, MongoDB.
Querying Relational Databases.
Retrieving Unstructured Data: text, videos, audio files, documents.

2. Scrubbing/Cleaning the Data


This is the most time-consuming stage and requires more effort. It is further divided into
two stages:
Examining Data:
identifying errors
identifying missing values
identifying corrupt records
Cleaning of data:
replace or fill missing values/errors
Prerequisite skills:
Coding language: Python, R.
Data Modifying Tools: Python libs, Numpy, Pandas, R.
Distributed Processing: Hadoop, Map Reduce/Spark.
3) Exploratory Data Analysis
When data reaches this stage of the pipeline, it is free from errors and
missing values, and hence is suitable for finding patterns using
visualizations and charts.
Prerequisite skills:
Python: NumPy, Matplotlib, Pandas, SciPy.
R: GGplot2, Dplyr.
Statistics: Random sampling, Inferential.
Data Visualization: Tableau.

4) Modeling the Data


This is that stage of the data science pipeline where machine learning
comes to play. With the help of machine learning, we create data models.
Data models are nothing but general rules in a statistical sense, which is
used as a predictive tool to enhance our business decision-making.
Prerequisite skills:
Machine Learning: Supervised/Unsupervised algorithms.
Evaluation methods.
Machine Learning Libraries: Python (Sci-kit Learn, NumPy).
Linear algebra and Multivariate Calculus.
CHOOSING A DATA SCIENCE LANGUAGE

Source : https://www.datacamp.com/blog/top-programming-languages-for-
data-scientists-in-2022
5) Interpreting the Data
Similar to paraphrasing your data science model. Always remember, if you can’t
explain it to a six-year-old, you don’t understand it yourself. So, communication
becomes the key!! This is the most crucial stage of the pipeline, wherewith the use of
psychological techniques, correct business domain knowledge, and your immense
storytelling abilities, you can explain your model to the non-technical audience.

Prerequisite skills:
Business domain knowledge.
Data visualization tools: Tableau, D3.js, Matplotlib, ggplot2, Seaborn.
Communication: Presenting/speaking and reporting/writing.

6) Revision
As the nature of the business changes, there is the introduction of new features that
may degrade your existing models. Therefore, periodic reviews and updates are very
important from both business’s and data scientist’s point of view.
Python’s Role in Data Science/ Why
Python for Data Science??
• Python is one of the most widely used programming languages in the field and
most of the data scientists use python for data science.
• This dynamic language is easy to learn and read, so it’s an optimal choice for
beginners.
• Python enables quick improvement and can interface with high-performance
algorithms written in Fortran or C.
• For data scientists who need to incorporate statistical code into production
databases or integrate data with web-based applications, Python is often the
ideal choice
• . It is also ideal for implementing algorithms, which is something that data
scientists need to do often.
• There are also Python packages that are specifically tailored for certain
functions, including pandas, NumPy, and SciPy.
• Data scientists working on various machine learning tasks find that Python’s
scikit-learn is a useful and valuable tool.
• Matplotlib, another one of Python’s packages, is also a perfect solution for
data science projects that require graphics and other visuals.
Python Features -captured the imaginations of data
science community.
Easy to learn
• The most appealing quality of Python is that anyone who wants to learn it
—even beginners—can do so quickly and easily and this is one of the
reasons why learners prefer python for data science. That also works well
for busy professionals who have limited time to spend learning. When
compared to other languages, R, for instance, Python promotes a shorter
learning curve with its easy-to-understand syntax.
Scalability
• Unlike other programming languages, such as R, Python excels
when it comes to scalability. It’s also
faster than languages like Matlab and Stata. It facilitates scale
because it gives data scientists flexibility and multiple ways to
approach different problems—one of the reasons why YouTube
migrated to the language. You can find Python across multiple
industries, powering the rapid development of applications for all
kinds of use cases.
Choice of data science libraries
• Another key benefit of using python for data science is that python offers is
access to a wide variety of data analysis and data science libraries. These
include, pandas, NumPy, SciPy, StatsModels, and scikit-learn. These are just
some of the many available libraries, and Python will continue to add to this
collection. Many data scientists who use Python find that this robust
programming language addresses a wide range of needs by offering new
solutions to problems that previously seemed unsolvable.
Python community
• One reason that Python is so well-known is a direct result of its community. As
the data science community continues to adopt it, more users are
volunteering by creating additional data science libraries. This is only driving
the creation of the most modern tools and advanced processing techniques
available today which is why most of the people are preferring Python for data
science.
• The community is a tight-knit one, and finding a solution to a challenging
problem has never been easier. A quick internet search is all you need, and
you can easily find the answer to any questions or connect with others who
may be able to help. Programmers can also connect with their peers on
Codementor and Stack Overflow.
Graphics and visualization
• Python comes with many visualization options. Matplotlib provides
the solid foundation around which other libraries like Seaborn,
pandas plotting, and ggplot have been built. The visualization
packages help make sense of data, create charts, graphical plots.
and web-ready interactive plots.
Bottom Line
• There is no denying that the current job market is competitive, as
the Bureau of Labor Statistics recently reported. If you’re looking
for a stable industry that isn’t going anywhere anytime soon, data
science is an excellent choice. But, choosing a successful industry is
only half the battle when it comes to job security.
• There is also a competition to consider, and it’s important to
remember that oftentimes, many qualified candidates competing
for the same job opening. One of the best ways to ensure you stand
out to recruiters and employers is to have the right credentials.
Earning your certification in Python with data science or other
relevant field is a surefire way to get your resume noticed by the
right people. Get started today!
Most popular Python Libraries for Data
Science
• Python is an interpreted, interactive, portable and object-oriented
programming language. This open-sourced general-purpose
language runs on many Unix variants, including Linux and macOS,
and Windows
.
• TensorFlow
• NumPy
• SciPy
• Matplotlib
• Pandas
• Keras
• SciKit-Learn
• Statsmodels
• Plotly
• Seaborn

You might also like