Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

4/4/2019

Introduction
to
Data Science

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions

Introduction to Data Science

1
4/4/2019

What is Data Science?

“To gain insights into data through


computation,statistics,and visualization.”

QuoraThreads for Expert Definitions


 What is Data Science?
 What does a Data Scientist do?

Data Science is Process

 Ask an interesting question

 Get the data

 Explore the data

 Model the data

 Communicate and visualize your results Data Science

2
4/4/2019

Data Science is Multidisciplinary

 The Scientific Method (wiki)


 Programming
 Databases
 Statistics
 Machine Learning
 Domain Knowledge

Data Science is Multidisciplinary

3
4/4/2019

Science Paradigm

Why Data Science?

• The ability to take data – to be able to understand it, to process it, to


extract value from it, to visualize it, to communicate it's going to be a
hugely important skill in the next decades, not only at the professional
level but even at the educational level for elementary school kids, for
high school kids, for college kids. Because now we really do have
essentially free and ubiquitous data.”

• – HalVarian

4
4/4/2019

Who is Data Scientist?

“A data scientist… excels at analyzing data, particularly large amounts of data, to


help a business gain a competitive edge.”

“The analysis of data using the scientific method”

“A data scientist is an individual, organization or application that performs


statistical analysis, data mining and retrieval processes on a large amount of data
to identify trends, figures and other relevant information.”

WHO’S A DATA SCIENTIST

• “A data scientist is someone who knows more statistics than a


computer scientist and more computer science than a statistician.”
- Josh Blumenstock

“Data Scientist = statistician + programmer + coach +


storyteller + artist”
- Shlomo Aragmon

5
4/4/2019

WHO’S A DATA SCIENTIST

Who is Data Scientist?

6
4/4/2019

Who’s a Data Scientist?

What does a Data Scientist Do?

OSEMN Things!
Obtain data

Scrub data
BUILD
DATA DESCRIPTIVE
PREDICTIVE
Explore data

PRODUCTS PRESCRIPTIVE Build Models


tools built with data
iNterpret results
to inform decision making
Hence the acronym
O-S-E-M-N
(pronounced, ‘awesome’)

7
4/4/2019

.. And This

Hypothesis Data Machine Parallel


Testing Visualization Learning Computing

Deep Database
Coding Optimization
Learning Querying

Key Concepts

 use many data sources


 understand how the data were collected (sampling is essential)
 weight the data thoughtfully (not all polls are equally good)
 use statistical models (not just hacking around in Excel)
 understand correlations (e.g., states that trend similarly)
 think like a Bayesian, check like a frequentist (reconciliation)
 have good communication skills(What does a 60% probability even
mean?
 visualize, validate, and understand the conclusions

8
4/4/2019

Common Challenges

 Big (massive) data (millions of


users, billions of events)
 curse of dimensionality
(hundreds of variables)
 missing data (not missing at
random)
 need to avoid overfitting (test
data vs. training data)

Common Tasks

 data munging/scraping/sampling/cleaning in order to get an


informative, manageable data set;
 data storage and management in order to be able to access data
quickly and reliably during subsequent analysis;
 exploratory data analysis to generate hypotheses and intuition
about the data;
 prediction based on statistical tools such as regression,
classification, clustering, forecasting and optimization; and
 communication of results through visualization, stories, and
interpretable summaries.

9
4/4/2019

Tools for the course

Tools for the course - Python

10
4/4/2019

Python Is IOSEMN

Inquire

Obtain

Scrub

Explore

Model

iNterpret
js
Outsider

Python Data Science Ecosystem

11
4/4/2019

Python Data Science Ecosystem

Packages - Data Manipulation Packages - Modelling


NumPy Low level array operations SciPy FFTs, integration, other general algorithms

Data tables and in-memory manipulation Statistical distributions and tests

Dask Parallel out-of-core array manipulation Machine Learning pipelines

High level interface for databases and PyMC3 Bayesian Probabilistic Programming
different computational backends
24
Packages - Visualisation
IPython Notebooks
Widely used and pow erful plotting package
seaborn Opinionated but beautiful data visualisations

Bokeh Interactive plotting w ith server option

Graphics API w ith translation betw een


languages (e.g. Python -> D3)

12
4/4/2019

Packages - Description
• NumPy
NumPy is a low level libra ry wri tten in C (and FORTRAN) for high level mathema ti cal functions . NumPy cleverl y
overcomes the problem of running slower algori thms on Python by using mul tidimensional a rra ys and functions tha t
operate on a rra ys . Any algori thm can then be expressed as a functi on on a rra ys , allowing the algori thms to be run
qui ckly.
NumPy is pa rt of the Sci Py project, and is released as a separa te libra ry so people who onl y need the basic
requirements can use i t without i nstalling the rest of Sci Py.
NumPy i s compatible with Python versions 2.4 through to 2.7.2 a nd 3.1+

• SciPy
Sci Py is a libra ry tha t uses NumPy for more ma thema tical functions. Sci Py uses NumPy a rra ys as the basic data
s tructure, a nd comes wi th modules for va ri ous commonl y used tasks in s cientifi c programming, including linea r
a l gebra, i ntegration (calculus), ordinary di fferential equation solving a nd signal processing.

• Numba
Numba is a NumPy awa re Python compiler (just-in-time (JIT) specializing compiler) whi ch compiles annotated Python
(a nd NumPy) code to LLVM (Low Level Vi rtual Ma chine) through special decorators. Briefl y, Numba uses a sys tem tha t
compi les Python code with LLVM to code which ca n be natively executed at runtime.

Packages - Description
• scikit-learn
scikit-learn is a Python module for machine learning built on top of SciPy and distributed under
the 3-Clause BSD license.

• Pandas
Pandas is data manipulation library based on Numpy which provides many useful functions for
accessing, indexing, merging and grouping data easily. The main data structure (DataFrame) is
close to what could be found in the R statistical package; that is, heterogeneous data tables with
name indexing, time series operations and auto-alignment of data.

• Matplotlib
Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be
saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing
transition of MATLAB users to Python. Many examples, along with the source code to re-create
them, are available in the matplotlib gallery.

13
4/4/2019

Packages - Description
• Rpy2
Rpy2 is a Python binding for the R statistical package allowing the execution of R
functions from Python and passing data back and forth between the two environments.
Rpy2 is the object oriented implementation of the Rpy bindings.

• PsycoPy
PsychoPy is a library for cognitive scientists allowing the creation of cognitive psychology
and neuroscience experiments. The library handles presentation of stimuli, scripting of
experimental design and data collection.

Packages - Description
• datetime (or) time
Date and time functions to manage date and time data
• math
Core math functions and the constants like pi, e etc.
• pickle
Serializes objects to file
• os (or) os.path
Operating system interfaces.
• re
A library of perl-like regular expression operations
• string
Useful constants and classes related to strings.
• sys
System parameters and functions

14
4/4/2019

Who is using Python?

Why should I become a Data Scientist?

DEMAND & SUPPLY

"We project a need for 1.5 million additional managers and analysts in the United States who can ask the
right questions and consume the results of the analysis of Big Data effectively."

"A significant constraint on realizing value from BigData will be a shortage of talent, particularly of people
with deep expertise in statistics and machine learning, and the managers and analysts who know how to
operate companies by using insights from Big Data."

Big data: The next frontier for innovation, competition, and productivity, McKinsey report

"By 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million
managers and analysts capable of reaping actionable insights from the big data deluge."

Game changers: Five opportunities for US growth and renewal, McKinsey report

25

15
4/4/2019

OK. How so do I become a Data Scientist?

Rea d books on
- Sta tistics
- Ma chi ne Learning
- Progra mming
- Da ta bases

Ta ke University courses

Apply for internships to work on real-life projects

Spend hours debugging on


Sta ckOverflow

Pa rti cipate in Data Hackathons/Data


Dri ven competitions

27

What is Python?

• Programming language

• You write instructions to the computer

• Python “interpreter” runs those instructions

16
4/4/2019

Why python?

• It ’s awesome and popular!


• Free and Open Source language.
• Readable syntax.
• Great for interactive work
• Easy to learn and has an active community.
• Large amount of libraries.
• High level, general purpose.
• Backed up with fast C & Fortran numerical libraries

28

Python - Applications
Python i s a powerful multi-paradigm computer programming language. With Python, we ca n do ma ny things.
Bel ow a re some of the things that ca n be achieved using Python.

 Systems Programming: Python’s built-in i nterfaces to operating-system servi ces make it i deal for writing
porta ble, maintainable system-administration tools a nd utilities (sometimes called s hell tools). Python
progra ms ca n search files a nd directory trees, etc.

 GUIs: Python’s simplicity and ra pid turnaround makes i t a good match for graphical user
i nterface programming on the desktop. Python comes with a standard object-oriented i nterface to the Tk GUI
API ca l led tkinter (Tkinter i n 2.X) tha t allows Python programs to i mplement portable GUIs with a native l ook
a nd feel.

 Internet Scripting: Python comes wi th standard Internet modules that a llow Python programs to perform a
wi de variety of networking tasks in cl ient and s erver modes.

 Database Programming: For tra di tional database demands, there a re Python i nterfaces to all commonly used
rel ational database s ystems like Sybase, Oracle, Informix, ODBC, MySQL, Pos tgreSQL, SQLite, and more.

28

17
4/4/2019

Python - Applications
Rapid Prototyping: To Python programmers the components wri tten in Python and C l ook the same. Because
of thi s , it’s possible to prototype systems i n Python initially, a nd then move s elected components to a compiled
l a nguage s uch as C or C++ for delivery.

Numeric and Scientific Programming: Python i s also heavily used i n numeric programming, a domain that would
not tra ditionally have been considered to be i n the scope of scripting l anguages, but has grown to become one of
Python’s most compelling use cases.

Google makes extensive use of Python in its web search systems.

The other use cases a re as follows:


 The popular YouTube vi deo sharing servi ce is largely written i n Python.
 The Dropbox s torage s ervice codes, both i ts s erver a nd desktop cl ient software, i s primarily written i n
Python.
 The Ra spberry Pi single-board computer promotes Python as i ts educational l anguage.
 The wi despread BitTorrent peer-to-peer file s haring system began its life as a Python program.
 Industrial Li ght & Ma gi c, Pixar, a nd others use Python in their production of animated movies.
 Googl e’s App Engine web development framework uses Python as an application l anguage.

28

Is Python a Scripting Language?


 Python i s a general-purpose programming language that is often applied in scripting roles. It is
commonly defined a s an object-oriented scripting language, a definition that blends support for OOP
wi th a n overall orientation toward scripting roles.

 A s cri pting l anguage or s cript language is a programming l anguage that supports scripts, programs
wri tten for a special run-time environment that can i nterpret (rather than compile) a nd a utomate the
execution of tasks that could a lternatively be executed one-by-one by a human operator. Python comes
under i n this category. So i t is called a scripting language.

 Stil l, the term ‘scripting’ seems to have s tuck to Python like glue. This may be because, people often use
the word ‘script’ i nstead of ‘program’ to describe a Python code file.

28

18
4/4/2019

How to get Python & Anaconda?

How to get Python?


There are two ways to get Python.
Base Python:
 You can download Python from the www.python.org/downloads.
 Once it down loaded, you can install the python
 Ensure that you have pip installed which is the package manager for Python and will enable you to easily
install 3rd party packages that you'll need to perform data science tasks.

ANACONDA:
Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics,
and scientific computing.

Download here http://docs.continuum.io/anaconda/install

Features
https://www.continuum.io/why-anaconda

We use ANACONDA for our sessions

19
4/4/2019

Why ANACONDA?

Why ANACONDA?

20
4/4/2019

Remote Conda commands

Anaconda for Analytics

21
4/4/2019

How to Run Python Code?

How to run Python Code?

3 ways to run the python interpreter from the terminal window:


• type ‘python’

• type ‘ipython’

• type ‘python helloworld.py’

4th Ways is Using any IDE like Jupyter, Spider, Pycharm, Canopy,
Rodeo etc.

We are using Jupyter notebook as part of our training,

22
4/4/2019

Introduction to IPython Note Book

39

IPython Notebook
One of Python’s most useful features is its i nteractive interpreter.
It a l lows for very fast testing of ideas without the overhead of creating test files as is typi cal i n most programming
l a nguages. However, the i nterpreter s upplied with the standard Python distribution is somewhat limited for extended
i nteractive use.

Ipython:
A comprehensive environment for i nteractive and exploratory computing

Three Main Components:


An enhanced i nteractive Python shell. A decoupled two-process c communication model , which a llows for multiple clients
to connect to a computation kernel, most notably the web-based notebook. An architecture for i nteractive parallel
computing

Some of the many useful features of IPython includes:


 Comma nd history, which can be browsed with the up a nd down arrows on the keyboard.
 Ta b a uto-completion.
 In-l ine editing of code.
 Object i ntrospection, and automatic extract of documentation strings from python objects like classes a nd functions.
 Good i nteraction with operating system shell.
 Support for multiple parallel back-end processes, that can run on computing cl usters or cl oud s ervices like Amazon EC2.

23
4/4/2019

IPython Notebook

 IPython provides a rich architecture for interactive computing with:


 A powerful interactive shell.
 A kernel for Jupyter.
 Easy to use, high performance tools for parallel computing.
 Support for interactive data visualization and use of GUI toolkits.
 Flexible, embeddable interpreters to load into your own projects .

 Beyond the Terminal ...


 The REPL (read, eval, print loop) as a Network Protocol
 Kernels
 Execute Code
 Clients
 Read input
 Present Output
 Simple abstractions enable rich, sophisticated clients

IPython Notebook

The Four Most Helpful Commands


 The four most helpful commands is shown to you i n a banner, every ti me you start IPython:
Command Description
? Introduction and overvi ew of IPython’s features.
%qui ckref Qui ck reference.
hel p Python’s own help s ystem.
object? Details about object, use object?? for extra details.
Ta b Completion:
 Ta b completion, especially for attributes, i s a convenient way to explore the structure of a ny object you’re dealing
wi th. Si mply type object_name.<TAB> to view the object’s attributes. Besides Python objects a nd keywords, tab
compl etion a lso works on file and directory names

24
4/4/2019

IPython Notebook
 The %run magic command allows you to run a ny python script a nd load all of i ts data directly i nto the interactive
na mespace. Since the file is re -read from disk each time, changes you make to i t are reflected i mmediately (unlike
i mported modules, which have to be specifically reloaded). IPython also i ncludes dreload, a recursive reload function.
 %run ha s special flags for timing the execution of your scripts (-t), or for running them under the control of either
Python’s pdb debugger (-d) or profiler (-p).
 The %edit command gives a reasonable approximation of multiline editing, by i nvoking your favorite editor on the s pot.
IPython wi ll execute the code you type in there as if it were typed interactively.

Ma gi c Functions ...
 The following examples s how how to call the builtin %timeit ma gic, both in line a nd cell mode:

The builtin magics i nclude:


 Functions that work with code: %run, %edit, %save, %macro, %recall, etc.
 Functi ons which a ffect the shell: %colors, %xmode, %autoindent, %automagic, etc.
 Other functions such as %reset, %timeit, %%writefile, %load, or %paste.

IPython Notebook
Expl oring your Objects
 Typi ng object_name? will print all s orts of details a bout any object, i ncluding docstrings, function definition l ines (for
ca l l a rguments) and constructor details for cl asses.
 To get s pecific i nformation on a n object, you ca n use the magic commands % pdoc, %pdef, %psource and %pfile.

Magic Functions:
IPython has a set of predefined magic functions that you ca n ca ll with a command line s tyle syntax.
There a re two kinds of magics, line-oriented a nd cell-oriented.
 Li ne magics a re prefixed with the % character a nd work much l ike OS command -line calls: they get as an argument the
res t of the l ine, where arguments are passed without parentheses or quotes.
 Cel l magics a re prefixed with a double %%, and they a re functions that get as a n a rgument not only the rest of the line,
but a l so the lines below it i n a separate argument.
 You ca n run the s cript.py. You ca n toggle this behavi our by running the %automagic magic.
 A more detailed explanation of the magic s ystem can be obtained by ca lling %magic,
 To s ee all the available magic functions, call %lsmagic

25
4/4/2019

IPython Notebook
System Shell Commands:
To run a ny command a t the system s hell, simply prefix i t with !. You ca n ca pture the output i nto a Python l ist. To pass the
va l ues of Python va riables or expressions to system commands, prefix them with $.

System Aliases:
 It’s convenient to have aliases to the system commands you use most often.
 Thi s allows you to work seamlessly from i nside IPython with the same commands you a re used to i n your system s hell.
 IPython comes with s ome pre-defined aliases and a complete system for changing directories, both vi a a stack (%pushd,
%popd and %dhist) a nd vi a direct %cd.
 The l atter keeps a history of vi sited directories and allows you to go to any previously vi sited one.

System Shell Commands ...

IPython Notebook
History
 IPython s tores both the commands you enter, and the results i t produces. You can easily go through previous
comma nds with the up- a nd down-arrow keys, or access your history i n more sophisticated ways.
 Input a nd output history a re kept i n va riables called In a nd Out, keyed by the prompt numbers. The last three objects i n
output history a re also kept i n va riables named _, __ a nd ___.
 You ca n use the %history ma gic function to examine past input a nd output. Input history from previous sessions is saved
i n a database, and IPython ca n be configured to save output history.
 Several other magic functions can use your i nput history, i ncluding %edit, %rerun, %recall, %macro, %save and
%pa stebin.

You can use a standard format to refer to lines:

Thi s will ta ke line 3 a nd lines 18 to 20 from the current s ession, a nd l ines 1-5 from the previous s ession.

26
4/4/2019

IPython Notebook
Debugging
 After a n exception occurs, you can ca ll %debug to jump i nto the Python debugger (pdb) a nd examine the problem.
Al ternatively, i f you call %pdb, IPython will a utomatically start the debugger on any uncaught exception.
 You ca n print variables, see code, execute s tatements a nd even walk up a nd down the call stack to tra ck down the true
s ource of the problem. This can be an efficient way to develop and debug code, in many cases eliminating the need for
pri nt s tatements or external debugging tools.
 You ca n a lso step through a program from the beginning by ca lling %run -d theprogram.py.
.

Introduction to Jupyter

39

27
4/4/2019

Jupyter Notebook

Jupyter

28
4/4/2019

Jupyter

Jupyter

29
4/4/2019

Jupyter

Jupyter

30
4/4/2019

Jupyter

Jupyter

31
4/4/2019

Jupyter – Getting Started

Jupyter

32
4/4/2019

Jupyter

Exercise
1. Launch new Jupyter notebook (ipython) and save the code
2. Practice Assignment statements (create basic variables and perform mathematical operations)
3. Create markdown file with Some descriptions
4. Practice Short cuts (using Jupyter cheat sheet) like
1. run the code,
2. i ns ert/delete the cell,
3. a dd/remove the output & l i ne numbers,
4. Merge/split cells
5. Toggl e between different types of code formats, comments
6. etc

33
4/4/2019

Contact us

Vi s it us on: http://www.analytixlabs.in/

For cours e registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more i nformation, please contact us: http://www.analytixlabs.co.in/contact-us/


Or ema il: info@analytixlabs.co.in
Ca l l us we would love to speak with you: (+91) 88021-73069

Joi n us on:
Twi tter - http://twitter.com/#!/AnalytixLabs
Fa cebook - http://www.facebook.com/analytixlabs
Li nkedIn - http://www.linkedin.com/in/analytixlabs
Bl og - http://www.analytixlabs.co.in/category/blog/

34

You might also like