Professional Documents
Culture Documents
Introduction To Data Science - 1650687630477
Introduction To Data Science - 1650687630477
Introduction
to
Data Science
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
1
4/4/2019
2
4/4/2019
3
4/4/2019
Science Paradigm
• – HalVarian
4
4/4/2019
5
4/4/2019
6
4/4/2019
OSEMN Things!
Obtain data
Scrub data
BUILD
DATA DESCRIPTIVE
PREDICTIVE
Explore data
7
4/4/2019
.. And This
Deep Database
Coding Optimization
Learning Querying
Key Concepts
8
4/4/2019
Common Challenges
Common Tasks
9
4/4/2019
10
4/4/2019
Python Is IOSEMN
Inquire
Obtain
Scrub
Explore
Model
iNterpret
js
Outsider
11
4/4/2019
High level interface for databases and PyMC3 Bayesian Probabilistic Programming
different computational backends
24
Packages - Visualisation
IPython Notebooks
Widely used and pow erful plotting package
seaborn Opinionated but beautiful data visualisations
12
4/4/2019
Packages - Description
• NumPy
NumPy is a low level libra ry wri tten in C (and FORTRAN) for high level mathema ti cal functions . NumPy cleverl y
overcomes the problem of running slower algori thms on Python by using mul tidimensional a rra ys and functions tha t
operate on a rra ys . Any algori thm can then be expressed as a functi on on a rra ys , allowing the algori thms to be run
qui ckly.
NumPy is pa rt of the Sci Py project, and is released as a separa te libra ry so people who onl y need the basic
requirements can use i t without i nstalling the rest of Sci Py.
NumPy i s compatible with Python versions 2.4 through to 2.7.2 a nd 3.1+
• SciPy
Sci Py is a libra ry tha t uses NumPy for more ma thema tical functions. Sci Py uses NumPy a rra ys as the basic data
s tructure, a nd comes wi th modules for va ri ous commonl y used tasks in s cientifi c programming, including linea r
a l gebra, i ntegration (calculus), ordinary di fferential equation solving a nd signal processing.
• Numba
Numba is a NumPy awa re Python compiler (just-in-time (JIT) specializing compiler) whi ch compiles annotated Python
(a nd NumPy) code to LLVM (Low Level Vi rtual Ma chine) through special decorators. Briefl y, Numba uses a sys tem tha t
compi les Python code with LLVM to code which ca n be natively executed at runtime.
Packages - Description
• scikit-learn
scikit-learn is a Python module for machine learning built on top of SciPy and distributed under
the 3-Clause BSD license.
• Pandas
Pandas is data manipulation library based on Numpy which provides many useful functions for
accessing, indexing, merging and grouping data easily. The main data structure (DataFrame) is
close to what could be found in the R statistical package; that is, heterogeneous data tables with
name indexing, time series operations and auto-alignment of data.
• Matplotlib
Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be
saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing
transition of MATLAB users to Python. Many examples, along with the source code to re-create
them, are available in the matplotlib gallery.
13
4/4/2019
Packages - Description
• Rpy2
Rpy2 is a Python binding for the R statistical package allowing the execution of R
functions from Python and passing data back and forth between the two environments.
Rpy2 is the object oriented implementation of the Rpy bindings.
• PsycoPy
PsychoPy is a library for cognitive scientists allowing the creation of cognitive psychology
and neuroscience experiments. The library handles presentation of stimuli, scripting of
experimental design and data collection.
Packages - Description
• datetime (or) time
Date and time functions to manage date and time data
• math
Core math functions and the constants like pi, e etc.
• pickle
Serializes objects to file
• os (or) os.path
Operating system interfaces.
• re
A library of perl-like regular expression operations
• string
Useful constants and classes related to strings.
• sys
System parameters and functions
14
4/4/2019
"We project a need for 1.5 million additional managers and analysts in the United States who can ask the
right questions and consume the results of the analysis of Big Data effectively."
"A significant constraint on realizing value from BigData will be a shortage of talent, particularly of people
with deep expertise in statistics and machine learning, and the managers and analysts who know how to
operate companies by using insights from Big Data."
Big data: The next frontier for innovation, competition, and productivity, McKinsey report
"By 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million
managers and analysts capable of reaping actionable insights from the big data deluge."
Game changers: Five opportunities for US growth and renewal, McKinsey report
25
15
4/4/2019
Rea d books on
- Sta tistics
- Ma chi ne Learning
- Progra mming
- Da ta bases
Ta ke University courses
27
What is Python?
• Programming language
16
4/4/2019
Why python?
28
Python - Applications
Python i s a powerful multi-paradigm computer programming language. With Python, we ca n do ma ny things.
Bel ow a re some of the things that ca n be achieved using Python.
Systems Programming: Python’s built-in i nterfaces to operating-system servi ces make it i deal for writing
porta ble, maintainable system-administration tools a nd utilities (sometimes called s hell tools). Python
progra ms ca n search files a nd directory trees, etc.
GUIs: Python’s simplicity and ra pid turnaround makes i t a good match for graphical user
i nterface programming on the desktop. Python comes with a standard object-oriented i nterface to the Tk GUI
API ca l led tkinter (Tkinter i n 2.X) tha t allows Python programs to i mplement portable GUIs with a native l ook
a nd feel.
Internet Scripting: Python comes wi th standard Internet modules that a llow Python programs to perform a
wi de variety of networking tasks in cl ient and s erver modes.
Database Programming: For tra di tional database demands, there a re Python i nterfaces to all commonly used
rel ational database s ystems like Sybase, Oracle, Informix, ODBC, MySQL, Pos tgreSQL, SQLite, and more.
28
17
4/4/2019
Python - Applications
Rapid Prototyping: To Python programmers the components wri tten in Python and C l ook the same. Because
of thi s , it’s possible to prototype systems i n Python initially, a nd then move s elected components to a compiled
l a nguage s uch as C or C++ for delivery.
Numeric and Scientific Programming: Python i s also heavily used i n numeric programming, a domain that would
not tra ditionally have been considered to be i n the scope of scripting l anguages, but has grown to become one of
Python’s most compelling use cases.
28
A s cri pting l anguage or s cript language is a programming l anguage that supports scripts, programs
wri tten for a special run-time environment that can i nterpret (rather than compile) a nd a utomate the
execution of tasks that could a lternatively be executed one-by-one by a human operator. Python comes
under i n this category. So i t is called a scripting language.
Stil l, the term ‘scripting’ seems to have s tuck to Python like glue. This may be because, people often use
the word ‘script’ i nstead of ‘program’ to describe a Python code file.
28
18
4/4/2019
ANACONDA:
Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics,
and scientific computing.
Features
https://www.continuum.io/why-anaconda
19
4/4/2019
Why ANACONDA?
Why ANACONDA?
20
4/4/2019
21
4/4/2019
• type ‘ipython’
4th Ways is Using any IDE like Jupyter, Spider, Pycharm, Canopy,
Rodeo etc.
22
4/4/2019
39
IPython Notebook
One of Python’s most useful features is its i nteractive interpreter.
It a l lows for very fast testing of ideas without the overhead of creating test files as is typi cal i n most programming
l a nguages. However, the i nterpreter s upplied with the standard Python distribution is somewhat limited for extended
i nteractive use.
Ipython:
A comprehensive environment for i nteractive and exploratory computing
23
4/4/2019
IPython Notebook
IPython Notebook
24
4/4/2019
IPython Notebook
The %run magic command allows you to run a ny python script a nd load all of i ts data directly i nto the interactive
na mespace. Since the file is re -read from disk each time, changes you make to i t are reflected i mmediately (unlike
i mported modules, which have to be specifically reloaded). IPython also i ncludes dreload, a recursive reload function.
%run ha s special flags for timing the execution of your scripts (-t), or for running them under the control of either
Python’s pdb debugger (-d) or profiler (-p).
The %edit command gives a reasonable approximation of multiline editing, by i nvoking your favorite editor on the s pot.
IPython wi ll execute the code you type in there as if it were typed interactively.
Ma gi c Functions ...
The following examples s how how to call the builtin %timeit ma gic, both in line a nd cell mode:
IPython Notebook
Expl oring your Objects
Typi ng object_name? will print all s orts of details a bout any object, i ncluding docstrings, function definition l ines (for
ca l l a rguments) and constructor details for cl asses.
To get s pecific i nformation on a n object, you ca n use the magic commands % pdoc, %pdef, %psource and %pfile.
Magic Functions:
IPython has a set of predefined magic functions that you ca n ca ll with a command line s tyle syntax.
There a re two kinds of magics, line-oriented a nd cell-oriented.
Li ne magics a re prefixed with the % character a nd work much l ike OS command -line calls: they get as an argument the
res t of the l ine, where arguments are passed without parentheses or quotes.
Cel l magics a re prefixed with a double %%, and they a re functions that get as a n a rgument not only the rest of the line,
but a l so the lines below it i n a separate argument.
You ca n run the s cript.py. You ca n toggle this behavi our by running the %automagic magic.
A more detailed explanation of the magic s ystem can be obtained by ca lling %magic,
To s ee all the available magic functions, call %lsmagic
25
4/4/2019
IPython Notebook
System Shell Commands:
To run a ny command a t the system s hell, simply prefix i t with !. You ca n ca pture the output i nto a Python l ist. To pass the
va l ues of Python va riables or expressions to system commands, prefix them with $.
System Aliases:
It’s convenient to have aliases to the system commands you use most often.
Thi s allows you to work seamlessly from i nside IPython with the same commands you a re used to i n your system s hell.
IPython comes with s ome pre-defined aliases and a complete system for changing directories, both vi a a stack (%pushd,
%popd and %dhist) a nd vi a direct %cd.
The l atter keeps a history of vi sited directories and allows you to go to any previously vi sited one.
IPython Notebook
History
IPython s tores both the commands you enter, and the results i t produces. You can easily go through previous
comma nds with the up- a nd down-arrow keys, or access your history i n more sophisticated ways.
Input a nd output history a re kept i n va riables called In a nd Out, keyed by the prompt numbers. The last three objects i n
output history a re also kept i n va riables named _, __ a nd ___.
You ca n use the %history ma gic function to examine past input a nd output. Input history from previous sessions is saved
i n a database, and IPython ca n be configured to save output history.
Several other magic functions can use your i nput history, i ncluding %edit, %rerun, %recall, %macro, %save and
%pa stebin.
Thi s will ta ke line 3 a nd lines 18 to 20 from the current s ession, a nd l ines 1-5 from the previous s ession.
26
4/4/2019
IPython Notebook
Debugging
After a n exception occurs, you can ca ll %debug to jump i nto the Python debugger (pdb) a nd examine the problem.
Al ternatively, i f you call %pdb, IPython will a utomatically start the debugger on any uncaught exception.
You ca n print variables, see code, execute s tatements a nd even walk up a nd down the call stack to tra ck down the true
s ource of the problem. This can be an efficient way to develop and debug code, in many cases eliminating the need for
pri nt s tatements or external debugging tools.
You ca n a lso step through a program from the beginning by ca lling %run -d theprogram.py.
.
Introduction to Jupyter
39
27
4/4/2019
Jupyter Notebook
Jupyter
28
4/4/2019
Jupyter
Jupyter
29
4/4/2019
Jupyter
Jupyter
30
4/4/2019
Jupyter
Jupyter
31
4/4/2019
Jupyter
32
4/4/2019
Jupyter
Exercise
1. Launch new Jupyter notebook (ipython) and save the code
2. Practice Assignment statements (create basic variables and perform mathematical operations)
3. Create markdown file with Some descriptions
4. Practice Short cuts (using Jupyter cheat sheet) like
1. run the code,
2. i ns ert/delete the cell,
3. a dd/remove the output & l i ne numbers,
4. Merge/split cells
5. Toggl e between different types of code formats, comments
6. etc
33
4/4/2019
Contact us
Vi s it us on: http://www.analytixlabs.in/
Joi n us on:
Twi tter - http://twitter.com/#!/AnalytixLabs
Fa cebook - http://www.facebook.com/analytixlabs
Li nkedIn - http://www.linkedin.com/in/analytixlabs
Bl og - http://www.analytixlabs.co.in/category/blog/
34