1 Introduction Python Programming For Data Science

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

INTRODUCTION

The programming requirements of data science demands a very versatile yet flexible language which is simple
to write the code but can handle highly complex mathematical processing. Python is most suited for such
requirements as it has already established itself both as a language for general computing as well as scientific
computing. Moreover it is being continuously upgraded in form of new addition to its plethora of libraries
aimed at different programming requirements. The following are features of python which makes it the
preferred language for data science.
i) A simple and easy to learn language which achieves result in fewer lines of code than other similar
languages like R. Its simplicity also makes it robust to handle complex scenarios with minimal code and
much less confusion on the general flow of the program.
ii) It is cross platform, so the same code works in multiple environments without needing any change. That
makes it perfect to be used in a multi-environment setup easily.
iii) It executes faster than other similar languages used for data analysis like R and MATLAB.
iv) Its excellent memory management capability, especially garbage collection makes it versatile in gracefully
managing very large volume of data transformation, slicing, dicing and visualization.
v) Python has got a very large collection of libraries which serve as special purpose analysis tools. E.g. NumPy
package deals with scientific computing and its array needs much less memory than the conventional python
list for managing numeric data. Also the number of such packages is continuously growing.
vi) Python has packages which can directly use the code from other languages like Java or C. This helps in
optimizing the code performance by using existing code of other languages, whenever it gives a better
result.
Python Machine Learning Ecosystem
The Python machine learning ecosystem is a collection of libraries that enable the developers to extract and
transform data, perform data wrangling operations, apply existing robust Machine Learning algorithms and also
develop custom algorithms easily. These libraries include numpy, scipy, pandas, scikit-learn, statsmodels,
tensorflow, keras, etc. The following is a description of these libraries:
1. PANDAS:- used for Data Analysis
2. NUMPY: - used for numerical analysis and formation i.e. for matrix and vector manipulation
3. MATPLOTLIB: - used for data visualization
4. SCIPY: - used for scientific computing
5. SEABORN: - used for data visualization
6. TENSORFLOW: - used in deep learning
7. SCIKIT-LEARN: - used in machine learning i.e. used as a source for many machine learning algorithms and
utilities
8. KERAS : - used for neural networks and deep learning
Setting Up a Python Environment
The starting step for the journey into the world of Data Science is the setup of the Python environment. You
have two options for setting up the environment:
• Install Python and the necessary libraries individually
• Use a pre-packaged Python distribution that comes with necessary libraries, e.g. Anaconda

Anaconda is a packaged compilation of Python along with a whole suite of a variety of libraries, including core
libraries which are widely used in Data Science. A major advantage of this distribution is that you don’t require
an elaborate setup and it works well on all flavors of operating systems and platforms, especially Windows,
which can often cause problems with installing specific Python packages. The Anaconda distribution is widely
used across industry Data Science environments and it also comes with a wonderful IDE, Spyder (Scientific
Python Development Environment), besides other useful utilities like jupyter notebooks, the IPython console, it
comes with an excellent package management tool, conda.

Data Science Programming~ Wainaina Page 1 of 11


Setting Up Anaconda Python Environment
The first step in setting up your environment with the required Anaconda distribution is downloading the
required installation package, which is the provider of the Anaconda distribution

Steps
You can follow the following steps to setup Python environment using Anaconda:
i) The first step is downloading the required installation package from https://www.anaconda.com/download/.
You can choose from Windows, Mac and Linux OS as per your requirement.
ii) Select the Python version you want to install on your machine. There you will get the options for 64-bit and
32-bit Graphical installer both.
iii) After selecting the OS and Python version, it will download the Anaconda installer on your computer. You
then need to double click the file and the installer will install Anaconda package.

Installing Libraries
In Python the preferred way to install additional libraries is using the pip installer. The basic syntax to install a
package from Python Package Index (PyPI) using pip is as follows:

pip install required_package

This will install the required_package if it is present in PyPI. You can also use other sources other than PyPI to
install packages but that generally would not be required. The Anaconda distribution is already supplemented
with a plethora of additional libraries, hence it is very unlikely that we will need additional packages from other
sources.
Another way to install packages, limited to Anaconda, is to use the conda install command. This will install the
packages from the Anaconda package channels and is recommended it especially on Windows.

Components of Python Machine Learning Ecosystem


The following is some of the core Data Science libraries that form the components of Python Machine learning
ecosystem. These components are some of the reasons why Python is an important language for Data Science.
The components discussed here is not exhaustive but is based on their importance in the whole ecosystem.

1. Jupyter Notebook
The Jupyter Notebook, formerly known as ipython notebooks is an interactive environment for running code
in the browser. It is a great tool for exploratory data analysis and is widely used by data scientists.
The following are some of the features of Jupyter notebooks that makes it one of the best components of
Python ML ecosystem:
 Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like code,
images, text, output etc. in a step by step manner.
 It helps a data scientist to document the thought process while developing the analysis process.
 One can also capture the result as the part of the notebook.
 With the help of Jupiter notebooks, it is possible to share your work with your peers.

Installation and Execution


You don’t require any additional installation for Jupyter notebooks, as it is already installed by the
Anaconda distribution. You just need to go to Anaconda Prompt and type the following command:

C:\>jupyter notebook

This will start a notebook server at the address localhost:8888 of your machine. Once you invoke this
command, you can navigate to the address localhost:8888 in your browser, to find the landing page depicted
in the diagram below, which can be used to access existing notebooks or create new ones.

Data Science Programming~ Wainaina Page 2 of 11


On the landing page, you can initiate a new notebook by clicking the New button on top right. By default it
will use the default kernel (i.e., the Python 3.x kernel) but we can also associate the notebook with a
different kernel (e.g. Python 2.7 kernel, if installed in your system). Clicking on the New button will take
you to the new notebook and you can start working in it as shown below.

Clicking on Python 3 you get the following screen:

Data Science Programming~ Wainaina Page 3 of 11


A notebook is just a collection of cells. There are three major types of cells in a notebook:
i) Code cells: - These are the cells that you can use to write your code and associated comments. The
contents of these cells are sent to the kernel associated with the notebook and the computed outputs are
displayed as the cells’ outputs.
ii) Markdown cells: - Markdown can be used to intelligently notate the computation process. These can
contain simple text comments, HTML tags, images, and even Latex equations. These will come in very
handy when we are dealing with a new and non-standard algorithm and we also want to capture the
stepwise math and logic related to the algorithm.
iii) Raw cells: - These are the simplest of the cells and they display the text written in them as is. These can
be used to add text that you don’t want to be converted by the conversion mechanism of the jupyter
notebooks.

2. NumPy
Numpy is the backbone of Machine Learning in Python. It is one of the most important libraries in Python
for numerical computations. It’s used in almost all Machine Learning and scientific computing libraries. It
stands for Numerical Python and provides an efficient way to store and manipulate multidimensional arrays
in Python. Generally, NumPy can also be seen as the replacement of MatLab because NumPy is mostly used
along with Scipy (Scientific Python) and Mat-plotlib (plotting library).

Installation and Execution


If you are using Anaconda distribution, then no need to install NumPy separately as it is already installed
with it. You just need to import the package into your Python script with the help of following:

import numpy as np

On the other hand, if you are using standard Python distribution then NumPy can be installed as follows:

pip install NumPy

After installing NumPy, you can import it into your Python script as shown above.

Numpy ndarray
The numeric functionality of numpy is orchestrated by two important constituents of the numpy package,
ndarray and Ufuncs (Universal function).
 ndarray (simply arrays or matrices) is a multi-dimensional array object which is the core data container
for all of the numpy operations. Mostly an array will be of a single data type (homogeneous) and
possibly multi-dimensional.
 Universal functions are the functions which operate on ndarrays in an element by element fashion.

Data Science Programming~ Wainaina Page 4 of 11


Creating Arrays
Arrays can be created in multiple ways in numpy. A single dimensional array can be created from Python
lists using np.array() method as shown below:

In[3]: arr = np.array([1,3,4,5,6])

The shape attribute of the array object returns the size of each dimension in the form of (row, columns),
while the size returns the total size of the array:

In [4]: arr.shape

Out[4]: (5,)

Unlike Python lists, NumPy arrays can explicitly be multidimensional. A multidimensional array is created
as shown below:

In[5]: import numpy as np


x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))

Out[5]:
x:
[[1 2 3]
[4 5 6]]

Creating Arrays from Scratch


For large arrays it is more efficient to create arrays from scratch using a bunch of special functions built in
NumPy as shown in the following examples:

i). np.zeros: Creates a matrix of specified dimensions containing only zeroes:

In[6]: np.zeros(10, dtype=int)


Out[6]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# Create a 2x4 two dimensional array filled with 0s


In [7]: arr = np.zeros((2,4))
...: arr

Out[7]:array([[ 0., 0., 0., 0.],


[ 0., 0., 0., 0.]])

ii). np.ones: Creates a matrix of specified dimension containing only ones:

In [8]: arr = np.ones((2,4))


...: arr
Out[8]:

array([[ 1., 1., 1., 1.],


[1., 1., 1., 1.]])

iii). np.arange : creates an array filled with a linear sequence, starting at 0, ending at 20, stepping by 2. This
is similar to the built-in range() function

Data Science Programming~ Wainaina Page 5 of 11


In[9]: np.arange(0, 20, 2)
Out[9]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

iv). np.identity: Creates an identity matrix of specified dimensions:

In [10]: arr = np.identity(3)


...: arr

Out[10]:
array([[ 1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

Alternatively

In[11]: np.eye(3)
Out[11]: array([[ 1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

v). To initialize an array of a specified dimension with random values can be done by using the randn
function from the numpy.random package:

In [12]: arr = np.random.randn(3,4)


...: arr
Out[12]:
array([[ 0.0102692 , -0.13489664, 1.03821719, -0.28564286],
[-1.12651838, 1.41684764, 1.11657566, -0.1909584],
[2.20532043, 0.14813109, 0.73521382, 1.1270668]])

 To create a 3x3 array of uniformly distributed random values between 0 and 1

In[13]: np.random.random((3, 3))


Out[13]: array([[ 0.99844933, 0.52183819, 0.22421193],
[0.08007488, 0.45429293, 0.20941444],
[0.14360941, 0.96910973, 0.946117 ]])

 To Create a 3x3 array of normally distributed random values with mean 0 and standard deviation 1

In[14]: np.random.normal(0, 1, (3, 3))


Out[14]: array([[ 1.51772646, 0.39614948, -0.10634696],
[0.25671348, 0.00732722, 0.37783601],
[0.68446945, 0.15926039, -0.70744073]])
 To create a 3x3 array of random integers in the interval [0, 10)

In[15]: np.random.randint(0, 10, (3, 3))


Out[15]: array([[2, 3, 4],
[5, 7, 8],
[0, 5, 0]])

Data Science Programming~ Wainaina Page 6 of 11


3. Pandas
Pandas is a Python library for data wrangling and analysis. It is built around a data structure called the
DataFrame that is modeled after the R DataFrame i.e. it similar to an Excel spreadsheet. Pandas provides a
great range of methods to modify and operate on a table; in particular, it allows SQL-like queries and joins
of tables. In contrast to NumPy, which requires that all entries in an array be of the same type, pandas
allows each column to have a separate type. Another valuable tool provided by pandas is its ability to ingest
from a great variety of file formats and databases, like SQL, Excel files, and comma-separated values (CSV)
files.
The following is an example of creating a DataFrame using a dictionary:

In[1]:import pandas as pd
# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location': ["New York", "Paris", "Berlin", "London"],
'Age': [24, 13, 53, 33]
}

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook

display(data_pandas)

This produces the following output:

Notice that the keys of dictionary are picked up as the column names in the dataframe and since index was
not included, it picked up the default index of normal arrays.

There are several possible ways to query the table. E.g. The following will select rows that have an age
column greater than 30

In[2]:display(data_pandas[data_pandas.Age > 30])

This produces the following result:

Data Processing in Pandas


Using Pandas, you can process data using the following five steps:
i) Load
ii) Prepare
iii) Manipulate

Data Science Programming~ Wainaina Page 7 of 11


iv) Model
v) Analyze

Data Structures of Pandas


All the data representation in pandas is done using two primary data structures:
i) Series
ii) Dataframes

i) Series
Series in pandas is a one-dimensional ndarray with an axis label. i.e. its functionality is similar to a
simple array. The values in a series will have an index that needs to be hashable. This requirement is
needed when we perform manipulation and summarization on data contained in a series data structure.

ii) Dataframe
Dataframe is the most important and useful data structure, which is used for almost all kind of data
representation and manipulation in pandas. Pandas are extremely useful in representing raw datasets as
well as processed feature sets in Machine Learning and Data Science. All the operations can be
performed along the axes, rows, and columns, in a dataframe.

Data Retrieval
Pandas provides numerous ways to retrieve and read in data. You can convert data from CSV files,
databases, flat files, etc into dataframes. You can also convert a list of dictionaries (Python dict) into a
dataframe.
The following are the most important data sources:

i) List of Dictionaries to Dataframe


This is one of the simplest methods to create a dataframe. It is useful in scenarios when you arrive at the
data you want to analyze, after performing some computations and manipulations on the raw data. It
allows integration of a panda based analysis into data being generated by other Python processing
pipelines.

ii) CSV Files to Dataframe


This is perhaps one of the most widely used ways of creating a dataframe. You can easily read a CSV, or
any delimited file (like TSV), and convert it into a dataframe using pandas. The following is a sample
slice of a CSV file containing the data of cities of the world from http://simplemaps.com/data/world-
cities.

The data is obtained using the following code

In [3]: city_data = pd.read_csv(filepath_or_buffer='simplemaps-


worldcities-basic.csv')

Databases to Dataframe
The most important data source for data scientists is the existing data sources used by their
organizations. Relational databases (DBs) and data warehouses are the de facto standard of data storage

Data Science Programming~ Wainaina Page 8 of 11


in almost all organizations. Pandas provides capabilities to connect to these databases directly, execute
queries on them to extract data, and then convert the result of the query into a structured dataframe. The
pandas.from_sql function combined with Python’s powerful database library implies that the task
of getting data from DBs is simple and easy.

Example:
The following code is used read data from a Microsoft SQL Server database.

server = 'xxxxxxxx' # Address of the database server


user = 'xxxxxx' # the username for the database server
password = 'xxxxx' # Password for the user
database = 'xxxxx' # Database in which the table is present
conn = pymssql.connect(server=server, user=user, password=password,
database=database)
query = "select * from some_table"
df = pd.read_sql(query, conn)

conn is an object used to identify the database server information and the type of database to pandas

4. Matplotlib
matplotlib is the primary scientific plotting library in Python. It provides functions for making publication-
quality visualizations such as line charts, histograms, scatter plots, etc. Visualizing your data and different
aspects of your analysis can give you important insights.
When working inside the Jupyter Notebook, you can show figures directly in the browser by using the
%matplotlib notebook and %matplotlib inline commands.

Example:
The following code produces the plot

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
#or you can use "from matplotlib import pyplot as plt"

# Generate a sequence of numbers from -10 to 10 with 100 steps in between


x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

Data Science Programming~ Wainaina Page 9 of 11


5. SciPy
SciPy is a collection of functions for scientific computing in Python. It provides, among other functionality,
advanced linear algebra routines, mathematical function optimization, signal processing, special
mathematical functions, and statistical distributions. scikit-learn draws from SciPy’s collection of functions
for implementing its algorithms. One of the most important part of SciPy is scipy.sparse, which
provides sparse matrices, which are another representation that is used for data in scikitlearn. Sparse
matrices are used whenever we want to store a 2D array that contains mostly zeros.

In[1]:
from scipy import sparse
# A 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))

It produces the following output

Out[1]:
NumPy array:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]

In[2]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))

It produces the following output

Out[2]:
SciPy sparse CSR matrix:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0

6. Scikit-learn
Scikit-learn is one of the most important and indispensable Python frameworks for Data Science and
Machine Learning in Python. It is built on top of the NumPy and SciPy scientific Python libraries and
implements a wide range of Machine Learning algorithms covering major areas of Machine Learning like
classification, clustering, regression, etc. All the mainstream Machine Learning algorithms like support
vector machines, logistic regression, random forests, K-means clustering, hierarchical clustering, etcs are
implemented efficiently in this library. Perhaps this library forms the foundation of applied and practical
Machine Learning. Besides this, its easy-to-use API and code design patterns have been widely adopted
across other frameworks.

Installation and Execution


If you are using Anaconda distribution, then no need to install Scikit-learn separately as it is already
installed with it. You just need to use the package into your Python script. For example, the following line
of statement is importing a dataset of breast cancer patients from Scikit-learn:

Data Science Programming~ Wainaina Page 10 of 11


from sklearn.datasets import load_breast_cancer

Core APIs
Scikit-learn is built on a small and simple list of core API ideas and design patterns. The following is a brief
descriptions on the core APIs on which the central operations of scikit-learn are based.

i) Dataset representation:
The data representation of most Machine Learning tasks are quite similar to each other. Very often we
have a collection of data points represented by data point vectors. A data point vector contains multiple
independent variables (or features) and one or more dependent variables (response variables). E.g. in a
linear regression problem it can be represented as [(X1, X2, X3, X4, …, Xn), (Y)] where the independent
variables (features) are represented by the Xs and the dependent variable (response variable) is
represented by Y.
The idea is to predict Y by fitting a model on the features. This data representation resembles a matrix
(considering multiple data point vectors), and a natural way to depict it is by using numpy arrays.

ii) Estimators: The estimator interface is one of the most important components of the scikit-learn library.
All the Machine Learning algorithms in the package implement the estimator interface. The learning
process is handled in a two-step process. The first step is the initialization of the estimator object; this
involves selecting the appropriate class object for the algorithm and supplying the parameters or
hyperparameters. The second step is applying the fit function to the data supplied (feature set and
response variables). The fit function will learn the output parameters of the Machine Learning algorithm
and expose them as public attributes of the object for easy inspection of the final model. The data to the
fit function is generally supplied in the form of an input-output matrix pair.

iii) Predictors: The predictor interface is implemented to generate predictions, forecasts, etc. using a
learned estimator for unknown data. E.g. in the case of a supervised learning problem, the predictor
interface will provide predicted classes for the unknown test array supplied to it. A requirement of a
predictor implementation is to provide a score function; this function will provide a scalar value for the
test input provided to it which will quantify the effectiveness of the model used. Such values will be
used in the future for tuning the Machine Learning models.

iv) Transformers: Transformation of input data before learning of a model is a very common task in
Machine Learning. Some data transformations are simple, for example replacing some missing data with
a constant, taking a log transform, while some data transformations are similar to learning algorithms
themselves (for example, PCA). To simplify the task of such transformations, some estimator objects
will implement the transformer interface. This interface allows you to perform a non-trivial
transformation on the input data and supply the output to our actual learning algorithm.

Data Science Programming~ Wainaina Page 11 of 11

You might also like