Professional Documents
Culture Documents
UNIT 3
UNIT 3
Data
(Unit -3 )
Working with Real Data
• Data science applications require data by definition. It would be nice if you could
simply go to a data store somewhere, purchase the data you need in an easy-
open package, and then write an application to access that data. However, data
is messy. It appears in all sorts of places, in many different forms. Every
organization has a different method of viewing data and stores it in a different
manner as well. Even when the data management system used by one company
is the same as the data management system used by another company, the
chances are slim that the data will appear in the same format or even use the
same data types. In short, before you can do any data science work, you must
discover how to access the data in all its myriad forms. Real data requires a lot of
work to use and fortunately, Python is up to the task of manipulating it as needed
• Now we will see the techniques required to access data in a number of forms
and locations. For example, memory streams represent a form of data storage
that your computer supports natively; flat files exist on your hard drive; relational
databases commonly appear on networks (although smaller relational databases,
such as those found in Access, could appear on your hard drive as well)
• The Scikit-learn library includes a number of toy datasets (small datasets meant
for you to play with). These datasets are complex enough to perform a number of
tasks, such as experimenting with Python to perform data science tasks.
Uploading, Streaming, and Sampling Data
• Storing data in local computer memory represents the fastest and
most reliable means to access it. The data could reside anywhere.
However, you don’t actually interact with the data in its storage
location. You load the data into memory from the storage location
and then interact with it in memory.
• This is the technique which we will use to access all the toy datasets
found in the Scikit-learn library, so you see this technique used
relatively often in the book.
• Data scientists call the columns in a database features or variables.
The rows are cases. Each row represents a collection of variables that
you can analyze.
File handling in python
File is named location on disk to store related information.
It is used to store permanently data in a non volatile memory.
When we need to read/ write data to/from file, we need to open it first.
and after completion of operation on file, we need to close it, so that
resources that are tied with the file are freed.
Python file operations takes place in following order:
1. open a file
2. Read/Write (perform action)
3. Close the file
Opening a file
Before we can read/write a file, we have to open it using python’s built in
function open(). This function returns a file object, also called a handle, as
it is used to read/modify the file accordingly.
we can specify mode while opening a file. In mode we specify whether we
want to open file for read ‘r’, for write ‘w’, or append ‘a’.
We can also specify to open a file in text format or binary format. Default
reading is in text mode. In this mode we get strings when we read from
file.
On the other end, binary mode returns bytes and this is the mode to be
used when dealing with non-text files like image/exe files.
Opening a file
Syntax : file object=open(filename [,accessmode])
Example: f=open(‘C:/python33/abc.txt’,’r’)
example : f1=open(‘1.txt’,’w’)
file name: The file name argument is string value that contains name of file
that we want to access
accessmode – This is optional filed. Default file access mode is ‘r’ – read.
The access mode determines the mode in which you want to read/write
which can be ‘r’,’w’,’r+’,’w+’,’rb’’wb’,’rb+’, wb+’, ‘rb’, ‘wb’, ‘a’, ‘ab’, ‘a+’, ‘ab+’, ‘x’
Mode of Opening a file
‘r’ – Opens a file for read only. The file pointer is placed at beginning of
file. This is default mode.
‘rb’ – Open a file for reading only in binary format. The file pointer is
placed at beginning of file.
‘r+’ – Open a file for both reading and writing. The file pointer is placed at
beginning of the file.
‘rb+’ – opens a file for both reading and writing in binary format. The file
pointer is placed at beginning of the file.
‘w’ – Opens a file for writing only. Overwrites the file if the file exists. If
file does not exists, creates a new file for writing.
‘wb’ – Opens a file for writing only in binary format. Overwrites the file if
the file exists. If file does not exists, creates a new file for writing.
Mode of Opening a file
‘w+’ – opens a file for both reading and writing. Overwrites the file if the
file exists. If file does not exists, creates a new file for reading and writing.
‘wb+’ – opens a file for both reading and writing in binary format.
Overwrites the file if the file exists. If file does not exists, creates a new
file for reading and writing.
‘a’ – Opens a file for appending. The file pointer is at the end of file if the
file exists. If file does not exists then it will crate new file for writing.
‘ab’ – Opens a file for appending in binary format. The file pointer is at the
end of file if the file exists. If file does not exists then it will crate new file
for writing.
‘a+’ – Opens a file for appending. and reading. The file pointer is at the
end of file if the file exists. If file does not exists then it will crate new file
for reading and writing.
Mode of Opening a file
‘ab+’ – Opens a file for appending and reading in binary format. The file
pointer is at the end of file if the file exists. If file does not exists then it
will crate new file for reading and writing.
‘x’ – Open the file for exclusive creation. If file already exists, the
operation fails.
Closing A file
When we have finished operations of file, we need to properly close it.
Python has garbage collector to clean up un referenced objects. But we
must not rely on it to close the file. closing a file will free up the resources
that were tied with the file and is done using close() method.
Syntax: fileObject.close()
Example program:
Example 2
# Open a file
fo = open("f2.txt", "r+")
str = fo.read()
print ("Read String is : ", str)
Example
# Open a file
O/P
fo = open("f2.txt", "r")
Python is a great language.
for i in fo:
Yeah its great!!
print(i)
The with statement
‘with’ statement can be used while opening a file. The advantage of ‘with’
statement is that it will take care of closing a file which is opened by it.
Hence need not to close the file explicitly.
In case of exception, with statement will close the file before exception is
handled.
syntax: with open (‘file name’,’mode’) as fileobject:
Example: I am learner
with open('f7.txt','w') as f1:
python is attractive
f1.write('I am learner\n')
f1.write('python is attractive\n')
with open('f7.txt','r') as f2:
for line in f2:
print(line)
UPLOADING SMALL AMOUNTS OF DATA INTO
MEMORY
• The most convenient method that you can use to work with data is to
load it directly into memory. This technique shows up a couple of
times earlier in the book but uses the toy dataset from the Scikit-
learn library. This section uses the Colors.txt file, shown in Figure 6-1,
for input. • The example also relies on native Python functionality to
get the task done. When you load a file (of any type), the
entire dataset is available at all times and the loading
process is quite short. Here is an example of how this
technique works.
• As the code performs data reads in the for loop, the file pointer moves to the next
record. Each record appears one at a time in observation. The code outputs the
value in observation using a print statement. You should receive above output
Sampling
• Many a times the dataset we are dealing with can be too large to be
handled in python. A workaround is to take random samples out of
the dataset and work on it.
• There are situations where sampling is appropriate, as it gives a near
representations of the underlying population.
• Sampling in Python
• We need to use sample() function
import pandas as pd
Online_Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online
Retail.csv")
Online_Retail.shape
O/P:
(541909, 8)
• Using Sample function
sample_data=Online_Retail.sample(n=1000,replace="False")
sample_data.shape
O/P:
(1000, 8)
• Using function .sample() on our data set we have taken a random
sample of 1000 rows out of total 541909 rows of full data.
SAMPLING DATA IN DIFFERENT WAYS
• Data streaming obtains all the records from a data source. While
Sampling is useful to access specific records Like every fifth record or
random samples
• You can save time and resources by simply sampling the data if you
dont want all records. The following code shows how to retrieve every
other record in the Colors.txt file:
Load Image file from 1. Imread() method to load from matplotlib.image import
Hard Disk using image from image module of imread
Matplotlib matplotlib from matplotlib.pyplot import *
2. Imshow() method from pyplot %matplotlib inline
module of matplot lib to load i=imread(‘imagename.ext’)
image in plot imshow(i)
3. show() method of pyplot show(i)
module to show image on plot
Accessing Data in Structured Flat-File Form
• A flat file presents the easiest kind of file to work with. The data appears as a
simple list of entries that you can read one at a time. Example: text files, CSV Files
and Excel Files
• A problem with using native Python techniques is that the input isn’t
intelligent. For example, when a file contains a header, Python simply reads it
as yet more data to process, rather than as a header. You can’t easily select a
particular column of data.
• The pandas library used in the sections that follow makes it much easier to read
and understand flat-file data. Classes and methods in the pandas library interpret
(parse) the flat-file data to make it easier to manipulate.
• Flat-file forms: text files, CSV files, Excel files
• Simplest and easiest form of flat file is text file. But it has a limitation.
• However, a text file also treats all data as strings, so you often have to convert
numeric data into other forms. A comma-separated value (CSV) file provides
more formatting and more information, but it requires a little more effort to
read. At the high end of flat-file formatting are custom data formats, such as an
Excel file, which contains extensive formatting and could include multiple
datasets in a single file.
READING FROM A TEXT FILE
• Text files can use a variety of storage formats.
However, a common format is to have a header
line that documents the purpose of each field,
followed by another line for each record in the
file. The file separates the fields using tabs. Refer
to Figure for an example of the Colors.txt file used
for the example in this section.
• To read file data apart from native python you can
use pandas. In pandas you find a set
of parsers, code used to read individual bits of
data and determine the purpose of each bit
according to the format of the entire file
import pandas as pd
color_table = pd.io.parsers.read_table("Colors.txt")
print(color_table)
READING FROM A TEXT FILE
• The code imports the pandas library, uses the read_table() method to
read Colors.txt into a variable named color_table, and then displays
the resulting memory data onscreen using the print function.
• Notice that the parser correctly interprets the first row as consisting
of field names. It numbers the records from 0 through 7. Using
read_table() method argumentss
Reading CSV delimited format
• A CSV file provides more formatting than a simple text file. In fact,
CSV files can become quite complicated.
• There is a standard that defines the format of CSV files as below:
1. A header defines each of the fields
2. Fields are separated by commas
3. Records are separated by linefeeds
4. Strings are enclosed in double quotes
5. Integers and real numbers appear without double quotes
Reading CSV delimited format
• Titanic.csv file used for this example. You can see the raw format
using any text editor. Sibsp : sibling+spouse
Parch : parents+Child
Applications such as Excel can import and format CSV
files so that they become easier to read. Figure 6-4
shows the same file in Excel.
Reading CSV delimited format
• Excel actually recognizes the header as a header. If you were to use
features such as data sorting, you could select header columns to
obtain the desired result. Fortunately, pandas also makes it possible
to work with the CSV file as formatted data, as shown in the following
example: age
0 29.0000
import pandas as pd 1 0.9167
2 2.0000
titanic = pd.io.parsers.read_csv("Titanic.csv") 3 30.0000
4 25.0000
#print(titanic) prints whole table 5 48.0000
X = titanic[['age']] …
1304 14.5000
1305 9999.0000
print(X) 1306 26.5000
1307 27.0000
To create the output as a list, you simply change the third line of
1308 29.0000
code to read X = titanic[[’age’]].values. Notice the addition of the
[1309 rows x 1 columns]
values property. The output changes to something like this (some
values omitted for the sake of space): [[29. ] [ 0.91670001] [ 2. ] ... [26.5 ] [27. ] [29. ]]
READING EXCEL AND OTHER MICROSOFT OFFICE FILES
• Excel and other Microsoft Office applications provide highly
formatted content. You can specify every aspect of the information
these files contain. The Values.xls file used for this example provides a
listing of sine, cosine, and tangent values for a random list of angles.
• an Excel file can contain more
than one worksheet, so you need
to tell pandas which worksheet to
process. In fact, you can choose to
process multiple worksheets, if
desired. When working with other
Office products, you have to be
specific about what to process.
• Just telling pandas to process
something isn’t good enough.
Here’s an example of working
with the Values.xls file.
READING EXCEL AND OTHER MICROSOFT OFFICE FILES
import pandas as pd
xls = pd.ExcelFile("Values.xls")
trig_values = xls.parse('Sheet1', index_col=None,na_values=['NA'])
print(trig_values)
READING EXCEL AND OTHER MICROSOFT OFFICE FILES
import pandas as pd
xls = pd.ExcelFile("Values.xls")
trig_values = xls.parse('Sheet1', index_col=None)
print(trig_values)
Summary To Upload/stream/sampling file
Library/Package Method/Class Code
1. Using Native python 1.Uploading: using Open() with open(‘filename’,’r’) as f:
file handling function and read() method print(f.read())
2. Streaming: using Open() with open(‘filename’,’r’) as f:
for function and file object with loop for i in f:
Text Files print(i)
3.Sampling using open() function with open(‘filename’,’r’) as f:
and for loop for i in f:
if(i%2==0):
print(i)
2. Using Pandas 1. Uploading using read_table() import pandas as pd
Text Files method of pandas.io.parsers t=pd.io.parsers.read_table(‘colo
r.txt’)
print(t)
Summary To Upload/stream/sampling file
Library/Package Method/Class Code
Using Pandas 1.Uploading using read_csv() import pandas as pd
CSV files method of pandas.io.parsers c=pd.io.parsers.read_csv(‘Titanic.csv’)
print(t)
print(c[[‘age’]]) #for column acess
Using Pandas 1. Uploading using ExcelFile class import pandas as pd
Excel files object and parse() method ex=pd. ExcelFile(“values.xls”)
trig_values=ex.parse(‘sheet1’,
index_col=None)
print(trig_values)
Sending Data in Unstructured File Form
• Unstructured data files consist of a series of bits. The file doesn’t separate
the bits from each other in any way.
• You can’t simply look into the file and see any structure because there
isn’t any to see. Unstructured file formats rely on the file user to know
how to interpret the data. For example, each pixel of a picture file could
consist of three 32-bit fields. Knowing that each field is 32-bits is up to
you. A header at the beginning of the file may provide clues about
interpreting the file, but even so, it’s up to you to know how to interact
with the file.
• The example in this section shows how to work with a picture as an
unstructured file. The example image is a public domain offering from
http://commons.wikimedia.org/wiki/Main_Page. To work with images, you
need to access the Scikit-image library (http://scikit-image.org/)
• For installation of library give following command at anaconda prompt
example_file = ("http://upload.wikimedia.org/wikipedia/commons/7/7d/Dog_face.png")
plt.show()
plt.imshow(image3, cmap=cm.gray)
OUTPUT
data type: <class 'numpy.ndarray'>, shape: (900,)
Managing Data from Relational Databases
• the vast majority of data used by organizations rely on relational
databases because these databases provide the means for organizing
massive amounts of complex data in an organized manner that makes
the data easy to manipulate. The goal of a database manager is to
make data easy to manipulate. The focus of most data storage is to
make data easy to retrieve.
• The one common denominator between many relational databases is
that they all rely on a form of the same language to perform data
manipulation, which does make the data scientist’s job easier. The
Structured Query Language (SQL) lets you perform all sorts of
management tasks in a relational database, retrieve data as needed,
and even shape it in a particular way so that the need to perform
additional shaping is unnecessary.
• Creating a connection to a database: step 1: gain access to database
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
• The output of a read method is always a DataFrame object that contains the
requested data. To write data, you must create a DataFrame object or use an
existing DataFrame object. You normally use these methods to perform most
tasks:
• read_sql_table(): Reads data from a SQL table to a DataFrame object
• read_sql_query(): Reads data from a database using a SQL query to a
DataFrame object
• read_sql(): Reads data from either a SQL table or query to a DataFrame object
• DataFrame.to_sql(): Writes the content of a DataFrame object to the specified
tables in the database
• The sqlalchemy library provides support for a broad range of SQL databases. The
following list contains just a few of them:
• SQLite
• MySQL
• PostgreSQL
• SQL Server
• ODBC (Open Database Connectivity)
Interacting with Data from NoSQL Databases
• NoSQL is an approach to database design that can accommodate a wide
variety of data models, including key-value, document, columnar and graph
formats. NoSQL, which stands for "not only SQL," is an alternative to
traditional relational databases in which data is placed in tables and data
schema is carefully designed before the database is built. NoSQL databases
are especially useful for working with large sets of distributed data.
• These Not only SQL (NoSQL) databases are used in large data storage
scenarios in which the relational model can become overly complex or can
break down in other ways
• Of course, you find fewer of these DBMSes used in the corporate
environment because they require special handling and training. Still, some
common DBMSes are used because they provide special functionality or
meet unique requirements. The process is essentially the same for using
NoSQL databases as it is for relational databases:
• 1. Import required database engine functionality.
• 2. Create a database engine.
• 3. Make any required queries using the database engine and the
functionality supported by the DBMS.
Interacting with Data from NoSQL Databases
• Working with MangoDB
• PyMongo library (https://api.mongodb.org/python/current/) and use
the MongoClient class to create the required engine.
• The MongoDB engine relies heavily on the find() function to locate
data. Following is a pseudo-code example of a MongoDB session
import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
Accessing Data from the Web
• A Web Service is a service offered by an application to
another application, communicating with each other via
the World Wide Web.
• A web service is a kind of web application that provides
a means to ask questions and receive answers. Web
services usually host a number of input types. In fact, a
particular web service may host entire groups of query
inputs.
• Web Services allow applications developed in different
technologies to communicate with each other through
a common format like XML, Jason, etc. Web services are
not tied to any one operating system or programming
language. For example, an application developed in Java
can communicate with the one developed in C#,
Android, etc., and vice versa.
Accessing Data from the Web
• Working with web services
means working with XML (in
most cases).
• With this in mind, the
example in this section
works with XML data found
in the XMLData.xml file,
shown in Figure
Acessing data from XML file
from lxml import objectify
import pandas as pd
xml = objectify.parse(open('XMLData.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))
for i in range(0,4):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['Number', 'String', 'Boolean’],
[obj[0].text, obj[1].text,obj[2].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
print(df)
• The example begins by importing libraries and parsing the data file using
the objectify.parse() method. Every XML document must contain a root
node, which is <MyDataset> in this case. The root node encapsulates the
rest of the content, and every node under it is a child. To do anything
practical with the document, you must obtain access to the root node
using the getroot() method.
• The next step is to create an empty DataFrame object that contains the
correct column names for each record entry: Number, String, and Boolean.
As with all other pandas data handling, XML data handling relies on a
DataFrame. The for loop fills the DataFrame with the four records from the
XML file (each in a <Record> node).
• The process looks complex but follows a logical order. The obj variable
contains all the children for one <Record> node. These children are loaded
into a dictionary object in which the keys are Number, String, and Boolean
to match the DataFrame columns.There is now a dictionary object that
contains the row data.
• The code creates an actual row for the DataFrame next. It gives the row the
value of the current for loop iteration. It then appends the row to the
DataFrame. To see that everything worked as expected, the code prints the
result
OUTPUT WORDS
bowm-NumPy array
2369
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0] S
E
[0 0 0 ... 0 0 0]
N
... T
[0 0 0 ... 0 0 0] A
N
[0 0 0 ... 0 0 0] C
[0 0 0 ... 0 0 0]] E
S
34442
(2369, 34442)
innovision is at index 17786
In m array :row are sentences from files and columns are from words generated by stemming
TF-IDF Transformation (Term Frequency & Inverse Document Frequency)
• Text data will be converted to vector with frequency (importance of word in document)
• Application:provides importance TF=No of repetations of words in a sentance/No. Of words in sentence
IDF=(log(No of sentences+1)/(No. Of sentences containing words+1))+1
• Of every word in document
TF-IDF= TF *IDF
• Example: IDF
TF
• S1=“ good boy” s1 s2 S3 words IDF
Vectors
• S2=“ good girl” TF good 1/2 1/2 1/3 good Log(4/4)+1=1
• S3=“boy girl good” * boy Log(4/3)+1
boy 1/2 0 1/3
Word Frequency girl Log(4/3)+1
girl 0 1/2 1/3
boy 2
girl 2 s f1 f2 f3
good boy girl
good 3
s1 1/2 ½*(log(4/3)+1) 0
IN TF-IDF Frequency represents
Semantic meaning of word. s2 1/2 0 1/2*(log(4/3)+1)
Appliation: Search engine s3 1/3 1/3*(log(4/3)+1) 1/3*(log(4/3)+1)
TF-IDF Transformation implementation for sentences s1,s2,s3
• Text data will be converted to vector with frequency (importance of word in document)
• Package : sklearn.feature_extraction_text.TfidfVectorizer()
from sklearn.feature_extraction.text import *
{'good': 2, 'boy': 0, 'girl': 1}
s1='good boy' [[0.78980693 0. 0.61335554]
s2='good girl' [0. 0.78980693 0.61335554]
s3='boy girl good' [0.61980538 0.61980538 0.48133417]]
tfidf=TfidfVectorizer() bow
tfidfm=tfidf.fit_transform([s1,s2,s3]).toarray() [[1 0 1]
print(tfidf.vocabulary_) [0 1 1]
print(tfidfm) [1 1 1]] girl good
boy
print('bow')
s1
bow=CountVectorizer() s2
bowm=bow.fit_transform([s1,s2,s3]).toarray() s3
print(bowm)
Remark: Output of TFIDF matrix will be vary from our calculation because TfidfVectorizer()
applies Euclidean Normalization after TFIDF calculation of every term
TF-IDF Transformation of document (20 newsgroup dataset)
from sklearn.datasets import fetch_20newsgroups (593, 13564)
from sklearn.feature_extraction.text import * (593, 13564)
categories = ['sci.space']
[[0 0 0 ... 0 0 0]
s = fetch_20newsgroups(subset='train',
[0 0 0 ... 0 0 0]
categories=categories,
[0 0 0 ... 0 0 0]
remove=('headers', 'footers', 'quotes')) ...
vec = CountVectorizer() [0 0 0 ... 0 0 0]
bowm=vec.fit_transform(s.data).toarray() [0 0 0 ... 0 0 0]
tfidf = TfidfVectorizer() [0 0 0 ... 0 0 0]]
tfidfm=tfidf.fit_transform(s.data).toarray() [[0. 0. 0. ... 0. 0. 0.]
print (tfidfm.shape) [0. 0. 0. ... 0. 0. 0.]
print(bowm.shape) [0. 0. 0. ... 0. 0. 0.]
print() ...
print(bowm) [0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
print(tfidfm)
[0. 0. 0. ... 0. 0. 0.]]
N-GRAM Modeling : Text modelling techniques
• N Gram is a continuous sequence of N items from given
sample of text
• Items can be characters or words
Type Of
N-Gram
Character Word N-
N-Gram Gram
N-GRAM Modeling : Character N Gram
• “the bird is flying on the blue sky”
• N=2
• Bigrams:’th’,’he’,’e ‘,’ b’’bi’,’ir’,’rd’,’d ‘,’ i’,’is’,’s ‘,etc.
• N=3:
• Trigrams: ‘the’,’he ‘, ‘e b’,’ bi’,bir’ ,’ird’, ‘rd ‘, ‘ d i’,etc
Trigrams Next character
“the bird is flying on the blue sky”
the ----- space
he ------ b
e b ------ i
bi --------r
N-GRAM Modeling : Word N Gram
• “the bird is flying on the blue sky.”
• N=3
• Trigrams:’the bird is’, ‘bird is flying’,’is flying on’, flying on
the’,’ on the blue’, ‘the blue sky’
Trigrams Next word
the bird is ----- flying
bird is flying ------ on
is flying on ------ the
flying on the --------blue
On the blue --------sky
The blue sky---------.
N-GRAM Modeling : Word N Gram for N number
of sentences
• “the bird is flying on the blue sky”
• If number of sentences are more then one then
import re
OUTPUT:
str='cat mat bat rat cat mat'
mat
result=re.search(r'm\w\w',str)
if result!=None:
search() returns only first matched
print(result.group()) occurrence instead of all occurrences
Python program to create regular expression to search phone number
(form: DDD-DDD-DDDD) from string
import re
data1 = 'My phone number is: 800-555-1212.'
data2 = '800-555-1234 is my phone number.'
OUTPUT:
pattern = r'\d{3}-\d{3}-\d{4}' 800-555-1212
m1=re.search(pattern,data1) 800-555-1234
print(m1.group())
m2=re.search(pattern,data2)
print(m2.group())