Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Working with Real

Data
(Unit -3 )
Working with Real Data
• Data science applications require data by definition. It would be nice if you could
simply go to a data store somewhere, purchase the data you need in an easy-
open package, and then write an application to access that data. However, data
is messy. It appears in all sorts of places, in many different forms. Every
organization has a different method of viewing data and stores it in a different
manner as well. Even when the data management system used by one company
is the same as the data management system used by another company, the
chances are slim that the data will appear in the same format or even use the
same data types. In short, before you can do any data science work, you must
discover how to access the data in all its myriad forms. Real data requires a lot of
work to use and fortunately, Python is up to the task of manipulating it as needed
• Now we will see the techniques required to access data in a number of forms
and locations. For example, memory streams represent a form of data storage
that your computer supports natively; flat files exist on your hard drive; relational
databases commonly appear on networks (although smaller relational databases,
such as those found in Access, could appear on your hard drive as well)
• The Scikit-learn library includes a number of toy datasets (small datasets meant
for you to play with). These datasets are complex enough to perform a number of
tasks, such as experimenting with Python to perform data science tasks.
Uploading, Streaming, and Sampling Data
• Storing data in local computer memory represents the fastest and
most reliable means to access it. The data could reside anywhere.
However, you don’t actually interact with the data in its storage
location. You load the data into memory from the storage location
and then interact with it in memory.
• This is the technique which we will use to access all the toy datasets
found in the Scikit-learn library, so you see this technique used
relatively often in the book.
• Data scientists call the columns in a database features or variables.
The rows are cases. Each row represents a collection of variables that
you can analyze.
File handling in python
 File is named location on disk to store related information.
 It is used to store permanently data in a non volatile memory.
 When we need to read/ write data to/from file, we need to open it first.
and after completion of operation on file, we need to close it, so that
resources that are tied with the file are freed.
 Python file operations takes place in following order:
 1. open a file
 2. Read/Write (perform action)
 3. Close the file
Opening a file
 Before we can read/write a file, we have to open it using python’s built in
function open(). This function returns a file object, also called a handle, as
it is used to read/modify the file accordingly.
 we can specify mode while opening a file. In mode we specify whether we
want to open file for read ‘r’, for write ‘w’, or append ‘a’.
 We can also specify to open a file in text format or binary format. Default
reading is in text mode. In this mode we get strings when we read from
file.
 On the other end, binary mode returns bytes and this is the mode to be
used when dealing with non-text files like image/exe files.
Opening a file
 Syntax : file object=open(filename [,accessmode])
 Example: f=open(‘C:/python33/abc.txt’,’r’)
 example : f1=open(‘1.txt’,’w’)
 file name: The file name argument is string value that contains name of file
that we want to access
 accessmode – This is optional filed. Default file access mode is ‘r’ – read.
The access mode determines the mode in which you want to read/write
which can be ‘r’,’w’,’r+’,’w+’,’rb’’wb’,’rb+’, wb+’, ‘rb’, ‘wb’, ‘a’, ‘ab’, ‘a+’, ‘ab+’, ‘x’
Mode of Opening a file
 ‘r’ – Opens a file for read only. The file pointer is placed at beginning of
file. This is default mode.
 ‘rb’ – Open a file for reading only in binary format. The file pointer is
placed at beginning of file.
 ‘r+’ – Open a file for both reading and writing. The file pointer is placed at
beginning of the file.
 ‘rb+’ – opens a file for both reading and writing in binary format. The file
pointer is placed at beginning of the file.
 ‘w’ – Opens a file for writing only. Overwrites the file if the file exists. If
file does not exists, creates a new file for writing.
 ‘wb’ – Opens a file for writing only in binary format. Overwrites the file if
the file exists. If file does not exists, creates a new file for writing.
Mode of Opening a file
 ‘w+’ – opens a file for both reading and writing. Overwrites the file if the
file exists. If file does not exists, creates a new file for reading and writing.
 ‘wb+’ – opens a file for both reading and writing in binary format.
Overwrites the file if the file exists. If file does not exists, creates a new
file for reading and writing.
 ‘a’ – Opens a file for appending. The file pointer is at the end of file if the
file exists. If file does not exists then it will crate new file for writing.
 ‘ab’ – Opens a file for appending in binary format. The file pointer is at the
end of file if the file exists. If file does not exists then it will crate new file
for writing.
 ‘a+’ – Opens a file for appending. and reading. The file pointer is at the
end of file if the file exists. If file does not exists then it will crate new file
for reading and writing.
Mode of Opening a file
 ‘ab+’ – Opens a file for appending and reading in binary format. The file
pointer is at the end of file if the file exists. If file does not exists then it
will crate new file for reading and writing.
 ‘x’ – Open the file for exclusive creation. If file already exists, the
operation fails.
Closing A file
 When we have finished operations of file, we need to properly close it.
 Python has garbage collector to clean up un referenced objects. But we
must not rely on it to close the file. closing a file will free up the resources
that were tied with the file and is done using close() method.
 Syntax: fileObject.close()
 Example program:

# Open a file in binary format o/p:


fo = open("foo.txt", "wb") Name of the file:
print ("Name of the file: ", fo.name) foo.txt
# Close opened file closed or not: True
fo.close()
print('closed or not:',fo.closed)
Writing to a file
 In order to write in to file, we need to open it with ‘w’ , append ‘a’, or
exclusive creation ‘x’ mode.
 we need to be careful with ‘w’ mode as it will overwrite file if file is
already exist. All previous data will be deleted.
 The write() method writes any string to an open file. It is important to
note that Python strings can have binary data and not just text.
 The write() method does not add a newline character ('\n') to the end of
the string
 Syntax: fileObject.write(string)
Example program to write string to file
# Open a file
fo = open("f2.txt", "w")
fo.write( "Python is a great language.\nYeah its great!!\n")
# Close opend file
fo.close()
The above method would create f2.txt file and
would write given content in that file and finally
it would close that file. If you would open this
file, it would have the following content −
The read() Method
 To read content of file, we must open the file in reading mode.
 The read() method reads a string from an open file. It is important to note
that Python strings can have binary data. apart from text data.
 Syntax: fileObject.read([size])
 Here, passed parameter size is the number of bytes to be read from the
opened file. This method starts reading from the beginning of the file and
if size is missing, then it tries to read as much as possible, maybe until the
end of file.
Example program to read content of file using read()
# Open a file O/P
Read String is : Python is_
fo = open("f2.txt", "r+") _ indicates space
str = fo.read(10)
print ("Read String is : ", str)

Example 2
# Open a file
fo = open("f2.txt", "r+")
str = fo.read()
print ("Read String is : ", str)

# Close opened file


fo.close() O/P
Read String is : Python is a great language.
Yeah its great!!
Alternative of read() : Read content of file line by line

Example
# Open a file
O/P
fo = open("f2.txt", "r")
Python is a great language.
for i in fo:
Yeah its great!!
print(i)
The with statement
 ‘with’ statement can be used while opening a file. The advantage of ‘with’
statement is that it will take care of closing a file which is opened by it.
Hence need not to close the file explicitly.
 In case of exception, with statement will close the file before exception is
handled.
 syntax: with open (‘file name’,’mode’) as fileobject:

Example: I am learner
with open('f7.txt','w') as f1:
python is attractive
f1.write('I am learner\n')
f1.write('python is attractive\n')
with open('f7.txt','r') as f2:
for line in f2:
print(line)
UPLOADING SMALL AMOUNTS OF DATA INTO
MEMORY
• The most convenient method that you can use to work with data is to
load it directly into memory. This technique shows up a couple of
times earlier in the book but uses the toy dataset from the Scikit-
learn library. This section uses the Colors.txt file, shown in Figure 6-1,
for input. • The example also relies on native Python functionality to
get the task done. When you load a file (of any type), the
entire dataset is available at all times and the loading
process is quite short. Here is an example of how this
technique works.

with open("Colors.txt", 'r') as open_file:


print('Colors.txt content:\n' + open_file.read())
UPLOADING SMALL AMOUNTS OF DATA INTO
MEMORY
• The example begins by using the open() method to obtain a file object. The
open() function accepts the filename and an access mode. In this case, the
access mode is read (r). It then uses the read() method of the file object to
read all the data in the file. If you were to specify a size argument as part of
read(), such as read(15), Python would read only the number of characters
that you specify or stop when it reaches the End Of File (EOF)
• The entire dataset is loaded from the library into free memory. Of course,
the loading process will fail if your system lacks sufficient memory to hold
the dataset. When this problem occurs, you need to consider other
techniques for working with the dataset, such as streaming it or sampling
it. In short, before you use this technique, you must ensure that the
dataset will actually fit in memory. You won’t normally experience any
problems when working with the toy datasets in the Scikit-learn library.
What is Streaming?
• Streaming is a method used for processing data piece by piece,
instead of trying to load the entire dataset into memory all at once.
• This process involves reading, working on, and writing data one piece
after another, which makes it perfect for managing large datasets that
are too big to fit into memory.
• By using streaming, you can greatly reduce the amount of memory
used and enhance the overall performance of your program.
• There are many reasons why streaming is important for handling
data:
• Large dataset size: When working with datasets that are bigger than
the memory available, streaming allows you to process the data
efficiently without overloading the system.

• Memory constraints: Streaming makes it possible to process data
within the limits of the available memory, which lowers the risk of
running into memory overflow errors.
• Error handling: Sometimes, you might need to use streaming because
certain rows in the data cause errors when you try to read everything
at once. Streaming helps you identify and handle these errors more
effectively.
• Real-time processing: Streaming is essential for applications that
require real-time data processing, such as collecting data from
sensors or analyzing social media feeds.
• Bandwidth optimization: Streaming helps make better use of
network bandwidth by minimizing the amount of data sent at any
given moment. This is especially important with cloud infrastructures
like AWS, Azure, and GCP, as it helps reduce data transfer costs and
improve performance
STREAMING LARGE AMOUNTS OF DATA INTO
MEMORY
• Some datasets will be so large that you won’t be able to fit them entirely in
memory at one time. In addition, you may find that some datasets load slowly
because they reside on a remote site. Streaming answers both needs by making it
possible to work with the data a little at a time. You download individual pieces,
making it possible to work with just part of the data and to work with it as you
receive it, rather than waiting for the entire dataset to download. Here’s an
example of how you can stream data using Python:

with open("Colors.txt", 'r') as open_file:


for observation in open_file:
print('Reading Data: ' + observation)

• As the code performs data reads in the for loop, the file pointer moves to the next
record. Each record appears one at a time in observation. The code outputs the
value in observation using a print statement. You should receive above output
Sampling
• Many a times the dataset we are dealing with can be too large to be
handled in python. A workaround is to take random samples out of
the dataset and work on it.
• There are situations where sampling is appropriate, as it gives a near
representations of the underlying population.
• Sampling in Python
• We need to use sample() function
import pandas as pd
Online_Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online
Retail.csv")
Online_Retail.shape
O/P:
(541909, 8)
• Using Sample function
sample_data=Online_Retail.sample(n=1000,replace="False")
sample_data.shape
O/P:
(1000, 8)
• Using function .sample() on our data set we have taken a random
sample of 1000 rows out of total 541909 rows of full data.
SAMPLING DATA IN DIFFERENT WAYS
• Data streaming obtains all the records from a data source. While
Sampling is useful to access specific records Like every fifth record or
random samples
• You can save time and resources by simply sampling the data if you
dont want all records. The following code shows how to retrieve every
other record in the Colors.txt file:

enumerate() to retrieve a row number. And row content –here row


number starts with 0
J%n==0 checks whether row number is even
If we change n=3 then in output we get row number 3 and 6
SAMPLING DATA IN DIFFERENT WAYS
• You can perform random sampling as well. Reading Line: 1 Content: Red 1

from random import random Reading Line: 3 Content: Yellow 3


sample_size = 0.25 Reading Line: 8 Content: White 8
with open("Colors.txt", 'r') as open_file:
for j, observation in enumerate(open_file):
if random()<=sample_size:
print('Reading Line: ' + str(j) + ' Content: ' + observation)
• To make this form of selection work, you must import the random class.
The random() method outputs a value between 0 and 1. However, Python
randomizes the output so that you don’t know what value you receive. The
sample_size variable contains a number between 0 and 1 to determine the
sample size. For example, 0.25 selects 25 percent of the items in the file.
GENERATING VARIATIONS ON IMAGE DATA
• Import and analyze image data
• simply read a local image in, obtain statistics about that image, and display the image
onscreen, as shown in the following code:
import matplotlib.image as img
import matplotlib.pyplot as plt
%matplotlib inline
image = img.imread("steve.jpg")
print(image.shape)
print(image.size)
plt.imshow(image)
plt.show()
• The image library reads the image into memory, while the pyplot library
displays it onscreen.
• After the code reads the file, it begins by displaying the image shape
property — the number of horizontal pixels, vertical pixels, and pixel depth.
Figure shows that the image is 195 x 259 x 3 pixels. The image size property imshow() loads the image and
is the combination of these three elements, or 151515 bytes. show() displays image
Summary To Load Image files
Library/Package Method/Class Code
Load image from web Image class from Ipython.display import
using Ipython.display Image
i=Image(‘http://..........’) #URL
i

Load Image file from 1. Imread() method to load from matplotlib.image import
Hard Disk using image from image module of imread
Matplotlib matplotlib from matplotlib.pyplot import *
2. Imshow() method from pyplot %matplotlib inline
module of matplot lib to load i=imread(‘imagename.ext’)
image in plot imshow(i)
3. show() method of pyplot show(i)
module to show image on plot
Accessing Data in Structured Flat-File Form
• A flat file presents the easiest kind of file to work with. The data appears as a
simple list of entries that you can read one at a time. Example: text files, CSV Files
and Excel Files
• A problem with using native Python techniques is that the input isn’t
intelligent. For example, when a file contains a header, Python simply reads it
as yet more data to process, rather than as a header. You can’t easily select a
particular column of data.
• The pandas library used in the sections that follow makes it much easier to read
and understand flat-file data. Classes and methods in the pandas library interpret
(parse) the flat-file data to make it easier to manipulate.
• Flat-file forms: text files, CSV files, Excel files
• Simplest and easiest form of flat file is text file. But it has a limitation.
• However, a text file also treats all data as strings, so you often have to convert
numeric data into other forms. A comma-separated value (CSV) file provides
more formatting and more information, but it requires a little more effort to
read. At the high end of flat-file formatting are custom data formats, such as an
Excel file, which contains extensive formatting and could include multiple
datasets in a single file.
READING FROM A TEXT FILE
• Text files can use a variety of storage formats.
However, a common format is to have a header
line that documents the purpose of each field,
followed by another line for each record in the
file. The file separates the fields using tabs. Refer
to Figure for an example of the Colors.txt file used
for the example in this section.
• To read file data apart from native python you can
use pandas. In pandas you find a set
of parsers, code used to read individual bits of
data and determine the purpose of each bit
according to the format of the entire file
import pandas as pd
color_table = pd.io.parsers.read_table("Colors.txt")
print(color_table)
READING FROM A TEXT FILE
• The code imports the pandas library, uses the read_table() method to
read Colors.txt into a variable named color_table, and then displays
the resulting memory data onscreen using the print function.
• Notice that the parser correctly interprets the first row as consisting
of field names. It numbers the records from 0 through 7. Using
read_table() method argumentss
Reading CSV delimited format
• A CSV file provides more formatting than a simple text file. In fact,
CSV files can become quite complicated.
• There is a standard that defines the format of CSV files as below:
1. A header defines each of the fields
2. Fields are separated by commas
3. Records are separated by linefeeds
4. Strings are enclosed in double quotes
5. Integers and real numbers appear without double quotes
Reading CSV delimited format
• Titanic.csv file used for this example. You can see the raw format
using any text editor. Sibsp : sibling+spouse
Parch : parents+Child
Applications such as Excel can import and format CSV
files so that they become easier to read. Figure 6-4
shows the same file in Excel.
Reading CSV delimited format
• Excel actually recognizes the header as a header. If you were to use
features such as data sorting, you could select header columns to
obtain the desired result. Fortunately, pandas also makes it possible
to work with the CSV file as formatted data, as shown in the following
example: age
0 29.0000
import pandas as pd 1 0.9167
2 2.0000
titanic = pd.io.parsers.read_csv("Titanic.csv") 3 30.0000
4 25.0000
#print(titanic) prints whole table 5 48.0000
X = titanic[['age']] …
1304 14.5000
1305 9999.0000
print(X) 1306 26.5000
1307 27.0000
To create the output as a list, you simply change the third line of
1308 29.0000
code to read X = titanic[[’age’]].values. Notice the addition of the
[1309 rows x 1 columns]
values property. The output changes to something like this (some
values omitted for the sake of space): [[29. ] [ 0.91670001] [ 2. ] ... [26.5 ] [27. ] [29. ]]
READING EXCEL AND OTHER MICROSOFT OFFICE FILES
• Excel and other Microsoft Office applications provide highly
formatted content. You can specify every aspect of the information
these files contain. The Values.xls file used for this example provides a
listing of sine, cosine, and tangent values for a random list of angles.
• an Excel file can contain more
than one worksheet, so you need
to tell pandas which worksheet to
process. In fact, you can choose to
process multiple worksheets, if
desired. When working with other
Office products, you have to be
specific about what to process.
• Just telling pandas to process
something isn’t good enough.
Here’s an example of working
with the Values.xls file.
READING EXCEL AND OTHER MICROSOFT OFFICE FILES

• In below code ExcelFile() constructor creates pointer to excel file. Xls


pointer allows to access file, define an index column, and specify how
to present empty values
• The index column is the one that the worksheet uses to index the
records. Using a value of None means that pandas should generate an
index for you. The parse() method obtains the values you request.

import pandas as pd
xls = pd.ExcelFile("Values.xls")
trig_values = xls.parse('Sheet1', index_col=None,na_values=['NA'])
print(trig_values)
READING EXCEL AND OTHER MICROSOFT OFFICE FILES
import pandas as pd
xls = pd.ExcelFile("Values.xls")
trig_values = xls.parse('Sheet1', index_col=None)
print(trig_values)
Summary To Upload/stream/sampling file
Library/Package Method/Class Code
1. Using Native python 1.Uploading: using Open() with open(‘filename’,’r’) as f:
file handling function and read() method print(f.read())
2. Streaming: using Open() with open(‘filename’,’r’) as f:
for function and file object with loop for i in f:
Text Files print(i)
3.Sampling using open() function with open(‘filename’,’r’) as f:
and for loop for i in f:
if(i%2==0):
print(i)
2. Using Pandas 1. Uploading using read_table() import pandas as pd
Text Files method of pandas.io.parsers t=pd.io.parsers.read_table(‘colo
r.txt’)
print(t)
Summary To Upload/stream/sampling file
Library/Package Method/Class Code
Using Pandas 1.Uploading using read_csv() import pandas as pd
CSV files method of pandas.io.parsers c=pd.io.parsers.read_csv(‘Titanic.csv’)
print(t)
print(c[[‘age’]]) #for column acess
Using Pandas 1. Uploading using ExcelFile class import pandas as pd
Excel files object and parse() method ex=pd. ExcelFile(“values.xls”)
trig_values=ex.parse(‘sheet1’,
index_col=None)
print(trig_values)
Sending Data in Unstructured File Form
• Unstructured data files consist of a series of bits. The file doesn’t separate
the bits from each other in any way.
• You can’t simply look into the file and see any structure because there
isn’t any to see. Unstructured file formats rely on the file user to know
how to interpret the data. For example, each pixel of a picture file could
consist of three 32-bit fields. Knowing that each field is 32-bits is up to
you. A header at the beginning of the file may provide clues about
interpreting the file, but even so, it’s up to you to know how to interact
with the file.
• The example in this section shows how to work with a picture as an
unstructured file. The example image is a public domain offering from
http://commons.wikimedia.org/wiki/Main_Page. To work with images, you
need to access the Scikit-image library (http://scikit-image.org/)
• For installation of library give following command at anaconda prompt

conda install -c conda-forge scikit-image


from skimage.io import imread • It then creates a string that points to the example file
online and places it in example_file. This string is part of
from skimage.transform import resize the imread() method call, along with as_grey, which is
set to True. The as_grey argument tells Python to turn
from matplotlib import pyplot as plt any color images into gray scale. Any images that are
already in gray scale remain that way.
import matplotlib.cm as cm

example_file = ("http://upload.wikimedia.org/wikipedia/commons/7/7d/Dog_face.png")

image = imread(example_file, as_grey=True)

plt.imshow(image, cmap=cm.gray) OUTPUT

plt.show()

To load image on screen –rendering (make it ready


to display onscreen) imshow() is used
The imshow() function performs the rendering and
uses a grayscale color map. The show() function
actually displays image for you, as shown in Figure
• You now have an image in memory and you may want to find out
more about it. When you run the following code, you discover the
image type and size:
print("data type: %s, shape: %s" % (type(image), image.shape))
OUTPUT
data type: <class 'numpy.ndarray'>, shape: (90, 90)
• The output from this call tells you that the image type is a numpy.ndarray and that the
image size is 90 pixels by 90 pixels. The image is actually an array of pixels that you can
manipulate in various ways. For example, if you want to crop the image, you can use the
following code to manipulate the image array: • The numpy.ndarray in image2 is smaller
than the one in image, so the output is
image2 = image[5:70,0:70]
smaller as well. Figure shows typical
plt.imshow(image2, cmap=cm.gray) results. The purpose of cropping the image
is to make it a specific size. Both images
plt.show() must be the same size for you to analyze
them. Cropping is one way to ensure that
the images are the correct size for analysis.
• Another method that you can use to change the image size is to resize it. The
following code resizes the image to a specific size for analysis:
image3 = resize(image2, (30, 30), mode='symmetric')

plt.imshow(image3, cmap=cm.gray)

print("data type: %s, shape: %s" %(type(image3), image3.shape))

The output from the print() function tells


you that the image is now 30 pixels by 30
pixels in size. You can compare it to any
image with the same dimensions.
• After you have all the images the right size, you need to flatten them.
A dataset row is always a single dimension, not two dimensions. The
image is currently an array of 30 pixels by 30 pixels, so you can’t make
it part of a dataset. The following code flattens image3 so that it
becomes an array of 900 elements that is stored in image_row.
print("data type: %s, shape: %s" %(type(image3), image3.shape))
image_row = image3.flatten()
print("data type: %s, shape: %s" %(type(image_row), image_row.shape))

OUTPUT
data type: <class 'numpy.ndarray'>, shape: (900,)
Managing Data from Relational Databases
• the vast majority of data used by organizations rely on relational
databases because these databases provide the means for organizing
massive amounts of complex data in an organized manner that makes
the data easy to manipulate. The goal of a database manager is to
make data easy to manipulate. The focus of most data storage is to
make data easy to retrieve.
• The one common denominator between many relational databases is
that they all rely on a form of the same language to perform data
manipulation, which does make the data scientist’s job easier. The
Structured Query Language (SQL) lets you perform all sorts of
management tasks in a relational database, retrieve data as needed,
and even shape it in a particular way so that the need to perform
additional shaping is unnecessary.
• Creating a connection to a database: step 1: gain access to database
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
• The output of a read method is always a DataFrame object that contains the
requested data. To write data, you must create a DataFrame object or use an
existing DataFrame object. You normally use these methods to perform most
tasks:
• read_sql_table(): Reads data from a SQL table to a DataFrame object
• read_sql_query(): Reads data from a database using a SQL query to a
DataFrame object
• read_sql(): Reads data from either a SQL table or query to a DataFrame object
• DataFrame.to_sql(): Writes the content of a DataFrame object to the specified
tables in the database

• The sqlalchemy library provides support for a broad range of SQL databases. The
following list contains just a few of them:
• SQLite
• MySQL
• PostgreSQL
• SQL Server
• ODBC (Open Database Connectivity)
Interacting with Data from NoSQL Databases
• NoSQL is an approach to database design that can accommodate a wide
variety of data models, including key-value, document, columnar and graph
formats. NoSQL, which stands for "not only SQL," is an alternative to
traditional relational databases in which data is placed in tables and data
schema is carefully designed before the database is built. NoSQL databases
are especially useful for working with large sets of distributed data.
• These Not only SQL (NoSQL) databases are used in large data storage
scenarios in which the relational model can become overly complex or can
break down in other ways
• Of course, you find fewer of these DBMSes used in the corporate
environment because they require special handling and training. Still, some
common DBMSes are used because they provide special functionality or
meet unique requirements. The process is essentially the same for using
NoSQL databases as it is for relational databases:
• 1. Import required database engine functionality.
• 2. Create a database engine.
• 3. Make any required queries using the database engine and the
functionality supported by the DBMS.
Interacting with Data from NoSQL Databases
• Working with MangoDB
• PyMongo library (https://api.mongodb.org/python/current/) and use
the MongoClient class to create the required engine.
• The MongoDB engine relies heavily on the find() function to locate
data. Following is a pseudo-code example of a MongoDB session
import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
Accessing Data from the Web
• A Web Service is a service offered by an application to
another application, communicating with each other via
the World Wide Web.
• A web service is a kind of web application that provides
a means to ask questions and receive answers. Web
services usually host a number of input types. In fact, a
particular web service may host entire groups of query
inputs.
• Web Services allow applications developed in different
technologies to communicate with each other through
a common format like XML, Jason, etc. Web services are
not tied to any one operating system or programming
language. For example, an application developed in Java
can communicate with the one developed in C#,
Android, etc., and vice versa.
Accessing Data from the Web
• Working with web services
means working with XML (in
most cases).
• With this in mind, the
example in this section
works with XML data found
in the XMLData.xml file,
shown in Figure
Acessing data from XML file
from lxml import objectify
import pandas as pd
xml = objectify.parse(open('XMLData.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))
for i in range(0,4):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['Number', 'String', 'Boolean’],
[obj[0].text, obj[1].text,obj[2].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
print(df)
• The example begins by importing libraries and parsing the data file using
the objectify.parse() method. Every XML document must contain a root
node, which is <MyDataset> in this case. The root node encapsulates the
rest of the content, and every node under it is a child. To do anything
practical with the document, you must obtain access to the root node
using the getroot() method.
• The next step is to create an empty DataFrame object that contains the
correct column names for each record entry: Number, String, and Boolean.
As with all other pandas data handling, XML data handling relies on a
DataFrame. The for loop fills the DataFrame with the four records from the
XML file (each in a <Record> node).
• The process looks complex but follows a logical order. The obj variable
contains all the children for one <Record> node. These children are loaded
into a dictionary object in which the keys are Number, String, and Boolean
to match the DataFrame columns.There is now a dictionary object that
contains the row data.
• The code creates an actual row for the DataFrame next. It gives the row the
value of the current for loop iteration. It then appends the row to the
DataFrame. To see that everything worked as expected, the code prints the
result
OUTPUT WORDS
bowm-NumPy array
2369
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0] S
E
[0 0 0 ... 0 0 0]
N
... T
[0 0 0 ... 0 0 0] A
N
[0 0 0 ... 0 0 0] C
[0 0 0 ... 0 0 0]] E
S
34442
(2369, 34442)
innovision is at index 17786

In m array :row are sentences from files and columns are from words generated by stemming
TF-IDF Transformation (Term Frequency & Inverse Document Frequency)
• Text data will be converted to vector with frequency (importance of word in document)
• Application:provides importance TF=No of repetations of words in a sentance/No. Of words in sentence
IDF=(log(No of sentences+1)/(No. Of sentences containing words+1))+1
• Of every word in document
TF-IDF= TF *IDF
• Example: IDF
TF
• S1=“ good boy” s1 s2 S3 words IDF
Vectors
• S2=“ good girl” TF good 1/2 1/2 1/3 good Log(4/4)+1=1
• S3=“boy girl good” * boy Log(4/3)+1
boy 1/2 0 1/3
Word Frequency girl Log(4/3)+1
girl 0 1/2 1/3
boy 2
girl 2 s f1 f2 f3
good boy girl
good 3
s1 1/2 ½*(log(4/3)+1) 0
IN TF-IDF Frequency represents
Semantic meaning of word. s2 1/2 0 1/2*(log(4/3)+1)
Appliation: Search engine s3 1/3 1/3*(log(4/3)+1) 1/3*(log(4/3)+1)
TF-IDF Transformation implementation for sentences s1,s2,s3
• Text data will be converted to vector with frequency (importance of word in document)
• Package : sklearn.feature_extraction_text.TfidfVectorizer()
from sklearn.feature_extraction.text import *
{'good': 2, 'boy': 0, 'girl': 1}
s1='good boy' [[0.78980693 0. 0.61335554]
s2='good girl' [0. 0.78980693 0.61335554]
s3='boy girl good' [0.61980538 0.61980538 0.48133417]]
tfidf=TfidfVectorizer() bow
tfidfm=tfidf.fit_transform([s1,s2,s3]).toarray() [[1 0 1]
print(tfidf.vocabulary_) [0 1 1]
print(tfidfm) [1 1 1]] girl good
boy
print('bow')
s1
bow=CountVectorizer() s2
bowm=bow.fit_transform([s1,s2,s3]).toarray() s3
print(bowm)
Remark: Output of TFIDF matrix will be vary from our calculation because TfidfVectorizer()
applies Euclidean Normalization after TFIDF calculation of every term
TF-IDF Transformation of document (20 newsgroup dataset)
from sklearn.datasets import fetch_20newsgroups (593, 13564)
from sklearn.feature_extraction.text import * (593, 13564)
categories = ['sci.space']
[[0 0 0 ... 0 0 0]
s = fetch_20newsgroups(subset='train',
[0 0 0 ... 0 0 0]
categories=categories,
[0 0 0 ... 0 0 0]
remove=('headers', 'footers', 'quotes')) ...
vec = CountVectorizer() [0 0 0 ... 0 0 0]
bowm=vec.fit_transform(s.data).toarray() [0 0 0 ... 0 0 0]
tfidf = TfidfVectorizer() [0 0 0 ... 0 0 0]]
tfidfm=tfidf.fit_transform(s.data).toarray() [[0. 0. 0. ... 0. 0. 0.]
print (tfidfm.shape) [0. 0. 0. ... 0. 0. 0.]
print(bowm.shape) [0. 0. 0. ... 0. 0. 0.]
print() ...
print(bowm) [0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
print(tfidfm)
[0. 0. 0. ... 0. 0. 0.]]
N-GRAM Modeling : Text modelling techniques
• N Gram is a continuous sequence of N items from given
sample of text
• Items can be characters or words

Type Of
N-Gram

Character Word N-
N-Gram Gram
N-GRAM Modeling : Character N Gram
• “the bird is flying on the blue sky”
• N=2
• Bigrams:’th’,’he’,’e ‘,’ b’’bi’,’ir’,’rd’,’d ‘,’ i’,’is’,’s ‘,etc.
• N=3:
• Trigrams: ‘the’,’he ‘, ‘e b’,’ bi’,bir’ ,’ird’, ‘rd ‘, ‘ d i’,etc
Trigrams Next character
“the bird is flying on the blue sky”
the ----- space
he ------ b
e b ------ i
bi --------r
N-GRAM Modeling : Word N Gram
• “the bird is flying on the blue sky.”
• N=3
• Trigrams:’the bird is’, ‘bird is flying’,’is flying on’, flying on
the’,’ on the blue’, ‘the blue sky’
Trigrams Next word
the bird is ----- flying
bird is flying ------ on
is flying on ------ the
flying on the --------blue
On the blue --------sky
The blue sky---------.
N-GRAM Modeling : Word N Gram for N number
of sentences
• “the bird is flying on the blue sky”
• If number of sentences are more then one then

Trigrams Next word


the bird is -----[ flying,eating,sleeping]
bird is flying ------ [on,through]
is flying on ------ the
flying on the --------[blue,orange]
On the blue --------sky
The blue sky---------.
N-GRAM Modeling with stop words
• Consider sentence: “shreedhar is a good boy”
• Stop words: ‘is’, ‘a’
N=3, words N-gram
Generates trigram by removing stopwords
Shreedhar good boy
For implementation: parameters of
CountVectorizer() class
• analyzer :string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’
• Whether the feature should be made of word n-gram or character n-grams.
Option ‘char_wb’ creates character n-grams only from text inside word
boundaries;
• ngram_range:tuple (min_n, max_n), default=(1, 1)
• The lower and upper boundary of the range of n-values for different word n-
grams or char n-grams to be extracted. All values of n such such that min_n
<= n <= max_n will be used. For example an ngram_range of (1, 1) means only
unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
Only applies if analyzer is not callable.
• max_features: (int), default=None, if None than whole vocabulary
considered else specific number (specified in argument) of ngrams will be
considered
N gram using sentence bow_wngrm_new =
CountVectorizer(analyzer='word',
from sklearn.feature_extraction.text import * ngram_range=(2,2),
s1='the bird is flying on the blue sky' max_features=3 ,stop_words='english')
bow_cngm = CountVectorizer(analyzer='char_wb', bow_wngrm_new_m=bow_wngrm_new.fit_trans
ngram_range=(3,3)) form([s1]).toarray()
bow_cngm_m=bow_cngm.fit_transform([s1]).toarray() print(bow_wngrm_new.get_feature_names())
print(bow_cngm.get_feature_names())
bow_cngm_new = CountVectorizer(analyzer='char_wb',
OUTPUT:
ngram_range=(3,3), max_features=5 )
[' bi', ' bl', ' fl', ' is', ' on', ' sk', ' th', 'bir', 'blu', 'fly',
bow_cngm_new_m=bow_cngm_new.fit_transform
'he ', 'ing', 'ird', 'is ', 'ky ', 'lue', 'lyi', 'ng ', 'on ', 'rd ',
([s1]).toarray() 'sky', 'the', 'ue ', 'yin']
print(bow_cngm_new.get_feature_names()) [' bi', ' th', 'he ', 'sky', 'the']
bow_wngrm = CountVectorizer(analyzer='word', ['bird flying', 'blue sky', 'flying blue']
ngram_range=(2,2),stop_words='english') ['bird flying', 'blue sky', 'flying blue']
bow_wngrm_m=bow_wngrm.fit_transform([s1]).toarray()
print(bow_wngrm.get_feature_names())
N gram Example using 20newsgroup
from sklearn.datasets import fetch_20newsgroups bow_wngrm =
from sklearn.feature_extraction.text import * CountVectorizer(analyzer='word',
categories = ['sci.space'] ngram_range=(2,2),
s = fetch_20newsgroups(subset='train', max_features=10,
categories=categories,
stop_words='english')
remove=('headers', 'footers', 'quotes'))
x2=bow_wngrm.fit_transform(s.data)
print (bow_cngrm.get_feature_names())
bow_cngrm = CountVectorizer(analyzer='char_wb',
ngram_range=(3,3), print(x1.toarray())
max_features=10) print (bow_wngrm.get_feature_names())
x1=bow_cngrm.fit_transform(s.data) print(x2.toarray())
OUTPUT
[' an', ' in', ' of', ' th', ' to', 'he ', 'ing', 'ion', 'nd ', 'the']
[[ 4 0 3 ... 6 3 10]
[ 0 0 2 ... 2 0 5]
[ 3 5 3 ... 0 3 10]
...
[ 3 2 2 ... 0 2 7]
[ 1 2 4 ... 1 0 3]
[10 5 14 ... 7 9 26]]
['anonymous ftp', 'commercial space', 'gamma ray', 'nasa gov', 'national space', 'remote sensing', 'sci space', 'space
shuttle', 'space station', 'washington dc']
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
Regular Expressions
 Regular Expression is a string that contains special
symbols and characters to find and extract information
needed by us from the given data. A regular Expressions
helps us to search information, match, find and split
information as per our requirements.
In python regular expressions are also known as regex.

python provides ‘ re’ module that stands fro regular


expression. This module contains many methods which can
be used to finding information in the available data.
Regular Expressions
reg= r'm\w\w‘
here prefix ‘r’ represents raw string of regular expression.
Generally we writes regular expressions as a row strings.
str='This is normal\nstring‘ This is normal
print(str) string

 In normal string ‘\n’ is interpreted as new line, while in


‘raw’ string ‘\n’ is intended for different purpose. It will not
for new line. str=r'This is raw\nstring‘ This is raw\nstring
print(str)
sequence characters in Regular Expressions
Character It’s Description
\d Represents any digit (0-9)
\D Represents any non digit
\s Represents white space Example: \t,\n,\v
\S Represents non-white space character
\w Represents any alphanumeric (a-z,A-Z,0-9)
\W Represents non alphanumeric
\A Matches only at start of string
\Z Matches only at end of the string
Methods of re module used in Regular
Expressions…4 marks
 1. search() : This method searches string from beginning
till the end and returns the first occurrence of the matching
string. We can use group() method to retrieve the string from
object returned by search()
ob=re.search(row_string_of RE,input-string)

To obtain matching part of regular expression from input


string we have to call groupby() with ob
ob.group(): returns matched part
Python program to create regular expression to search for strings starting
with m and having total 3 characters using search
method….implementation -1…4 marks

import re
OUTPUT:
str='cat mat bat rat cat mat'
mat
result=re.search(r'm\w\w',str)
if result!=None:
search() returns only first matched
print(result.group()) occurrence instead of all occurrences
Python program to create regular expression to search phone number
(form: DDD-DDD-DDDD) from string

import re
data1 = 'My phone number is: 800-555-1212.'
data2 = '800-555-1234 is my phone number.'
OUTPUT:
pattern = r'\d{3}-\d{3}-\d{4}' 800-555-1212
m1=re.search(pattern,data1) 800-555-1234
print(m1.group())
m2=re.search(pattern,data2)
print(m2.group())

You might also like