Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 60

An

Industrial Training Seminar Report


on

“1. Introduction to Python Programming


&
2. Python for Data Science”
Submitted in partial fulfillment for the award of degree of

Bachelor of Technology
in
Department of Electronics & Communication Engineering

Submitted to: Submitted by:


Ms. Pooja Choudhary Saransh Sharma
Assistant Professor-II 17ESKEC064
Department of ECE VII Sem
SKIT M& G, Jaipur ECE B
Ms. Gloria Jospeh
Assistant Professor-II
Department of ECE
SKIT M& G, Jaipur

Department of Electronics and Communication Engineering


Swami Keshvanand Institute of Technology, Management
& Gramothan, Jaipur
Rajasthan Technical University, Kota
Session 2020-2021

i
CERTIFICATE

CERTIFICATE

ii
Acknowledgement

iii
There are always some key personalities whose roles are vital for the successful completion of
any work. However, it would not have been possible to complete this work without the kind
support and help of many individuals and organization. I would like to extend my sincere thanks
to all of them.

I am highly indebted to my mentors Mrs. Pooja Choudhary, Assistant Professor & Ms.
Gloria Joseph, Assistant Professor for their guidance and constant supervision as well as for
providing necessary information regarding the industrial training seminar & also for their support
in completing the Industrial Training. I would like to thank Mr. Lalit Kumar Lata, Faculty,
Department of Electronics and Communication, SKIT M & G, Jaipur for their kind support and
guidance to complete my Industrial Training Successfully. They helped us throughout the
training. Their excellent guidance has been instrumental in making this training a success.

I would like to thank Prof. (Dr.) Mukesh Arora, Professor & Head, Department of Electronics
and communication, SKIT M & G, Jaipur for providing me the opportunity to do training in
consistent direction and the adequate means and support to pursue this training.

Finally, earnest and sincere thanks to all the Faculty members of Electronics and Communication
Department, SKIT M & G, Jaipur for their direct and indirect support in the completion of this
industrial training.

Last but not least, we sincerely express our deepest gratitude to our families for their
wholehearted support and encouragement to us to take up this course. In addition, a very special
thanks to our colleagues and friends for their support.

Saransh Sharma
17ESKEC064

iv
TABLE OF CONTENTS

Certificate ii
Acknowledgement iv
Table of Content v
List of Figures vii
PART A/ Course I
Chapter 1:Introduction………………….………………………………………………………2
1.1 Introduction……………………………………………………………………………………2
1.2 Why Python?………………………………………………………………………………….2
1.3 Characteristics of Python……………………………………………………………………..2
1.4 Local Environment Setup……………………………………………………………………..3
Chapter 2: Strings………………………………………………………………………………..4
2.1 Strings is a Sequence…….……………………………………………………………………4
2.2 Getting the length of a string………………………………………………………………….4
2.3 Traversal through a string with a loop…………………………………………………………

v
LIST OF FIGURES

Fig. 1.1 Disc harrow and disc 6


Fig. 1.2 Spring harrow 7
Fig. 1.3 Roller harrow 7

vi
Part A

(Introduction to Scripting in Python)

1
Chapter 1

INTRODUCTION TO PYTHON
1.1 Introduction

Python is a general-purpose interpreted, interactive, object-oriented, and high-level


programming language. It was created by Guido van Rossum during 1985- 1990. Python source
code is also available under the GNU General Public License (GPL). Python is a powerful
general-purpose programming language. It is used in web development, data science, creating
software prototypes, and so on. Fortunately for beginners, Python has simple easy-to-use syntax.
This makes Python an excellent language to learn to program for beginners.
1.2 Why Python?
Python is one of the most loved programming languages by developers, data scientists,
software engineers, and even hackers because of its versatility, flexibility, and object-oriented
features. Many of the web and mobile applications we enjoy today is because of Python’s
abundant libraries, various frameworks, vast collections of modules, and file extensions. Not
only that, Python is great for building micro-project to macro enterprise web services as well as
on supporting other types of programming languages. Python is Interpreted − Python is
processed at runtime by the interpreter. You do not need to compile your program before
executing it. This is similar to PERL and PHP.
 Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
 Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
 Python is a Beginner's Language − Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications from simple
text processing to WWW browsers to games.
 Easy-to-read − Python code is more clearly defined and visible to the eyes.

1.2 Characteristics of Python


Following are important characteristics of Python Programming −
 It supports functional and structured programming methods as well as OOP.
 It can be used as a scripting language or can be compiled to byte-code for building large
applications.
 It provides very high-level dynamic data types and supports dynamic type checking.
 It supports automatic garbage collection.
 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

1.3 Local Environment Setup

2
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment on windows.

Open a terminal window and type "python" to find out if it is already installed and which version
is installed.

1.3.1 Getting Python


The most up-to-date and current source code, binaries, documentation, news, etc., is available on
the official website of Python https://www.python.org/
You can download Python documentation from https://www.python.org/doc/. The
documentation is available in HTML, PDF, and PostScript formats.
1.3.2 Installing Python
Python distribution is available for a wide variety of platforms. You need to download
only the binary code applicable for your platform and install Python.
If the binary code for your platform is not available, you need a C compiler to compile the
source code manually. Compiling the source code offers more flexibility in terms of choice of
features that you require in your installation.
Here is a quick overview of installing Python on windows platform-
 Open a Web browser and go to https://www.python.org/downloads/.
 Follow the link for the Windows installer python-XYZ.msi file where XYZ is the version
you need to install.
 To use this installer python-XYZ.msi, the Windows system must support Microsoft
Installer 2.0. Save the installer file to your local machine and then run it to find out if
your machine supports MSI.
 Run the downloaded file. This brings up the Python install wizard, which is really easy to
use. Just accept the default settings, wait until the install is finished, and you are done

1.3.3 Setting up PATH


Programs and other executable files can be in many directories, so operating systems
provide a search path that lists the directories that the OS searches for executables.
The path is stored in an environment variable, which is a named string maintained by the
operating system. This variable contains information available to the command shell and other
programs.
The path variable is named as PATH in Unix or Path in Windows (Unix is case sensitive;
Windows is not).
In Mac OS, the installer handles the path details. To invoke the Python interpreter from any
particular directory, you must add the Python directory to your path.
To add the Python directory to the path for a particular session in Windows −

3
At the command prompt − type path %path%;C:\Python and press Enter.
Note − C:\Python is the path of the Python directory.

Chapter 2

4
1. STRINGS

2.1 string is a sequence


A string is a sequence of characters. You can access the characters one at a time with the bracket
operator:

>>> fruit = 'banana'


>>> letter = fruit[1]

The second statement extracts the character at index position 1 from the fruit variable and
assigns it to the letter variable. The expression in brackets is called an index. The index indicates
which character in the sequence you want (hence the name).
2.1.1 String Indexes
You can use any expression, including variables and operators, as an index, but the value of the
index has to be an integer. Otherwise you get:

>>> letter = fruit[1.5]


TypeError: string indices must be integers

2.2 Getting the length of a string using len


len is a built-in function that returns the number of characters in a string:

>>> fruit = 'banana'


>>> len(fruit)
6

To get the last letter of a string, you might be tempted to try something like this:

>>> length = len(fruit)


>>> last = fruit[length]
IndexError: string index out of range

The reason for the IndexError is that there is no letter in “banana” with the index 6. Since we
started counting at zero, the six letters are numbered 0 to 5.

>>> last = fruit[length-1]

5
>>> print(last)
a

Alternatively, you can use negative indices, which count backward from the end of the string.
The expression fruit[-1] yields the last letter, fruit[-2] yields the second to last, and so on.

2.3 Traversal through a string with a loop


A lot of computations involve processing a string one character at a time. Often they start at the
beginning, select each character in turn, do something to it, and continue until the end. This
pattern of processing is called a traversal. One way to write a traversal is with a while loop:

index = 0
while index < len(fruit):
letter = fruit[index]
print(letter)
index = index + 1

This loop traverses the string and displays each letter on a line by itself. The loop condition
is index < len(fruit), so when index is equal to the length of the string, the condition is false, and
the body of the loop is not executed. The last character accessed is the one with the
index len(fruit)-1, which is the last character in the string. Another way to write a traversal is
with a for loop:

for char in fruit:


print(char)

Each time through the loop, the next character in the string is assigned to the variable char. The
loop continues until no characters are left.

2.4 String slices


A segment of a string is called a slice. Selecting a slice is similar to selecting a character:

>>> s = 'Monty Python'


>>> print(s[0:5])
Monty
>>> print(s[6:12])
Python

6
The operator returns the part of the string from the “n-th” character to the “m-th” character,
including the first but excluding the last.
If you omit the first index (before the colon), the slice starts at the beginning of the string. If you
omit the second index, the slice goes to the end of the string:

>>> fruit = 'banana'


>>> fruit[:3]
'ban'
>>> fruit[3:]
'ana'

2.5 Strings are immutable


It is tempting to use the operator on the left side of an assignment, with the intention of changing
a character in a string. For example:

>>> greeting = 'Hello, world!'


>>> greeting[0] = 'J'
TypeError: 'str' object does not support item assignment

The reason for the error is that strings are immutable, which means you can’t change an existing
string.
The best you can do is create a new string that is a variation on the original:

>>> greeting = 'Hello, world!'


>>> new_greeting = 'J' + greeting[1:]
>>> print(new_greeting)
Jello, world!

This example concatenates a new first letter onto a slice of greeting. It has no effect on the
original string.

2.6 Looping and counting


The following program counts the number of times the letter “a” appears in a string:

word = 'banana'
count = 0

7
for letter in word:
if letter == 'a':
count = count + 1
print(count)

This program demonstrates another pattern of computation called a counter. The


variable count is initialized to 0 and then incremented each time an “a” is found. When the loop
exits, count contains the result: the total number of a’s.

2.7 The in operator
The word in is a boolean operator that takes two strings and returns True if the first appears as a
substring in the second:

>>> 'a' in 'banana'


True
>>> 'seed' in 'banana'
False

2.8 String comparison


The comparison operators work on strings. To see if two strings are equal:

if word == 'banana':
print('All right, bananas.')

Other comparison operations are useful for putting words in alphabetical order:

if word < 'banana':


print('Your word,' + word + ', comes before banana.')
elif word > 'banana':
print('Your word,' + word + ', comes after banana.')
else:
print('All right, bananas.')

2.9 String methods

Python has a set of built-in methods that you can use on strings.

8
Note: All string methods returns new values. They do not change the original string.

Here are some of the most common string methods. A method is like a function, but it runs "on"
an object. If the variable s is a string, then the code s.lower() runs the lower() method on that
string object and returns the result (this idea of a method running on an object is one of the basic
ideas that make up Object Oriented Programming, OOP). Here are some of the most common
string methods:

 s.lower(), s.upper() -- returns the lowercase or uppercase version of the string


 s.strip() -- returns a string with whitespace removed from the start and end
 s.isalpha()/s.isdigit()/s.isspace()... -- tests if all the string chars are in the various character
classes
 s.startswith('other'), s.endswith('other') -- tests if the string starts or ends with the given
other string
 s.find('other') -- searches for the given other string (not a regular expression) within s, and
returns the first index where it begins or -1 if not found
 s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced
by 'new'
 s.split('delim') -- returns a list of substrings separated by the given delimiter. The
delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb',
'ccc']. As a convenient special case s.split() (with no arguments) splits on all whitespace
chars.
 s.join(list) -- opposite of split(), joins the elements in the given list together using the
string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc

9
Chapter 3
LISTS

3.1 A list is a sequence


Like a string, a list is a sequence of values. In a string, the values are characters; in a list, they
can be any type. The values in list are called elements or sometimes items. There are several
ways to create a new list; the simplest is to enclose the elements in square brackets (“[" and "]”):

[10, 20, 30, 40]


['crunchy frog', 'ram bladder', 'lark vomit']

The first example is a list of four integers. The second is a list of three strings. The elements of a
list don’t have to be the same type. The following list contains a string, a float, an integer, and
(lo!) another list:

['spam', 2.0, 5, [10, 20]]

A list within another list is nested. A list that contains no elements is called an empty list; you
can create one with empty brackets, []. As you might expect, you can assign list values to
variables:

>>> cheeses = ['Cheddar', 'Edam', 'Gouda']


>>> numbers = [17, 123]
>>> empty = []
>>> print(cheeses, numbers, empty)
['Cheddar', 'Edam', 'Gouda'] [17, 123] []

3.2 Lists are mutable


Unlike strings, lists are mutable because you can change the order of items in a list or reassign an
item in a list. When the bracket operator appears on the left side of an assignment, it identifies
the element of the list that will be assigned.

>>> numbers = [17, 123]


>>> numbers[1] = 5
>>> print(numbers)

10
[17, 5]

The one-th element of numbers, which used to be 123, is now 5.


List indices work the same way as string indices:
 Any integer expression can be used as an index.
 If you try to read or write an element that does not exist, you get an IndexError.

 If an index has a negative value, it counts backward from the end of the list.

The in operator also works on lists.

>>> cheeses = ['Cheddar', 'Edam', 'Gouda']


>>> 'Edam' in cheeses
True
>>> 'Brie' in cheeses
False
, 6, 7, 8, 9]

3.3 List slices


The slice operator also works on lists:

>>> t = ['a', 'b', 'c', 'd', 'e', 'f']


>>> t[1:3]
['b', 'c']
>>> t[:4]
['a', 'b', 'c', 'd']
>>> t[3:]
['d', 'e', 'f']

If you omit the first index, the slice starts at the beginning. If you omit the second, the slice goes
to the end. So if you omit both, the slice is a copy of the whole list.

>>> t[:]
['a', 'b', 'c', 'd', 'e', 'f']

11
Since lists are mutable, it is often useful to make a copy before performing operations that fold,
spindle, or mutilate lists.
A slice operator on the left side of an assignment can update multiple elements:

>>> t = ['a', 'b', 'c', 'd', 'e', 'f']


>>> t[1:3] = ['x', 'y']
>>> print(t)
['a', 'x', 'y', 'd', 'e', 'f']

3.4 List methods


Python provides methods that operate on lists. For example, append adds a new element to the
end of a list:

>>> t = ['a', 'b', 'c']


>>> t.append('d')
>>> print(t)
['a', 'b', 'c', 'd']

extend takes a list as an argument and appends all of the elements:

>>> t1 = ['a', 'b', 'c']


>>> t2 = ['d', 'e']
>>> t1.extend(t2)
>>> print(t1)
['a', 'b', 'c', 'd', 'e']

This example leaves t2 unmodified.


sort arranges the elements of the list from low to high:

>>> t = ['d', 'c', 'e', 'b', 'a']


>>> t.sort()
>>> print(t)
['a', 'b', 'c', 'd', 'e']

Most list methods are void; they modify the list and return None. If you accidentally write t =
t.sort(), you will be disappointed with the result.

12
3.5 Deleting elements
There are several ways to delete elements from a list. If you know the index of the element you
want, you can use pop:

>>> t = ['a', 'b', 'c']


>>> x = t.pop(1)
>>> print(t)
['a', 'c']
>>> print(x)
b

pop modifies the list and returns the element that was removed. If you don’t provide an index, it
deletes and returns the last element.
If you don’t need the removed value, you can use the del operator:

>>> t = ['a', 'b', 'c']


>>> del t[1]
>>> print(t)
['a', 'c']

If you know the element you want to remove (but not the index), you can use remove:

>>> t = ['a', 'b', 'c']


>>> t.remove('b')
>>> print(t)
['a', 'c']

The return value from remove is None.


To remove more than one element, you can use del with a slice index:

>>> t = ['a', 'b', 'c', 'd', 'e', 'f']


>>> del t[1:5]
>>> print(t)
['a', 'f']

13
3.6 Lists and functions
There are a number of built-in functions that can be used on lists that allow you to quickly look
through a list without writing your own loops:

>>> nums = [3, 41, 12, 9, 74, 15]


>>> print(len(nums))
6
>>> print(max(nums))
74
>>> print(min(nums))
3
>>> print(sum(nums))
154
>>> print(sum(nums)/len(nums))
25

The sum() function only works when the list elements are numbers. The other functions
(max(), len(), etc.) work with lists of strings and other types that can be comparable.
We could rewrite an earlier program that computed the average of a list of numbers entered by
the user using a list.
First, the program to compute an average without a list:

total = 0
count = 0
while (True):
inp = input('Enter a number: ')
if inp == 'done': break
value = float(inp)
total = total + value
count = count + 1
average = total / count
print('Average:', average)

14
In this program, we have count and total variables to keep the number and running total of the
user’s numbers as we repeatedly prompt the user for a number.
We could simply remember each number as the user entered it and use built-in functions to
compute the sum and count at the end.

numlist = list()
while (True):
inp = input('Enter a number: ')
if inp == 'done': break
value = float(inp)
numlist.append(value)

average = sum(numlist) / len(numlist)


print('Average:', average)

We make an empty list before the loop starts, and then each time we have a number, we append
it to the list. At the end of the program, we simply compute the sum of the numbers in the list
and divide it by the count of the numbers in the list to come up with the average.

3.7 Lists and strings


A string is a sequence of characters and a list is a sequence of values, but a list of characters is
not the same as a string. To convert from a string to a list of characters, you can use list:

>>> s = 'spam'
>>> t = list(s)
>>> print(t)
['s', 'p', 'a', 'm']

Because list is the name of a built-in function, you should avoid using it as a variable name. I
also avoid the letter “l” because it looks too much like the number “1”. So that’s why I use “t”.
The list function breaks a string into individual letters. If you want to break a string into words,
you can use the split method:

>>> s = 'pining for the fjords'


>>> t = s.split()
>>> print(t)

15
['pining', 'for', 'the', 'fjords']
>>> print(t[2])
the

Once you have used split to break the string into a list of words, you can use the index operator
(square bracket) to look at a particular word in the list.
You can call split with an optional argument called a delimiter that specifies which characters to
use as word boundaries. The following example uses a hyphen as a delimiter:

>>> s = 'spam-spam-spam'
>>> delimiter = '-'
>>> s.split(delimiter)
['spam', 'spam', 'spam']

join is the inverse of split. It takes a list of strings and concatenates the elements. join is a string
method, so you have to invoke it on the delimiter and pass the list as a parameter:

>>> t = ['pining', 'for', 'the', 'fjords']


>>> delimiter = ' '
>>> delimiter.join(t)
'pining for the fjords'

In this case the delimiter is a space character, so join puts a space between words. To concatenate
strings without spaces, you can use the empty string, "", as a delimiter.
3.9 List Comprehensions
List comprehensions provide a concise way to create lists. Common applications are to make
new lists where each element is the result of some operations applied to each member of another
sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.

A list comprehension consists of brackets containing an expression followed by a for clause, then
zero or more for or if clauses. The result will be a new list resulting from evaluating the
expression in the context of the for and if clauses which follow it. For example, this listcomp
combines the elements of two lists if they are not equal:

>>> [(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]


[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]

16
Chapter 4

DICTIONARIES

4.1 Overview of Dictionaries

A dictionary is like a list, but more general. In a list, the index positions have to be integers; in a
dictionary, the indices can be (almost) any type.
You can think of a dictionary as a mapping between a set of indices (which are called keys) and a
set of values. Each key maps to a value. The association of a key and a value is called a key-
value pair or sometimes an item.
As an example, we’ll build a dictionary that maps from English to Spanish words, so the keys
and the values are all strings.
The function dict creates a new dictionary with no items. Because dict is the name of a built-in
function, you should avoid using it as a variable name.

>>> eng2sp = dict()


>>> print(eng2sp)
{}

The curly brackets, {}, represent an empty dictionary. To add items to the dictionary, you can
use square brackets:

>>> eng2sp['one'] = 'uno'

This line creates an item that maps from the key 'one' to the value “uno”. If we print the
dictionary again, we see a key-value pair with a colon between the key and value:

>>> print(eng2sp)
{'one': 'uno'}

This output format is also an input format. For example, you can create a new dictionary with
three items. But if you print eng2sp, you might be surprised:

>>> eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}


>>> print(eng2sp)
{'one': 'uno', 'three': 'tres', 'two': 'dos'}

17
The order of the key-value pairs is not the same. In fact, if you type the same example on your
computer, you might get a different result. In general, the order of items in a dictionary is
unpredictable.
But that’s not a problem because the elements of a dictionary are never indexed with integer
indices. Instead, you use the keys to look up the corresponding values:

>>> print(eng2sp['two'])
'dos'

The key 'two' always maps to the value “dos” so the order of the items doesn’t matter.
If the key isn’t in the dictionary, you get an exception:

>>> print(eng2sp['four'])
KeyError: 'four'

The len function works on dictionaries; it returns the number of key-value pairs:

>>> len(eng2sp)
3

The in operator works on dictionaries; it tells you whether something appears as a key in the
dictionary (appearing as a value is not good enough).

>>> 'one' in eng2sp


True
>>> 'uno' in eng2sp
False

To see whether something appears as a value in a dictionary, you can use the method values,
which returns the values as a type that can be converted to a list, and then use the in operator:

>>> vals = list(eng2sp.values())


>>> 'uno' in vals
True

The in operator uses different algorithms for lists and dictionaries. For lists, it uses a linear
search algorithm. As the list gets longer, the search time gets longer in direct proportion to the
length of the list. For dictionaries, Python uses an algorithm called a hash table that has a
remarkable property: the in operator takes about the same amount of time no matter how many
items there are in a dictionary.

18
4.2 Dictionary as a set of counters
Suppose you are given a string and you want to count how many times each letter appears. There
are several ways you could do it:
1. You could create 26 variables, one for each letter of the alphabet. Then you could
traverse the string and, for each character, increment the corresponding counter, probably
using a chained conditional.
2. You could create a list with 26 elements. Then you could convert each character to a
number (using the built-in function ord), use the number as an index into the list, and
increment the appropriate counter.
3. You could create a dictionary with characters as keys and counters as the corresponding
values. The first time you see a character, you would add an item to the dictionary. After
that you would increment the value of an existing item.
Each of these options performs the same computation, but each of them implements that
computation in a different way.
An implementation is a way of performing a computation; some implementations are better than
others. For example, an advantage of the dictionary implementation is that we don’t have to
know ahead of time which letters appear in the string and we only have to make room for the
letters that do appear.
Here is what the code might look like:

word = 'brontosaurus'
d = dict()
for c in word:
if c not in d:
d[c] = 1
else:
d[c] = d[c] + 1
print(d)

We are effectively computing a histogram, which is a statistical term for a set of counters (or
frequencies).
The for loop traverses the string. Each time through the loop, if the character c is not in the
dictionary, we create a new item with key c and the initial value 1 (since we have seen this letter
once). If c is already in the dictionary we increment d[c].
Here’s the output of the program:

{'a': 1, 'b': 1, 'o': 2, 'n': 1, 's': 2, 'r': 2, 'u': 2, 't': 1}

19
The histogram indicates that the letters “a” and “b” appear once; “o” appears twice, and so on.
Dictionaries have a method called get that takes a key and a default value. If the key appears in
the dictionary, get returns the corresponding value; otherwise it returns the default value. For
example:

>>> counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}


>>> print(counts.get('jan', 0))
100
>>> print(counts.get('tim', 0))
0

We can use get to write our histogram loop more concisely. Because the get method
automatically handles the case where a key is not in a dictionary, we can reduce four lines down
to one and eliminate the if statement.

word = 'brontosaurus'
d = dict()
for c in word:
d[c] = d.get(c,0) + 1
print(d)

The use of the get method to simplify this counting loop ends up being a very commonly used
“idiom” in Python and we will use it many times in the rest of the book. So you should take a
moment and compare the loop using the if statement and in operator with the loop using
the get method. They do exactly the same thing, but one is more succinct.

4.3 Dictionaries and files


One of the common uses of a dictionary is to count the occurrence of words in a file with some
written text. Let’s start with a very simple file of words taken from the text of Romeo and Juliet.
For the first set of examples, we will use a shortened and simplified version of the text with no
punctuation. Later we will work with the text of the scene with punctuation included.

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

20
We will write a Python program to read through the lines of the file, break each line into a list of
words, and then loop through each of the words in the line and count each word using a
dictionary.
You will see that we have two for loops. The outer loop is reading the lines of the file and the
inner loop is iterating through each of the words on that particular line. This is an example of a
pattern called nested loops because one of the loops is the outer loop and the other loop is
the inner loop.
Because the inner loop executes all of its iterations each time the outer loop makes a single
iteration, we think of the inner loop as iterating “more quickly” and the outer loop as iterating
more slowly.
The combination of the two nested loops ensures that we will count every word on every line of
the input file.

fname = input('Enter the file name: ')


try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
counts = dict()
for line in fhand:
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
print(counts)

In our else statement, we use the more compact alternative for incrementing a


variable. counts[word] += 1 is equivalent to counts[word] = counts[word] + 1. Either method can
be used to change the value of a variable by any desired amount. Similar alternatives exist for -
=, *=, and /=.
When we run the program, we see a raw dump of all of the counts in unsorted hash order.

python count1.py

21
Enter the file name: romeo.txt
{'and': 3, 'envious': 1, 'already': 1, 'fair': 1,
'is': 3, 'through': 1, 'pale': 1, 'yonder': 1,
'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1,
'window': 1, 'sick': 1, 'east': 1, 'breaks': 1,
'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1,
'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}

It is a bit inconvenient to look through the dictionary to find the most common words and their
counts, so we need to add some more Python code to get us the output that will be more helpful.

4.4 Looping and dictionaries


If you use a dictionary as the sequence in a for statement, it traverses the keys of the dictionary.
This loop prints each key and the corresponding value:

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}


for key in counts:
print(key, counts[key])

Here’s what the output looks like:

jan 100
chuck 1
annie 42

Again, the keys are in no particular order.


We can use this pattern to implement the various loop idioms that we have described earlier. For
example if we wanted to find all the entries in a dictionary with a value above ten, we could
write the following code:

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}


for key in counts:
if counts[key] > 10 :
print(key, counts[key])

The for loop iterates through the keys of the dictionary, so we must use the index operator to
retrieve the corresponding value for each key. Here’s what the output looks like:

22
jan 100
annie 42

We see only the entries with a value above 10.


If you want to print the keys in alphabetical order, you first make a list of the keys in the
dictionary using the keys method available in dictionary objects, and then sort that list and loop
through the sorted list, looking up each key and printing out key-value pairs in sorted order as
follows:

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}


lst = list(counts.keys())
print(lst)
lst.sort()
for key in lst:
print(key, counts[key])

Here’s what the output looks like:

['jan', 'chuck', 'annie']


annie 42
chuck 1
jan 100

First you see the list of keys in unsorted order that we get from the keys method. Then we see the
key-value pairs in order from the for loop.

23
Chapter 5
TUPLES AND SETS
5.1 Tuples

5.1.1 Tuples are immutable


A tuple1 is a sequence of values much like a list. The values stored in a tuple can be any type,
and they are indexed by integers. The important difference is that tuples are immutable. Tuples
are also comparable and hashable so we can sort lists of them and use tuples as key values in
Python dictionaries. Syntactically, a tuple is a comma-separated list of values:

>>> t = 'a', 'b', 'c', 'd', 'e'

Although it is not necessary, it is common to enclose tuples in parentheses to help us quickly


identify tuples when we look at Python code:

>>> t = ('a', 'b', 'c', 'd', 'e')

To create a tuple with a single element, you have to include the final comma:

>>> t1 = ('a',)
>>> type(t1)
<type 'tuple'>

Without the comma Python treats ('a') as an expression with a string in parentheses that evaluates
to a string:

>>> t2 = ('a')
>>> type(t2)
<type 'str'>

Another way to construct a tuple is the built-in function tuple. With no argument, it creates an
empty tuple:

>>> t = tuple()
>>> print(t)
()

If the argument is a sequence (string, list, or tuple), the result of the call to tuple is a tuple with
the elements of the sequence:

24
>>> t = tuple('lupins')
>>> print(t)
('l', 'u', 'p', 'i', 'n', 's')

Because tuple is the name of a constructor, you should avoid using it as a variable name.
Most list operators also work on tuples. The bracket operator indexes an element:

>>> t = ('a', 'b', 'c', 'd', 'e')


>>> print(t[0])
'a'

And the slice operator selects a range of elements.

>>> print(t[1:3])
('b', 'c')

But if you try to modify one of the elements of the tuple, you get an error:

>>> t[0] = 'A'


TypeError: object doesn't support item assignment

You can’t modify the elements of a tuple, but you can replace one tuple with another:

>>> t = ('A',) + t[1:]


>>> print(t)
('A', 'b', 'c', 'd', 'e')

5.1.2 Comparing tuples


The comparison operators work with tuples and other sequences. Python starts by comparing the
first element from each sequence. If they are equal, it goes on to the next element, and so on,
until it finds elements that differ. Subsequent elements are not considered (even if they are really
big).

>>> (0, 1, 2) < (0, 3, 4)


True
>>> (0, 1, 2000000) < (0, 3, 4)
True

25
Decorate: a sequence by building a list of tuples with one or more sort keys preceding the
elements from the sequence,
Sort: the list of tuples using the Python built-in sort, and
Undecorate: by extracting the sorted elements of the sequence.
For example, suppose you have a list of words and you want to sort them from longest to
shortest:

txt = 'but soft what light in yonder window breaks'


words = txt.split()
t = list()
for word in words:
t.append((len(word), word))
t.sort(reverse=True)
res = list()
for length, word in t:
res.append(word)
print(res)

The first loop builds a list of tuples, where each tuple is a word preceded by its length.
sort compares the first element, length, first, and only considers the second element to break ties.
The keyword argument reverse=True tells sort to go in decreasing order.
The second loop traverses the list of tuples and builds a list of words in descending order of
length. The four-character words are sorted in reverse alphabetical order, so “what” appears
before “soft” in the following list. The output of the program is as follows:

['yonder', 'window', 'breaks', 'light', 'what',


'soft', 'but', 'in']

5.1.3 Dictionaries and tuples


Dictionaries have a method called items that returns a list of tuples, where each tuple is a key-value pair:

>>> d = {'a':10, 'b':1, 'c':22}


>>> t = list(d.items())
>>> print(t)
[('b', 1), ('a', 10), ('c', 22)]

As you should expect from a dictionary, the items are in no particular order.

26
However, since the list of tuples is a list, and tuples are comparable, we can now sort the list of
tuples. Converting a dictionary to a list of tuples is a way for us to output the contents of a
dictionary sorted by key:

>>> d = {'a':10, 'b':1, 'c':22}


>>> t = list(d.items())
>>> t
[('b', 1), ('a', 10), ('c', 22)]
>>> t.sort()
>>> t
[('a', 10), ('b', 1), ('c', 22)]

The new list is sorted in ascending alphabetical order by the key value.

5.1.4 Multiple assignment with dictionaries


Combining items, tuple assignment, and for, you can see a nice code pattern for traversing the
keys and values of a dictionary in a single loop:

for key, val in list(d.items()):


print(val, key)

This loop has two iteration variables because items returns a list of tuples and key, val is a tuple
assignment that successively iterates through each of the key-value pairs in the dictionary.
For each iteration through the loop, both key and value are advanced to the next key-value pair in
the dictionary (still in hash order).
The output of this loop is:

10 a
22 c
1b

Again, it is in hash key order (i.e., no particular order).


If we combine these two techniques, we can print out the contents of a dictionary sorted by
the value stored in each key-value pair.
To do this, we first make a list of tuples where each tuple is (value, key). The items method
would give us a list of (key, value) tuples, but this time we want to sort by value, not key. Once
we have constructed the list with the value-key tuples, it is a simple matter to sort the list in
reverse order and print out the new, sorted list.

27
>>> d = {'a':10, 'b':1, 'c':22}
>>> l = list()
>>> for key, val in d.items() :
... l.append( (val, key) )
>>> l
[(10, 'a'), (22, 'c'), (1, 'b')]
>>> l.sort(reverse=True)
>>> l
[(22, 'c'), (10, 'a'), (1, 'b')]
>>>

5.1.5 Using tuples as keys in dictionaries


Because tuples are hashable and lists are not, if we want to create a composite key to use in a
dictionary we must use a tuple as the key.
We would encounter a composite key if we wanted to create a telephone directory that maps
from last-name, first-name pairs to telephone numbers. Assuming that we have defined the
variables last, first, and number, we could write a dictionary assignment statement as follows:

directory[last,first] = number

The expression in brackets is a tuple. We could use tuple assignment in a for loop to traverse this
dictionary.

for last, first in directory:


print(first, last, directory[last,first])

This loop traverses the keys in directory, which are tuples. It assigns the elements of each tuple
to last and first, then prints the name and corresponding telephone number.

5.2 Sets
In Python, Set is an unordered collection of data type that is iterable, mutable and has no duplicate
elements. The order of elements in a set is undefined though it may consist of various elements.
The major advantage of using a set, as opposed to a list, is that it has a highly optimized method for
checking whether a specific element is contained in the set.

28
5.2.1 Creating a Set

Sets can be created by using the built-in set() function with an iterable object or a sequence by
placing the sequence inside curly braces, separated by ‘comma’.
Note – A set cannot have mutable elements like a list, set or dictionary, as its elements.

# Python program to demonstrate 


# Creation of Set in Python
  
# Creating a Set
set1 = set()
print("Intial blank Set: ")
print(set1)
  
# Creating a Set with 
# the use of a String
set1 = set("PythonProgramming")
print("\nSet with the use of String: ")
print(set1)
# Creating a Set with
# the use of Constructor
# (Using object to Store String)
String = 'PythonProgramming'
set1 = set(String)
print("\nSet with the use of an Object: " )
print(set1)
  
# Creating a Set with
# the use of a List
set1 = set(["Coding", "Is", "Fun"])
print("\n Set with the use of List: ")
print(set1)
Output:
Intial blank Set:
set()
Set with the use of String:
{'e', 'r', 'k', 'o', 'G', 's', 'F'}
Set with the use of an Object:
{'r', 'o', 'e', 'F', 's', 'k', 'G'}
Set with the use of List:
{'Geeks', 'For'}

29
Chapter 6
6. Hands on Projects
As we are done with the python data structures so in this section of we will talk about 2
interesting hands on projects at an introductory and intermediate level respectively.

6.1 Chat-bot Python Project:


A chatbot is an intelligent piece of software that is capable of communicating and
performing actions similar to a human. Chatbot are used a lot in customer interaction, marketing
on social network sites and instantly messaging the client. There are two basic types of chatbot
models In this Python project with source code, we are going to build a chatbot using deep
learning techniques. The chatbot will be trained on the dataset which contains categories
(intents), pattern and responses. We use a special recurrent neural network (LSTM) to classify
which category the user’s message belongs to and then we will give a random response from the
list of responses.based on how they are built; Retrieval based and Generative based models.

6.1.1 Working
Here are the 5 steps to create a chatbot in Python from scratch:

1. Import and load the data file-


First, make a file name as train_chatbot.py. We import the necessary packages for our
chatbot and initialize the variables we will use in our Python project.

The data file is in JSON format so we used the json package to parse the JSON file into Python.

This is how our intents.json file looks like-

30
2. Preprocess data
When working with text data, we need to perform various preprocessing on the data
before we make a machine learning or a deep learning model. Tokenizing is the most basic and
first thing you can do on text data. Tokenizing is the process of breaking the whole text into
small parts like words.

Here we iterate through the patterns and tokenize the sentence using nltk.word_tokenize()
function and append each word in the words list. We also create a list of classes for our tags.

31
Now we will lemmatize each word and remove duplicate words from the list. Lemmatizing is the
process of converting a word into its lemma form and then creating a pickle file to store the
Python objects which we will use while predicting.

3. Create training and testing data-


Now, we will create the training data in which we will provide the input and the output.
Our input will be the pattern and output will be the class our input pattern belongs to. But the
computer doesn’t understand text so we will convert text into numbers.

32
4.Build the model-
We have our training data ready, now we will build a deep neural network that has 3
layers. We use the Keras sequential API for this. After training the model for 200 epochs, we
achieved 100% accuracy on our model. Let us save the model as ‘chatbot_model.h5’.

5.Predict the response (Graphical User Interface)-


Now to predict the sentences and get a response from the user to let us create a new file
‘chatapp.py’.

33
We will load the trained model and then use a graphical user interface that will predict the
response from the bot. The model will only tell us the class it belongs to, so we will implement
some functions which will identify the class and then retrieve us a random response from the list
of responses.Again we import the necessary packages and load the ‘words.pkl’ and ‘classes.pkl’
pickle files which we have created when we trained our model:

To predict the class, we will need to provide input in the same way as we did while training. So
we will create some functions that will perform text preprocessing and then predict the class.

34
Now we will code a graphical user interface. For this, we use the Tkinter library which already
comes in python. We will take the input message from the user and then use the helper functions
we have created to get the response from the bot and display it on the GUI. Here is the full
source code for the GUI.

35
6. Run the chatbot
To run the chatbot, we have two main files; train_chatbot.py and chatapp.py.

First, we train the model using the command in the terminal:

python train_chatbot.py

If we don’t see any error during training, we have successfully created the model. Then to run
the app, we run the second file.

python chatgui.py
The program will open up a GUI window within a few seconds. With the GUI you can easily
chat with the bot.

Fig. 6.1 Demo of Chat Bot

5.3.2 Summary

36
In this Python data science project, we understood about chatbots and implemented a deep
learning version of a chatbot in Python which is accurate. You can customize the data according
to business requirements and train the chatbot with great accuracy. Chatbots are used everywhere
and all businesses is looking forward to implementing bot in their workflow.

Part B
(Python for Data Science)

37
Chapter 1
Introduction
1.1 Data Science-
Data science is an inter-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from many structural and unstructured
data. Data science is related to data mining, machine learning and big data.

Data science is a "concept to unify statistics, data analysis and their related methods" in
order to "understand and analyze actual phenomena" with data. It uses techniques and theories
drawn from many fields within the context of mathematics, statistics, computer science, domain
knowledge and information science. Turing award winner Jim Gray imagined data science as a
"fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and
asserted that "everything about science is changing because of the impact of information
technology" and the data deluge.

1.2 Foundation-
Data science is an interdisciplinary field focused on extracting knowledge from data sets,
which are typically large (see big data). The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such, it
incorporates skills from computer science, mathematics, statistics, information visualization,
graphic design, complex systems, communication and business. Statistician Nathan Yau,
drawing on Ben Fry, also links data science to human-computer interaction: users should be able
to intuitively control and explore data. In 2015, the American Statistical Association identified
database management, statistics and machine learning, and distributed and parallel systems as the
three emerging foundational professional communities.

1.2.1 Relationship to statistics

38
Many statisticians, including Nate Silver, have argued that data science is not a
new field, but rather another name for statistics. Others argue that data science is distinct
from statistics because it focuses on problems and techniques unique to digital data.
Vasant Dhar writes that statistics emphasizes quantitative data and description. In
contrast, data science deals with quantitative and qualitative data (e.g. images) and
emphasizes prediction and action. Andrew Gelman of Columbia University and data
scientist Vincent Granville have described statistics as a nonessential part of data science.
Stanford professor David Donoho writes that data science is not distinguished from
statistics by the size of datasets or use of computing, and that many graduate programs
misleadingly advertise their analytics and statistics training as the essence of a data
science program. He describes data science as an applied field growing out of traditional
statistics. In summary, data science can be therefore described as an applied branch of
statistics.
1.3 Impact of Data Science -
Big data is very quickly becoming a vital tool for businesses and companies of all sizes.
[29] The availability and interpretation of big data has altered the business models of old
industries and enabled the creation of new ones.[29] Data-driven businesses are worth $1.2
trillion collectively in 2020, an increase from $333 billion in the year 2015.[30] Data scientists
are responsible for breaking down big data into usable information and creating software and
algorithms that help companies and organizations determine optimal operations.[30] As big data
continues to have a major impact on the world, data science does as well due to the close
relationship between the two.

39
Chapter 2
Technologies and Methods

2.1 Technologies and Techniques


There are a variety of different technologies and techniques that are used for data science
which depend on the application. More recently, full-featured, end-to-end platforms have been
developed and heavily used for data science and machine learning.
2.1.1 Techniques
 Linear Regression
 Logistic Regression
 Decision tree is used as prediction models for classification and data fitting. The
decision tree structure can be used to generate rules able to classify or predict
target/class/label variable based on the observation attributes.
 Support Vector Machine (SVM)
 Clustering is a technique used to group data together.
 Dimensionality reduction is used to reduce the complexity of data computation so
that it can be performed more quickly.
 Machine learning is a technique used to perform tasks by inferencing patterns from
data.
2.1.2 Languages
 Python is a programming language with simple syntax that is commonly used for
data science. There are a number of python libraries that are used in data science
including numpy, pandas, Matplotlib and scipy.
 R is a programming language that was designed for statisticians and data mining
and is optimized for computation.
 Julia is a high-level, high-performance, dynamic programming language well-suited
for numerical analysis and computational science.

40
2.1.3 Frameworks
 TensorFlow is a framework for creating machine learning models developed by
Google.
 Pytorch is another framework for machine learning developed by Facebook.
 Jupyter Notebook is an interactive web interface for Python that allows faster
experimentation.
 Apache Hadoop is a software framework that is used to process data over large
distributed systems.

2.1.4 Visualization Tool

 Plotly provides a rich set of interactive scientific graphing libraries.


 Tableau makes a variety of software that is used for data visualization.[33]
 PowerBI is a business analytics service by Microsoft.
 Qlik produces software such as QlikView and Qlik Sense used for data
visualization and business intelligence.
 AnyChart provides JavaScript libraries and other tools for data visualization in
charts and dashboards.
 Google Charts is a JavaScript-based web service made and supported by Google for
creating graphical charts.
 Sisense provides a front-end for building data visualizations including dashboards
and reports.
 Webix is a UI toolkit that includes dedicated tools for information visualization.

41
Chapter 3
Linear Regression
3.1 Definition
In statistics, linear regression is a linear approach to modelling the relationship between a
scalar response and one or more explanatory variables (also known as dependent and
independent variables). The case of one explanatory variable is called simple linear regression;
for more than one, the process is called multiple linear regression. This term is distinct from
multivariate linear regression, where multiple correlated dependent variables are predicted, rather
than a single scalar variable.
3.1.1 Use
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
 If the goal is prediction, forecasting, or error reduction,[clarification needed]
linear regression can be used to fit a predictive model to an observed data set of
values of the response and explanatory variables. After developing such a
model, if additional values of the explanatory variables are collected without an
accompanying response value, the fitted model can be used to make a
prediction of the response.
 If the goal is to explain variation in the response variable that can be attributed
to variation in the explanatory variables, linear regression analysis can be
applied to quantify the strength of the relationship between the response and the
explanatory variables, and in particular to determine whether some explanatory
variables may have no linear relationship with the response at all, or to identify
which subsets of explanatory variables may contain redundant information
about the response.

42
3.2 Mathematical Approach
Simple and multiple linear regression

Fig. 3.1

Example of simple linear regression, which has one independent variable

The very simplest case of a single scalar predictor variable x and a single scalar response


variable y is known as simple linear regression. The extension to multiple and/or vector-
valued predictor variables (denoted with a capital X) is known as multiple linear regression,
also known as multivariable linear regression.

Multiple linear regression is a generalization of simple linear regression to the case of more
than one independent variable, and a special case of general linear models, restricted to one
dependent variable. The basic model for multiple linear regression is

for each observation i = 1, ... , n.

In the formula above we consider n observations of one dependent variable


and p independent variables. Thus, Yi is the ith observation of the dependent
variable, Xij is ith observation of the jth independent variable, j = 1, 2, ..., p. The
values βj represent parameters to be estimated, and εi is the ith independent identically
distributed normal error.

43
In the more general multivariate linear regression, there is one equation of the above form
for each of m > 1 dependent variables that share the same set of explanatory variables and
hence are estimated simultaneously with each other:

for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j =
1, ... ,  m.

Nearly all real-world regression models involve multiple predictors, and basic descriptions
of linear regression are often phrased in terms of the multiple regression model. Note,
however, that in these cases the response variable y is still a scalar. Another
term, multivariate linear regression, refers to cases where y is a vector, i.e., the same
as general linear regression.

3.2 Other estimation techniques

 Bayesian linear regression applies the framework of Bayesian statistics to linear regression.
(See also Bayesian multivariate linear regression.) In particular, the regression coefficients β
are assumed to be random variables with a specified prior distribution. The prior distribution
can bias the solutions for the regression coefficients, in a way similar to (but more general
than) ridge regression or lasso regression. In addition, the Bayesian estimation process
produces not a single point estimate for the "best" values of the regression coefficients but an
entire posterior distribution, completely describing the uncertainty surrounding the quantity.
This can be used to estimate the "best" coefficients using the mean, mode, median, any
quantile (see quantile regression), or any other function of the posterior distribution.
 Quantile regression focuses on the conditional quantiles of y given X rather than the
conditional mean of y given X. Linear quantile regression models a particular conditional
quantile, for example the conditional median, as a linear function βTx of the predictors.
 Mixed models are widely used to analyze linear regression relationships involving dependent
data when the dependencies have a known structure. Common applications of mixed models
include analysis of data involving repeated measurements, such as longitudinal data, or data
obtained from cluster sampling. They are generally fit as parametric models, using maximum

44
likelihood or Bayesian estimation. In the case where the errors are modeled as normal
random variables, there is a close connection between mixed models and generalized least
squares.[18] Fixed effects estimation is an alternative approach to analyzing this type of data.
 Principal component regression (PCR)[7][8] is used when the number of predictor variables
is large, or when strong correlations exist among the predictor variables. This two-stage
procedure first reduces the predictor variables using principal component analysis then uses
the reduced variables in an OLS regression fit. While it often works well in practice, there is
no general theoretical reason that the most informative linear function of the predictor
variables should lie among the dominant principal components of the multivariate
distribution of the predictor variables. The partial least squares regression is the extension of
the PCR method which does not suffer from the mentioned deficiency.
 Least-angle regression[6] is an estimation procedure for linear regression models that was
developed to handle high-dimensional covariate vectors, potentially with more covariates
than observations.
 The Theil–Sen estimator is a simple robust estimation technique that chooses the slope of the
fit line to be the median of the slopes of the lines through pairs of sample points. It has
similar statistical efficiency properties to simple linear regression but is much less sensitive to
outliers.[19]
 Other robust estimation techniques, including the α-trimmed mean approach[citation needed],
and L-, M-, S-, and R-estimators have been introduced.

45
Chapter 4

Advantages and Uses

4.1 Application

Linear regression is widely used in biological, behavioral and social sciences to


describe possible relationships between variables. It ranks as one of the most important tools
used in these disciplines.

4.1.1 Finance

The capital asset pricing model uses linear regression as well as the concept of
beta for analyzing and quantifying the systematic risk of an investment. This comes directly
from the beta coefficient of the linear regression model that relates the return on the investment
to the return on all risky assets.

4.1.2 Epidemiology

Early evidence relating tobacco smoking to mortality and morbidity came from
observational studies employing regression analysis. In order to reduce spurious
correlations when analyzing observational data, researchers usually include several
variables in their regression models in addition to the variable of primary interest. For
example, in a regression model in which cigarette smoking is the independent variable of
primary interest and the dependent variable is lifespan measured in years, researchers
might include education and income as additional independent variables, to ensure that
any observed effect of smoking on lifespan is not due to those other socio-economic
factors. However, it is never possible to include all possible confounding variables in an
empirical analysis. For example, a hypothetical gene might increase mortality and also
cause people to smoke more. For this reason, randomized controlled trials are often able
to generate more compelling evidence of causal relationships than can be obtained using

46
regression analyses of observational data. When controlled experiments are not feasible,
variants of regression analysis such as instrumental variables regression may be used to
attempt to estimate causal relationships from observational data.

4.1.3 Machine learning

Linear regression plays an important role in the field of artificial intelligence such
as machine learning. The linear regression algorithm is one of the fundamental
supervised machine-learning algorithms due to its relative simplicity and well-known
properties.

4.1.4 Economics

Linear regression is the predominant empirical tool in economics. For example, it


is used to predict consumption spending, fixed investment spending, inventory
investment, purchases of a country's exports, spending on imports, the demand to hold
liquid assets, labor demand, and labor supply.

4.1.5 Health care

In the health-care industry, data science is making great leaps. The various
industries in health-care making use of data science are:

i. Medical Image Analysis

In the medical image analysis, data science has created a strong sphere of
influence for analyzing medical images such as X-rays, MRIs, CT-Scans, etc.
Previously, doctors and medical examiners would have to manually search for
clues in the medical images. However, with the advancements in computing
technologies and surge in data, it is possible to create machines that can
automatically detect flaws in the imagery. Data Scientists have created powerful
image recognition tools that allow doctors to have an in-depth understanding of
complex medical imagery.

47
ii. Genomic Data Science

Genomic Data Science applies the statistical techniques to genomic sequences,


allowing the bioinformaticians and geneticists to understand the defects in genetic
structures. It is also helpful in classifying diseases that are genetic in nature. With data
science, we can analyze how genes react to varying kinds of medicines. Also, several big
data technologies like MapReduce have significantly reduced the processing time for
genome sequencing.

iii. Drug Discovery

Another important field making use of data science is drug discovery. In drug
discovery, new candidate medicines are formulated. Drug Discovery is a tedious and
often complex process. Data Science can help us to simplify this process and provide us
with an early insight into the success rate of the newly discovered drug. With Machine
Learning, we can also analyze several combinations of drugs and their effect on different
gene structure to predict the outcome.

iv. Predictive Modeling for Diagnosis

With the advancements in predictive modeling, data scientists can help to predict
the outcome of disease given the historical data of the patients. Data Science has enabled
practitioners to analyze the data, make correlations between the variables of the data and
also provide insights to doctors and medical practitioners.

48
Chapter 5
5.1 Hands on project(Stock Price Prediction)
Predicting how the stock market will perform is one of the most difficult things to do.
There are so many factors involved in the prediction – physical factors vs. Psychological,
rational and irrational behaviour, etc. All these aspects combine to make share prices volatile and
very difficult to predict with a high degree of accuracy.  We will implement a mix of machine
learning algorithms to predict the future stock price of this company, starting with simple
algorithms like averaging and linear regression, and then move on to advanced techniques like
LSTM.
5.1.1 Fundamental Analysis and Technical Analysis
 Fundamental Analysis involves analyzing the company’s future profitability on the
basis of its current business environment and financial performance.
 Technical Analysis, on the other hand, includes reading the charts and using
statistical figures to identify the trends in the stock market.
We will first load the dataset and define the target variable for the problem:

49
The profit or loss calculation is usually determined by the closing price of a stock for the
day, hence we will consider the closing price as the target variable. Let’s plot the target variable
to understand how it’s shaping up in our data:

#setting index as date

df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')

df.index = df['Date']

#plot

plt.figure(figsize=(16,8))

plt.plot(df['Close'], label='Close Price history')

5.1.2 Long Short term memory (LSTM)


LSTMs are widely used for sequence prediction problems and have proven to be extremely
effective. The reason they work so well is because LSTM is able to store past information that is
important, and forget the information that is not. LSTM has three gates:

 The input gate: The input gate adds information to the cell state
 The forget gate: It removes the information that is no longer required by the model
 The output gate: Output Gate at LSTM selects the information to be shown as output

Implementation

#importing required libraries

from sklearn.preprocessing import MinMaxScaler

from keras.models import Sequential

from keras.layers import Dense, Dropout, LSTM

#creating dataframe

data = df.sort_index(ascending=True, axis=0)

new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])

for i in range(0,len(data)):

50
new_data['Date'][i] = data['Date'][i]

new_data['Close'][i] = data['Close'][i]

#setting index

new_data.index = new_data.Date

new_data.drop('Date', axis=1, inplace=True)

#creating train and test sets

dataset = new_data.values

train = dataset[0:987,:]

valid = dataset[987:,:]

#converting dataset into x_train and y_train

scaler = MinMaxScaler(feature_range=(0, 1))

scaled_data = scaler.fit_transform(dataset)

x_train, y_train = [], []

for i in range(60,len(train)):

x_train.append(scaled_data[i-60:i,0])

y_train.append(scaled_data[i,0])

x_train, y_train = np.array(x_train), np.array(y_train)

x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))

# create and fit the LSTM network

model = Sequential()

model.add(LSTM(units=50, return_sequences=True, input_shape=(x_train.shape[1],1)))

model.add(LSTM(units=50))

model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=2)

51
#predicting 246 values, using past 60 from the train data

inputs = new_data[len(new_data) - len(valid) - 60:].values

inputs = inputs.reshape(-1,1)

inputs = scaler.transform(inputs)

X_test = []

for i in range(60,inputs.shape[0]):

X_test.append(inputs[i-60:i,0])

X_test = np.array(X_test)

X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))

closing_price = model.predict(X_test)

closing_price = scaler.inverse_transform(closing_price)

Results

rms=np.sqrt(np.mean(np.power((valid-closing_price),2)))

rms

11.772259608962642

#for plotting

train = new_data[:987]

valid = new_data[987:]

valid['Predictions'] = closing_price

plt.plot(train['Close'])

plt.plot(valid[['Close','Predictions']])

52
Fig. 5.2 Prediction Output Visualization

53
Chapter 6
Conclusion
6.1 Conclusion

The practice of data science can best be described as a combination of analytical


engineering and exploration. The business presents a problem we would like to solve. Rarely is
the business problem directly one of our basic data mining tasks. We decompose the problem
into subtasks that we think we can solve, usually starting with existing tools. For some of these
tasks we may not know how well we can solve them, so we have to mine the data and conduct
evaluation to see. If that does not succeed, we may need to try something completely different. In
the process we may discover knowledge that will help us to solve the problem we had set out to
solve, or we may discover something unexpected that leads us to other important successes.

Neither the analytical engineering nor the exploration should be omitted when
considering the application of data science methods to solve a business problem. Omitting the
engineering aspect usually makes it much less likely that the results of mining data will actually
solve the business problem. Omitting the understanding of process as one of exploration and
discovery often keeps an organization from putting the right management, incentives, and
investments in place for the project to succeed.

The purpose of Data Science, we conclude that Data Scientists are the backbone of data-intensive
companies. The purpose of Data Scientists is to extract, preprocess and analyze data. Through
this, companies can make better decisions. Various companies have their own requirements and
use data accordingly. In the end, the goal of Data Scientist to make businesses grow better. With
the decisions and insights provided, the companies can adopt appropriate strategies and customize
themselves for enhanced customer experience.

54

You might also like