Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

GUJARAT TECHNOLOGICAL UNIVERSITY

Academic year
(2021-2022)

ITM UNIVERSITY
INTERNSHIP REPORT UNDER

SUBJECT OF

SUMMER INTERNSHIP (3170001)

B.E. SEMESTER-VII

BRANCH - Computer Science and Engineering

Submitted by :-

KUSHAL BHATT

BrainyBeam Technologies Pvt. Ltd.

Mr. KULKARNI GAURAV SHARAD MR. SAGAR JASANI

(Internal Guide) (External Guide)


Company Profile

Company Name: BrainyBeam Technologies Pvt Ltd


Address: 118, Sukan Mall , Science city road, Ahmedabad
Contact No: +91 9033237336

Email Id : sagar@brainybeam.com
Website: www.brainybeam.com
About Us

At BrainyBeam, we see Innovation as a clear differentiator. Innovation, along with focus on deep,
long-lasting client relationships and strong domain expertise, drives every facet of our day-to-day
operations.

BrainyBeam Technologies was founded with a vision to address growing businesses' needs of
reducing the time to market and cost effectiveness required to develop and maintain unique and
customized web and mobile solutions. We are uniquely and strategically positioned to partner
with startups and leading brands to help them expand their business and offer the most effective
and cost-efficient solutions that provide revenues and value to their business needs.

Vision
To become the most trusted and preferred offshore IT solutions partner for Startups, SMBs and
Enterprises through innovation and technology leadership. Understanding your ambitious
vision, honing in on its essence, creating a design strategy, and knowing how to technically
execute it is what we do best. Our promise? The integrity of your vision will be maintained and
we'll enhance it to best reach your target customers. With our primary focus on creating
amazing user experiences, we'll help you understand the tradeoffs, prioritize features, and
distill valuable functionality. It's an art form we care about getting right.
Joining Letter
Completion Certificate
Kushal Bhatt

ACKNOWLEDGE
MENT

I would like to express my deepest gratitude to all those who provided me the
possibility tothe completion of the internship. A special gratitude of thanks I give
to our Assitant Professor, Prof. Shweta Rajput, whose contribution in stimulating
suggestions and encouragement, helped me to coordinate the internship especially
in drafting this report.

Furthermore, I would also like to acknowledge with much appreciation the crucial
role of the Head of Department, Dr. Avani Vasant, who gave the permission to use
all required equipment and the necessary material to fulfil the task. Last but not the
least, many thanks go to the teachers and my friends and families who have
invested their full effort in guiding us inachieving the goal.

Also I appreciate the guidance given by the developer at BrainyBeam, Mr Raj as


well as the panels especially for the internship that has advised me and gave
guidance at every momentof the internship.

190950131008
Kushal Bhatt

Abstract

Data Science and analysis is playing the most significant role today covering
every industry in the market. For e.g., finance, e-commerce, business,
education, government.
Now organizations play a 360-degree role to analyse the behaviour and interest
of their customers to take decisions in favour of them. Data is analysed through
programming language such as python which is one of the most versatile
language and helps in doing a lot of things through it.
Netflix is a pure data science project that reached at the top through analysing
every single interest of their customers. Key terminology that are used in Data
Science are: Data Visualization, Anaconda Jupyter Notebook, Exploratory Data
Analysis, Machine Learning, Data wrangling, and Evaluation using scikit
library‘s surprise module.

190950131008
Kushal Bhatt

||| DAY - 1
BASIC INTRODUCTION AND DOMAIN KNOWLADGE
Explain about work flow of whole internship. Also discuss some basic
domain knowledge.
Introduction about Field
i. Discuss some basic point about python, working of python, advantages of
python for working in data science.
ii. Also explained how to install and run python and jupyter notebook and
other useful tools?

Difference between Data Science, Data Analysis & Machine Learning


i. Data Science: Use mathematical skills for get desire outcome from data.
ii. Data Analysis: Analyzing data with different charts and tables.
iii. Machine Learning: Totally based on mathematics used for prediction, for
building models etc.
Basic Linux Commands
1. cd – Use the cd command to change the directory
2. mkdir – use the mkdir command to create a folder or directory
3. touch – Create new file
4. rmdir- Deletes a directory
5. ls - Lists all the files and directories in the present working directory.
6. Pwd - Displays the path of the present working directory.
7. rm – use the rm command to delete files and directories.

AIM: Task: build a python program which can take input of students with their
subject marks, and gives their total marks obtained.
Program:
total = 0
n = int(input("Enter the number of students "))
for i in range(n):
name = input("Enter the Name:")
sub = int(input("No. of subjects "))

190950131008
Kushal Bhatt

for i in range(sub):
marks = int(input("Enter marks"))
total = total+marks
print(total)

190950131008
Kushal Bhatt

||| DAY - 2
AIM: List out the methods used commonly in list, set, tuple,
dictionary with their rules
Data Types in Python
1. str: A string data type is traditionally a sequence of characters, either as a literal
constant or as some kind of variable. The latter may allow its elements to be mutated
and the length changed, or it may be fixed (after creation).
2. Numbers: Int, float, complex and long integers are numeric data types. We can store
real number values in int, floating point values in float and complex numbers in
complex data types and long for integers of unlimited size.
3. Lists are the build-in data-types in python that are used to store multiple items in a
single variables. The data is stored in [ ].
4. Sets are also used to store multiple items in a single variables. In set there is no order
and no index. Data stored between { }.
5. Tuples: Similar to list the tuples are ordered and similar to set the tuples are
immutable. Stored in ( ).
6. Dictionary: Storing of values ,Ordered , changeable(mutable) , doesn‘t allow change
of values.

LIST:
Example: a= [‗Jai‘,‘ni‘,‘sh‘]
Lists are the build-in data-types in python that are used to store multiple items in a single
variables. The plus point of list is that the order of list does not change, and the items in the
list are changeable (mutable) and the last point as the list allows duplicate values too.
LIST Methods:
- .append(x) : Add an item to the end of the list
- .insert(i, x): Inserting an item at a given position
- .remove(x) : removing the first item from the list whose value is equal to x
- copy(): Copying of the list
- count(): Number of elements with the specified value
- reverse() : reverse the list

190950131008
Kushal Bhatt

SET
- Sets are also used to store multiple items in a single variables.
- In set there is no order and no index.
- The down point of set data type is the value cannot be changed once the set is created
immutable
- Repetition of values are not allowed in set.
Sets Methods:
a. add(): adds element to a set
b. discard(): Removes an Element from The Set
c. isdisjoint(): Checks Disjoint Sets
d. issubset(): Checks if a Set is Subset of Another Set
e. union(): Returns the union of sets
f. update(): Add elements to the set
g. clear(): remove all elements from a set
CODE.
# set of vowels
vowels = {'a', 'e', 'i', 'u'}

print(vowels) # adding 'o'


vowels.add('o')
print('Vowels are:', vowels)

#discard 'o'

190950131008
Kushal Bhatt

vowels.discard("o")
print(vowels)

#isdisjoin()
A = {1, 2, 3, 4}
B = {5, 6, 7}
C = {4, 5, 6}
print('Are A and B disjoint?', A.isdisjoint(B))
print('Are A and C disjoint?', A.isdisjoint(C))

#issubset()
A1 = {1, 2, 3}
B1 = {1, 2, 3, 4, 5}
print(A1.issubset(B1))

#union
A2 = {'a', 'c', 'd'}
B2 = {'c', 'd', 2 }
print('A U B =', A2.union(B2))

#update
A3 = {'a', 'b'}
B3 = {1, 2, 3}
result = A3.update(B3)
print('A =', A3)

#clear vowels.clear()
print('Vowels (after clear):', vowels)

190950131008
Kushal Bhatt

TUPLE
- Storing of multiple items in one variable
- Similar to list the tuples are ordered and similar to set the tuples are immutable.
- Tuples also allow duplicates.
Tuples Methods:
a. .count( ): Returns the number of times a specified value occurs in a tuple.
b. .index( ): Searches the tuple for a specified value and returns the position of where it was
found.

Dictionaries
- Storing of values
- Ordered , changeable(mutable) , doesn‘t allow change of values

Dictionary Methods

get() - Returns the value of the specified key


items() - Returns a list containing a tuple for each key value pair
keys() - Returns a list containing the dictionary's keys
pop() - Removes the element with the specified key
popitem() - Removes the last inserted key-value pair

Code.
#get()
person = {'name': 'Jainish', 'age': 21}
print('Name: ', person.get('name'))
print('Age: ', person.get('age'))

190950131008
Kushal Bhatt

#items()
print(person.items())

#keys
print(person.keys())

#setdefault()
age = person.setdefault('age')

print('person = ',person)
print('Age = ',age)

#values()
print(person.values())

#clear()
person.clear()
print(person)

190950131008
Kushal Bhatt

||| DAY - 3
AIM:
1) Random module functions with explanation
2) Build password generator program containing numbers, Alphabets
and special characters.
3) Write a note about NLP, NLU, and NLG with examples.
4) Perform text to speech examples using gtts.

Program:

1) Random module functions with explanation


- Random function in python is use to generate random numbers.
- The use of random function is to generate random number, automatic password
generator or OTP generator.
RANDOM Methods:
- Seed() is the method to initialize the random number generator
- Sample() gives a sample of sequence
- Uniform() returns a random float number between the two parameters
- getstate() returns the current internal state of the random number generator
- randrange() returns a random number between the given range

190950131008
Kushal Bhatt

CODE:
import random as rand
x = rand.randrange(5)
print(x) # returns a random number in the given number range

rand.seed(20)
print(rand.random())

y = rand.randint(1,50)
print(y)

mylist = ["apple", "banana", "cherry"]


print(rand.sample(mylist, k=2))

b = rand.uniform(1.0,5.0)
print(b)

print(rand.getstate())

print(rand.randrange(3, 9))

2) Build password generator program containing numbers, Alphabets


and special characters.
import random
length = int(input("enter length"))
num = '0123456789qwertyuiopasdfghjklzxcvbnm!@#$%^&*()'
print(" ".join(random.sample(num,length)))

190950131008
Kushal Bhatt

Learning about external and internal libraries


External libraries taught were gtts (google-text-to-speech) is implemented and read the
document of the gtts and there use in the daily life…
from gtts import gTTS
>>> a= gTTS("hello how are you")
>>> a.save('voice.mp3')
>>> exit()

3) Write a note about NLP, NLU, and NLG with examples.


NLP – natural Language Processing is the automatic manipulation of natural language like
speech or text using some software or deep learning and machine learning concepts.
So, NLP is the field of AI where the programmer train the machine in order to read,
understand and derive meaning from the human languages.
Eg. – In the organisation as Sentiment Analysis, Cognitive Assistant developed by
IBM as search engine and many more…

190950131008
Kushal Bhatt

NLU – Natural Language Understanding is the branch of NLP where the transformation of
human language into machine readable format. And allows computers to understand the
commands without the formalized syntax of computer languages and for computers to
communicate back to humans in their own languages.

NLG – Natural Language Generation is the subfield of AI, is a software that automatically
transforms data into plain-English content. With the right data in the right format, an NLG
system can automatically turn numbers in a spreadsheet into data driven narratives or even
use associations between words to create partially or fully machine written text.

4) Perform text to speech examples using gtts


gTTS (Google Text to Speech)
ation developed by Google.

Pip install gttts


g this in our program we can insert it by this command:

From gtts import Gtts

gTTS Functions:
a. .get_bodies() : Get request bodies sent to the TTS API.
b. .save(): Do the TTS API request and write results to file.
c. .write_to_fp() : Do the TTS API request(s) and write bytes to a file-like object.
d. .lang() : Support for different languages.

from gtts import gTTS


a=gTTS("Jainish")
a.save("voice1.mp3")

Output

voice1.mp3

Similarly there many more libraries for audio recording such as pyaudio.
Here is the small implementation of pyaudio.
import pyaudio
import wave

FORMAT = pyaudio.paInt16

190950131008
Kushal Bhatt

CHANNELS = 2
RATE = 44100
CHUNK = 1024
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "file.wav"

audio = pyaudio.PyAudio()

# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
print("recording...")
frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):


data = stream.read(CHUNK)
frames.append(data)
print("finished recording")

# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()

waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')


waveFile.setnchannels(CHANNELS)
waveFile.setsampwidth(audio.get_sample_size(FORMAT))
waveFile.setframerate(RATE)
waveFile.writeframes(b''.join(frames))
waveFile.close()

So the program will record the voice and store it in the folder you are working on.

190950131008
Kushal Bhatt

||| DAY - 4
AIM:
1) List out 5 methods of pandas and numpy with output.
2) Reshape(-1,1) explanation
3) Linear regression working with mathematical equation.
4) Sales prediction with csv
Day 4: At day 4, we were taught the methods of numpy and pandas that are the most
important for performing task on data science project like importing files and creating
dataframes. Then linear regression as the first algorithm for model building and predicting
data.
Program:
1) List out 5 methods of pandas and numpy with output.
NUMPY:

import numpy as np
a=np.array([[1,2,3,4,5,6],[7,8,9,10,11,12]])
a
a.ndim ---------- 1 (ndim() function return the number of dimensions of an array)

a.shape ---------- 2 (# returns the shape of the array)


(a.reshape(4,3) ---------- 3 (# reshaping the array)
print(np.mean(a)) --------- 4 #Mean

print(np.std(a)) --------- #Standard Dev.

print(np.var(a))-------------#Variance

print(a.T) ------------ #Transpose the matrix

190950131008
Kushal Bhatt

Pandas:
import pandas as pd
df = pd.read_csv('Data1.csv')
df

df[df.language=='java'] # query from the dataframe

a= df[df.language == 'python'][['rating','userid']]
df['rating']*2

d={'a':[1,2,3,4],'b':[4,5,6,7],'c':[7,8,9,10]} # To Convert Dictionary into Data-frame


df = pd.DataFrame(d)
df

d={'a':[5],'b':[4],'c':[6]} # converting dict. To Dataframe


df = pd.DataFrame(d)
df

sa = np.array([3,4,5,6]) # adding of two arrays using pandas Series


sa1 = np.array([11,2,45,6])
s1 = pd.Series(sa1)
s = pd.Series(sa)
p = pd.DataFrame([s1,s],columns =['A','B','C','D'])
p

ls_of_ls =[[1,2,3,4],[5,6,7,8],[4,5,6,7]] # list to dataframe


print(ls_of_ls)

190950131008
Kushal Bhatt

df2 =pd.DataFrame(ls_of_ls)
df2

d = {'a':[1,2,3,4,5],'b':[4,5,6,7,8]}
df2 = pd.DataFrame(d)
df2

s1 = pd.Series([1,2,3,4]) # to concat two series


s2 = pd.Series([5,6,7,8])
c = pd.concat([s1,s2],axis=1)
c

df.info()

groupby() function is used to split the data into groups based on some criteria. pandas objects
can be split on any of their axes.
Pd.groupby(―column_name‖)

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

Task 1: Why we only give list to a array or any data structure to convert it into
data-frames.

Code:
import pandas as pd
d = {'a':1,'b':2}

#If we right above code it will show value error as "If using all scalar values, you must pass
an index"
#The reason is we are providing here a scalar values.
#So our program can not able to decide to take this value as rows or columns.

#df1 = pd.DataFrame(d)

#So for sucessful execution we need to pass an index.


df1 = pd.DataFrame(d,index=[0])

print(df1)
#Or we can write like below
d1 = {'a':[1,2],'b':[4,5]}
df2 = pd.DataFrame(d1)
df2

Task 2 : Explanation about reshape(-1,1).


The reshape() function is used to give a new shape to an array without changing its data.
Array to be reshaped. The new shape should be compatible with the original shape.
 -1 in reshape function is used to explicitly tell our function about dimension of
that axis.
 For example, if we have an array of shape (6,4) then reshaping it with (-1,1),
then the array will get reshaped in such a way that the resulting array has only 1
column and this is only possible by having 24 rows, hence (24,1).
Code:
import numpy as np

z = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
print(z.shape)

z1 = z.reshape(-1,1)

print(z1.shape)

190950131008
Kushal Bhatt

3) Linear regression working with mathematical equation.


CODE

import numpy as np
from sklearn.linear_model import LinearRegression

x = np.array([1,2,3,4,5]).reshape(-1,1) # converting the x array into 1D


y = np.array([10,26,38,41,63])

testing = np.array([8,9])

model = LinearRegression()
model.fit(x,y)
model.predict(testing.reshape(-1,1))

# Flattening array means converting a multidimensional array into a 1D


# array. We can use reshape(-1) to do this.

m = model.coef_
print(m)
print(model.intercept_)

190950131008
Kushal Bhatt

4) Sales prediction with csv

CODE
import numpy as np
import pandas as pd

df1 = pd.read_csv('sales.csv')
df1

data1 = df1.groupby(['month']).mean()
x = np.array(data1.index)
y = np.array(data1['sales'])
testing = np.array([17,12])

from sklearn.linear_model import LinearRegression


model = LinearRegression()
model.fit(x.reshape(-1,1),y)
print(model.predict(testing.reshape(-1,1)))

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

||| DAY - 5
AIM :
1) Try to clean train.csv data
2) decision tree explanation
3) random state working

1) Try to clean the train.csv data


CODE
# Loading data in pandas
import numpy as np
import pandas as pd
df = pd.read_csv('train.csv')
df

# creating a different columns as Gender using Label Encoding


from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['Gender']= le.fit_transform(df['Sex'])
df.info()
# data format
df.describe()

df.describe(include=['O'])

# Dropping Columns which are not useful


cols = ['Name', 'Ticket', 'Cabin']
df = df.drop(cols, axis=1)

# Dropping rows having missing values


df = df.dropna()

df.info()

# Taking care of missing data , so age columns Nan Values are interpolated and filled...
df['Age'] = df['Age'].interpolate()

df["Fare"] = df["Fare"].fillna(df["Fare"].median())

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

2) decision tree explanation


Decision Tree : Tool for classification and prediction. A Decision tree is a flowchart like tree
structure, where each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label.
Although, a real dataset will have a lot more features and this will just be a branch in a much
bigger tree, but you can‘t ignore the simplicity of this algorithm. The feature importance is
clear and relations can be viewed easily. This methodology is more commonly known as
learning decision tree from data and above tree is called Classification tree as the target is to
classify passenger as survived or died. Regression trees are represented in the same manner,
just they predict continuous values like price of a house. In general, Decision Tree algorithms
are referred to as CART or Classification and Regression Trees.

Mathematics behind Decision tree algorithm: Before going to the Information Gain first
we have to understand entropy
Entropy: Entropy is the measures of impurity, disorder, or uncertainty in a bunch of
examples.
Purpose of Entropy:
Entropy controls how a Decision Tree decides to split the data. It affects how a Decision
Tree draws its boundaries.
―Entropy values range from 0 to 1‖, Less the value of entropy more it is trusting able.

190950131008
Kushal Bhatt

Attribute Selection Measures


If the dataset consists of N attributes then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step. By just randomly selecting
any node to be the root can‘t solve the issue. If we follow a random approach, it may give us
bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like :
Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square
3) random state working
ML_model(n_estimators=100,max_depth=5,gamma=0,random_state=0..)
Train_test_split selects randomly the train and test size basing on the ratio given. Every
single time you run this function you will have a randomly selected train and test values
based on the train and test size ratio. This will also have an impact on the evaluation metrics
as if the random_state is not selected the values from the evaluation will differ. The values 0,
1 are the most common values used. you can select your own value. Assume it as a shuffle of
cards were, after the first shuffle that's random state 1 the after the second shuffle that's
random state 2 etc.

190950131008
Kushal Bhatt

||| DAY - 6
AIM:
On day 6 of the internship we were taught about the more data cleaning
operations and then decision tree algorithm and its parameters to better
understand the algorithm. And then we were assign the 4 tasks.
1) Explain 5 parameters used in decision tree model
2) Mention data cleaning methods and its working.
3) Make a diagram explaining decision tree parameters with titanic
dataset with equation.
4) Obtain 90% accuracy from the dataset.

1) Explain 5 parameters used in decision tree model


a. max_depth
• The first hyper parameter to tune in a Decision Tree is max_depth.
• It indicates how deep the decision tree can be.
• The deeper the tree, the more splits it has and it captures more information
about data.
• A Decision Tree overfits for large depth values. The tree perfectly predicts all
of the train data but it fails to generalize the findings for the new data.

b. min_samples_split
• An internal node will have further splits (also called children).
• min_samples_split specifies the minimum number of samples required to split
an internal node.
• We can either specify number to denote the minimum number or a fraction to
denote the percentage of samples in an internal node.

c. min_samples_leaf
• A leaf node is a node without any children.
• min_samples_leaf is the minimum number of the samples required to be at a
leaf node.
• This parameter is similar to min_samples_splits, however, this describe the
minimum number of samples at the leafs, the base of the tree.

190950131008
Kushal Bhatt

d. max_features
• max_features represent the number of features to consider when looking for
the best split.

• We can either specify a number to denote the max_features at each split or a


fraction to denote the percentage of features to consider while making a split.
• We also have options such as sqrt, log2, None.

e. criterion
criterion − string, optional default= ―gini‖
• supported criteria are ―gini‖ and ―entropy‖. Function to measure the quality of
a split.

2) Mention data cleaning methods and its working.


Data cleansing or data cleaning is the process of identifying and removing (or
correcting) inaccurate records from a dataset, table, or database and refers to
recognizing unfinished, unreliable, inaccurate, or non-relevant parts of the data
and then restoring, remodeling, or removing the dirty or crude data.
This improves the quality of the training data for analytics and enables accurate
decision-making.

a) Check for the Missing Values – for detecting missing values across
different array dtypes. Pandas provides the functions named, isnull() and
notnull().
Example.

190950131008
Kushal Bhatt

b) Cleaning/ Filling Missing Data


Pandas provide function as fillna() for filling the missing values i.e. NaN
values in couple of ways like filling the mean, variance or standard
deviation.
Replace NaN with a Scalar Value

Filling NaN values with the Forword and Backward


- Pad/fill : Fill methods forward
- Bfill/backfill : methods Backward

190950131008
Kushal Bhatt

c) Drop Missing Values – simply exclude the missing values, then use the
dropna function along with the axis argument. By default, axis = 0, along
row, which means that if any value within a row is NA then the whole
row is excluded.

d) Replace Missing Values: replacing the missing values with some


specific values, so that the efficiency of the model increases by replacing
a small amount of values.
Replacing NA with a scalar value is equivalent behavior of the fillna()
function.

e) Tidying up Fields in the Data : Cleaning the specific columns and


getting them to a uniform format to get a better understanding of the
dataset and enforce consistency. For example, lets assume a dataset of
library books and there is column named Data of Publication which have
the values.

190950131008
Kushal Bhatt

A particular book can have only one date of publication. Therefore, we need to do
the following:

 Remove the extra dates in square brackets, wherever present: 1879 [1878]
 Convert date ranges to their ―start date‖, wherever present: 1860-63; 1839,
38-54
 Completely remove the dates we are not certain about and replace them with
NumPy‘s NaN: [1897?]
 Convert the string nan to NumPy‘s NaN value

190950131008
Kushal Bhatt

Synthesizing these patterns, we can actually take advantage of a single regular


expression to extract the publication year:

regex = r'^(\d{4})'

3) Make a diagram explaining decision tree parameters with titanic


dataset with equation.

190950131008
Kushal Bhatt

4) Obtain 90% accuracy from the dataset.


### importing libraries

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier


train_df = pd.read_csv("train.csv")
train_df.info()

total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

train_df.columns.values

190950131008
Kushal Bhatt

train_df = train_df.drop(['PassengerId'],axis=1)
### Dealing with the missing data
import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train_df]

for dataset in data:


dataset['Cabin'] = dataset['Cabin'].fillna("U0")
dataset['Deck'] = dataset['Cabin'].map(lambda x:re.compile("([a-zA-
Z]+)").search(x).group())
dataset['Deck'] = dataset['Deck'].map(deck)
dataset['Deck'] = dataset['Deck'].fillna(0)
dataset['Deck'] = dataset['Deck'].astype(int)

# Now we can drop the Cabin feature


train_df = train_df.drop(['Cabin'],axis=1)

### Age 's missing values

data = [train_df]

for dataset in data:


mean = train_df["Age"].mean()
std = train_df["Age"].std()
is_null = dataset["Age"].isnull().sum()

#compute the random numbers between the mean,std, and is_null


rand_age = np.random.randint(mean-std, mean+std,size = is_null)

# fill NaN values in Age Column with random values generated


age_slice = dataset["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
dataset["Age"] = age_slice
dataset["Age"] = train_df["Age"].astype(int)

train_df["Age"].isnull().sum()

### Embarked missing values


train_df['Embarked'].describe()

common_value = 'S'
data = [train_df]

for dataset in data:

190950131008
Kushal Bhatt

dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

### Converting ―Fare‖ from float to int64


data = [train_df]

for dataset in data:


dataset['Fare'] = dataset['Fare'].fillna(0)
dataset['Fare'] = dataset['Fare'].astype(int)

### Convert ‗Sex‘ feature into numeric


genders = {"male":0,"female":1}
data = [train_df]

for dataset in data:


dataset['Sex'] = dataset['Sex'].map(genders)

### Checking the Ticket for missing values


train_df = train_df.drop(['Ticket'],axis =1)
train_df

x = train_df.drop(["Survived","Name"],axis=1)
y = train_df["Survived"]

from sklearn.model_selection import train_test_split


x_train, x_test,y_train,y_test = train_test_split(x,y,test_size=0.1)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train,y_train)
y_pred = decision_tree.predict(x_test)
acc_decision_tree = round(decision_tree.score(x_train,y_train) *100,2)

print(acc_decision_tree)

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

||| DAY - 7
Task List:
1) Find polarity of unique products using apply with Function.
2) Use frequency distribution and pos tag from ntk.
3) Make list of meta characters and make the pattern for email and phone
number.

a) Find polarity of unique products using apply with Function.

The range of polarity is from -1 to 1(negative to positive) and will tell us if the
text contains positive or negative feedback.
The polarity of words is retrieved from the package pattern and the sentence
polarity is calculated using: Sum of polarity of all the words in a sentence
divided by the total number of words in the sentence.

import pandas as pd
import numpy as np
import nltk
df = pd.read_csv("cloths-rating.csv")

df['grouped'] = df.groupby(['ProductID'])['Text'].transform(lambda x : ' '.join(x))

def sentiment_calc(text):
try:
return TextBlob(str(text)).sentiment.polarity
except:
return None
df['sentiment'] = df['grouped'].apply(sentiment_calc)

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

b) Use frequency distribution and pos tag from ntk

import nltk

text="The titular threat of The Blob has always struck me as the ultimate movie ...

rampant."
print(text)

from nltk.tokenize import blankline_tokenize


b=blankline_tokenize(text)
len(b)

from nltk.tokenize import word_tokenize


from nltk.probability import FreqDist
freq=FreqDist()
tokens=word_tokenize(text)
len(tokens)

for i in tokens:

190950131008
Kushal Bhatt

freq[i]+=1
freq

cw =freq.most_common(5)
cw

pos=nltk.pos_tag(tokens)
pos

from nltk import ne_chunk


from nltk import word_tokenize, pos_tag, ne_chunk
chunk=ne_chunk(pos)
chunk

190950131008
Kushal Bhatt

c) Make list of meta characters and make the pattern for email and phone
number.

190950131008
Kushal Bhatt

import re
reg = '^(\w|\.|\_|\-)+[@](\w|\_|\-|\.)+[.]\w{2,3}$'

em = input("enter email")

if(re.search(reg,em)):
print("Nice its a valid email id, enjoy")
else:
print("So sorry not a valid email id")

import re
regex = '^[6-9]\d{9}$'
phone = input("enter number")
if(re.search(regex,phone)):
print("Valid Phone")
else:
print("Invalid Phone")

190950131008
Kushal Bhatt

||| DAY - 8
Task: Explain TF-IDF with example.

TF-IDF stands for ―Term Frequency — Inverse Document Frequency‖. This is a technique
to quantify a word in documents, we generally compute a weight to each word which signifies
the importance of the word in the document and corpus. This method is a widely used
technique in Information Retrieval and Text Mining.

t — term (word)
d — document (set of words)
N — count of corpus
corpus — the total document set

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is
the ratio of number of times the word appears in a document compared to the total number of
words in that document. It increases as the number of occurrences of that word within the
document increases. Each document has its own tf.

EXAMPLE.
from sklearn.feature_extraction.text import TfidfVectorizer
sentences=['what is your name','where do you live','do you live in surat','what is your
latname']

vectors=TfidfVectorizer()
vectors.fit(sentences)
transform=vectors.transform(sentences)
print(transform)
print(vectors.vocabulary_)
transform.shape

190950131008
Kushal Bhatt

import pandas as pd
d={'name':['a','b','c'],'id':[1,2,3]}
a=pd.DataFrame(d)
a

import pandas as pd
d=[1,2,3,4,5]
g=[6,7,8,9,10]
a=pd.Series(d)
b=pd.Series(g)
a+b

df=pd.DataFrame(vectors.fit_transform(sentences).toarray(),columns=vectors.get_feature_na
mes())
vectors=TfidfVectorizer(binary=True,min_df=2,max_df=3)
vectors.fit(sentences)
transform=vectors.transform(sentences)
print(transform)
print(vectors.vocabulary_)
df=pd.DataFrame(vectors.fit_transform(sentences).toarray(),columns=vectors.get_feature_na
mes())

190950131008
Kushal Bhatt

||| DAY - 9
Task: Explain three techniques of stemming.

Stemming is the process of reducing inflection in words to their root forms such as mapping a
group of words to the same stem even if the stem itself is not a valid word in the Language.

Difference between snowball, Lancaster and Porter is:

SNOWBALL Stemming:
When compared to Porter Stemmer, the Snowball Stemmer can map non-English words too.
Since it supports other languages the Snowball Stemmers can be called a multi-lingual
stemmer. The Snowball stemmers are also imported from the nltk package. This stemmer is
based on a programming language called ‗Snowball‘ that processes small strings and is the
most widely used stemmer. A lot of the things added to the Snowball stemmer were because
of issues noticed with the Porter stemmer. There is about a 5% difference in the way that
Snowball stems versus Porter.
LANCASTER Stemming:
The Lancaster Stemming are more aggressive and dynamic compared to the other two
stemmers. The stemmers is really faster, but the algorithm is really confusing when dealing
with small words. But they are not as efficient as Snowball Stemmers. The Lancaster
stemmers save the rules extremely and basically uses an iterative algorithm.
PORTER:
It‘s not too complex and development on it is frozen. Typically, it‘s a nice starting basic
stemmer, but it‘s not really advised to use it for any production/complex application. Based
on the idea that the suffixes in the English language are made up of a combination of smaller
and simpler suffixes. This stemmer is known for its speed and simplicity. The main
applications of Porter Stemmer include data mining and Information retrieval. However, its
applications are only limited to English words. Also, the group of stems is mapped on to the
same stem and the output stem is not necessarily a meaningful word. The algorithms are
fairly lengthy in nature and are known to be the oldest stemmer.

190950131008
Kushal Bhatt

||| DAY - 10
Task:
1) Explain collaborative and content based filtering with example.
2) Explain cosine similarly with equation
3) Explain RMSE and MSE with mathematical equation.

a) Explain collaborative and content based filtering with example.

Content-based filtering, makes recommendations based on user preferences for product


features. Content-based filtering can recommend a new item, but needs more data of user
preference in order to incorporate best match.
Content-based filtering. It relies on similarities between features of the items. It recommends
items to a customer based on previously rated highest items by the same customer.
 Comparing what and how many features match and collect scores
 Recommend highest scored item
 Code will be based on an algorithm, by given some item, the most similar item
will be found

Collaborative filtering mimics user-to-user recommendations. It predicts users preferences


as a linear, weighted combination of other user preferences. Collaborative filtering needs
large dataset with active users who rated a product before in order to make accurate
predictions.
Collaborative filtering. It relies on how other users responded to these same items. It doesn‘t
rely of features of the item, but the preferences from other users.
 Users will have a table with different rated items of what they choose or liked

190950131008
Kushal Bhatt

 Based on the similarities, prediction can be make of what the user might like, based
on what similar users did.
 The list will be filtered and matched to users who used the same items for comparison
and recommendations
 Collaborative algorithm uses ―User Behaviour‖ for recommending items. They
exploit behaviour of other users and items in terms of transaction history, ratings,
selection and purchase information. Other users behaviour and preferences over the
items are used to recommend items to the new users. In this case, features of the items
are not known.

b) Explain cosine similarly with equation

Cosine similarity measures the similarity between two vectors of an inner product space. It is
measured by the cosine of the angle between two vectors and determines whether two vectors
are pointing in roughly the same direction. It is often used to measure document similarity in
text analysis.

Cosine similarity is a measure of similarity that can be used to compare documents or, say,
give a ranking of documents with respect to a given vector of query words. Let x and y be two
vectors for comparison. Using the cosine measure as a similarity function, we have

c) Explain RMSE and MSE with mathematical equation.

Mean Square error is one such error metric for judging the accuracy and error rate of
any machine learning algorithm for a regression problem. So, MSE is a risk function
that helps us determine the average squared difference between the predicted and the
actual value of a feature or variable.

190950131008
Kushal Bhatt

RMSE is an acronym for Root Mean Square Error, which is the square root of value
obtained from Mean Square Error function. Using RMSE, we can easily plot a
difference between the estimated and actual values of a parameter of the model.

190950131008
Kushal Bhatt

||| DAY - 11
Task: Perform recommendation (based on rating) with any dataset

In the following task, we were informed to perform recommendation based on rating from
any dataset. I was provided with the Amazon Electronics Rating Dataset where the attributes
(columns) as reviewerID, asin, reviewerName, helpful, reviewText, overall, summary,
unixReviewTime and reviewTime. Where we are mainly considering reviewerID, asin,
overall and ReviewText and summary for the recommendation and sentiment analysis.

First step is to Import the libraries, which the most important task for any exploratory or
sentiment analysis.

Then, we have to import the dataset, so the dataset is in the form of json file and downloaded
from the link http://jmcauley.ucsd.edu/data/amazon/. The original data was in json format.
The json was imported and decoded to convert json format to csv format. The sample dataset
is shown below:
Sample review Dataset:

"reviewerID": "A2SUAM1J3GNN3B",

"asin": "0000013714",

"reviewerName": "J. McDonald",

"helpful": [2, 3],

190950131008
Kushal Bhatt

"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful
time playing these old hymns. The music is at times hard to read because we think the book
was published for singing from more than playing from. Great purchase though!",

"overall": 5.0,

"summary": "Heavenly Highway Hymns",

"unixReviewTime": 1252800000,

"reviewTime": "09 13, 2009"

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

190950131008
Kushal Bhatt

||| DAY - 12
Task: Find the key from the dictionary containing 1-5 ratings as keys and 40
values.

dict1 = {'1':[-5,-4,-3.75,-3,-2.5,-2.25,-2,-1.5,-1.25,-1,-0.75,-0.5,-0.25],'2':[-
0.24,0.25,0.5,0.75,1] ,'3': [1.01,1.25,1.5,2],'4': [2.01,2.25,2.5,3],'5':[3.01,3.75,4,5]}

def fun(val):
for i in dict1:
for j in dict1[i]:
if val == j:
return i
if val <= j:
return i

190950131008
Kushal Bhatt

||| DAY - 13
Task: Perform recommendation (based on Rating) with any dataset

import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.metrics.pairwise import cosine_similarity
LOADING THE CSV FILE
df = pd.read_csv("cloths-rating.csv")
df.head()
Find Sentiment on Text(Reviews)
def sentiment_calc(text):
try:
return TextBlob(str(text)).sentiment.polarity
except:
return None
df['sentiment'] = df['Text'].apply(sentiment_calc)
df

Apply Multiplication B/W Rating & Sentiment


df['Updated_score'] = df['Rating']*df['sentiment']
df

190950131008
Kushal Bhatt

Make UserId into Normal Form


from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['UserID'] = le.fit_transform(df['UserID'])
df

Make function for classify updated_score


dict1 = {'1':[-5,-4,-3.75,-3,-2.5,-2.25,-2,-1.5,-1.25,-1,-0.75,-0.5,-0.25],'2':[-
0.24,0.25,0.5,0.75,1] ,'3': [1.01,1.25,1.5,2],'4': [2.01,2.25,2.5,3],'5':[3.01,3.75,4,5]}

def fun(val):
for i in dict1:
for j in dict1[i]:
if val == j:

190950131008
Kushal Bhatt

return i
if val <= j:
return i

Apply function on updated_score and put into New_score column


df['New_score'] = df['Updated_score'].apply(fun)
df['New_score'] = pd.to_numeric(df['New_score'])
df
Pivot table of ProductID, UserID and New_score
df_pivot =df.pivot_table(index='ProductID',columns='UserID',values='Rating').fillna(0)
df_pivot

Sparse Matrix (Compressed Sparse Row)


from scipy.sparse import csr_matrix

df_pivot_matrix = csr_matrix(df_pivot.values)
print(df_pivot_matrix)

190950131008
Kushal Bhatt

Fitting data into NearestNeighborsModel


from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric = 'cosine', n_neighbors=20, radius=1)

model_knn.fit(df_pivot_matrix)

data_dict={}
for i in range(0, len(similarity.flatten())): #gives length of similarity array

if i == 0:
print('Recommendations for {0}:\n'.format(df_pivot.index[query_index]))
else:
data_dict[str(df_pivot.index[indices.flatten()[i]])] = float(similarity.flatten()[i])
print(f'{df_pivot.index[indices.flatten()[i]]}, is similarity distance = with
{similarity.flatten()[i]:.20f}:')

print(data_dict)

190950131008
Kushal Bhatt

There is very sight difference between the recommendation using rating and recommendation
using review.

190950131008
Kushal Bhatt

||| DAY - 14
Task: Explain surprise package and its working
Scikit-Surprise is an easy-to-use Python scikit for recommender systems, another example of
python scikit is Scikit-learn which has lots of awesome estimators. Singular vector
decomposition (SVD) shown here employs the use of gradient descent to minimize the
squared error between predicted rating and actual rating, eventually getting the best model.

190950131008

You might also like