fiNAL RESULT Merged

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

SUMMER TRAINING REPORT

ON

“DATA SCIENCE”

Submitted to
RAJASTHAN TECHNICAL UNIVERSITY

In Partial Fulfilment of the Requirement for the Award of

BACHELOR’S DEGREE IN
COMPUTER SCIENCE AND ENGINEERING

BY

VEDANT KALIA 20ESKCA068

UNDER THE GUIDANCE OF


Prof Nidhi Shrivastav

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SWAMI KESHVANAND INSTITUTE OF TECHNOLOGY ,
JAIPUR

2022-2023
Swami Keshvanand Institute of Technology, Jaipur
Department of Computer Science and Engineering

CERTIFICATE
Acknowledgement

It is my pleasure to be indebted to various people, who directly or indirectly


contributed in the development of this work and who influenced my thinking,
behavior,and acts during the course of study.

I express my sincere gratitude to Dr. Mukesh Kumar Gupta,


HOD,(Department) for providing me an opportunity to work in a consistent
direction and providing all necessary means to complete my presentations and
report thereafter.

I would like to thank my esteemed supervisor Ms. Nidhi Srivastav


,Department of Computer Science & Engineering ,Swami Keshvanand Institute
of Technology, Management and Gramothan , Jaipur for her valuable
suggestion , keen interest constant encouragement , incessant inspiration and
continuous help throughout this work. Her excellent guidance has been
instrumental in making this work a success.

I express my sincere heartfelt gratitude to all the staff members of


Department of Computer Science & Engineering who helped me directly or
indirectly during this course of work.

I would also like to express my thanks to my parents for their support and
blessings. A special thank goes to all my friends for their support in completion
of this work.

VEDANT KALIA
20ESKCA068
Contents

1 Introduction
OVERVIEW 6
MOTIVATION 6
OBJECTIVES OF TRAINING 7

2 Introduction to Project / Modules


Problem Statement 8

3 Description of Modules 10
MODULE 1 : Python For Data Science 12
MODULE 2 : Understanding Statistic For Data Science 32
MODULE 3 : Predictive Modeling and Introduction to Machine Learning 40

4 Conclusion
TAKEAWAYS OF TRAINING 42
FUTURE SCOPE 42
List of Figures

4.1 Screenshot of assignments . . . . . . . . . . . . . . . . . . . . . . 5


Chapter 1

Introduction
OVERVIEW

The Data Science Training by Internshala is a 6-week online training


program in which Internshala aim to provide you with a comprehensive
introduction to data science. In this training program, you will learn the
basics of python, statistics, predictive modeling, and machine learning. This
training program has video tutorials and is packed with assignments,
assessments tests, quizzes, and practice exercises for you to get hands-on
learning experience. At the end of this training program, you will have a
solid understanding of data science and will be able to build an end-to-end
predictive model

MOTIVATION

Learning machine learning was purely from an interest in Artificial


Intelligence for me. For some reason, just hearing the word robots or
anything about AI just fascinated me but it was just a moment of joy at that
time. I believe that data is power and Machine Learning is something that
can unlock this immense potential of data.

Besides, it I am eager to learn data science so that I can involve in various


Fields where data science play a major role.
OBJECTIVES OF TRAINING

Upon successful completion of the Certificate, graduates should be able to :-

1. Use their learned skills, knowledge and abilities to deal with datas for the in-
ternet and apply basic design principles to present ideas, information, products,
and services on creating models
3. Demonstrate communication skills, service management skills, and
presentation skills.
4. Complete job preparation tasks including writing resumes and cover letters,
con- ducting job interviews and developing an E-Portfolio. Apply
employability skills including fundamental skills, personal management skills,
and teamwork skills.
Chapter 2
Problem Description

Provided with following files: train.csv and test.csv.


Use train.csv dataset to train the model. This file contains all the client and call details as
well as the target variable “subscribed”. Then use the trained model to predict whether a
new set of clients will subscribe the term deposit.
In j63]: train['subacrioed’].repIace('r:',e,inp1ace=True}
train[ ’ subsci ited ” ] . nep1ace( , es', 1,1np1ace=true)

In j64]: train.head()

ID age job marital education defau It balance housing laan contact day dura0on campigu pdays pnevious sub•

0 20110 Sd adrriin. v n known no 1033 no no telephone 19 nov 44 2 1 0 unko u


1 40570 31 unknown secondary no 4 no no c•=IIular ZO 61 2 0 uuk x‹u
2 15320 27 seruic•=s secondary no 801 yes no c•=IIular 18 240 1 -1 0 unKnovin
3 43962 57 nanagerren1 divorced teniary no 3287 no no cellular 22 iun 8d7 1 84 3 success
4 20842 31 technician secondary no 116 yes no cellular 4 feb 380 1 -1 0 unknown

Iñ j65]: train . is n ull() , sum( )

ID
age
job

education e
default e
balance 6
hous:ing e

contact e
day e
duration e
campaign e
pdays 6
previous 0
poutcome 0
subscribed 0
dtype: int6é

In 66 ] : corr=Rrz i n , cor r( )
nas k = n p . array(co rr)

p1t.f1gure(£-kg size=(14, s) )
sn. heatmap(cor,1:inet‹1dth=e. 3, iTiask sk, annot=Tnue, s q Mare=T rue)

‹natplotl1b . axes ._s ubpJots. xessMbpJot at ex2scse7a8elB›


target = traz n ' subscr ibed’ ]
train = traln .drop( ' subscribed ’ ,1)

train = pd . get_dumies(train)

-frs skJearn.ltode1_seJect1on Export tra! n_test_sp Lit

Logistic Regression
{rm sk]earn.linear_model import LOgistlcRepression

Lreg = Logist icRegres s ion()

Lreg.-L-it(x_traln,y_tna in)

LogisticRegression(C=l.0, class_weig0t=Mone, dual=False, fit_intercept=True, intercept scaling-


1, max_1ter=]00, multi class='ovr ', mobs=], penaIty='l2', random_state=None,
solver=’liblinear', told.0BBt, verbo5e=a, narm_start=False)

pred = lreg.pred1ct(x value)

DECISION TREE

mm sklearn. metnics irrg›ort ac c uracy_score

accurac\’_score(y_va1ue, prod)
6.855 576619273 361B

lr+:ai skJearn.I ree import Dec1s ionTneeC Ja ss1fier

elf = DecisionTreeClass1fler(max_depth=4)

clf.fit(x_train,y_train)
DecisionTreeC1a5szfier(class_we1ghtGone, cr1ter1oN='gini', max_depth=t, max_features=WOne,
max_1eaf_nodes=None, m1n_impurity_decrease=0.0, min_impur1ty_splitaNone,
min samples_leaf=1, min samples split=2,
min weight {raction leaf=0.0, presort=False, randm state=Mone,
s p 11t:ter= ' be st ' )

pred ict = cl-F. predict(x_va lue) acc urac5' score(y

value, predict)
6.9072669B26224329

test = pd.get_dummes(test)

\es\ re0 iction = elf.pred ict(test) sulxri ssion = pd . DataF

rame()

sulxri ss:ion[ ’ ID ’ ] = test[ ' ID' ]


sulxri ss:ion[ ’ sub5c r ibed ’ ] = test_predict:ion

mutant Oslo n[ ' subs c r ibed ' ] . replace(G, ‘no ', inpl ace=True) mutant Oslo n[ ' subs
c r ibed ' ] . replace(1,‘yes ’, inp lace=7rue)

sulxri Oslo n . to_csv( ' subnzss ion. csv ', header=Trtie, index=Fa1se)
Chapter 3

Description of Modules
Module-1: Python for Data Science

Introduction to Python
Python is a high-level, general-purpose and a very popular programming
language. Python programming language (latest Python 3) is being used in
web development, Machine Learning applications, along with all cutting
edge technology in Software Industry. Python Programming Language is very
well suited for Beginners, also for experienced programmers with other
programming languages like C++ and Java.

Below are some facts about Python Programming Language:

• Python is currently the most widely used multi-purpose, high-


level programming language.
• Python allows programming in Object-Oriented and Procedural paradigms.
• Python programs generally are smaller than other programming languages
like Java. Programmers have to type relatively less and indentation
requirement of the language, makes them readable all the time.
• Python language is being used by almost all tech-giant companies
like –Google, Amazon, Face book, Integra, Drop box, Umber… etc.
• The biggest strength of Python is huge collection of standard library
which cane used for the following:
• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Drop box)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more.
Understanding Operators
a. Arithmetic operators:
Arithmetic operators are used to perform mathematical operations
like addition, subtraction, multiplication and division.

OPERATOR DESCRIPTION

+ Addition: adds two operands

- Subtraction: subtracts two operands

* Multiplication: multiplies two operands

/ Division (float): divides the first operand by the second

// Division (floor): divides the first operand by the second


Modulus: returns the remainder when first operand is
% divided by the second

** Power : Returns first raised to power second


b. Relational Operators:
Relational operators compare the values. It either
returns True or False according to the condition.

OPERATOR DESCRIP SYNTA


TION X

> Greater than: True if left operand is greater than the x>y
right

< Less than: True if left operand is less than the right x<y

x == y
== Equal to: True if both operands are equal

!= Not equal to - True if operands are not equal x != y

Greater than or equal to: True if left operand is x >= y


>= greater than or equal to the right

Less than or equal to: True if left operand is less than x <= y
or equal
<=
to the right
c. Logical operators:
Logical operators perform Logical AND, Logical OR and
Logical NOT operations.

OPERATOR DESCRIP SYNTAX


TION

and Logical AND: True if both the operands are x and y


true

or Logical OR: True if either of the operands is x or y


true

not Logical NOT: True if operand is false not x

d. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.

OPERATOR DESCRIPTION SYNTAX

& Bitwise AND x&y

| Bitwise OR x|y

~ Bitwise NOT ~x

^ Bitwise XOR x^y

>> Bitwise right shift x>>

<< Bitwise left shift x<<


e. Assignment operators:
Assignment operators are used to assign values to the variables.

OPERATOR DESCRIP SYNTAX


TION

Assign value of right side of


= expression to left side operand x=y+z

Add AND: Add right side operand


+= with left side operand and then assign a+=b a=alb

to left operand

Subtract AND: Subtract right operand


-= from left operand and then assign to a-=b a=a-b

left operand

Multiply AND: Multiply right operand


*= with left operand and then assign to left a*=b a=a*b

operand

Divide AND: Divide left operand


/= with right operand and then assign to a/=b a=a/b

left operand
Modulus AND: Takes modulus using left
and
%= right operands and assign result a%=b a=a%b

to left operand
Divide(floor) AND: Divide left

operand with right operand and then


//= a//=b a=a//b
assign the value(floor) to left

operand

Exponent AND: Calculate

exponent(raise power) value using


**= a**=b a=a**b
operands and assign value to left

operand

Performs Bitwise AND on


&= operands and assign value to left as&=b a=a&b

operand

Performs Bitwise OR on operands and


|= assign value to left operand a|=b a=a|b

Performs Bitwise or on operands and


^= assign value to left operand a^=b a=a^b

Performs Bitwise right shift on


>>= operands and assign value to left a>>=b a=a>>b

operand
a <<=
Performs Bitwise left shift on
<<= b a= a
operands and assign value to left <<

operand b
f. Special operators: There are some special type of operators like-
Identity operators:
Is and is not are the identity operators both are used to check if two
values are located on the same part of the memory. Two variables
that are equal do not imply that they are identical.
is True if the operands are identical
is not True if the operands are not identical

Membership operators:
In and not in are the membership operators; used to test whether a
value or variable is in a sequence.
in True if value is found in the sequence
not in True if value is not found in the sequence

g. Precedence and Associatively of Operators:


Operator precedence and associatively as these determine the
priorities of the operator.
Operator Precedence:
This is used in an expression with more than one operator with different
precedence to determine which operation to perform first.
Operator Associatively:
If an expression contains two or more operators with the same
precedence then Operator Associatively is used to determine. It can
either be Left to Right or from Right to Left.

OPERATOR DESCRIPTION ASSOCIATIVIT


Y

() Parentheses left-to-right

** Exponent right-to-left

* /% Multiplication/division/modulus left-to-right
OPERATOR DESCRIP ASSOCIATIVITY
TION

+ - Addition/subtraction left-to-right

<< >> Bitwise shift left, Bitwise shift right left-to-right

Relational less than/less than or equal to


< <= Relational greater than/greater than or
left-to-right
> >= equal to

== != Relational is equal to/is not equal to left-to-right

Variables and Data Types


Variables:
a. Python Variables Naming Rules:
There are certain rules to what you can name a variable (called an identifier).
• Python variables can only begin with a letter (A-Z/a-z) or an underscore (_).
• The rest of the identifier may contain letters (A-Z/a-z), underscores (_),
and numbers (0-9).
• Python is case-sensitive, and so are Python identifiers. Name and name are
two different identifiers.
b. Assigning and Reassigning Python Variables:
• To assign a value to Python variables, you don’t need to declare its type.
• You name it according to the rules stated in section 2a, and type the value
after the equal sign (=).
• You can’t put the identifier on the right-hand side of the equal sign.
• Neither can you assign Python variables to a keyword.
c. Multiple Assignments:
• You can assign values to multiple Python variables in one statement.
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
Data Types:
A. Python Numbers:
There are four numeric Python data types.
a. Into
Into stands for integer. This Python Data Type holds signed integers. We can use the
type () function to find which class it belongs to.
b. Float
This Python Data Type holds floating-point real values. An into can only
store the number 3, but float can store 3.25 if you want.
c. Long
This Python Data type holds a long integer of unlimited length. But this
construct does not exist in Python 3.x.
d. Complex
This Python Data type holds a complex number. A complex number looks
like this: abs Here, a and b are the real parts of the number, and j is
imaginary.
B. Strings:
A string is a sequence of characters. Python does not have a char data type,
unlike Java. You can delimit a string using single quotes or double-
quotes.
a. Spanning a String across Lines:
To span a string across multiple lines, you can use triple quotes.
b. Displaying Part of a String:
You can display a character from a string using its index in the string. Remember,
indexing starts with 0.
c. String Formatters:
String formatters allow us to print characters and values at once. You can use the
% operator.
d. String Concatenation:
You can concatenate (join) strings using + operator. However, you cannot
concatenate values of different types.
C. Python Lists:
A list is a collection of values. Remember, it may contain different types of values.
To define a list, you must put values separated with commas in square
brackets. You don’t need to declare a type for a list either.
a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator.
Indexing for a list begins with 0, like for a string. A Python doesn’t have
arrays.
b. Length of a List
Python supports an inbuilt function to calculate the length of a list.
c. Reassigning Elements of a List
A list is mutable. This means that you can reassign elements later on.
d. Iterating on the List
To iterate over the list we can use the for loop. By iterating, we can access
each element one by one which is very helpful when we need to perform
some operations on each element of list.
e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in
DataFlair’stutorial on Python Lists.
D. Python Tuples:
A tuple is like a list. You declare it using parentheses instead.
a. Accessing and Slicing a Tuple
You access a tuple the same way as you’d access a list. The same goes for
slicing it.
b. A tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements.
E. Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs
separated by commas. Separate keys and values by a colon(:).The type()
function works withdictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements
You can reassign a value to a key.
c. List of Keys
Use the keys() function to get a list of keys in the dictionary.
F. Bool:
A Boolean value can be True or False.
G. Sets:
A set can have a list of values. Define it using curly braces. It returns only
one instance of any value present more than once. However, a set is
unordered, so it doesn’t support indexing. Also, it is mutable. You can
change its elements or add more. Use the add() and remove() methods to do
so.
H. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into
another type. Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
Conditional Statements
a. If statements
If statement is one of the most commonly used conditional statement in most
of the programming languages. It decides whether certain statements need to
be executed or not. If statement checks for a given condition, if the condition
is true,then the set of code present inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of
code only when the Boolean expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code Set of statements to execute if the
condition is true

b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the
else block.
Else block will execute only when the condition becomes false, this is the
block where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of
code present inside the if block if the condition becomes TRUE and executes
a block of code present in the else block if the condition becomes FALSE.
Syntax:
If (Boolean expression):
Block of code #Set of statements to execute if condition is true

Else:
Block of code #Set of statements to execute if condition is false

c. elif statements
In python, we have one more conditional statement called elif statements.
Elif statement is used to check multiple conditions only if the given if
condition false. It’s similar to an if-else statement and the only difference is
that in else we will not check the condition but in elf we will do check the
condition.
Leif statements are similar to if-else statements but elf statements evaluate
multiple conditions.
Syntax:
if (condition):
#Set of statement to execute if condition is
trueelif (condition):
#Set of statements to be executed when if condition is false and
elifcondition is true
else:
#Set of statement to be executed when both if and elif conditions are false

d. Nested if-else statements


Nested if-else statements mean that an if statement or if-else statement is
present inside another if or if-else block. Python provides this feature as
well; this in turn will help us to check multiple conditions in a given
program.
An if statement present inside another if statement which is present inside
another if statements and so on.
Nested if Syntax:
if (condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is
true#end of nested if
#end of if
Nested if-else Syntax:

if(condition):
#Statements to execute if condition is
trueif(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false
else:
#Statements to execute if condition is false

e. elif Ladder
We have seen about the elif statements but what is this elif ladder. As the
name itself suggests a program which contains ladder of elif statements or elif
statements which are structured in the form of a ladder.
This statement is used to test multiple expressions.
Syntax:
If (condition):
#Set of statement to execute if condition is
trueelif (condition):
#Set of statements to be executed when if condition is false and
elifcondition is true
elif (condition):
#Set of statements to be executed when both if and first elif
condition isfalse and second elif condition is true
elif (condition):
#Set of statements to be executed when if, first elif and second
elifconditions are false and third elif statement is true
else:
#Set of statement to be executed when all if and elif conditions are false
Looping ConstructsLoops:
a. while loop:
Repeats a statement or group of statements while a given condition is TRUE.
Ittests the condition before executing the loop body.
Syntax:
while expression:
Statement
b. For loop:
Executes a sequence of statements multiple times and abbreviates the code
that manages the loop variable.
Syntax:
For iterating_var in sequence:
statements(s)
c. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop:
for iterating_var in sequence:
for iterating_var in sequence:
statements(s)
statements(s)
Syntax of nested while loop:
while expression:
while expression:
statement(s)
statement(s)
Loop Control Statements:
a. break statement:
Terminates the loop statement and transfers execution to the statement
immediately following the loop.
b. Continue statement:
Causes the loop to skip the remainder of its body and immediately retest
itscondition prior to reiterating.
c. pass statement:
The pass statement in Python is used when a statement is required
syntactically but you do not want any command or code to execute.
Functions
A. Built-in Functions or pre-defined functions:
These are the functions which are already defined by Python. For example:
id (),type(), print (), etc.
B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid
repetition of code. It is done by using def function.
Data Structure
Python has implicit support for Data Structures which enable you to store
and access data. These structures are called List, Dictionary, Tuple and Set.
Lists
Lists in Python are the most versatile data structure. They are used to store
heterogeneous data items, from integers to strings or even another list! They
are also mutable, which means that their elements can be changed even after
the list is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each
item is separated by a comma.
Creating lists in Python
Since each element in a list has its own distinct position, having duplicate
values ina list is not a problem.
Accessing List elements
To access elements of a list, we use Indexing. Each element in a list has an
index related to it depending on its position in the list. The first element of
the list has the index 0, the next element has index 1, and so on. The last
element of the list has an index of one less than the length of the list.
Indexing in Python lists
While positive indexes return elements from the start of the list, negative
indexes return values from the end of the list. This saves us from the trivial
calculation whichwe would have to otherwise perform if we wanted to return
the nth element from the end of the list. So instead of trying to return
List_name[len(List_name)-1] element, we can simply write List_name[-1].

Using negative indexes, we can return the nth element from the end of the
list easily. If we wanted to return the first element from the end, or the last
index, the associated index is -1. Similarly, the index for the second last
element will be -2, and so on. Remember, the 0th index will still refer to the
very first element in the list.
Appending values in Lists
We can add new elements to an existing list using the append() or insert() methods:
append () – Adds an element to the end of the list
insert() – Adds an element to a specific position in the list which needs to be
specified along with the value
Removing elements from Lists
Removing elements from a list is as easy as adding them and can be done
using the remove() or pop() methods:
remove() – Removes the first occurrence from the list that matches the given
valuepop() – This is used when we want to remove an element at a specified
index from the list. However, if we don’t provide an index value, the last
element will be removed from the list.
Sorting Lists
On comparing two strings, we just compare the integer values of each
character from the beginning. If we encounter the same characters in both
the strings, we just compare the next character until we find two differing
characters.
Concatenating Lists
We can even concatenate two or more lists by simply using the + symbol.
This willreturn a new list containing elements from both the lists:
List comprehensions
A very interesting application of Lists is List comprehension which provides
a neat way of creating new lists. These new lists are created by applying an
operation on each element of an existing list. It will be easy to see their
impact if we first check out how it can be done using the good old for-loops.
Stacks & Queues using Lists
A list is an in-built data structure in Python. But we can use it to create user-
defined data structures. Two very popular user-defined data structures built
using lists are Stacks and Queues.
Stacks are a list of elements in which the addition or deletion of elements is
done from the end of the list. Think of it as a stack of books. Whenever you
need to add or remove a book from the stack, you do it from the top. It uses
the simple concept of Last-In-First-Out.
Queues, on the other hand, are a list of elements in which the addition of
elements takes place at the end of the list, but the deletion of elements takes
place from the front of the list. You can think of it as a queue in the real-
world. The queue becomes shorter when people from the front exit the
queue. The queue becomes longer when someone new adds to the queue
from the end. It uses the concept of First- In-First-Out.
Module-2: Understanding the Statistics for Data Science

Introduction to Statistics
Statistics simply means numerical data, and is field of math that generally
deals withcollection of data, tabulation, and interpretation of numerical data.
It is actually a form of mathematical analysis that uses different quantitative
models to produce a set of experimental data or studies of real life. It is an
area of applied mathematics concern with data collection analysis,
interpretation, and presentation. Statistics deals with how data can be used to
solve complex problems. Some people consider statistics to be a distinct
mathematical science rather than a branch of mathematics. Statistics makes
work easy and simple and provides a clear and clean picture of workyou do on
a regular basis.
Basic terminology of Statistics:
Population –
It is actually a collection of set of individuals or objects or events
whose properties are to be analyzed.
Sample –
It is the subset of a population.
Types of Statistics :
Measures of Central Tendency
(i) Mean :
It is measure of average of all value in a sample set.
For example,

(ii) Median:
It is measure of central value of a sample set. In these, data set is ordered
from lowest to highest value and then finds exact middle.
For example,
(iii) Mode:
It is value most frequently arrived in sample set. The value repeated most of
time in central set is actually mode.
For example,

Understanding the spread of data


Measure of Variability is also known as measure of dispersion and
used to describe variability in a sample or population. In statistics,
there are three common measures of variability as shown below:
(i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
(ii) Variance:
It simply describes how much a random variable defers from expected
value and it is also computed as square of deviation.
S2= ∑ni=1 [(xi - x)2 ÷ n]
In these formula, n represent total data represent mean of data points
points,and xi represent individual data
points.
(iii) Dispersion:
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2

Data Distribution
Terms related to Exploration of Data Distribution
-> Boxplot
-> Frequency Table
-> Histogram
-> Density Plot
Boxplot : It is based on the percentiles of the data as shown in the figure
below.The top and bottom of the boxplot are 75th and 25th percentile of
the data. The extended lines are known as whiskers that includes the range
of rest of the data. # BoxPlot Population In Millions
fig, ax1 = plt.subplots()
fig.set_size_inches(9, 15)

ax1 = sns.boxplot(x = data.PopulationInMillions, orient ="v")


ax1.set_ylabel("Population by State in Millions", fontsize = 15)
ax1.set_title("Population - BoxPlot", fontsize = 20)

Frequency Table: It is a tool to distribute the data into equally spaced ranges,
segments and tells us how many values fall in each segment.
Histogram: It is a way of visualizing data distribution through frequency
table with bins on the x-axis and data count on the y-axis.
Code – Histogram

# Histogram Population In Millions

fig, ax2 =plt.subplots()


fig.set_size_inches(9, 15)

ax2 =sns.distplot(data.PopulationInMillions, kde =False)


ax2.set_ylabel("Frequency", fontsize = 15) ax2.set_xlabel("Population
by State in Millions", fontsize = 15)ax2.set_title("Population -
Histogram", fontsize = 20)
Output :

Density Plot: It is related to histogram as it shows data-values being


distributed as continuous line. It is a smoothed histogram version. The
outputbelow is the density plor superposed over histogram.
Code – Density Plot for the data
# Density Plot - Population

fig, ax3 = plt.subplots()


fig.set_size_inches(7, 9)

ax3 = sns.distplot(data.Population, kde = True)


ax3.set_ylabel("Density", fontsize = 15) ax3.set_xlabel("Murder
Rate per Million", fontsize = 15) ax3.set_title("Desnsity Plot -
Population", fontsize = 20)Output :

Introduction to Probability
Probability refers to the extent of occurrence of events. When an event
occurs likethrowing a ball, picking a card from deck, etc ., then the must be
some probability associated with that event.
In terms of mathematics, probability refers to the ratio of wanted outcomes to
the total number of possible outcomes. There are three approaches to the
theory of probability, namely:
1. Empirical Approach
2. Classical Approach
3. Axiomatic Approach
In this article, we are going to study about Axiomatic Approach.In this
approach, we represent the probability in terms of sample space(S) and other
terms.
Basic Terminologies:
Random Event :- If the repetition of an experiment occurs
several times under similar conditions, if it does not produce
the same outcome every time but the outcome in a trial is one
of the several possible outcomes, then such an experiment is
called random event or a probabilistic event.
Elementary Event – The elementary event refers to the
outcome of each random event performed. Whenever the
random event is performed, each associated outcome is known
as elementary event.
Sample Space – Sample Space refers to the set of all
possible outcomes of a random event. Example, when a coin
is tossed, the possible outcomes are head and tail.
Event – An event refers to the subset of the sample space
associated with a random event.
Occurrence of an Event – An event associated with a random
events said to occur if any one of the elementary event
belonging to it is an outcome.
Sure Event – An event associated with a random event is said
to be sure event if it always occurs whenever the random
event is performed.
Impossible Event – An event associated with a random event is
said to be impossible event if it never occurs whenever the
random events performed.
Compound Event – An event associated with a random event is
said to be compound event if it is the disjoint union of two or
more elementary events.
Mutually Exclusive Events – Two or more events associated
with a random event are said to be mutually exclusive events if
any one of the event occurs, it prevents the occurrence of all
other events. This means that no two or more events can occur
Exhaustive Events – Two or more events associated with a
randomevent are said to be exhaustive events if their union is
the sample space.
Probability of an Event – If there are total p possible outcomes associated
with a random experiment and q of them are favourable outcomes to the event
A, then the probability of event A is denoted by P(A) and is given by
P(A) = q/p
Probabilities of Discreet and Continuous Variables
Random variable is basically a function which maps from the set of sample
space to set of real numbers. The purpose is to get an idea about result of a
particular situation where we are given probabilities of different outcomes.

Discrete Random Variable:


A random variable X is said to be discrete if it takes on finite number of
values. The probability function associated with it is said to be PMF =
Probability mass function.P(xi) = Probability that X = xi = PMF of X = pi.
1. 0 ≤ pi ≤ 1.
2. ∑pi = 1 where sum is taken over all possible values of x.
Continuous Random Variable:
A random variable X is said to be continuous if it takes on infinite number of
values. The probability function associated with it is said to be PDF =
Probability density function.
PDF: If X is continuous random
variable.P (x < X < x + dx) = f(x)*dx.
1. 0 ≤ f(x) ≤ 1; for all x
2. ∫ f(x) dx = 1 over all values of x
Then P (X) is said to be PDF of the distribution.
N a ve
e t Correlati
g i on
Module-3: Predictive Modeling and Basics of Machine Learning

Introduction to Predictive Modeling


Predictive analytics involves certain manipulations on data from existing
data sets with the goal of identifying some new trends and patterns. These
trends and patterns are then used to predict future outcomes and trends. By
performing predictive analysis, we can predict future trends and performance.
It is also defined as the prognostic analysis, the word prognostic means
prediction. Predictive analytics uses the data, statistical algorithms and
machine learning techniques to identify the probability of future outcomes
based on historical data.
Understanding the types of Predictive Models
Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a
teacher. Basically supervised learning is a learning in which we teach or
train the machine using data which is well labeled that means some data is
already tagged with the correct answer. After that, the machine is provided
with a new set of examples(data) so that supervised learning algorithm
analyses the training data(set of training examples) and produces a correct
outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of machine is to group unsorted
information according to similarities, patterns and differences without any
prior training of data.
Stages of Predictive Models
Steps To Perform Predictive Analysis:
Some basic steps should be performed in order to perform predictive analysis.
1. Define Problem Statement:
Define the project outcomes, the scope of the effort, objectives; identify the
datasets that are going to be used.
2. Data Collection:
Data collection involves gathering the necessary details required for the analysis.
It involves the historical or past data from an authorized source over which
predictive analysis is to be performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process
of data cleaning, we remove un-necessary and erroneous data. It involves
removing the redundant data and duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it
thoroughlyin order to identify some patterns or new outcomes from the data
set. In this stage, we discover useful information and conclude by
identifying some patternsor trends.
5. Build Predictive Model:
In this stage of predictive analysis, we use various algorithms to build
predictive models based on the patterns observed. It requires knowledge of
python, R, Statistics and MATLAB and so on. We also test our hypothesis
using standard statistic models.
6. Validation:
It is a very important step in predictive analysis. In this step, we check the
efficiency of our model by performing various tests. Here we provide
sample input sets to check the validity of our model. The model needs to be
evaluated for its accuracy in this stage.
7. Deployment:
In deployment we make our model work in a real environment and it helps in
everyday discussion making and makes it available to use.
8. Model Monitoring:
Regularly monitor your models to check performance and ensure that we
have proper results. It is seeing how model predictions are performing
against actual data s
Takeaways Of Training

Biggest Takeaways from My Data Science Summer Internship


1) Spend time working on your own
This is a pretty important aspect that I myself should have done better
in my intern- ship. Take some time out to do pair programming with
friends, take some time to discuss the code and/or the project, but sit
down, put on your headphones and focus on writing some code.I look
up stuff on Google for help.
2) Learned different technologies and frame work
Learned a lot of things including different tools, frameworks and languages
like Python and Pandas.

Conclusion

Data science is one of the most innovative fields in the modern world.
It provides the best suggestions for tackling the challenges faced by
increasing demand and a sustainable future. The necessity for a data
scientist is expanding along with the importance of data science. Data
scientists are the world's future. A data scientist must therefore be
able to offer excellent solutions that address the problems in all
industries.

Future Scope

The future scopes for data scientist are:

• Website is being regularly collected by businesses and companies for


trans- actions and through website interactions. Many companies
face a common challenge to analyze and categorize the data that is
collected and stored. A developer becomes the savior in a situation of
mayhem like this. Companies can progress a lot with proper and
efficient handling of data, which results in productivity.

• Data is generated by everyone on a daily basis with and without our


notice. The interaction we have with data daily will only keep
increasing as time passes. In addition, the amount of data existing in
the world will increase at lightning speed. As data production will be
on the rise, the demand for data scientists will be crucial

You might also like