COMPUTER VISION UNIT 4

NOIDA INSTITUTE OF ENGINEERING AND
TECHNOLOGY, GREATER NOIDA
Object Detection
Unit: 4
Subject Name: Faculty Name:

Computer Vision Dr Preeti Gera
Affiliation:
Course Details: Associate Professor
B.Tech 7th Sem Department:
CSE
Dr Preeti Gera COMPUTER VISION UNIT- IV

1
UNIT-I
UNIT-I: Introduction to Computer Vision:
Introduction to Computer Vision
Computer Vision, Research and Applications, (Self-Driving Cars, Facial

Recognition, Augmented & Mixed Reality, Healthcare). Most popular
examples Categorization of Images, Object Detection, Observation of Moving
Objects, Retrieval of Images Based on Their Contents, Computer Vision Tasks
classification, object detection, Instance segmentation. Convolutional Neural
Networks, Evolution of CNN Architectures for Image, Recent CNN
Dr Preeti Gera COMPUTER VISION UNIT- IV 2

UNIT-II
UNIT-I: Architectures
Architectures
Representation of a Three-Dimensional Moving Scene. Convolutional

layers, pooling layers, and padding. Transfer learning and pre-trained
models Architectures.
Architectures Design: LeNet-5, AlexNet, VGGNet, GoogLeNet,

ResNet, Efficient Net, Mobile Net, RNN Introduction.

UNIT-III
UNIT-III: Segmentation
Segmentation
Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel

Transformations, Geometric Operations, Spatial Operations in Image Processing, Instance
Segmentation, Localisation, Object detection and image segmentation using CNNs, LSTM and
GRU’s. Vision Models, Vision Languages, Quality Analysis, Visual Dialogue, Active Contours &
Application, Split & Merge, Mean Shift & Mode Finding, Normalized Cuts.

UNIT-IV
UNIT-I: Architectures
Object Detection
Object Detection and Sliding Windows, R-CNN, Fast R-CNN,

Object Recognition, 3-D vision and Geometry, Digital
Watermarking. Object Detection, face recognition instance
Recognition, Category Recognition Objects, Scenes,
Activities, Object classification.

UNIT-V
UNIT-V: Visualization and Generative Models
Visualization and Generative Models
Object Detection and Sliding Windows, R-CNN, Fast R-CNN,

Object Recognition, 3-D vision and Geometry, Digital
Watermarking. Object Detection, face recognition instance
Recognition, Category Recognition Objects, Scenes,
Activities, Object classification.

Syllabus
Unit Module Topics Covered
1 Introduction to Computer Vision, Research and Applications, (Self-Driving Cars, Facial Recognition, Augmented & Mixed Reality,
Computer Healthcare). Most popular examples Categorization of Images, Object Detection, Observation of Moving Objects,
Vision: Retrieval of Images Based on Their Contents, Computer Vision Tasks classification, ,object detection, Instance
segmentation. Convolutional Neural Networks, Evolution of CNN Architectures for Image, Recent CNN
2 Architectures Representation of a Three-Dimensional Moving Scene. Convolutional layers, pooling layers, and padding. Transfer
learning and pre-trained models Architectures.
Architectures Design : LeNet-5, AlexNet, VGGNet, GoogLeNet, ResNet, Efficient Net, Mobile Net . RNN
Introduction, perceptron Backpropagation in CNN,RNN.
Segmentation Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel Transformations,
3 Geometric Operations, Spatial Operations in Image Processing, Instance Segmentation, Localisation, Object
detection and image segmentation using CNNs, LSTM and GRU’s. Vision Models, Vision Languages, Quality
Analysis, Visual Dialogue, other attention models, self attention and transformers. Active Contours & Application,
Split & Merge, Mean Shift & Mode Finding, Normalized Cuts,
4 Object Detection Object Detection and Sliding Windows, R-CNN, Fast R-CNN, Object Recognition, 3-D vision and Geometry,
Digital Watermarking. Object Detection, face recognition instance Recognition, Category Recognition Objects,
Scenes, Activities, Object classification and detection, Encoder in Code, Decoder in Code, U-Net Code: Encoder,
Decoder , Few Shot and zero shot learning, self-supervised learning, Adversarial Robustness, Pruning and model
compression, Neural Architecture search, Objects in Scenes. YOLO Fundamentals of Image Formation,
Convolution and Filtering.
5 Visualization and Benefits of Interpretability, Fashion MNIST Class Activation Map code walkthrough, GradCAM,ZFNet.Image
Generative compression methods and its requirements,statisticalcompression, spatial compression, contour coding. Deep
Models Generative Models introduction,Generative Adversarial Networks Combination VAE and GAN’s, other VAE and
GAN’s deep generative models. GAN Improvements, Deep Generative Models across multiple domains, Deep
Generative Models image and video applications.

7
Course Objective
Course Objective: To learn about key features of Computer Vision, design, implement
and provide continuous improvement in the accuracy and outcomes of various datasets
with more reliable and concise analysis results.

8
Course Outcome
Course outcome: After completion of this course students will be able to

Description Bloom’s
Taxonomy
CO1 Analyse knowledge of deep architectures used for K4
solving various Vision and Pattern Association tasks.
CO2 Develop appropriate learning rules for each of the K2

architectures of perceptron and learn about different
factors of back propagation.
CO3 Deploy training algorithm for pattern association with K5
the help of memory network.
CO4 Design and deploy the models of deep learning with the K5
help of use cases.
CO5 Understand, Analyse different theories of deep learning K3
using neural networks.

Program Outcome
At the end of the semester, the student will be able:

POs Engineering Graduates will be able to
PO1 Engineering Knowledge
PO2 Problem Analysis
PO3 Design & Development of solutions
PO4 Conduct Investigation of complex problems
PO5 Modern Tool Usage
PO6 The Engineer and Society
PO7 Environment and sustainability
PO8 Ethics
PO9 Individual & Team work
PO10 Communication
PO11 Project management and Finance
PO12 Life Long Learning

CO-PO and PSO Mapping
Computer vision
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Subject code 3 2 1
Subject code 2 1
Subject code 2 3
Subject code 3 3 2
Subject code 3 2 1 1
avg

Program Educational Objectives(PEOs)
PEO1: To have an excellent scientific and engineering breadth so as to

comprehend, analyze, design and provide sustainable solutions for
real-life problems using state-of-the-art technologies.
PEO2:To have a successful career in industries, to pursue higher studies or
to support enterpreneurial endeavors and to face global challenges.
PEO3:To have an effective communication skills, professional attitude,
ethical values and a desire to learn specific knowledge in emerging
trends, technologies for research, innovation and product
development and contribution to society.
PEO4: To have life-long learning for up-skilling and re-skilling for
successful professional career as engineer, scientist, enterpreneur
and bureaucrat for betterment of society

End Semester Question Paper Template
B TECH
(SEM-VII) THEORY EXAMINATION 20__-20__
Computer Vision
Time: 3 Hours Total Marks:
100
Note: 1. Attempt all Sections. If require any missing data; then choose
suitably.
SECTION A
1.Attempt
Q.No. all questions in brief.Question 2 x 10 = 20
Marks CO
1 2
2 2
. .
10 2

End Semester Question Paper Templates
SECTION B
2. Attempt any three of the following: 3 x 10 = 30
Q.No. Question Marks CO

1 10
2 10
. .
5 10
SECTION C
3. Attempt any one part of the following: 1 x 10 = 10
1 10
2 10

End Semester Question Paper Templates
1 10
2 10
1 10
2 10
1 10
2 10

CONTENT
Unit Content
Object Detection
Object Detection and Sliding Windows
R-CNN
Fast R-CNN
Object Recognition
3-D vision and Geometry
, Digital Watermarking
Object Detection
face recognition
instance Recognition
Category Recognition Objects, Scenes
Activities, Object classification

CO-PO and PSO Mapping
PSO1 PSO2 PSO3

Subject code 1 2
Subject code 2
Subject code 1
Subject code 1 2 3
Subject code 1 1 2
avg
*3= High *2= Medium *1=Low

PREREQUISITE
Students should have the knowledge of computer organization
Students should know the artificial intelligence basic concepts.
Students should be able to create, design and manipulate the images and idea about
pixels

PREREQUISITE
Prerequisites
No prior experience with computer vision is assumed, although
previous knowledge of visual computing or signal processing
will be helpful (e.g., CSCI 1230). The following skills are
necessary for this class:
•Math: Linear algebra, vector calculus, and probability. Linear
algebra is the most important and is required.
•Data structures: You will write code that represents images as
matrices, high-dimensional features, and geometric
constructions.
•Programming and toolchains: A good working knowledge.
Intro CS is required, and an intermediate systems course is
strongly encouraged.
Object Detection
What do you think you

notice in the image?
Some of you may respond as follows:
● A bear in a grassy area. Or
● Black bear walking in a grassy area. Or
● A black bear and some green grass. Or
● A bear eating grass.

Object Detection
M ot i v ation
Aid to Visually Challenged People:
❏ We can develop a product for the blind that will assist

them in navigating the roadways without the assistance
of others.
❏ This may be accomplished by first translating the scene to
text, then the text to speech.
❏ Both are now well-known Deep
Learning
applications.
Object Detection
M ot i v ation
Self driving cars
❏ Automatic driving one of the most difficult problems, and

accurately captioning the environment around the vehicle
can help the self-driving system.
Object Detection
M ot i v ation
Image Searching
❏ Automatic captioning might help Google Image

search become as good as Google Search.
❏ Every image could be turned into a caption first, and then
searches could be conducted based on the caption.
Object Detection
M ot i v ation
CCTV Surveillance
❏ CCTV cameras are now common, but if we can create

appropriate captions in addition to observing people.
❏ We can trigger warnings as soon as harmful activity is
detected. This is likely to help decrease crime and/or
accidents.
❏ It can be used to describe video in real time.
Object Detection
Image Captioning Model

★ A deep learning based image captioning architecture relies on two
components: Convolutional Neural Network (CNN) and Recurrent Neural
Network (RNN).
★ CNN worked as an encoder for image encoding and feature extraction.
★ RNN used as decoder for language modeling, it generates the caption word
by word.
Object Detection
Softmax Activation Function
Softmax is an activation function that scales numbers/logits into probabilities.
Image courtesy: https://towardsdatascience.com/

Object Detection
Fully Connected Layer

❖ Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of
the high-level features as represented by the output of the convolutional layer.
❖ Softmax applied over out fully connected layer to classify the input.

Object Detection
Convolutional Neural Network

★ Convolutional neural networks (CNN/ConvNet) are a type of deep neural network that is
frequently used to analyse visual images.
★ It has following component’;

Object Detection
R-CNN
To bypass the problem of selecting a huge number of regions, Ross Girshick et al. proposed a method
where we use selective search to extract just 2000 regions from the image and he called them region
proposals. Therefore, now, instead of trying to classify a huge number of regions, you can just work with
2000 regions.
.
The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the
image and the extracted features are fed into an SVM to classify the presence of the object within that
candidate region proposal.
In addition to predicting the presence of an object within the region proposals, the algorithm also predicts
four values which are offset values to increase the precision of the bounding box. For example, given a
region proposal, the algorithm would have predicted the presence of a person but the face of that person
within that region proposal could’ve been cut in half. Therefore, the offset values help in adjusting the
bounding box of the region proposal.
CNN architecture of R-CNN
• After that these regions are warped into a single square of regions of
dimension as required by the CNN model. The CNN model that we used
here is a pre-trained AlexNet model, which is the state-of-the-art CNN model
at that time for image classification Let’s look at AlexNet architecture here.

• Here the input of AlexNet is (227, 227, 3). So, if the region proposals
are small and large then we need to resize that region proposal to
given dimensions.
• From the above architecture, we remove the last softmax layer to get
the (1, 4096) feature vector. We pass this feature vector into SVM and
bounding box regressor.

• SVM (Support Vector Machine)
• The feature vector generated by CNN is then consumed by the binary SVM which is trained on
each class independently. This SVM model takes the feature vector generated in previous CNN
architecture and outputs a confidence score of the presence of an object in that region.
• However, there is an issue with training with SVM is that we required AlexNet feature vectors for
training the SVM class. So, we could not train AlexNet and SVM independently in paralleled
manner. This challenge is resolved in future versions of R-CNN (Fast R-CNN, Faster R-CNN,
etc.).
• Bounding Box Regressor
• In order to precisely locate the bounding box in the image., we used a scale-invariant linear
regression model called bounding box regressor. For training this model we take as predicted and
Ground truth pairs of four dimensions of localization.
• These dimensions are (x, y, w, h) where x and y are the pixel coordinates of the center of the
bounding box respectively. w and h represent the width and height of bounding boxes. This
method increases the Mean Average precision (mAP) of the result by 3-4%.

• Output:
• Now we have region proposals that are classified for every class label. In
order to deal with the extra bounding box generated by the above model in
the image, we use an algorithm called Non- maximum suppression. It
works in 3 steps:
• Discard those objects where the confidence score is less than a certain
threshold value( say 0.5).
• Select the region which has the highest probability among candidates
regions for the object as the predicted region.
• In the final step, we discard those regions which have IoU
(intersection Over Union) with the predicted region over 0.5.

• Problems with R-CNN
• It still takes a huge amount of time to train the network as you would have
to classify 2000 region proposals per image.
• It cannot be implemented real time as it takes around 47 seconds for each
test image.
• The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate
region proposals.
• Fast R-CNN

• The approach is similar to the R-CNN algorithm. But, instead of feeding the region
proposals to the CNN, we feed the input image to the CNN to generate a convolutional
feature map.
• From the convolutional feature map, we identify the region of proposals and warp them into
squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be
fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to
predict the class of the proposed region and also the offset values for the bounding box.
• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
• Faster R-CNN

• Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to
find out the region proposals. Selective search is a slow and time-consuming
process affecting the performance of the network. Therefore, Shaoqing
Ren et al. came up with an object detection algorithm that eliminates the
selective search algorithm and lets the network learn the region proposals.
• Similar to Fast R-CNN, the image is provided as an input to a convolutional

network which provides a convolutional feature map. Instead of using
selective search algorithm on the feature map to identify the region proposals,
a separate network is used to predict the region proposals. The predicted
region proposals are then reshaped using a RoI pooling layer which is then
used to classify the image within the proposed region and predict the offset
values for the bounding boxes.

• The Fast R-CNN consists of a CNN (usually pre-trained on the
ImageNet classification task) with its final pooling layer replaced by
an “ROI pooling” layer and its final FC layer is replaced by two
branches — a (K + 1) category softmax layer branch and a category-
specific bounding box regression branch.

Object Detection
What can you determine about
1. the sizes of objects
2. the distances of objects from the camera?
What knowledge
do you use to
analyze this image?
What objects are shown in thisObject
image?Detection
How can you estimate distance from the camera?
What feature changes with distance?
Object Detection
3D Shape from X
• shading
• silhouette mainly research
• texture
• stereo
• light striping used in practice
• motion
Object Detection
Perspective Imaging Model: 1D

real image This is the axis of the
point E xi
real image plane.
f
camera lens O is the center of projection.
O
image of point
D B in front image This is the axis of the front
zc
xf image plane, which we use.
xi = xc
xc B f zc
3D object
point
Object Detection
Perspective in 2D Yc
(Simplified)
camera
P´=(xi,yi,f) Xc
3D object point yi
P=(xc,yc,zc) ray f
F
=(xw,yw,zw) xi
yc optical
axis
zw=zc
xi xc
= xi = (f/zc)xc
Zc xc f zc
yi = (f/zc)yc
Here camera coordinates yi yc
equal world coordinates. =
f zc
Object Detection
3D from Stereo
3D point
left image right image
disparity: the difference in image location of the same 3D

point when projected under perspective to two different cameras.
d = xleft - xright
Object Detection
Depth Perception from Stereo

Simple Model: Parallel Optic Axes
image plane z
f Z
camera L
xl
b
baseline
f
camera R xr
x-b
X P=(x,z)
z x-b y-axis is
z = x = z = y = y perpendicular
f xl f xr f yl yr to the page.
Object Detection
Resultant Depth Calculation
For stereo cameras with parallel optical axes, focal length f,

baseline b, corresponding image points (xl,yl) and (xr,yr)
with disparity d:
z = f*b / (xl - xr) = f*b/d This method of

determining depth
x = xl*z/f or b + xr*z/f from disparity is
called triangulation.
y = yl*z/f or yr*z/f
Object Detection
Finding Correspondences
• If the correspondence is correct,

triangulation works VERY well.
• But correspondence finding is not perfectly solved.

for the general stereo problem.
• For some very specific applications, it can be solved

for those specific kind of images, e.g. windshield of
a car.
° °
Object Detection
3 Main Matching Methods
1. Cross correlation using small windows.
dense
2. Symbolic feature matching, usually using segments/corners.
sparse
3. Use the newer interest operators, ie. SIFT. sparse

Object Detection
Epipolar Geometry Constraint:

1. Normal Pair of Images
The epipolar plane cuts through the image plane(s) P
forming 2 epipolar lines.
y1 z1 z2
y2 epipolar
plane
P1
P2 x
C1 b C2
The match for P1 (or P2) in the other image,
must lie on the same epipolar line.
Object Detection
Epipolar Geometry:
General Case
P
y1
y2 P2
P1 e2 x2
e1 x1
C1
C2
Object Detection
1. Epipolar Constraint:
Constraints Matching points lie on
corresponding epipolar
lines.
P 2. Ordering Constraint:
Usually in the same
Q order across the lines.
e2
e1
C1
C2
Object Detection
Structured Light
light stripe
3D data can also be derived using
• a single camera
• a light source that can produce

stripe(s) on the 3D object
light camera
source
Object Detection
Structured Light
3D Computation
3D data can also be derived using
3D point
• a single camera (x, y, z)
• a light source that can produce

stripe(s) on the 3D object
 (0,0,0)
light b x axis
f
b source
[x y z] = --------------- [x´ y´ f] (x´,y´,f)
3D f cot  - x´ image
Object Detection
Depth from Multiple Light Stripes
What are these objects?

Object Detection
Our (former) System

4-camera light-striping stereo
cameras
projector
rotation
table
3D
object
Object Detection
Camera Model: Recall there are 5 Different

Frames of Reference
yc
• Object xc
yf zw
C
xf
• World
image a
• Camera
• Real Image W
zc yw
pyramid
• Pixel Image zp
yp A object
xp
xw
Object Detection
Rigid Body Transformations in 3D

zp
pyramid model
zw
in its own
model space
xp
yp
rotate
W
yw
translate
scale instance of the
object in the
xw world
Object Detection
Translation and Scaling in 3D

Object Detection
Rotation in 3D is about an axis

z
rotation by angle 
 about the x axis
y
x Px
Px´ 1 0 0 0
Py´ 0 cos  - sin  0 Py
= 0 sin  cos  0 Pz
Pz´
1 0 0 0 1 1
Object Detection
Rotation about Arbitrary Axis
T R1 R2
One translation and two rotations to line it up with a

major axis. Now rotate it about that axis. Then apply
the reverse transformations (R2, R1, T) to move it back.
Px´ r11 r12 r13 tx Px
Py´ r21 r22 r23 ty Py
=
Pz´ r31 r32 r33 tz Pz
1 0 0 0 1 1
Object Detection
The Camera Model
How do we get an image point IP from a world point P?
Px
s Ipr c11 c12 c13 c14
Py
s Ipc = c21 c22 c23 c24
Pz
s c31 c32 c33 1
1
image camera matrix C world
point point
What’s in C?
Object Detection
The camera model handles the rigid body transformation from
world coordinates to camera coordinates plus the perspective
transformation to image coordinates.
1. CP = T R WP
2. IP = (f) CP
CPx
s Ipx 1 0 0 0 CPy
s Ipy = 0 1 0 0 CPz
s 0 0 1/f 1 1
image perspective 3D point in
point transformation camera
coordinates
Object Detection
Camera Calibration
• In order work in 3D, we need to know the parameters
of the particular camera setup.
• Solving for the camera parameters is called calibration.
yc • intrinsic parameters are

yw
xc of the camera device
C
zc • extrinsic parameters are
W xw where the camera sits
zw in the world
Object Detection
Intrinsic Parameters
• principal point (u0,v0)

C f
• scale factors (dx,dy)
• aspect ratio distortion factor  (u0,v0)
• focal length f
• lens distortion factor 

(models radial lens distortion)
Object Detection
Extrinsic Parameters
• translation parameters
t = [tx ty tz]
• rotation matrix
r11 r12 r13 0

R= r21 r22 r23 0
Are there really
r31 r32 r33 0
nine parameters?
0 0 0 1
Object Detection
Calibration Object
The idea is to snap

images at different
depths and get a
lot of 2D-3D point
correspondences.
Object Detection
The Tsai Procedure
• The Tsai procedure was developed by Roger Tsai

at IBM Research and is most widely used.
• Several images are taken of the calibration object

yielding point correspondences at different distances.
• Tsai’s algorithm requires n > 5 correspondences
{(xi, yi, zi), (ui, vi)) | i = 1,…,n}
between (real) image points and 3D points.

Object Detection
Tsai’s Geometric Setup

camera Oc y
x
image plane
principal point p0
pi = (ui,vi)
(0,0,zi) y
Pi = (xi,yi,zi)
x 3D point
z
Object Detection
Tsai’s Procedure
• Given n point correspondences ((xi,yi,zi),

(ui,vi))
• Estimates
• 9 rotation matrix values
• 3 translation matrix values
• focal length
• lens distortion factor
• By solving several systems of equations

Object Detection
We use them for general stereo.
y1
y2 P2=(r2,c2)
P1=(r1,c1) e2 x2
e1 x1
C1
C2
Object Detection
For a correspondence (r1,c1) in
image 1 to (r2,c2) in image 2:
1. Both cameras were calibrated. Both camera matrices are

then known. From the two camera equations we get
4 linear equations in 3 unknowns.
r1 = (b11 - b31*r1)x + (b12 - b32*r1)y + (b13-b33*r1)z

c1 = (b21 - b31*c1)x + (b22 - b32*c1)y + (b23-b33*c1)z
r2 = (c11 - c31*r2)x + (c12 - c32*r2)y + (c13 - c33*r2)z

c2 = (c21 - c31*c2)x + (c22 - c32*c2)y + (c23 - c33*c2)z
Direct solution uses 3 equations, won’t give reliable results.
Object Detection
Solve by computing the closest

approach of the two skew rays.
P1
solve for If the rays

shortest
V P intersected
Q1 perfectly in 3D,
the intersection
would be P.
Instead, we solve for the shortest line
segment connecting the two rays and
let P be its midpoint.
Object Detection
Application: Kari Pulli’s Reconstruction of

3D Objects from light-striping stereo.
Object Detection
Application: Zhenrong Qian’s 3D Blood Vessel

Reconstruction from Visible Human Data
Object Detection
Digital Watermarking
Object Detection
Information Hiding
• Information Hiding…..started with
Steganography (art of hidden writing):

The art and science of writing hidden messages in such a way that no
one apart from the intended recipient knows of the existence of the
message. The existence of information is secret.
Stego – Hidden , Graphy – Writing  ‘art of hidden writing’

Object Detection
Steganography
(dates back to 440 BC)
• Histaeus used his slaves (information tattooed on a slave’s shaved head )
Initial Applications of information hiding  Passing Secret messages

Object Detection
Microchip - Application
• Germans used Microchips in World War II
Initial Applications of information hiding  Passing Secret messages

Object Detection
What is a watermark ?
What is a watermark ? A distinguishing mark impressed on
paper during manufacture; visible when paper is held up to
the light (e.g. $ Bill)
Application for print media  authenticity of print media

Object Detection
What is a watermark ?
Digital Watermarking: Application of Information hiding
(Hiding Watermarks in digital Media, such as images)
Digital Watermarking can be ?

- Perceptible (e.g. author information in .doc)
- Imperceptible (e.g. author information in images)
Visibility is application dependent
Invisible watermarks are preferred ?

Object Detection
Applications
Copyright Protecton:To prove the ownership

of digital media
Eg. Cut paste of images
Hidden Watermarks represent the

copyright information
Object Detection
Applications
Tamper proofing: To find out if data was tampered.
Eg. Change meaning of images
Hidden Watermarks track change

in meaning
Issues: Accuracy of detection

Object Detection
Applications
Quality Assessment: Degradation of Visual Quality
Loss of Visual Quality
Hidden Watermarks track change in visual quality

Object Detection
Comparison
• Watermarking Vs Cryptography
Watermark D  Hide information in D
Encrypt D  Change form of D

Object Detection
Watermarking Process
• Data (D), Watermark (W), Stego Key (K),

Watermarked Data (Dw)
Embed (D, W, K) = Dw
Extract (Dw) = W’ and compare with W

(e.g. find the linear correlation and compare it to a
threshold)
Q. How do we make this system secure ?

A. K is secret (Use cryptography to make information hidden more secure)
Object Detection
Example – Embedding (Dw = D + W)
• Matrix representation (12 blocks – 3 x 4 matrix)
(Algorithm Used: Random number generator RNG), Seed for RNG = K,
D = Matrix representation, W = Author’s name
1 2 3 4
5 6 7 8
9 10 11 12
Object Detection
Example – Extraction
• The Watermark can be identified by generating the random
numbers using the seed K
6 8
10
Object Detection
Data Domain Categorization
•Spatial Watermarking
Direct usage of data to embed and extract Watermark
e.g. voltage values for audio data
•Transform Based Watermarking

Conversion of data to another format to embed and extract.
e.g. Conversion to polar co-ordinate systems of 3D models,
makes it robust against scaling
Object Detection
Extraction Categorization
• Informed (Private)
Extract using {D, K, W}
• Semi - Blind (Semi-Private)
Extract using {K, W}
• Blind (Public)
Extract using {K}
- Blind (requires less information storage)

- Informed techniques are more robust to tampering
Object Detection
Robustness Categorization
• Fragile (for tamper proofing e.g. losing watermark implies tampering)
• Semi-Fragile (robust against user level operations, e.g. image

compression)
• Robust (against adversary based attack, e.g. noise addition to images)
This categorization is application dependent

Object Detection
Categorization of Watermark
Eg1. Robust Private Spatial Watermarks
Eg2. Blind Fragile DCT based Watermarks
Eg3. Blind Semi-fragile Spatial Watermarks

Object Detection
Watermarking Example
Application: Copyright Protection

Design Requirements:
- Imperceptibility
- Capacity
- Robustness
- Security
Object Detection
Imperceptibility
Watermarking
Stanford Bunny 3D Model Visible Watermarks in Bunny

Model  Distortion
Watermarking
Stanford Bunny 3D Model Invisible Watermarks in Bunny Model

 Minimal Distortion
Object Detection
Robustness
Adversaries can attack the data set and

remove the watermark.
Attacks are generally data dependent

e.g. Compression that adds noise can be used as an
attack to remove the watermark. Different data
types can have different compression schemes.
Object Detection
Robustness
• Value Change Attacks

- Noise addition e.g. lossy compression
- Uniform Affine Transformation e.g. 3D
model being rotated in 3D space OR
image being scaled
If encoding of watermarks are data value dependent 
Watermark is lost  Extraction process fails
Object Detection
Robustness
• Sample loss Attacks

- Cropping e.g. Cropping in images
- Smoothing e.g. smoothing of audio
signals e.g. Change in Sample rates
in audio data change in sampling rat
results in loss of samples
If watermarks are encoded in parts of data set which are

lost  Watermark is lost  Extraction process fails
Object Detection
Robustness
• Reorder Attack
- Reversal of sequence of data values e.g. reverse filter in audio signal reverses
the order of data values in time
0 1 1 1 1 0
Attack
1 2 3 3 2 1
Samples in time Samples in time
If encoding is dependent on an order and the order is changed 

Watermark is lost Extraction process fails
Object Detection
Capacity
• Multiple Watermarks can be supported.
• More capacity implies more robustness since watermarks can be

replicated.
Spatial Methods are have higher capacity than transform

techniques ?
Object Detection
Security
• In case the key used during watermark is lost anyone can read the
watermark and remove it.
• In case the watermark is public, it can be encoded and copyright

information is lost.
Object Detection
Watermarking Algorithm
Design Requirements
 As much information (watermarks) as possible
 Capacity
 Only be accessible by authorized parties
 Security
 Resistance against hostile/user dependent
changes
 Robustness
 Invisibility
 Imperceptibility
Object Detection
Object
Detectio
n
• What all objects are in
the scene?
• Can you locate them ?
• How did you locate

them ?
2
28-08-2021
Object Detection
Object Classification, Detection

and Segmentation
28-08-2021
Object Detection
Classification and
Localization
Classification • Suppose there are five categories of
with objects with their corresponding labels.
localization • Flower ( 1: [1,0,0,0,0])
• Fruit (2: [0, 1,0,0,0,])
• Bird (3: [0, 0, 1,0,0])
• Insect (4: [0,0,0,1,0])
• Background only ( none of the above)
(5: [0,0,0,0,1])
• CNN output would be ‘flower’ with
bounding box:
• centre, height and width.
Flower with bounding box
28-08-2021
103
Object Detection
Classification and
Localization
•Classification with • Suppose there are four Fully connected Flower: 0.92
• localization categories of objects.4096 to 5 Fruit : 0.023
Bird: 0.02
• • Flower Insect: 0.07
• • Bird Background: 0. 32
• • Insect
• • Background
• • CNN output would be ‘flower’

• with bounding box:
Localisation is a regression
• • centre,problem !
height and width.
• Flower with bounding box
28-08-2021
Object Detection
Detection as a Regression
Problem
2 classes 1 class 1 class

4 boxes 1 box 4 boxes
Each image can give different number of outputs !
28-08-2021
Object Detection
Object Detection: Data

Labels
• Two classes, four instances.
• How will you label?
• Five classes dataset:
( [1,0,0,0,0], bounding box of flower location 1,

[0,0,1,0,0], bounding box of bird,
[ 1,0,0,0,0], bounding box of flower location 2,
[1,0,0,0,0], bounding box of flower location 3)
28-08-2021
Object Detection
Object Detection Methods using

CNN
• Two types of methods
• Two stage methods : Initial feature extraction and then classification
of each segmented \ local region.
• Single stage methods : both object classification and localisation by a
single pass through CNN.
28-08-2021
Object Detection
Region Based CNN (R-CNN): Two Stage

Method
• Region proposal : Propose category-independent regions of interest by
selective search (  2000 per image)
• Classification of regions : Use CNN for feature extraction and SVM for
classification
Image source: Girshick et al., 2014

28-08-2021
Object Detection
Region Based CNN (R-

CNN)...
• Category-independent region proposals:
• Defines the set of candidate regions for the detector.
• A large convolutional neural network:
• Extracts a fixed-length feature vector from each region.
• A set of class specific linear SVMs: provides binary classification for
each proposal.
28-08-2021
Object Detection
Region Based CNN (R-

CNN)...
• Selective search for region proposals
• Start with thousands of tiny initial regions. ( divide the image into a grid
and process the grid cells for extracting information )
• Use a greedy algorithm to grow a region. Similar regions are merged

with a similarity measure S between regions a and b defined as:
𝑆 𝑎, 𝑏 + 𝑆𝑡 𝑒 𝑥 𝑡 𝑢 𝑟 𝑒 (𝑎,
= 𝑆𝑠𝑖𝑧𝑒
𝑎, 𝑏 𝑏)
28-08-2021
Object Detection
Selective Search for Region

Proposals
Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/
28-08-2021
Object Detection
Image Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/

28-08-2021
Object Detection
• Very slow in training and inference.

• Nearly 2,000 region proposals are required to
be processed by a CNN to extract features.
Main • Therefore R-CNN repeats the CNN
feature extraction process approx. 2,000
Drawback of times.
R-CNN and • Fast R-CNN was introduced by Girshik et al,
(2015) to overcome this processing issue.
Improvemen
t by
Fast R-CNN
Image Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/
14
Object Detection
Faster
RCNN
• Faster R-CNN does not use a special region proposal method to create
region proposals.
• A region proposal network is trained to

extract region proposals from the
feature maps.
• These proposals are then fed into the

Region of interest ( RoI) pooling layer in
the Fast R-CNN type network.
Ren et al, {2015)

28-08-2021
Object Detection
Region Proposal
Network
Image source: Ren et al, {2015)

Exhibited a good performance of upto 17 frames per second fps processing
and
70% mAP 9 mean average precision.
28-08-2021
Yet not suitable for real time applications.
Object Detection
Yolo : You Only Look

Once
• A Single shot detector that
trains a single CNN once
only for all the objects in
the scene.
• Basic Idea. Predict a class
and a bounding box for
every location in a grid.
https://arxiv.org/abs/1506.02640 Redmon et al. CVPR 2016.

Object Detection
YOLO
Features
• Computationally Fast, can be used in real time environment.
• Globally processing the entire image only once with a single CNN.
• Learns generalizable representations
• Maintains a high accuracy range.

Object Detection
How Does YOLO

Work?
• The algorithm "only looks once" at the image.
• Needs only one forward propagation pass through the network to
make predictions. The network reasons globally about the full image and
all the objects in the image in one go.
• It uses features from the entire image to predict each bounding box for
objects.
• It also predicts all bounding boxes across all classes for an image
simultaneously.
• It then outputs recognized objects together with the bounding boxes after a
process called non-max suppression (We will see what is non-max suppression
soon).
• The YOLO design enables end-to-end training and realtime speeds while
maintaining high average precision.
Object Detection
YOL
O
Object Detection
YOL
O
If the center/midpoint of
an object falls into a grid
cell, that grid cell is
responsible for detecting
that object.
Object Detection
YOL
O
Each grid cell predicts B
bounding boxes and
confidence scores for
those boxes.
These confidence scores

reflect how confident the
model is that the box
contains an object and also
how accurate it thinks the
box is, that it predicts.
Object Detection
YOLO
Algorithm
h
(x,y)
Object Detection
YOLO
Algorithm
Object Detection
Object Detection
Exampl
Exampl e
e
Source: Andrew Ng Lectures on Yolo, Coursera

Object Detection
For each anchor box, compute elementwise product to extract

a probability that the box contains a certain class.
Source: Andrew Ng Lectures on Yolo, Coursera

Object Detection
YOLO’s
Prediction
• For each of the 19x19 grid cells, the maximum of the probability
scores (taking a max across both - 5 anchor boxes and across different
classes).
• Color that grid cell according to what object that grid cell considers
the most likely.
Object Detection
Too Many
Boxes!
Object Detection
Dealing with Anchor

Boxes
• Two stage filtering out of anchor boxes.
• Set a threshold on confidence of a box detecting a class.
• Ignore boxes with a low score, that is, when the box is not very
confident about detecting a class.
• Select only one box when several boxes overlap with each other and
detect the same object. How ?
Object Detection
IoU: Intersection over

Union
Estimated 𝐵1 ‫𝐵 ځ‬2
bounding box 𝐼𝑜𝑈 =
𝐵1 ‫𝐵 ڂ‬2
Ground truth
bounding
box
𝐵1 𝐵2
Object Detection
First Level Filtering Out

(Boxes)
Remove all those boxes whose score is less than the threshold
Object Detection
Non Max
Suppression
• Second level filter for selecting the right boxes.
1. Select the box that has the highest score.

2. Compute its overlap with all other boxes and remove boxes that
overlap it more than the threshold set for IoU.
3. Go back to step 1 and iterate until there's no more boxes with a
lower score than the current selected box.
Object Detection
Yolo CNN
Architecture
Object Detection
YOLO
Versions
• YOLOv1: Joseph Redmon (June 2015)
• YOLOv2-v3: Joseph Redmon and Ali Farhadi (2016-18)
• YOLOv4: Alexey Bochkovskiy ( April 2020)
• YOLOv5: Glenn Jocher ( May 2020)
• Controversies and comparisons
• https://medium.com/deelvin-machine-learning/yolov4-vs-yolov5-
db1e0ac7962b
28-08-2021 35
Object Detection
SSD: Single Shot MultiBox

Detector
• Developed by Liu et al (December 2015) and as reported in their paper –
• Faster than Yolo, as accurate as two stage methods like Faster R-CNN.
• Predicts categories and box offsets.
• Uses small convolutional filters applied to feature maps.
• Makes predictions using feature maps of different scales

Object Detection
SSD
Framework
• SSD only needs an input image and ground truth boxes for each
object during training.
• Through CNN a small set of default boxes of different aspect ratios is

evaluated at each location
• This is done in several feature maps with different scales (e.g. 8 X

8 and 4 X 4).
Object Detection
SSD
Framework
• For each default box, both the shape offsets and the confidence
scores for all object categories are predicted.
• At training time, these default boxes are matched with the ground
truth boxes.
• The model loss is a weighted sum between localization loss and

confidence loss.
Object Detection
SSD
Framework
Two default boxes with the cat and one with the dog are matched, which are
treated as positives and the rest as negatives.
Object Detection
SSD Model vs YOLO

Model
Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection
Default Boxes and Aspect

Ratios
• Allowing different default
box shapes in several feature
maps lets us efficiently
discretize the space of
possible output box shapes.
Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection
Feature Maps of Different

Scales
• Lower resolution feature maps detect larger scale objects and higher
resolution feature maps detect lower scale objects.
Image Source: https://medium.com/@jonathan_hui

Object Detection
Feature Pyramid of Different

Scales
Source: Zhao et al. (2018)

Object Detection
Results
Source: https://towardsdatascience.com
Object Detection
Real Time Performance

Evaluation
• Let us check out their real time performance
Object Detection
M2De
T
• Zhao et al. (2018) introduced a new single shot object detector based
on multi-level feature pyramid network.
• Apart from scale variation, appearance-complexity variation should
also be considered for the object detection task.
• Object instances with similar size can be quite different.
• M2Det adds a new dimension to multi-scale detection - multi-level
learning.
• Deeper level learns features for objects with more appearance-
complexity variation (e.g., pedestrian in a road), while shallower level
learns features for more simplistic objects(e.g., traffic light).
Object Detection
M2De
T

Object Detection
m2DeT: Multi Level

Features

Object Detection
m2DeT vs SSD : Feature

Maps
SSD
Source: Zhao et al. (2018) M2DeT

Object Detection
M2De
T
• Three modules : Feature Fusion Module (FFM), Thinned U-shape
Module (TUM) and Scale-wise Feature Aggregation Module (SFAM).
Real Time Performance

Object Detection
M2DeT: Feature Fusion

Module
• FFMv1 enriches semantic
information into base features
by fusing feature maps of the
backbone.
• FFMv2 modules extract multi-

level multiscale features
together with TUMs.

Object Detection
M2DeT: Thinned U-shape

Module
• Each TUM generates a
group of multi-scale
features.
• TUMs and FFMv2s

together extract multi-
level multiscale features.

Object Detection
M2DeT: Scale-wise Feature Aggregation

Module
• SFAM aggregates the multi-level multiscale features generated by
TUMs into a multi-level feature pyramid

Object Detection
distinguish between these three computer vision tasks:
•Image Classification: Predict the type or class of an object in an image.
• Input: An image with a single object, such as a photograph.
• Output: A class label (e.g. one or more integers that are mapped to class labels).
•Object Localization: Locate the presence of objects in an image and indicate their location
with a bounding box.
• Input: An image with one or more objects, such as a photograph.
• Output: One or more bounding boxes (e.g. defined by a point, width, and height).
•Object Detection: Locate the presence of objects with a bounding box and types or classes of
the located objects in an image.
• Input: An image with one or more objects, such as a photograph.
• Output: One or more bounding boxes (e.g. defined by a point, width, and height), and a
class label for each bounding box.

Convolutional Neural Networks

Object Detection
With the rise of autonomous vehicles, smart video surveillance, facial detection and
various people counting applications, fast and accurate object detection systems are rising
in demand. These systems involve not only recognizing and classifying every object in an
image, but localizing each one by drawing the appropriate bounding box around it. This
makes object detection a significantly harder task than its traditional computer vision
predecessor, image classification.

Faster R-CNN Object Detection
Faster R-CNN is now a canonical model for deep learning-based object detection. It helped
inspire many detection and segmentation models that came after it, including the two others
we’re going to examine today. Unfortunately, we can’t really begin to understand Faster R-
CNN without understanding its own predecessors, R-CNN and Fast R-CNN, so let’s take a
quick dive into its ancestry.
R-CNN
R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN really kicked things off.
R-CNN, or Region-based Convolutional Neural Network, consisted of 3 simple steps:
1.Scan the input image for possible objects using an algorithm called Selective Search,
generating ~2000 region proposals
2.Run a convolutional neural net (CNN) on top of each of these region proposals
3.Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear
regressor to tighten the bounding box of the object, if such an object exists.

Object Detection

Object Detection
Fast R-CNN
R-CNN’s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many
ways, but improved on its detection speed through two main augmentations:
1.Performing feature extraction over the image before proposing regions, thus only running one
CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
2.Replacing the SVM with a softmax layer, thus extending the neural network for predictions
instead of creating a new model

Object Detection

Object Detection
As we can see from the image, we are now generating region proposals based on the last
feature map of the network, not from the original image itself. As a result, we can train just one
CNN for the entire image.
In addition, instead of training many different SVM’s to classify each object class, there is a
single softmax layer that outputs the class probabilities directly. Now we only have one neural
net to train, as opposed to one neural net and many SVM’s.

Object Detection
Faster R-CNN
At this point, we’re back to our original target: Faster R-CNN. The main insight of Faster R-
CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it
introduced the region proposal network (RPN).
Here’s how the RPN worked:
•At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and
maps it to a lower dimension (e.g. 256-d)
•For each sliding-window location, it generates multiple possible regions based on k fixed-
ratio anchor boxes (default bounding boxes)
•Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates
representing the bounding box of the region

Object Detection
we look at each location in our last feature map and consider k different boxes centered
around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether
or not we think it contains an object, and what the coordinates for that box are. This is what it
looks like at one sliding window location:

The 2k scores represent the softmax probability of each of the k bounding boxes being on
“object.” Notice that although the RPN outputs bounding box coordinates, it does not try to
classify any potential objects: its sole job is still proposing object regions. If an anchor box has
an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a
region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN.
We add a pooling layer, some fully-connected layers, and finally a softmax classification layer
and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.

Object Detection
Run a CNN (in this case, ResNet) over the input image
Add a fully convolutional layer to generate a score bank of the aforementioned “position-
sensitive score maps.” There should be k²(C+1) score maps, with k² representing the number of
relative positions to divide an object (e.g. 3² for a 3 by 3 grid) and C+1 representing the number
of classes plus the background.
Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI’s)
For each RoI, divide it into the same k² “bins” or subregions as the score maps
For each bin, check the score bank to see if that bin matches the corresponding position of some
object. For example, if I’m on the “upper-left” bin, I will grab the score maps that correspond to
the “upper-left” corner of an object and average those values in the RoI region. This process is
repeated for each class.
Once each of the k² bins has an “object match” value for each class, average the bins to get a
single score per class.
Classify the RoI with a softmax over the remaining C+1 dimensional vector

Most popular example Categorization of IMAGES
Computer Vision Tasks classification, Instance segmentation. Convolutional

Neural Networks, Evolution of CNN Architectures for Image, Recent CNN

Faculty Video Links, Youtube & NPTEL Video Links and
Online Courses Details
• Youtube/other Video Links
https://nptel.ac.in/courses/106106093/
https://www.youtube.com/watch?v=m-aKj5ovDfg
https://www.youtube.com/watch?v=G4NYQox4n2g

DAILY QUIZ
1. What is computer vision?
2. Name three common applications of computer vision.
3. What is the purpose of image segmentation in computer vision?
4. What is the difference between object detection and object

recognition?
5. Explain the concept of convolution in convolutional neural networks

(CNNs).

DAILY QUIZ
1. What is optical character recognition (OCR) used for in computer vision?
2. What is the purpose of non-maximum suppression in object detection

algorithms?
3. What are some common challenges faced in computer vision tasks?
4. What is the difference between supervised and unsupervised learning in

computer vision?
5. Name three popular deep learning architectures used in computer vision.

WEEKLY ASSIGNMENT
1. Explain the concept of image filtering and provide examples of

commonly used filters in computer vision.
2. Discuss the differences between image classification and object
detection in computer vision. Provide examples of each.
3. Explain the process of feature extraction in computer vision. How are
features used in tasks like object recognition or image matching?
4. Describe the steps involved in building a convolutional neural network
(CNN) for image classification. Discuss the purpose of each step.

WEEKLY ASSIGNMENT
1. Discuss the challenges and potential solutions for handling occlusion in

object detection algorithms.
2. Compare and contrast traditional computer vision techniques with
deep learning-based approaches. What are the advantages and
limitations of each?
3. Explain the concept of image segmentation and its applications in
computer vision. Discuss different segmentation methods.
4. Discuss the concept of optical flow in computer vision. How is it used to
analyze motion in videos or sequences of images?
5. Explain the concept of image registration and its applications in
computer vision. Provide examples of scenarios where image
registration is useful.
6. Discuss the role of data augmentation techniques in computer vision
tasks. How can data augmentation improve the performance of deep
learning models?

WEEKLY ASSIGNMENT
1. Explain the concept of object tracking in computer vision. Discuss different algorithms or
techniques used for object tracking.
2. Describe the process of image recognition using convolutional neural networks (CNNs).
What are the key components and steps involved?
3. Discuss the concept of depth estimation in computer vision. Explain how depth
information can be extracted from 2D images.
4. Explain the concept of image stitching and its applications. How are multiple images
combined to create a panoramic image?
5. Discuss the challenges and approaches for handling scale invariance in object detection
algorithms.
6. Describe the concept of facial recognition in computer vision. Discuss its applications,
advantages, and potential privacy concerns.
7. Explain the concept of semantic segmentation and its applications in computer vision.
Provide examples of scenarios where semantic segmentation is useful.
8. Discuss the concept of object recognition using feature descriptors. Explain popular
feature descriptor algorithms such as SIFT or SURF.
9. Explain the concept of image super-resolution and its applications. How can low-
resolution images be enhanced to improve their quality?
10. Discuss the role of transfer learning in computer vision. How can pre-trained models be
utilized for new tasks or datasets?
MCQ s
Question 1: What is computer vision?
A. The study of computers and their components
B. The field of processing and understanding visual data by computers
C. The development of computer software for image editing
D. The study of visual perception in humans
Question 2: Which of the following is an application of computer vision?

A. Speech recognition
B. Natural language processing
C. Object detection
D. Network security
Question 3: Which technique is commonly used for feature extraction in computer vision?
A. Convolutional Neural Networks (CNN)
B. Decision Trees
C. Support Vector Machines (SVM)
D. K-means clustering

MCQ s
Question 4: What is the purpose of image segmentation in computer vision?

A. Classifying images into different categories
B. Detecting and recognizing objects in images
C. Enhancing and manipulating image quality
D. Dividing an image into meaningful regions or segments
Question 5: Which of the following is an example of an object recognition task in
computer vision?
A. Determining the sentiment of an image
B. Identifying the boundaries of objects in an image
C. Recognizing specific objects in an image, such as cars or faces
D. Analyzing the texture or color distribution of an image
Question 6: Which technique is commonly used for image classification in computer
vision?
A. Principal Component Analysis (PCA)
B. Naive Bayes classifier
C. Latent Semantic Analysis (LSA)
D. Convolutional Neural Networks (CNN)
MCQ s(CONT’d)
1.What is computer vision? Answer: c. Computer vision is a field of artificial intelligence that
focuses on enabling computers to interpret and understand visual information from images or
videos.
2.Name three common applications of computer vision. Answer: a. Autonomous vehicles, b.
Object recognition, c. Medical image analysis
3.What is the purpose of image segmentation in computer vision? Answer: b. Image
segmentation aims to partition an image into meaningful regions or segments to facilitate
object detection, tracking, or analysis.
4.What is the difference between object detection and object recognition? Answer: c. Object
detection involves both localizing and classifying objects within an image, while object
recognition focuses solely on identifying objects without localizing them.
5.Explain the concept of convolution in convolutional neural networks (CNNs). Answer: a.
Convolution involves applying a filter/kernel to an input image or feature map, computing
element-wise multiplications, and Drsumming
Preeti Gera the results
COMPUTER to produce
VISION UNIT- IV a feature map. 174
MCQ s(CONT’d)
1.What is optical character recognition (OCR) used for in computer vision? Answer: b. OCR is
used to convert printed or handwritten text from images into machine-readable text, enabling
automated text analysis or data extraction.
2.What is the purpose of non-maximum suppression in object detection algorithms? Answer: a.
Non-maximum suppression is used to eliminate redundant bounding box detections by keeping
only the most confident detection and suppressing overlapping or lower-confidence detections.
3.What are some common challenges faced in computer vision tasks? Answer: c. Variations in
lighting conditions, occlusion, viewpoint changes, and limited labeled data are common
challenges in computer vision tasks.
4.What is the difference between supervised and unsupervised learning in computer vision?
Answer: b. Supervised learning requires labeled training data, where input images are associated
with corresponding ground-truth labels. Unsupervised learning involves learning patterns or
structures from unlabeled data without explicit labels.

Old Question Papers
• Why do we use data augmentation?
• What are the metrics used for object detection?

• When do you say that an object detection method is efficient?
• How many types of recognition are there in artificial intelligence?
• How can you evaluate the predictions in an Object Detection model?
• What are the main steps in a typical Computer Vision pipeline?
• What is the difference between Semantic Segmentation and Instance Segmentation in Computer Vision?
• How do Neural Networks distinguish useful features from non-useful features in Computer Vision?
• How do Neural Networks distinguish useful features from non-useful features in Computer Vision?
• How would you decide when to grayscale the input images for a Computer Vision problem?
• Provide an intuitive explanation of how The Sliding Window approach works in Object Detection
• What Image Noise Filters techniques do you know?

Old Question Papers

EXPECTED QUESTIONS FOR UNIVERSITY EXAM



SUMMARY
 Computer vision is a field of study that deals with the extraction of information
from images or videos to understand and interpret visual data. It involves the
development of algorithms and techniques to enable computers to perceive and
understand the visual world in a way similar to humans.
 Computer vision encompasses various tasks and applications, including image

classification, object detection, image segmentation, facial recognition, scene
understanding, and video analysis. These tasks involve processing and analyzing
visual data to extract meaningful information and make decisions based on it.
 The fundamental concepts in computer vision include image formation, image

processing, feature extraction, and pattern recognition. Image formation deals with
how images are captured and represented using pixels. Image processing
techniques are used to enhance and manipulate images to improve their quality or
extract specific information. Feature extraction involves identifying relevant visual
characteristics or patterns from images that can be used for tasks like object
recognition or tracking.

SUMMARY
 Computer vision techniques employ both traditional computer vision algorithms

and deep learning approaches. Traditional algorithms rely on handcrafted features
and mathematical models to process and analyze visual data. Deep learning
methods, particularly convolutional neural networks (CNNs), have gained
popularity in recent years due to their ability to learn directly from raw pixel data
and achieve state-of-the-art results in various computer vision tasks.
 Computer vision finds applications in diverse fields such as autonomous vehicles,

surveillance systems, medical imaging, robotics, augmented reality, and industrial
automation. It plays a crucial role in enabling machines to understand and interact
with the visual world, opening up possibilities for advanced applications and
advancements in numerous domains.
 As computer vision continues to evolve, researchers and practitioners explore new

techniques, algorithms, and applications to tackle more complex challenges and
improve the accuracy and efficiency of visual understanding by machines.

REFERENCES
• "Computer Vision: Algorithms and Applications" by Richard Szeliski

• This comprehensive book covers the fundamental concepts and algorithms in computer
vision, including image formation, image features, stereo vision, multiple view geometry,
and object recognition. It also includes numerous examples and MATLAB code snippets.
• "Deep Learning for Computer Vision with Python" by Adrian Rosebrock
• This book focuses on applying deep learning techniques to solve computer vision
problems. It covers topics such as convolutional neural networks (CNNs), image
classification, object detection, and image segmentation. The book provides practical
examples and code implementations using Python and the Keras library.
• "Computer Vision: Models, Learning, and Inference" by Simon J.D. Prince
• This book provides a comprehensive introduction to computer vision, covering various
topics such as image formation, filtering, feature detection and matching, object
recognition, and 3D reconstruction. It also includes discussions on statistical and
probabilistic models in computer vision.
https://www.oreilly.com/library/view/datawarehousingarchitecture/0130809020/ch07.html
https://www.slideshare.net/2cdude/data-warehousing-3292359

Any Certification/Courses for this subject
Course Offered By Duratio Rating Link

Name n
Introductio IBM 1-4 4.4

n to weeks
Computer
Vision and
Image
Processing https://www.coursera.org/
learn/introduction-
computer-vision-watson-
opencv#about
Computer University at 1-4 4.2
Vision Buffalo weeks
Basics https://www.coursera.org/
learn/computer-vision-
basics#syllabus

OPTIONAL / CASE STUDIES
OPTIONAL / CASE STUDIES
Factors influencing the adoption of new mobility technologies and services:,
1 Drone Systems and Applications ( Healthcare, Agriculture, Security)
2 Autonomous Vehicular System
3 Motion Prediction for Autonomous Vehicles
4 Clinical applications Robotic surgery

Thank You

COMPUTER VISION UNIT 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMPUTER VISION UNIT 4

Uploaded by

Copyright:

Available Formats

NOIDA INSTITUTE OF ENGINEERING AND

TECHNOLOGY, GREATER NOIDA

Subject Name: Faculty Name:

Dr Preeti Gera COMPUTER VISION UNIT- IV

UNIT-I: Introduction to Computer Vision:

Introduction to Computer Vision

Computer Vision, Research and Applications, (Self-Driving Cars, Facial

Dr Preeti Gera COMPUTER VISION UNIT- IV 2

Representation of a Three-Dimensional Moving Scene. Convolutional

Architectures Design: LeNet-5, AlexNet, VGGNet, GoogLeNet,

Dr Preeti Gera COMPUTER VISION UNIT- IV 3

Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel

Dr Preeti Gera COMPUTER VISION UNIT- IV 4

Object Detection and Sliding Windows, R-CNN, Fast R-CNN,

Dr Preeti Gera COMPUTER VISION UNIT- IV 5

UNIT-V: Visualization and Generative Models

Visualization and Generative Models

Object Detection and Sliding Windows, R-CNN, Fast R-CNN,

Dr Preeti Gera COMPUTER VISION UNIT- IV 6

Dr Preeti Gera COMPUTER VISION UNIT- IV

Dr Preeti Gera COMPUTER VISION UNIT- IV

Course outcome: After completion of this course students will be able to

CO2 Develop appropriate learning rules for each of the K2

Dr Preeti Gera COMPUTER VISION UNIT- IV 9

At the end of the semester, the student will be able:

Dr Preeti Gera COMPUTER VISION UNIT- IV 10

Dr Preeti Gera COMPUTER VISION UNIT- IV 11

PEO1: To have an excellent scientific and engineering breadth so as to

Dr Preeti Gera COMPUTER VISION UNIT- IV 12

Dr Preeti Gera COMPUTER VISION UNIT- IV 13

Q.No. Question Marks CO

Q.No. Question Marks CO

Dr Preeti Gera COMPUTER VISION UNIT- IV 14

Dr Preeti Gera COMPUTER VISION UNIT- IV 15

Object Detection and Sliding Windows

Dr Preeti Gera COMPUTER VISION UNIT- IV 16

PSO1 PSO2 PSO3

*3= High *2= Medium *1=Low

Dr Preeti Gera COMPUTER VISION UNIT- IV 17

Students should have the knowledge of computer organization

Students should know the artificial intelligence basic concepts.

Dr Preeti Gera COMPUTER VISION UNIT- IV 18

What do you think you

Dr Preeti Gera COMPUTER VISION UNIT- IV 20

Aid to Visually Challenged People:

❏ We can develop a product for the blind that will assist

Self driving cars

❏ Automatic driving one of the most difficult problems, and

❏ Automatic captioning might help Google Image

❏ CCTV cameras are now common, but if we can create

Image Captioning Model

Softmax Activation Function

Softmax is an activation function that scales numbers/logits into probabilities.

Image courtesy: https://towardsdatascience.com/

Fully Connected Layer

Image courtesy: https://towardsdatascience.com/

Convolutional Neural Network

Image courtesy: https://towardsdatascience.com/

Dr Preeti Gera COMPUTER VISION UNIT- IV 30

Dr Preeti Gera COMPUTER VISION UNIT- IV 31

Dr Preeti Gera COMPUTER VISION UNIT- IV 32

Dr Preeti Gera COMPUTER VISION UNIT- IV 33

Dr Preeti Gera COMPUTER VISION UNIT- IV 34

Dr Preeti Gera COMPUTER VISION UNIT- IV 36

3= High 2= Medium *1=Low

z = fb / (xl - xr) = fb/d This method of

r1 = (b11 - b31r1)x + (b12 - b32r1)y + (b13-b33*r1)z

r2 = (c11 - c31r2)x + (c12 - c32r2)y + (c13 - c33*r2)z