Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 186

NOIDA INSTITUTE OF ENGINEERING AND

TECHNOLOGY, GREATER NOIDA

Object Detection

Unit: 4

Subject Name: Faculty Name:


Computer Vision Dr Preeti Gera
Affiliation:
Course Details: Associate Professor
B.Tech 7th Sem Department:
CSE

Dr Preeti Gera COMPUTER VISION UNIT- IV


1
UNIT-I

UNIT-I: Introduction to Computer Vision:

Introduction to Computer Vision

Computer Vision, Research and Applications, (Self-Driving Cars, Facial


Recognition, Augmented & Mixed Reality, Healthcare). Most popular
examples Categorization of Images, Object Detection, Observation of Moving
Objects, Retrieval of Images Based on Their Contents, Computer Vision Tasks
classification, object detection, Instance segmentation. Convolutional Neural
Networks, Evolution of CNN Architectures for Image, Recent CNN

Dr Preeti Gera COMPUTER VISION UNIT- IV 2


UNIT-II

UNIT-I: Architectures

Architectures

Representation of a Three-Dimensional Moving Scene. Convolutional


layers, pooling layers, and padding. Transfer learning and pre-trained
models Architectures.

Architectures Design: LeNet-5, AlexNet, VGGNet, GoogLeNet,


ResNet, Efficient Net, Mobile Net, RNN Introduction.

Dr Preeti Gera COMPUTER VISION UNIT- IV 3


UNIT-III

UNIT-III: Segmentation

Segmentation

Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel


Transformations, Geometric Operations, Spatial Operations in Image Processing, Instance
Segmentation, Localisation, Object detection and image segmentation using CNNs, LSTM and
GRU’s. Vision Models, Vision Languages, Quality Analysis, Visual Dialogue, Active Contours &
Application, Split & Merge, Mean Shift & Mode Finding, Normalized Cuts.

Dr Preeti Gera COMPUTER VISION UNIT- IV 4


UNIT-IV

UNIT-I: Architectures

Object Detection

Object Detection and Sliding Windows, R-CNN, Fast R-CNN,


Object Recognition, 3-D vision and Geometry, Digital
Watermarking. Object Detection, face recognition instance
Recognition, Category Recognition Objects, Scenes,
Activities, Object classification.

Dr Preeti Gera COMPUTER VISION UNIT- IV 5


UNIT-V

UNIT-V: Visualization and Generative Models

Visualization and Generative Models

Object Detection and Sliding Windows, R-CNN, Fast R-CNN,


Object Recognition, 3-D vision and Geometry, Digital
Watermarking. Object Detection, face recognition instance
Recognition, Category Recognition Objects, Scenes,
Activities, Object classification.

Dr Preeti Gera COMPUTER VISION UNIT- IV 6


Syllabus
Unit Module Topics Covered
1 Introduction to Computer Vision, Research and Applications, (Self-Driving Cars, Facial Recognition, Augmented & Mixed Reality,
Computer Healthcare). Most popular examples Categorization of Images, Object Detection, Observation of Moving Objects,
Vision: Retrieval of Images Based on Their Contents, Computer Vision Tasks classification, ,object detection, Instance
segmentation. Convolutional Neural Networks, Evolution of CNN Architectures for Image, Recent CNN

2 Architectures Representation of a Three-Dimensional Moving Scene. Convolutional layers, pooling layers, and padding. Transfer
learning and pre-trained models Architectures.
Architectures Design : LeNet-5, AlexNet, VGGNet, GoogLeNet, ResNet, Efficient Net, Mobile Net . RNN
Introduction, perceptron Backpropagation in CNN,RNN.
Segmentation Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel Transformations,
3 Geometric Operations, Spatial Operations in Image Processing, Instance Segmentation, Localisation, Object
detection and image segmentation using CNNs, LSTM and GRU’s. Vision Models, Vision Languages, Quality
Analysis, Visual Dialogue, other attention models, self attention and transformers. Active Contours & Application,
Split & Merge, Mean Shift & Mode Finding, Normalized Cuts,

4 Object Detection Object Detection and Sliding Windows, R-CNN, Fast R-CNN, Object Recognition, 3-D vision and Geometry,
Digital Watermarking. Object Detection, face recognition instance Recognition, Category Recognition Objects,
Scenes, Activities, Object classification and detection, Encoder in Code, Decoder in Code, U-Net Code: Encoder,
Decoder , Few Shot and zero shot learning, self-supervised learning, Adversarial Robustness, Pruning and model
compression, Neural Architecture search, Objects in Scenes. YOLO Fundamentals of Image Formation,
Convolution and Filtering.
5 Visualization and Benefits of Interpretability, Fashion MNIST Class Activation Map code walkthrough, GradCAM,ZFNet.Image
Generative compression methods and its requirements,statisticalcompression, spatial compression, contour coding. Deep
Models Generative Models introduction,Generative Adversarial Networks Combination VAE and GAN’s, other VAE and
GAN’s deep generative models. GAN Improvements, Deep Generative Models across multiple domains, Deep
Generative Models image and video applications.

Dr Preeti Gera COMPUTER VISION UNIT- IV


7
Course Objective

Course Objective: To learn about key features of Computer Vision, design, implement
and provide continuous improvement in the accuracy and outcomes of various datasets
with more reliable and concise analysis results.

Dr Preeti Gera COMPUTER VISION UNIT- IV


8
Course Outcome

Course outcome: After completion of this course students will be able to


Description Bloom’s
Taxonomy
CO1 Analyse knowledge of deep architectures used for K4
solving various Vision and Pattern Association tasks.

CO2 Develop appropriate learning rules for each of the K2


architectures of perceptron and learn about different
factors of back propagation.
CO3 Deploy training algorithm for pattern association with K5
the help of memory network.

CO4 Design and deploy the models of deep learning with the K5
help of use cases.
CO5 Understand, Analyse different theories of deep learning K3
using neural networks.

Dr Preeti Gera COMPUTER VISION UNIT- IV 9


Program Outcome

At the end of the semester, the student will be able:


POs Engineering Graduates will be able to
PO1 Engineering Knowledge
PO2 Problem Analysis
PO3 Design & Development of solutions
PO4 Conduct Investigation of complex problems
PO5 Modern Tool Usage
PO6 The Engineer and Society
PO7 Environment and sustainability
PO8 Ethics
PO9 Individual & Team work
PO10 Communication
PO11 Project management and Finance
PO12 Life Long Learning

Dr Preeti Gera COMPUTER VISION UNIT- IV 10


CO-PO and PSO Mapping

Computer vision

CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

Subject code 3 2 1

Subject code 2 1

Subject code 2 3

Subject code 3 3 2

Subject code 3 2 1 1

avg

Dr Preeti Gera COMPUTER VISION UNIT- IV 11


Program Educational Objectives(PEOs)

PEO1: To have an excellent scientific and engineering breadth so as to


comprehend, analyze, design and provide sustainable solutions for
real-life problems using state-of-the-art technologies.
PEO2:To have a successful career in industries, to pursue higher studies or
to support enterpreneurial endeavors and to face global challenges.
PEO3:To have an effective communication skills, professional attitude,
ethical values and a desire to learn specific knowledge in emerging
trends, technologies for research, innovation and product
development and contribution to society.
PEO4: To have life-long learning for up-skilling and re-skilling for
successful professional career as engineer, scientist, enterpreneur
and bureaucrat for betterment of society

Dr Preeti Gera COMPUTER VISION UNIT- IV 12


End Semester Question Paper Template

B TECH
(SEM-VII) THEORY EXAMINATION 20__-20__
Computer Vision
Time: 3 Hours Total Marks:
100
Note: 1. Attempt all Sections. If require any missing data; then choose
suitably.
SECTION A
1.Attempt
Q.No. all questions in brief.Question 2 x 10 = 20
Marks CO
1 2
2 2
. .
10 2

Dr Preeti Gera COMPUTER VISION UNIT- IV 13


End Semester Question Paper Templates
SECTION B
2. Attempt any three of the following: 3 x 10 = 30

Q.No. Question Marks CO


1 10
2 10
. .
5 10
SECTION C
3. Attempt any one part of the following: 1 x 10 = 10

Q.No. Question Marks CO

1 10
2 10

Dr Preeti Gera COMPUTER VISION UNIT- IV 14


End Semester Question Paper Templates
4. Attempt any one part of the following: 1 x 10 = 10
Q.No. Question Marks CO

1 10
2 10
5. Attempt any one part of the following: 1 x 10 = 10
Q.No. Question Marks CO
1 10
2 10
6. Attempt any one part of the following: 1 x 10 = 10
Q.No. Question Marks CO

1 10
2 10

Dr Preeti Gera COMPUTER VISION UNIT- IV 15


CONTENT
Unit Content
Object Detection

Object Detection and Sliding Windows

R-CNN
Fast R-CNN
Object Recognition
3-D vision and Geometry

, Digital Watermarking

Object Detection
face recognition
instance Recognition
Category Recognition Objects, Scenes
Activities, Object classification

Dr Preeti Gera COMPUTER VISION UNIT- IV 16


CO-PO and PSO Mapping

PSO1 PSO2 PSO3


Subject code 1 2
Subject code 2
Subject code 1
Subject code 1 2 3
Subject code 1 1 2
avg

*3= High *2= Medium *1=Low

Dr Preeti Gera COMPUTER VISION UNIT- IV 17


PREREQUISITE

Students should have the knowledge of computer organization

Students should know the artificial intelligence basic concepts.

Students should be able to create, design and manipulate the images and idea about
pixels

Dr Preeti Gera COMPUTER VISION UNIT- IV 18


PREREQUISITE

Prerequisites
No prior experience with computer vision is assumed, although
previous knowledge of visual computing or signal processing
will be helpful (e.g., CSCI 1230). The following skills are
necessary for this class:
•Math: Linear algebra, vector calculus, and probability. Linear
algebra is the most important and is required.
•Data structures: You will write code that represents images as
matrices, high-dimensional features, and geometric
constructions.
•Programming and toolchains: A good working knowledge.
Intro CS is required, and an intermediate systems course is
strongly encouraged.
Dr Preeti Gera COMPUTER VISION UNIT- IV 19
Object Detection

What do you think you


notice in the image?
Some of you may respond as follows:
● A bear in a grassy area. Or
● Black bear walking in a grassy area. Or
● A black bear and some green grass. Or
● A bear eating grass.

Dr Preeti Gera COMPUTER VISION UNIT- IV 20


Object Detection

M ot i v ation

Aid to Visually Challenged People:

❏ We can develop a product for the blind that will assist


them in navigating the roadways without the assistance
of others.
❏ This may be accomplished by first translating the scene to
text, then the text to speech.
❏ Both are now well-known Deep
Learning
applications.
Object Detection

M ot i v ation

Self driving cars

❏ Automatic driving one of the most difficult problems, and


accurately captioning the environment around the vehicle
can help the self-driving system.
Object Detection

M ot i v ation

Image Searching

❏ Automatic captioning might help Google Image


search become as good as Google Search.
❏ Every image could be turned into a caption first, and then
searches could be conducted based on the caption.
Object Detection

M ot i v ation

CCTV Surveillance

❏ CCTV cameras are now common, but if we can create


appropriate captions in addition to observing people.
❏ We can trigger warnings as soon as harmful activity is
detected. This is likely to help decrease crime and/or
accidents.
❏ It can be used to describe video in real time.
Object Detection

Image Captioning Model


★ A deep learning based image captioning architecture relies on two
components: Convolutional Neural Network (CNN) and Recurrent Neural
Network (RNN).
★ CNN worked as an encoder for image encoding and feature extraction.
★ RNN used as decoder for language modeling, it generates the caption word
by word.
Object Detection

Softmax Activation Function

Softmax is an activation function that scales numbers/logits into probabilities.

Image courtesy: https://towardsdatascience.com/


Object Detection

Fully Connected Layer


❖ Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of
the high-level features as represented by the output of the convolutional layer.
❖ Softmax applied over out fully connected layer to classify the input.

Image courtesy: https://towardsdatascience.com/


Object Detection

Convolutional Neural Network


★ Convolutional neural networks (CNN/ConvNet) are a type of deep neural network that is
frequently used to analyse visual images.
★ It has following component’;

Image courtesy: https://towardsdatascience.com/


Object Detection
R-CNN
To bypass the problem of selecting a huge number of regions, Ross Girshick et al. proposed a method
where we use selective search to extract just 2000 regions from the image and he called them region
proposals. Therefore, now, instead of trying to classify a huge number of regions, you can just work with
2000 regions.
.
The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the
image and the extracted features are fed into an SVM to classify the presence of the object within that
candidate region proposal.

In addition to predicting the presence of an object within the region proposals, the algorithm also predicts
four values which are offset values to increase the precision of the bounding box. For example, given a
region proposal, the algorithm would have predicted the presence of a person but the face of that person
within that region proposal could’ve been cut in half. Therefore, the offset values help in adjusting the
bounding box of the region proposal.
CNN architecture of R-CNN

• After that these regions are warped into a single square of regions of
dimension as required by the CNN model. The CNN model that we used
here is a pre-trained AlexNet model, which is the state-of-the-art CNN model
at that time for image classification Let’s look at AlexNet architecture here.

Dr Preeti Gera COMPUTER VISION UNIT- IV 30


• Here the input of AlexNet is (227, 227, 3). So, if the region proposals
are small and large then we need to resize that region proposal to
given dimensions.
• From the above architecture, we remove the last softmax layer to get
the (1, 4096) feature vector. We pass this feature vector into SVM and
bounding box regressor.

Dr Preeti Gera COMPUTER VISION UNIT- IV 31


• SVM (Support Vector Machine)
• The feature vector generated by CNN is then consumed by the binary SVM which is trained on
each class independently. This SVM model takes the feature vector generated in previous CNN
architecture and outputs a confidence score of the presence of an object in that region.

• However, there is an issue with training with SVM is that we required AlexNet feature vectors for
training the SVM class. So, we could not train AlexNet and SVM independently in paralleled
manner. This challenge is resolved in future versions of R-CNN (Fast R-CNN, Faster R-CNN,
etc.).
• Bounding Box Regressor
• In order to precisely locate the bounding box in the image., we used a scale-invariant linear
regression model called bounding box regressor. For training this model we take as predicted and
Ground truth pairs of four dimensions of localization.

• These dimensions are (x, y, w, h) where x and y are the pixel coordinates of the center of the
bounding box respectively. w and h represent the width and height of bounding boxes. This
method increases the Mean Average precision (mAP) of the result by 3-4%.

Dr Preeti Gera COMPUTER VISION UNIT- IV 32


• Output:
• Now we have region proposals that are classified for every class label. In
order to deal with the extra bounding box generated by the above model in
the image, we use an algorithm called Non- maximum suppression. It
works in 3 steps:
• Discard those objects where the confidence score is less than a certain
threshold value( say 0.5).
• Select the region which has the highest probability among candidates
regions for the object as the predicted region.
• In the final step, we discard those regions which have IoU
(intersection Over Union) with the predicted region over 0.5.

Dr Preeti Gera COMPUTER VISION UNIT- IV 33


• Problems with R-CNN
• It still takes a huge amount of time to train the network as you would have
to classify 2000 region proposals per image.
• It cannot be implemented real time as it takes around 47 seconds for each
test image.
• The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate
region proposals.
• Fast R-CNN

Dr Preeti Gera COMPUTER VISION UNIT- IV 34


• The approach is similar to the R-CNN algorithm. But, instead of feeding the region
proposals to the CNN, we feed the input image to the CNN to generate a convolutional
feature map.

• From the convolutional feature map, we identify the region of proposals and warp them into
squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be
fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to
predict the class of the proposed region and also the offset values for the bounding box.

• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.

• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
Dr Preeti Gera COMPUTER VISION UNIT- IV 35
• Faster R-CNN

Dr Preeti Gera COMPUTER VISION UNIT- IV 36


• Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to
find out the region proposals. Selective search is a slow and time-consuming
process affecting the performance of the network. Therefore, Shaoqing
Ren et al. came up with an object detection algorithm that eliminates the
selective search algorithm and lets the network learn the region proposals.

• Similar to Fast R-CNN, the image is provided as an input to a convolutional


network which provides a convolutional feature map. Instead of using
selective search algorithm on the feature map to identify the region proposals,
a separate network is used to predict the region proposals. The predicted
region proposals are then reshaped using a RoI pooling layer which is then
used to classify the image within the proposed region and predict the offset
values for the bounding boxes.

Dr Preeti Gera COMPUTER VISION UNIT- IV 37


• The Fast R-CNN consists of a CNN (usually pre-trained on the
ImageNet classification task) with its final pooling layer replaced by
an “ROI pooling” layer and its final FC layer is replaced by two
branches — a (K + 1) category softmax layer branch and a category-
specific bounding box regression branch.

Dr Preeti Gera COMPUTER VISION UNIT- IV 38


Object Detection
What can you determine about
1. the sizes of objects
2. the distances of objects from the camera?

What knowledge
do you use to
analyze this image?
What objects are shown in thisObject
image?Detection
How can you estimate distance from the camera?
What feature changes with distance?
Object Detection

3D Shape from X

• shading
• silhouette mainly research
• texture

• stereo
• light striping used in practice
• motion
Object Detection

Perspective Imaging Model: 1D


real image This is the axis of the
point E xi
real image plane.
f
camera lens O is the center of projection.
O
image of point
D B in front image This is the axis of the front
zc
xf image plane, which we use.

xi = xc
xc B f zc
3D object
point
Object Detection
Perspective in 2D Yc

(Simplified)
camera
P´=(xi,yi,f) Xc
3D object point yi
P=(xc,yc,zc) ray f
F
=(xw,yw,zw) xi

yc optical
axis
zw=zc
xi xc
= xi = (f/zc)xc
Zc xc f zc
yi = (f/zc)yc
Here camera coordinates yi yc
equal world coordinates. =
f zc
Object Detection

3D from Stereo
3D point

left image right image

disparity: the difference in image location of the same 3D


point when projected under perspective to two different cameras.
d = xleft - xright
Object Detection

Depth Perception from Stereo


Simple Model: Parallel Optic Axes
image plane z
f Z
camera L
xl
b
baseline
f
camera R xr
x-b

X P=(x,z)

z x-b y-axis is
z = x = z = y = y perpendicular
f xl f xr f yl yr to the page.
Object Detection

Resultant Depth Calculation

For stereo cameras with parallel optical axes, focal length f,


baseline b, corresponding image points (xl,yl) and (xr,yr)
with disparity d:

z = f*b / (xl - xr) = f*b/d This method of


determining depth
x = xl*z/f or b + xr*z/f from disparity is
called triangulation.
y = yl*z/f or yr*z/f
Object Detection

Finding Correspondences

• If the correspondence is correct,


triangulation works VERY well.

• But correspondence finding is not perfectly solved.


for the general stereo problem.

• For some very specific applications, it can be solved


for those specific kind of images, e.g. windshield of
a car.
° °
Object Detection

3 Main Matching Methods

1. Cross correlation using small windows.

dense

2. Symbolic feature matching, usually using segments/corners.

sparse

3. Use the newer interest operators, ie. SIFT. sparse


Object Detection

Epipolar Geometry Constraint:


1. Normal Pair of Images
The epipolar plane cuts through the image plane(s) P
forming 2 epipolar lines.

y1 z1 z2
y2 epipolar
plane
P1
P2 x
C1 b C2
The match for P1 (or P2) in the other image,
must lie on the same epipolar line.
Object Detection

Epipolar Geometry:
General Case
P

y1
y2 P2
P1 e2 x2
e1 x1
C1
C2
Object Detection
1. Epipolar Constraint:
Constraints Matching points lie on
corresponding epipolar
lines.

P 2. Ordering Constraint:
Usually in the same
Q order across the lines.

e2
e1
C1
C2
Object Detection

Structured Light
light stripe
3D data can also be derived using

• a single camera

• a light source that can produce


stripe(s) on the 3D object

light camera
source
Object Detection

Structured Light
3D Computation
3D data can also be derived using
3D point
• a single camera (x, y, z)

• a light source that can produce


stripe(s) on the 3D object
 (0,0,0)
light b x axis
f
b source
[x y z] = --------------- [x´ y´ f] (x´,y´,f)

3D f cot  - x´ image
Object Detection

Depth from Multiple Light Stripes

What are these objects?


Object Detection

Our (former) System


4-camera light-striping stereo
cameras

projector

rotation
table

3D
object
Object Detection

Camera Model: Recall there are 5 Different


Frames of Reference
yc

• Object xc
yf zw
C
xf
• World
image a
• Camera

• Real Image W
zc yw
pyramid
• Pixel Image zp
yp A object
xp
xw
Object Detection

Rigid Body Transformations in 3D


zp
pyramid model
zw
in its own
model space

xp

yp
rotate
W
yw
translate
scale instance of the
object in the
xw world
Object Detection

Translation and Scaling in 3D


Object Detection

Rotation in 3D is about an axis


z

rotation by angle 
 about the x axis
y
x Px
Px´ 1 0 0 0
Py´ 0 cos  - sin  0 Py
= 0 sin  cos  0 Pz
Pz´
1 0 0 0 1 1
Object Detection

Rotation about Arbitrary Axis

T R1 R2

One translation and two rotations to line it up with a


major axis. Now rotate it about that axis. Then apply
the reverse transformations (R2, R1, T) to move it back.
Px´ r11 r12 r13 tx Px
Py´ r21 r22 r23 ty Py
=
Pz´ r31 r32 r33 tz Pz
1 0 0 0 1 1
Object Detection

The Camera Model

How do we get an image point IP from a world point P?

Px
s Ipr c11 c12 c13 c14
Py
s Ipc = c21 c22 c23 c24
Pz
s c31 c32 c33 1
1
image camera matrix C world
point point
What’s in C?
Object Detection
The camera model handles the rigid body transformation from
world coordinates to camera coordinates plus the perspective
transformation to image coordinates.

1. CP = T R WP
2. IP = (f) CP

CPx
s Ipx 1 0 0 0 CPy
s Ipy = 0 1 0 0 CPz
s 0 0 1/f 1 1
image perspective 3D point in
point transformation camera
coordinates
Object Detection

Camera Calibration
• In order work in 3D, we need to know the parameters
of the particular camera setup.

• Solving for the camera parameters is called calibration.

yc • intrinsic parameters are


yw
xc of the camera device
C
zc • extrinsic parameters are
W xw where the camera sits
zw in the world
Object Detection

Intrinsic Parameters

• principal point (u0,v0)


C f
• scale factors (dx,dy)

• aspect ratio distortion factor  (u0,v0)

• focal length f

• lens distortion factor 


(models radial lens distortion)
Object Detection

Extrinsic Parameters

• translation parameters
t = [tx ty tz]

• rotation matrix

r11 r12 r13 0


R= r21 r22 r23 0
Are there really
r31 r32 r33 0
nine parameters?
0 0 0 1
Object Detection

Calibration Object

The idea is to snap


images at different
depths and get a
lot of 2D-3D point
correspondences.
Object Detection
The Tsai Procedure

• The Tsai procedure was developed by Roger Tsai


at IBM Research and is most widely used.

• Several images are taken of the calibration object


yielding point correspondences at different distances.

• Tsai’s algorithm requires n > 5 correspondences

{(xi, yi, zi), (ui, vi)) | i = 1,…,n}

between (real) image points and 3D points.


Object Detection

Tsai’s Geometric Setup


camera Oc y

x
image plane
principal point p0
pi = (ui,vi)

(0,0,zi) y
Pi = (xi,yi,zi)
x 3D point
z
Object Detection

Tsai’s Procedure

• Given n point correspondences ((xi,yi,zi),


(ui,vi))

• Estimates
• 9 rotation matrix values
• 3 translation matrix values
• focal length
• lens distortion factor

• By solving several systems of equations


Object Detection

We use them for general stereo.

y1
y2 P2=(r2,c2)
P1=(r1,c1) e2 x2
e1 x1
C1
C2
Object Detection
For a correspondence (r1,c1) in
image 1 to (r2,c2) in image 2:

1. Both cameras were calibrated. Both camera matrices are


then known. From the two camera equations we get

4 linear equations in 3 unknowns.

r1 = (b11 - b31*r1)x + (b12 - b32*r1)y + (b13-b33*r1)z


c1 = (b21 - b31*c1)x + (b22 - b32*c1)y + (b23-b33*c1)z

r2 = (c11 - c31*r2)x + (c12 - c32*r2)y + (c13 - c33*r2)z


c2 = (c21 - c31*c2)x + (c22 - c32*c2)y + (c23 - c33*c2)z
Direct solution uses 3 equations, won’t give reliable results.
Object Detection

Solve by computing the closest


approach of the two skew rays.

P1

solve for If the rays


shortest
V P intersected
Q1 perfectly in 3D,
the intersection
would be P.
Instead, we solve for the shortest line
segment connecting the two rays and
let P be its midpoint.
Object Detection

Application: Kari Pulli’s Reconstruction of


3D Objects from light-striping stereo.
Object Detection

Application: Zhenrong Qian’s 3D Blood Vessel


Reconstruction from Visible Human Data
Object Detection

Digital Watermarking
Object Detection

Information Hiding

• Information Hiding…..started with

Steganography (art of hidden writing):


The art and science of writing hidden messages in such a way that no
one apart from the intended recipient knows of the existence of the
message. The existence of information is secret.

Stego – Hidden , Graphy – Writing  ‘art of hidden writing’


Object Detection
Steganography
(dates back to 440 BC)
• Histaeus used his slaves (information tattooed on a slave’s shaved head )

Initial Applications of information hiding  Passing Secret messages


Object Detection

Microchip - Application

• Germans used Microchips in World War II

Initial Applications of information hiding  Passing Secret messages


Object Detection

What is a watermark ?
What is a watermark ? A distinguishing mark impressed on
paper during manufacture; visible when paper is held up to
the light (e.g. $ Bill)

Application for print media  authenticity of print media


Object Detection

What is a watermark ?
Digital Watermarking: Application of Information hiding
(Hiding Watermarks in digital Media, such as images)

Digital Watermarking can be ?


- Perceptible (e.g. author information in .doc)
- Imperceptible (e.g. author information in images)
Visibility is application dependent

Invisible watermarks are preferred ?


Object Detection

Applications

Copyright Protecton:To prove the ownership


of digital media

Eg. Cut paste of images

Hidden Watermarks represent the


copyright information
Object Detection

Applications

Tamper proofing: To find out if data was tampered.

Eg. Change meaning of images

Hidden Watermarks track change


in meaning

Issues: Accuracy of detection


Object Detection

Applications

Quality Assessment: Degradation of Visual Quality

Loss of Visual Quality

Hidden Watermarks track change in visual quality


Object Detection

Comparison

• Watermarking Vs Cryptography

Watermark D  Hide information in D

Encrypt D  Change form of D


Object Detection

Watermarking Process

• Data (D), Watermark (W), Stego Key (K),


Watermarked Data (Dw)

Embed (D, W, K) = Dw

Extract (Dw) = W’ and compare with W


(e.g. find the linear correlation and compare it to a
threshold)

Q. How do we make this system secure ?


A. K is secret (Use cryptography to make information hidden more secure)
Object Detection

Watermarking Process
Example – Embedding (Dw = D + W)
• Matrix representation (12 blocks – 3 x 4 matrix)
(Algorithm Used: Random number generator RNG), Seed for RNG = K,
D = Matrix representation, W = Author’s name

1 2 3 4

5 6 7 8

9 10 11 12
Object Detection

Watermarking Process
Example – Extraction
• The Watermark can be identified by generating the random
numbers using the seed K

6 8

10
Object Detection

Data Domain Categorization

•Spatial Watermarking
Direct usage of data to embed and extract Watermark
e.g. voltage values for audio data

•Transform Based Watermarking


Conversion of data to another format to embed and extract.
e.g. Conversion to polar co-ordinate systems of 3D models,
makes it robust against scaling
Object Detection

Extraction Categorization
• Informed (Private)
Extract using {D, K, W}
• Semi - Blind (Semi-Private)
Extract using {K, W}
• Blind (Public)
Extract using {K}

- Blind (requires less information storage)


- Informed techniques are more robust to tampering
Object Detection

Robustness Categorization

• Fragile (for tamper proofing e.g. losing watermark implies tampering)

• Semi-Fragile (robust against user level operations, e.g. image


compression)

• Robust (against adversary based attack, e.g. noise addition to images)

This categorization is application dependent


Object Detection

Categorization of Watermark

Eg1. Robust Private Spatial Watermarks

Eg2. Blind Fragile DCT based Watermarks

Eg3. Blind Semi-fragile Spatial Watermarks


Object Detection

Watermarking Example

Application: Copyright Protection


Design Requirements:
- Imperceptibility
- Capacity
- Robustness
- Security
Object Detection

Imperceptibility

Watermarking

Stanford Bunny 3D Model Visible Watermarks in Bunny


Model  Distortion

Watermarking

Stanford Bunny 3D Model Invisible Watermarks in Bunny Model


 Minimal Distortion
Object Detection

Robustness

Adversaries can attack the data set and


remove the watermark.

Attacks are generally data dependent


e.g. Compression that adds noise can be used as an
attack to remove the watermark. Different data
types can have different compression schemes.
Object Detection

Robustness

• Value Change Attacks


- Noise addition e.g. lossy compression
- Uniform Affine Transformation e.g. 3D
model being rotated in 3D space OR
image being scaled
If encoding of watermarks are data value dependent 
Watermark is lost  Extraction process fails
Object Detection

Robustness

• Sample loss Attacks


- Cropping e.g. Cropping in images
- Smoothing e.g. smoothing of audio
signals e.g. Change in Sample rates
in audio data change in sampling rat
results in loss of samples

If watermarks are encoded in parts of data set which are


lost  Watermark is lost  Extraction process fails
Object Detection

Robustness

• Reorder Attack
- Reversal of sequence of data values e.g. reverse filter in audio signal reverses
the order of data values in time

0 1 1 1 1 0
Attack
1 2 3 3 2 1

Samples in time Samples in time

If encoding is dependent on an order and the order is changed 


Watermark is lost Extraction process fails
Object Detection

Capacity

• Multiple Watermarks can be supported.

• More capacity implies more robustness since watermarks can be


replicated.

Spatial Methods are have higher capacity than transform


techniques ?
Object Detection

Security

• In case the key used during watermark is lost anyone can read the
watermark and remove it.

• In case the watermark is public, it can be encoded and copyright


information is lost.
Object Detection
Watermarking Algorithm
Design Requirements
 As much information (watermarks) as possible

 Capacity
 Only be accessible by authorized parties
 Security
 Resistance against hostile/user dependent
changes
 Robustness
 Invisibility
 Imperceptibility
Object Detection

Object
Detectio
n
• What all objects are in
the scene?

• Can you locate them ?

• How did you locate


them ?

2
28-08-2021
Object Detection

Object Classification, Detection


and Segmentation

28-08-2021
Object Detection

Classification and
Localization
Classification • Suppose there are five categories of
with objects with their corresponding labels.
localization • Flower ( 1: [1,0,0,0,0])
• Fruit (2: [0, 1,0,0,0,])
• Bird (3: [0, 0, 1,0,0])
• Insect (4: [0,0,0,1,0])
• Background only ( none of the above)
(5: [0,0,0,0,1])
• CNN output would be ‘flower’ with
bounding box:
• centre, height and width.
Flower with bounding box

28-08-2021

103
Object Detection

Classification and
Localization
•Classification with • Suppose there are four Fully connected Flower: 0.92
• localization categories of objects.4096 to 5 Fruit : 0.023
Bird: 0.02
• • Flower Insect: 0.07
• • Bird Background: 0. 32
• • Insect
• • Background

• • CNN output would be ‘flower’


• with bounding box:
Localisation is a regression
• • centre,problem !
height and width.
• Flower with bounding box
28-08-2021
Object Detection

Detection as a Regression
Problem

2 classes 1 class 1 class


4 boxes 1 box 4 boxes

Each image can give different number of outputs !

28-08-2021
Object Detection

Object Detection: Data


Labels
• Two classes, four instances.
• How will you label?
• Five classes dataset:

( [1,0,0,0,0], bounding box of flower location 1,


[0,0,1,0,0], bounding box of bird,
[ 1,0,0,0,0], bounding box of flower location 2,
[1,0,0,0,0], bounding box of flower location 3)

28-08-2021
Object Detection

Object Detection Methods using


CNN
• Two types of methods
• Two stage methods : Initial feature extraction and then classification
of each segmented \ local region.
• Single stage methods : both object classification and localisation by a
single pass through CNN.

28-08-2021
Object Detection

Region Based CNN (R-CNN): Two Stage


Method
• Region proposal : Propose category-independent regions of interest by
selective search (  2000 per image)
• Classification of regions : Use CNN for feature extraction and SVM for
classification

Image source: Girshick et al., 2014


28-08-2021
Object Detection

Region Based CNN (R-


CNN)...
• Category-independent region proposals:
• Defines the set of candidate regions for the detector.
• A large convolutional neural network:
• Extracts a fixed-length feature vector from each region.
• A set of class specific linear SVMs: provides binary classification for
each proposal.

28-08-2021
Object Detection

Region Based CNN (R-


CNN)...
• Selective search for region proposals
• Start with thousands of tiny initial regions. ( divide the image into a grid
and process the grid cells for extracting information )

• Use a greedy algorithm to grow a region. Similar regions are merged


with a similarity measure S between regions a and b defined as:

𝑆 𝑎, 𝑏 + 𝑆𝑡 𝑒 𝑥 𝑡 𝑢 𝑟 𝑒 (𝑎,
= 𝑆𝑠𝑖𝑧𝑒
𝑎, 𝑏 𝑏)

28-08-2021
Object Detection

Selective Search for Region


Proposals

Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/
28-08-2021
Object Detection

Image Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/


28-08-2021
Object Detection

• Very slow in training and inference.


• Nearly 2,000 region proposals are required to
be processed by a CNN to extract features.
Main • Therefore R-CNN repeats the CNN
feature extraction process approx. 2,000
Drawback of times.
R-CNN and • Fast R-CNN was introduced by Girshik et al,
(2015) to overcome this processing issue.
Improvemen
t by
Fast R-CNN

Image Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/

14
Object Detection

Faster
RCNN
• Faster R-CNN does not use a special region proposal method to create
region proposals.

• A region proposal network is trained to


extract region proposals from the
feature maps.

• These proposals are then fed into the


Region of interest ( RoI) pooling layer in
the Fast R-CNN type network.

Ren et al, {2015)


28-08-2021
Object Detection

Region Proposal
Network

Image source: Ren et al, {2015)


Exhibited a good performance of upto 17 frames per second fps processing
and
70% mAP 9 mean average precision.

28-08-2021
Yet not suitable for real time applications.
Object Detection

Yolo : You Only Look


Once
• A Single shot detector that
trains a single CNN once
only for all the objects in
the scene.
• Basic Idea. Predict a class
and a bounding box for
every location in a grid.

https://arxiv.org/abs/1506.02640 Redmon et al. CVPR 2016.


Object Detection

YOLO
Features
• Computationally Fast, can be used in real time environment.

• Globally processing the entire image only once with a single CNN.

• Learns generalizable representations

• Maintains a high accuracy range.


Object Detection

How Does YOLO


Work?
• The algorithm "only looks once" at the image.
• Needs only one forward propagation pass through the network to
make predictions. The network reasons globally about the full image and
all the objects in the image in one go.
• It uses features from the entire image to predict each bounding box for
objects.
• It also predicts all bounding boxes across all classes for an image
simultaneously.
• It then outputs recognized objects together with the bounding boxes after a
process called non-max suppression (We will see what is non-max suppression
soon).
• The YOLO design enables end-to-end training and realtime speeds while
maintaining high average precision.
Object Detection

YOL
O
Object Detection

YOL
O

If the center/midpoint of
an object falls into a grid
cell, that grid cell is
responsible for detecting
that object.
Object Detection

YOL
O
Each grid cell predicts B
bounding boxes and
confidence scores for
those boxes.

These confidence scores


reflect how confident the
model is that the box
contains an object and also
how accurate it thinks the
box is, that it predicts.
Object Detection

YOLO
Algorithm

h
(x,y)
Object Detection

YOLO
Algorithm
Object Detection
Object Detection

Exampl
Exampl e
e

Source: Andrew Ng Lectures on Yolo, Coursera


Object Detection

For each anchor box, compute elementwise product to extract


a probability that the box contains a certain class.

Source: Andrew Ng Lectures on Yolo, Coursera


Object Detection

YOLO’s
Prediction
• For each of the 19x19 grid cells, the maximum of the probability
scores (taking a max across both - 5 anchor boxes and across different
classes).
• Color that grid cell according to what object that grid cell considers
the most likely.
Object Detection

Too Many
Boxes!
Object Detection

Dealing with Anchor


Boxes
• Two stage filtering out of anchor boxes.

• Set a threshold on confidence of a box detecting a class.

• Ignore boxes with a low score, that is, when the box is not very
confident about detecting a class.

• Select only one box when several boxes overlap with each other and
detect the same object. How ?
Object Detection

IoU: Intersection over


Union
Estimated 𝐵1 ‫𝐵 ځ‬2
bounding box 𝐼𝑜𝑈 =
𝐵1 ‫𝐵 ڂ‬2

Ground truth
bounding
box

𝐵1 𝐵2
Object Detection

First Level Filtering Out


(Boxes)

Remove all those boxes whose score is less than the threshold
Object Detection

Non Max
Suppression
• Second level filter for selecting the right boxes.

1. Select the box that has the highest score.


2. Compute its overlap with all other boxes and remove boxes that
overlap it more than the threshold set for IoU.
3. Go back to step 1 and iterate until there's no more boxes with a
lower score than the current selected box.
Object Detection

Yolo CNN
Architecture
Object Detection

YOLO
Versions
• YOLOv1: Joseph Redmon (June 2015)
• YOLOv2-v3: Joseph Redmon and Ali Farhadi (2016-18)
• YOLOv4: Alexey Bochkovskiy ( April 2020)
• YOLOv5: Glenn Jocher ( May 2020)
• Controversies and comparisons
• https://medium.com/deelvin-machine-learning/yolov4-vs-yolov5-

db1e0ac7962b

28-08-2021 35
Object Detection

SSD: Single Shot MultiBox


Detector
• Developed by Liu et al (December 2015) and as reported in their paper –

• Faster than Yolo, as accurate as two stage methods like Faster R-CNN.

• Predicts categories and box offsets.

• Uses small convolutional filters applied to feature maps.

• Makes predictions using feature maps of different scales


Object Detection

SSD
Framework
• SSD only needs an input image and ground truth boxes for each
object during training.

• Through CNN a small set of default boxes of different aspect ratios is


evaluated at each location

• This is done in several feature maps with different scales (e.g. 8 X


8 and 4 X 4).
Object Detection

SSD
Framework
• For each default box, both the shape offsets and the confidence
scores for all object categories are predicted.

• At training time, these default boxes are matched with the ground
truth boxes.

• The model loss is a weighted sum between localization loss and


confidence loss.
Object Detection

SSD
Framework

Two default boxes with the cat and one with the dog are matched, which are
treated as positives and the rest as negatives.
Object Detection

SSD Model vs YOLO


Model

Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection

Default Boxes and Aspect


Ratios
• Allowing different default
box shapes in several feature
maps lets us efficiently
discretize the space of
possible output box shapes.

Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection

Feature Maps of Different


Scales
• Lower resolution feature maps detect larger scale objects and higher
resolution feature maps detect lower scale objects.

Image Source: https://medium.com/@jonathan_hui


Object Detection

Feature Pyramid of Different


Scales

Source: Zhao et al. (2018)


Object Detection

Results

Source: https://towardsdatascience.com
Object Detection

Real Time Performance


Evaluation
• Let us check out their real time performance
Object Detection

M2De
T
• Zhao et al. (2018) introduced a new single shot object detector based
on multi-level feature pyramid network.
• Apart from scale variation, appearance-complexity variation should
also be considered for the object detection task.
• Object instances with similar size can be quite different.
• M2Det adds a new dimension to multi-scale detection - multi-level
learning.
• Deeper level learns features for objects with more appearance-
complexity variation (e.g., pedestrian in a road), while shallower level
learns features for more simplistic objects(e.g., traffic light).
Object Detection

M2De
T

Source: Zhao et al. (2018)


Object Detection

m2DeT: Multi Level


Features

Source: Zhao et al. (2018)


Object Detection

m2DeT vs SSD : Feature


Maps

SSD

Source: Zhao et al. (2018) M2DeT


Object Detection

M2De
T
• Three modules : Feature Fusion Module (FFM), Thinned U-shape
Module (TUM) and Scale-wise Feature Aggregation Module (SFAM).

Source: Zhao et al. (2018)

Real Time Performance


Object Detection

M2DeT: Feature Fusion


Module
• FFMv1 enriches semantic
information into base features
by fusing feature maps of the
backbone.

• FFMv2 modules extract multi-


level multiscale features
together with TUMs.

Source: Zhao et al. (2018)


Object Detection

M2DeT: Thinned U-shape


Module
• Each TUM generates a
group of multi-scale
features.

• TUMs and FFMv2s


together extract multi-
level multiscale features.

Source: Zhao et al. (2018)


Object Detection

M2DeT: Scale-wise Feature Aggregation


Module
• SFAM aggregates the multi-level multiscale features generated by
TUMs into a multi-level feature pyramid

Source: Zhao et al. (2018)


Object Detection
distinguish between these three computer vision tasks:
•Image Classification: Predict the type or class of an object in an image.
• Input: An image with a single object, such as a photograph.
• Output: A class label (e.g. one or more integers that are mapped to class labels).
•Object Localization: Locate the presence of objects in an image and indicate their location
with a bounding box.
• Input: An image with one or more objects, such as a photograph.
• Output: One or more bounding boxes (e.g. defined by a point, width, and height).
•Object Detection: Locate the presence of objects with a bounding box and types or classes of
the located objects in an image.
• Input: An image with one or more objects, such as a photograph.
• Output: One or more bounding boxes (e.g. defined by a point, width, and height), and a
class label for each bounding box.

Dr Preeti Gera COMPUTER VISION UNIT- IV 153


Convolutional Neural Networks

Dr Preeti Gera COMPUTER VISION UNIT- IV 154


Object Detection

With the rise of autonomous vehicles, smart video surveillance, facial detection and
various people counting applications, fast and accurate object detection systems are rising
in demand. These systems involve not only recognizing and classifying every object in an
image, but localizing each one by drawing the appropriate bounding box around it. This
makes object detection a significantly harder task than its traditional computer vision
predecessor, image classification.

Dr Preeti Gera COMPUTER VISION UNIT- IV 155


Faster R-CNN Object Detection
Faster R-CNN is now a canonical model for deep learning-based object detection. It helped
inspire many detection and segmentation models that came after it, including the two others
we’re going to examine today. Unfortunately, we can’t really begin to understand Faster R-
CNN without understanding its own predecessors, R-CNN and Fast R-CNN, so let’s take a
quick dive into its ancestry.

R-CNN
R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN really kicked things off.
R-CNN, or Region-based Convolutional Neural Network, consisted of 3 simple steps:
1.Scan the input image for possible objects using an algorithm called Selective Search,
generating ~2000 region proposals
2.Run a convolutional neural net (CNN) on top of each of these region proposals
3.Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear
regressor to tighten the bounding box of the object, if such an object exists.

Dr Preeti Gera COMPUTER VISION UNIT- IV 156


Object Detection

Dr Preeti Gera COMPUTER VISION UNIT- IV 157


Object Detection

Fast R-CNN
R-CNN’s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many
ways, but improved on its detection speed through two main augmentations:
1.Performing feature extraction over the image before proposing regions, thus only running one
CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
2.Replacing the SVM with a softmax layer, thus extending the neural network for predictions
instead of creating a new model

Dr Preeti Gera COMPUTER VISION UNIT- IV 158


Object Detection

Dr Preeti Gera COMPUTER VISION UNIT- IV 159


Object Detection

As we can see from the image, we are now generating region proposals based on the last
feature map of the network, not from the original image itself. As a result, we can train just one
CNN for the entire image.

In addition, instead of training many different SVM’s to classify each object class, there is a
single softmax layer that outputs the class probabilities directly. Now we only have one neural
net to train, as opposed to one neural net and many SVM’s.

Dr Preeti Gera COMPUTER VISION UNIT- IV 160


Object Detection

Faster R-CNN
At this point, we’re back to our original target: Faster R-CNN. The main insight of Faster R-
CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it
introduced the region proposal network (RPN).
Here’s how the RPN worked:
•At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and
maps it to a lower dimension (e.g. 256-d)
•For each sliding-window location, it generates multiple possible regions based on k fixed-
ratio anchor boxes (default bounding boxes)
•Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates
representing the bounding box of the region

Dr Preeti Gera COMPUTER VISION UNIT- IV 161


Object Detection
we look at each location in our last feature map and consider k different boxes centered
around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether
or not we think it contains an object, and what the coordinates for that box are. This is what it
looks like at one sliding window location:

Dr Preeti Gera COMPUTER VISION UNIT- IV 162


The 2k scores represent the softmax probability of each of the k bounding boxes being on
“object.” Notice that although the RPN outputs bounding box coordinates, it does not try to
classify any potential objects: its sole job is still proposing object regions. If an anchor box has
an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a
region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN.
We add a pooling layer, some fully-connected layers, and finally a softmax classification layer
and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.

Dr Preeti Gera COMPUTER VISION UNIT- IV 163


Object Detection

Run a CNN (in this case, ResNet) over the input image
Add a fully convolutional layer to generate a score bank of the aforementioned “position-
sensitive score maps.” There should be k²(C+1) score maps, with k² representing the number of
relative positions to divide an object (e.g. 3² for a 3 by 3 grid) and C+1 representing the number
of classes plus the background.
Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI’s)
For each RoI, divide it into the same k² “bins” or subregions as the score maps
For each bin, check the score bank to see if that bin matches the corresponding position of some
object. For example, if I’m on the “upper-left” bin, I will grab the score maps that correspond to
the “upper-left” corner of an object and average those values in the RoI region. This process is
repeated for each class.
Once each of the k² bins has an “object match” value for each class, average the bins to get a
single score per class.
Classify the RoI with a softmax over the remaining C+1 dimensional vector

Dr Preeti Gera COMPUTER VISION UNIT- IV 164


Most popular example Categorization of IMAGES

Computer Vision Tasks classification, Instance segmentation. Convolutional


Neural Networks, Evolution of CNN Architectures for Image, Recent CNN

Dr Preeti Gera COMPUTER VISION UNIT- IV 165


Faculty Video Links, Youtube & NPTEL Video Links and
Online Courses Details

• Youtube/other Video Links

https://nptel.ac.in/courses/106106093/

https://www.youtube.com/watch?v=m-aKj5ovDfg

https://www.youtube.com/watch?v=G4NYQox4n2g

Dr Preeti Gera COMPUTER VISION UNIT- IV 166


DAILY QUIZ

1. What is computer vision?

2. Name three common applications of computer vision.

3. What is the purpose of image segmentation in computer vision?

4. What is the difference between object detection and object


recognition?

5. Explain the concept of convolution in convolutional neural networks


(CNNs).

Dr Preeti Gera COMPUTER VISION UNIT- IV 167


DAILY QUIZ

1. What is optical character recognition (OCR) used for in computer vision?

2. What is the purpose of non-maximum suppression in object detection


algorithms?

3. What are some common challenges faced in computer vision tasks?

4. What is the difference between supervised and unsupervised learning in


computer vision?

5. Name three popular deep learning architectures used in computer vision.

Dr Preeti Gera COMPUTER VISION UNIT- IV 168


WEEKLY ASSIGNMENT

1. Explain the concept of image filtering and provide examples of


commonly used filters in computer vision.
2. Discuss the differences between image classification and object
detection in computer vision. Provide examples of each.
3. Explain the process of feature extraction in computer vision. How are
features used in tasks like object recognition or image matching?
4. Describe the steps involved in building a convolutional neural network
(CNN) for image classification. Discuss the purpose of each step.

Dr Preeti Gera COMPUTER VISION UNIT- IV 169


WEEKLY ASSIGNMENT

1. Discuss the challenges and potential solutions for handling occlusion in


object detection algorithms.
2. Compare and contrast traditional computer vision techniques with
deep learning-based approaches. What are the advantages and
limitations of each?
3. Explain the concept of image segmentation and its applications in
computer vision. Discuss different segmentation methods.
4. Discuss the concept of optical flow in computer vision. How is it used to
analyze motion in videos or sequences of images?
5. Explain the concept of image registration and its applications in
computer vision. Provide examples of scenarios where image
registration is useful.
6. Discuss the role of data augmentation techniques in computer vision
tasks. How can data augmentation improve the performance of deep
learning models?

Dr Preeti Gera COMPUTER VISION UNIT- IV 170


WEEKLY ASSIGNMENT

1. Explain the concept of object tracking in computer vision. Discuss different algorithms or
techniques used for object tracking.
2. Describe the process of image recognition using convolutional neural networks (CNNs).
What are the key components and steps involved?
3. Discuss the concept of depth estimation in computer vision. Explain how depth
information can be extracted from 2D images.
4. Explain the concept of image stitching and its applications. How are multiple images
combined to create a panoramic image?
5. Discuss the challenges and approaches for handling scale invariance in object detection
algorithms.
6. Describe the concept of facial recognition in computer vision. Discuss its applications,
advantages, and potential privacy concerns.
7. Explain the concept of semantic segmentation and its applications in computer vision.
Provide examples of scenarios where semantic segmentation is useful.
8. Discuss the concept of object recognition using feature descriptors. Explain popular
feature descriptor algorithms such as SIFT or SURF.
9. Explain the concept of image super-resolution and its applications. How can low-
resolution images be enhanced to improve their quality?
10. Discuss the role of transfer learning in computer vision. How can pre-trained models be
utilized for new tasks or datasets?
Dr Preeti Gera COMPUTER VISION UNIT- IV 171
MCQ s
Question 1: What is computer vision?
A. The study of computers and their components
B. The field of processing and understanding visual data by computers
C. The development of computer software for image editing
D. The study of visual perception in humans

Question 2: Which of the following is an application of computer vision?


A. Speech recognition
B. Natural language processing
C. Object detection
D. Network security

Question 3: Which technique is commonly used for feature extraction in computer vision?
A. Convolutional Neural Networks (CNN)
B. Decision Trees
C. Support Vector Machines (SVM)
D. K-means clustering

Dr Preeti Gera COMPUTER VISION UNIT- IV 172


MCQ s

Question 4: What is the purpose of image segmentation in computer vision?


A. Classifying images into different categories
B. Detecting and recognizing objects in images
C. Enhancing and manipulating image quality
D. Dividing an image into meaningful regions or segments
Question 5: Which of the following is an example of an object recognition task in
computer vision?
A. Determining the sentiment of an image
B. Identifying the boundaries of objects in an image
C. Recognizing specific objects in an image, such as cars or faces
D. Analyzing the texture or color distribution of an image
Question 6: Which technique is commonly used for image classification in computer
vision?
A. Principal Component Analysis (PCA)
B. Naive Bayes classifier
C. Latent Semantic Analysis (LSA)
D. Convolutional Neural Networks (CNN)
Dr Preeti Gera COMPUTER VISION UNIT- IV 173
MCQ s(CONT’d)

1.What is computer vision? Answer: c. Computer vision is a field of artificial intelligence that
focuses on enabling computers to interpret and understand visual information from images or
videos.
2.Name three common applications of computer vision. Answer: a. Autonomous vehicles, b.
Object recognition, c. Medical image analysis
3.What is the purpose of image segmentation in computer vision? Answer: b. Image
segmentation aims to partition an image into meaningful regions or segments to facilitate
object detection, tracking, or analysis.
4.What is the difference between object detection and object recognition? Answer: c. Object
detection involves both localizing and classifying objects within an image, while object
recognition focuses solely on identifying objects without localizing them.
5.Explain the concept of convolution in convolutional neural networks (CNNs). Answer: a.
Convolution involves applying a filter/kernel to an input image or feature map, computing
element-wise multiplications, and Drsumming
Preeti Gera the results
COMPUTER to produce
VISION UNIT- IV a feature map. 174
MCQ s(CONT’d)

1.What is optical character recognition (OCR) used for in computer vision? Answer: b. OCR is
used to convert printed or handwritten text from images into machine-readable text, enabling
automated text analysis or data extraction.
2.What is the purpose of non-maximum suppression in object detection algorithms? Answer: a.
Non-maximum suppression is used to eliminate redundant bounding box detections by keeping
only the most confident detection and suppressing overlapping or lower-confidence detections.
3.What are some common challenges faced in computer vision tasks? Answer: c. Variations in
lighting conditions, occlusion, viewpoint changes, and limited labeled data are common
challenges in computer vision tasks.
4.What is the difference between supervised and unsupervised learning in computer vision?
Answer: b. Supervised learning requires labeled training data, where input images are associated
with corresponding ground-truth labels. Unsupervised learning involves learning patterns or
structures from unlabeled data without explicit labels.

Dr Preeti Gera COMPUTER VISION UNIT- IV 175


Old Question Papers

• Why do we use data augmentation?

• What are the metrics used for object detection?


• When do you say that an object detection method is efficient?
• How many types of recognition are there in artificial intelligence?
• How can you evaluate the predictions in an Object Detection model?
• What are the main steps in a typical Computer Vision pipeline?
• What is the difference between Semantic Segmentation and Instance Segmentation in Computer Vision?
• How do Neural Networks distinguish useful features from non-useful features in Computer Vision?
• How do Neural Networks distinguish useful features from non-useful features in Computer Vision?
• How would you decide when to grayscale the input images for a Computer Vision problem?
• Provide an intuitive explanation of how The Sliding Window approach works in Object Detection
• What Image Noise Filters techniques do you know?

Dr Preeti Gera COMPUTER VISION UNIT- IV 176


Old Question Papers

Dr Preeti Gera COMPUTER VISION UNIT- IV 177


EXPECTED QUESTIONS FOR UNIVERSITY EXAM

Dr Preeti Gera COMPUTER VISION UNIT- IV 178


EXPECTED QUESTIONS FOR UNIVERSITY EXAM

Dr Preeti Gera COMPUTER VISION UNIT- IV 179


EXPECTED QUESTIONS FOR UNIVERSITY EXAM

Dr Preeti Gera COMPUTER VISION UNIT- IV 180


SUMMARY

 Computer vision is a field of study that deals with the extraction of information
from images or videos to understand and interpret visual data. It involves the
development of algorithms and techniques to enable computers to perceive and
understand the visual world in a way similar to humans.

 Computer vision encompasses various tasks and applications, including image


classification, object detection, image segmentation, facial recognition, scene
understanding, and video analysis. These tasks involve processing and analyzing
visual data to extract meaningful information and make decisions based on it.

 The fundamental concepts in computer vision include image formation, image


processing, feature extraction, and pattern recognition. Image formation deals with
how images are captured and represented using pixels. Image processing
techniques are used to enhance and manipulate images to improve their quality or
extract specific information. Feature extraction involves identifying relevant visual
characteristics or patterns from images that can be used for tasks like object
recognition or tracking.

Dr Preeti Gera COMPUTER VISION UNIT- IV 181


SUMMARY

 Computer vision techniques employ both traditional computer vision algorithms


and deep learning approaches. Traditional algorithms rely on handcrafted features
and mathematical models to process and analyze visual data. Deep learning
methods, particularly convolutional neural networks (CNNs), have gained
popularity in recent years due to their ability to learn directly from raw pixel data
and achieve state-of-the-art results in various computer vision tasks.

 Computer vision finds applications in diverse fields such as autonomous vehicles,


surveillance systems, medical imaging, robotics, augmented reality, and industrial
automation. It plays a crucial role in enabling machines to understand and interact
with the visual world, opening up possibilities for advanced applications and
advancements in numerous domains.

 As computer vision continues to evolve, researchers and practitioners explore new


techniques, algorithms, and applications to tackle more complex challenges and
improve the accuracy and efficiency of visual understanding by machines.

Dr Preeti Gera COMPUTER VISION UNIT- IV 182


REFERENCES

• "Computer Vision: Algorithms and Applications" by Richard Szeliski


• This comprehensive book covers the fundamental concepts and algorithms in computer
vision, including image formation, image features, stereo vision, multiple view geometry,
and object recognition. It also includes numerous examples and MATLAB code snippets.
• "Deep Learning for Computer Vision with Python" by Adrian Rosebrock
• This book focuses on applying deep learning techniques to solve computer vision
problems. It covers topics such as convolutional neural networks (CNNs), image
classification, object detection, and image segmentation. The book provides practical
examples and code implementations using Python and the Keras library.
• "Computer Vision: Models, Learning, and Inference" by Simon J.D. Prince
• This book provides a comprehensive introduction to computer vision, covering various
topics such as image formation, filtering, feature detection and matching, object
recognition, and 3D reconstruction. It also includes discussions on statistical and
probabilistic models in computer vision.
https://www.oreilly.com/library/view/datawarehousingarchitecture/0130809020/ch07.html
https://www.slideshare.net/2cdude/data-warehousing-3292359

Dr Preeti Gera COMPUTER VISION UNIT- IV 183


Any Certification/Courses for this subject

Course Offered By Duratio Rating Link


Name n

Introductio IBM 1-4 4.4


n to weeks
Computer
Vision and
Image
Processing https://www.coursera.org/
learn/introduction-
computer-vision-watson-
opencv#about
Computer University at 1-4 4.2
Vision Buffalo weeks
Basics https://www.coursera.org/
learn/computer-vision-
basics#syllabus

Dr Preeti Gera COMPUTER VISION UNIT- IV 184


OPTIONAL / CASE STUDIES

OPTIONAL / CASE STUDIES

Factors influencing the adoption of new mobility technologies and services:,

1 Drone Systems and Applications ( Healthcare, Agriculture, Security)

2 Autonomous Vehicular System

3 Motion Prediction for Autonomous Vehicles

4 Clinical applications Robotic surgery

Dr Preeti Gera COMPUTER VISION UNIT- IV 185


Thank You

Dr Preeti Gera COMPUTER VISION UNIT- IV 186

You might also like