Professional Documents
Culture Documents
COMPUTER VISION UNIT 4
COMPUTER VISION UNIT 4
Object Detection
Unit: 4
UNIT-I: Architectures
Architectures
UNIT-III: Segmentation
Segmentation
UNIT-I: Architectures
Object Detection
2 Architectures Representation of a Three-Dimensional Moving Scene. Convolutional layers, pooling layers, and padding. Transfer
learning and pre-trained models Architectures.
Architectures Design : LeNet-5, AlexNet, VGGNet, GoogLeNet, ResNet, Efficient Net, Mobile Net . RNN
Introduction, perceptron Backpropagation in CNN,RNN.
Segmentation Popular Image Segmentation Architectures, FCN Architecture, Upsampling Methods, Pixel Transformations,
3 Geometric Operations, Spatial Operations in Image Processing, Instance Segmentation, Localisation, Object
detection and image segmentation using CNNs, LSTM and GRU’s. Vision Models, Vision Languages, Quality
Analysis, Visual Dialogue, other attention models, self attention and transformers. Active Contours & Application,
Split & Merge, Mean Shift & Mode Finding, Normalized Cuts,
4 Object Detection Object Detection and Sliding Windows, R-CNN, Fast R-CNN, Object Recognition, 3-D vision and Geometry,
Digital Watermarking. Object Detection, face recognition instance Recognition, Category Recognition Objects,
Scenes, Activities, Object classification and detection, Encoder in Code, Decoder in Code, U-Net Code: Encoder,
Decoder , Few Shot and zero shot learning, self-supervised learning, Adversarial Robustness, Pruning and model
compression, Neural Architecture search, Objects in Scenes. YOLO Fundamentals of Image Formation,
Convolution and Filtering.
5 Visualization and Benefits of Interpretability, Fashion MNIST Class Activation Map code walkthrough, GradCAM,ZFNet.Image
Generative compression methods and its requirements,statisticalcompression, spatial compression, contour coding. Deep
Models Generative Models introduction,Generative Adversarial Networks Combination VAE and GAN’s, other VAE and
GAN’s deep generative models. GAN Improvements, Deep Generative Models across multiple domains, Deep
Generative Models image and video applications.
Course Objective: To learn about key features of Computer Vision, design, implement
and provide continuous improvement in the accuracy and outcomes of various datasets
with more reliable and concise analysis results.
CO4 Design and deploy the models of deep learning with the K5
help of use cases.
CO5 Understand, Analyse different theories of deep learning K3
using neural networks.
Computer vision
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Subject code 3 2 1
Subject code 2 1
Subject code 2 3
Subject code 3 3 2
Subject code 3 2 1 1
avg
B TECH
(SEM-VII) THEORY EXAMINATION 20__-20__
Computer Vision
Time: 3 Hours Total Marks:
100
Note: 1. Attempt all Sections. If require any missing data; then choose
suitably.
SECTION A
1.Attempt
Q.No. all questions in brief.Question 2 x 10 = 20
Marks CO
1 2
2 2
. .
10 2
1 10
2 10
1 10
2 10
5. Attempt any one part of the following: 1 x 10 = 10
Q.No. Question Marks CO
1 10
2 10
6. Attempt any one part of the following: 1 x 10 = 10
Q.No. Question Marks CO
1 10
2 10
R-CNN
Fast R-CNN
Object Recognition
3-D vision and Geometry
, Digital Watermarking
Object Detection
face recognition
instance Recognition
Category Recognition Objects, Scenes
Activities, Object classification
Students should be able to create, design and manipulate the images and idea about
pixels
Prerequisites
No prior experience with computer vision is assumed, although
previous knowledge of visual computing or signal processing
will be helpful (e.g., CSCI 1230). The following skills are
necessary for this class:
•Math: Linear algebra, vector calculus, and probability. Linear
algebra is the most important and is required.
•Data structures: You will write code that represents images as
matrices, high-dimensional features, and geometric
constructions.
•Programming and toolchains: A good working knowledge.
Intro CS is required, and an intermediate systems course is
strongly encouraged.
Dr Preeti Gera COMPUTER VISION UNIT- IV 19
Object Detection
M ot i v ation
M ot i v ation
M ot i v ation
Image Searching
M ot i v ation
CCTV Surveillance
In addition to predicting the presence of an object within the region proposals, the algorithm also predicts
four values which are offset values to increase the precision of the bounding box. For example, given a
region proposal, the algorithm would have predicted the presence of a person but the face of that person
within that region proposal could’ve been cut in half. Therefore, the offset values help in adjusting the
bounding box of the region proposal.
CNN architecture of R-CNN
• After that these regions are warped into a single square of regions of
dimension as required by the CNN model. The CNN model that we used
here is a pre-trained AlexNet model, which is the state-of-the-art CNN model
at that time for image classification Let’s look at AlexNet architecture here.
• However, there is an issue with training with SVM is that we required AlexNet feature vectors for
training the SVM class. So, we could not train AlexNet and SVM independently in paralleled
manner. This challenge is resolved in future versions of R-CNN (Fast R-CNN, Faster R-CNN,
etc.).
• Bounding Box Regressor
• In order to precisely locate the bounding box in the image., we used a scale-invariant linear
regression model called bounding box regressor. For training this model we take as predicted and
Ground truth pairs of four dimensions of localization.
• These dimensions are (x, y, w, h) where x and y are the pixel coordinates of the center of the
bounding box respectively. w and h represent the width and height of bounding boxes. This
method increases the Mean Average precision (mAP) of the result by 3-4%.
• From the convolutional feature map, we identify the region of proposals and warp them into
squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be
fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to
predict the class of the proposed region and also the offset values for the bounding box.
• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
• The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
Dr Preeti Gera COMPUTER VISION UNIT- IV 35
• Faster R-CNN
What knowledge
do you use to
analyze this image?
What objects are shown in thisObject
image?Detection
How can you estimate distance from the camera?
What feature changes with distance?
Object Detection
3D Shape from X
• shading
• silhouette mainly research
• texture
• stereo
• light striping used in practice
• motion
Object Detection
xi = xc
xc B f zc
3D object
point
Object Detection
Perspective in 2D Yc
(Simplified)
camera
P´=(xi,yi,f) Xc
3D object point yi
P=(xc,yc,zc) ray f
F
=(xw,yw,zw) xi
yc optical
axis
zw=zc
xi xc
= xi = (f/zc)xc
Zc xc f zc
yi = (f/zc)yc
Here camera coordinates yi yc
equal world coordinates. =
f zc
Object Detection
3D from Stereo
3D point
X P=(x,z)
z x-b y-axis is
z = x = z = y = y perpendicular
f xl f xr f yl yr to the page.
Object Detection
Finding Correspondences
dense
sparse
y1 z1 z2
y2 epipolar
plane
P1
P2 x
C1 b C2
The match for P1 (or P2) in the other image,
must lie on the same epipolar line.
Object Detection
Epipolar Geometry:
General Case
P
y1
y2 P2
P1 e2 x2
e1 x1
C1
C2
Object Detection
1. Epipolar Constraint:
Constraints Matching points lie on
corresponding epipolar
lines.
P 2. Ordering Constraint:
Usually in the same
Q order across the lines.
e2
e1
C1
C2
Object Detection
Structured Light
light stripe
3D data can also be derived using
• a single camera
light camera
source
Object Detection
Structured Light
3D Computation
3D data can also be derived using
3D point
• a single camera (x, y, z)
3D f cot - x´ image
Object Detection
projector
rotation
table
3D
object
Object Detection
• Object xc
yf zw
C
xf
• World
image a
• Camera
• Real Image W
zc yw
pyramid
• Pixel Image zp
yp A object
xp
xw
Object Detection
xp
yp
rotate
W
yw
translate
scale instance of the
object in the
xw world
Object Detection
rotation by angle
about the x axis
y
x Px
Px´ 1 0 0 0
Py´ 0 cos - sin 0 Py
= 0 sin cos 0 Pz
Pz´
1 0 0 0 1 1
Object Detection
T R1 R2
Px
s Ipr c11 c12 c13 c14
Py
s Ipc = c21 c22 c23 c24
Pz
s c31 c32 c33 1
1
image camera matrix C world
point point
What’s in C?
Object Detection
The camera model handles the rigid body transformation from
world coordinates to camera coordinates plus the perspective
transformation to image coordinates.
1. CP = T R WP
2. IP = (f) CP
CPx
s Ipx 1 0 0 0 CPy
s Ipy = 0 1 0 0 CPz
s 0 0 1/f 1 1
image perspective 3D point in
point transformation camera
coordinates
Object Detection
Camera Calibration
• In order work in 3D, we need to know the parameters
of the particular camera setup.
Intrinsic Parameters
• focal length f
Extrinsic Parameters
• translation parameters
t = [tx ty tz]
• rotation matrix
Calibration Object
x
image plane
principal point p0
pi = (ui,vi)
(0,0,zi) y
Pi = (xi,yi,zi)
x 3D point
z
Object Detection
Tsai’s Procedure
• Estimates
• 9 rotation matrix values
• 3 translation matrix values
• focal length
• lens distortion factor
y1
y2 P2=(r2,c2)
P1=(r1,c1) e2 x2
e1 x1
C1
C2
Object Detection
For a correspondence (r1,c1) in
image 1 to (r2,c2) in image 2:
P1
Digital Watermarking
Object Detection
Information Hiding
Microchip - Application
What is a watermark ?
What is a watermark ? A distinguishing mark impressed on
paper during manufacture; visible when paper is held up to
the light (e.g. $ Bill)
What is a watermark ?
Digital Watermarking: Application of Information hiding
(Hiding Watermarks in digital Media, such as images)
Applications
Applications
Applications
Comparison
• Watermarking Vs Cryptography
Watermarking Process
Embed (D, W, K) = Dw
Watermarking Process
Example – Embedding (Dw = D + W)
• Matrix representation (12 blocks – 3 x 4 matrix)
(Algorithm Used: Random number generator RNG), Seed for RNG = K,
D = Matrix representation, W = Author’s name
1 2 3 4
5 6 7 8
9 10 11 12
Object Detection
Watermarking Process
Example – Extraction
• The Watermark can be identified by generating the random
numbers using the seed K
6 8
10
Object Detection
•Spatial Watermarking
Direct usage of data to embed and extract Watermark
e.g. voltage values for audio data
Extraction Categorization
• Informed (Private)
Extract using {D, K, W}
• Semi - Blind (Semi-Private)
Extract using {K, W}
• Blind (Public)
Extract using {K}
Robustness Categorization
Categorization of Watermark
Watermarking Example
Imperceptibility
Watermarking
Watermarking
Robustness
Robustness
Robustness
Robustness
• Reorder Attack
- Reversal of sequence of data values e.g. reverse filter in audio signal reverses
the order of data values in time
0 1 1 1 1 0
Attack
1 2 3 3 2 1
Capacity
Security
• In case the key used during watermark is lost anyone can read the
watermark and remove it.
Capacity
Only be accessible by authorized parties
Security
Resistance against hostile/user dependent
changes
Robustness
Invisibility
Imperceptibility
Object Detection
Object
Detectio
n
• What all objects are in
the scene?
2
28-08-2021
Object Detection
28-08-2021
Object Detection
Classification and
Localization
Classification • Suppose there are five categories of
with objects with their corresponding labels.
localization • Flower ( 1: [1,0,0,0,0])
• Fruit (2: [0, 1,0,0,0,])
• Bird (3: [0, 0, 1,0,0])
• Insect (4: [0,0,0,1,0])
• Background only ( none of the above)
(5: [0,0,0,0,1])
• CNN output would be ‘flower’ with
bounding box:
• centre, height and width.
Flower with bounding box
28-08-2021
103
Object Detection
Classification and
Localization
•Classification with • Suppose there are four Fully connected Flower: 0.92
• localization categories of objects.4096 to 5 Fruit : 0.023
Bird: 0.02
• • Flower Insect: 0.07
• • Bird Background: 0. 32
• • Insect
• • Background
Detection as a Regression
Problem
28-08-2021
Object Detection
28-08-2021
Object Detection
28-08-2021
Object Detection
28-08-2021
Object Detection
𝑆 𝑎, 𝑏 + 𝑆𝑡 𝑒 𝑥 𝑡 𝑢 𝑟 𝑒 (𝑎,
= 𝑆𝑠𝑖𝑧𝑒
𝑎, 𝑏 𝑏)
28-08-2021
Object Detection
Source: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/
28-08-2021
Object Detection
14
Object Detection
Faster
RCNN
• Faster R-CNN does not use a special region proposal method to create
region proposals.
Region Proposal
Network
28-08-2021
Yet not suitable for real time applications.
Object Detection
YOLO
Features
• Computationally Fast, can be used in real time environment.
• Globally processing the entire image only once with a single CNN.
YOL
O
Object Detection
YOL
O
If the center/midpoint of
an object falls into a grid
cell, that grid cell is
responsible for detecting
that object.
Object Detection
YOL
O
Each grid cell predicts B
bounding boxes and
confidence scores for
those boxes.
YOLO
Algorithm
h
(x,y)
Object Detection
YOLO
Algorithm
Object Detection
Object Detection
Exampl
Exampl e
e
YOLO’s
Prediction
• For each of the 19x19 grid cells, the maximum of the probability
scores (taking a max across both - 5 anchor boxes and across different
classes).
• Color that grid cell according to what object that grid cell considers
the most likely.
Object Detection
Too Many
Boxes!
Object Detection
• Ignore boxes with a low score, that is, when the box is not very
confident about detecting a class.
• Select only one box when several boxes overlap with each other and
detect the same object. How ?
Object Detection
Ground truth
bounding
box
𝐵1 𝐵2
Object Detection
Remove all those boxes whose score is less than the threshold
Object Detection
Non Max
Suppression
• Second level filter for selecting the right boxes.
Yolo CNN
Architecture
Object Detection
YOLO
Versions
• YOLOv1: Joseph Redmon (June 2015)
• YOLOv2-v3: Joseph Redmon and Ali Farhadi (2016-18)
• YOLOv4: Alexey Bochkovskiy ( April 2020)
• YOLOv5: Glenn Jocher ( May 2020)
• Controversies and comparisons
• https://medium.com/deelvin-machine-learning/yolov4-vs-yolov5-
db1e0ac7962b
28-08-2021 35
Object Detection
• Faster than Yolo, as accurate as two stage methods like Faster R-CNN.
SSD
Framework
• SSD only needs an input image and ground truth boxes for each
object during training.
SSD
Framework
• For each default box, both the shape offsets and the confidence
scores for all object categories are predicted.
• At training time, these default boxes are matched with the ground
truth boxes.
SSD
Framework
Two default boxes with the cat and one with the dog are matched, which are
treated as positives and the rest as negatives.
Object Detection
Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection
Image source :Liu et al., Single Shot MultiBox Detector, December 2015.
Object Detection
Results
Source: https://towardsdatascience.com
Object Detection
M2De
T
• Zhao et al. (2018) introduced a new single shot object detector based
on multi-level feature pyramid network.
• Apart from scale variation, appearance-complexity variation should
also be considered for the object detection task.
• Object instances with similar size can be quite different.
• M2Det adds a new dimension to multi-scale detection - multi-level
learning.
• Deeper level learns features for objects with more appearance-
complexity variation (e.g., pedestrian in a road), while shallower level
learns features for more simplistic objects(e.g., traffic light).
Object Detection
M2De
T
SSD
M2De
T
• Three modules : Feature Fusion Module (FFM), Thinned U-shape
Module (TUM) and Scale-wise Feature Aggregation Module (SFAM).
With the rise of autonomous vehicles, smart video surveillance, facial detection and
various people counting applications, fast and accurate object detection systems are rising
in demand. These systems involve not only recognizing and classifying every object in an
image, but localizing each one by drawing the appropriate bounding box around it. This
makes object detection a significantly harder task than its traditional computer vision
predecessor, image classification.
R-CNN
R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN really kicked things off.
R-CNN, or Region-based Convolutional Neural Network, consisted of 3 simple steps:
1.Scan the input image for possible objects using an algorithm called Selective Search,
generating ~2000 region proposals
2.Run a convolutional neural net (CNN) on top of each of these region proposals
3.Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear
regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN
R-CNN’s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many
ways, but improved on its detection speed through two main augmentations:
1.Performing feature extraction over the image before proposing regions, thus only running one
CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
2.Replacing the SVM with a softmax layer, thus extending the neural network for predictions
instead of creating a new model
As we can see from the image, we are now generating region proposals based on the last
feature map of the network, not from the original image itself. As a result, we can train just one
CNN for the entire image.
In addition, instead of training many different SVM’s to classify each object class, there is a
single softmax layer that outputs the class probabilities directly. Now we only have one neural
net to train, as opposed to one neural net and many SVM’s.
Faster R-CNN
At this point, we’re back to our original target: Faster R-CNN. The main insight of Faster R-
CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it
introduced the region proposal network (RPN).
Here’s how the RPN worked:
•At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and
maps it to a lower dimension (e.g. 256-d)
•For each sliding-window location, it generates multiple possible regions based on k fixed-
ratio anchor boxes (default bounding boxes)
•Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates
representing the bounding box of the region
Run a CNN (in this case, ResNet) over the input image
Add a fully convolutional layer to generate a score bank of the aforementioned “position-
sensitive score maps.” There should be k²(C+1) score maps, with k² representing the number of
relative positions to divide an object (e.g. 3² for a 3 by 3 grid) and C+1 representing the number
of classes plus the background.
Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI’s)
For each RoI, divide it into the same k² “bins” or subregions as the score maps
For each bin, check the score bank to see if that bin matches the corresponding position of some
object. For example, if I’m on the “upper-left” bin, I will grab the score maps that correspond to
the “upper-left” corner of an object and average those values in the RoI region. This process is
repeated for each class.
Once each of the k² bins has an “object match” value for each class, average the bins to get a
single score per class.
Classify the RoI with a softmax over the remaining C+1 dimensional vector
https://nptel.ac.in/courses/106106093/
https://www.youtube.com/watch?v=m-aKj5ovDfg
https://www.youtube.com/watch?v=G4NYQox4n2g
1. Explain the concept of object tracking in computer vision. Discuss different algorithms or
techniques used for object tracking.
2. Describe the process of image recognition using convolutional neural networks (CNNs).
What are the key components and steps involved?
3. Discuss the concept of depth estimation in computer vision. Explain how depth
information can be extracted from 2D images.
4. Explain the concept of image stitching and its applications. How are multiple images
combined to create a panoramic image?
5. Discuss the challenges and approaches for handling scale invariance in object detection
algorithms.
6. Describe the concept of facial recognition in computer vision. Discuss its applications,
advantages, and potential privacy concerns.
7. Explain the concept of semantic segmentation and its applications in computer vision.
Provide examples of scenarios where semantic segmentation is useful.
8. Discuss the concept of object recognition using feature descriptors. Explain popular
feature descriptor algorithms such as SIFT or SURF.
9. Explain the concept of image super-resolution and its applications. How can low-
resolution images be enhanced to improve their quality?
10. Discuss the role of transfer learning in computer vision. How can pre-trained models be
utilized for new tasks or datasets?
Dr Preeti Gera COMPUTER VISION UNIT- IV 171
MCQ s
Question 1: What is computer vision?
A. The study of computers and their components
B. The field of processing and understanding visual data by computers
C. The development of computer software for image editing
D. The study of visual perception in humans
Question 3: Which technique is commonly used for feature extraction in computer vision?
A. Convolutional Neural Networks (CNN)
B. Decision Trees
C. Support Vector Machines (SVM)
D. K-means clustering
1.What is computer vision? Answer: c. Computer vision is a field of artificial intelligence that
focuses on enabling computers to interpret and understand visual information from images or
videos.
2.Name three common applications of computer vision. Answer: a. Autonomous vehicles, b.
Object recognition, c. Medical image analysis
3.What is the purpose of image segmentation in computer vision? Answer: b. Image
segmentation aims to partition an image into meaningful regions or segments to facilitate
object detection, tracking, or analysis.
4.What is the difference between object detection and object recognition? Answer: c. Object
detection involves both localizing and classifying objects within an image, while object
recognition focuses solely on identifying objects without localizing them.
5.Explain the concept of convolution in convolutional neural networks (CNNs). Answer: a.
Convolution involves applying a filter/kernel to an input image or feature map, computing
element-wise multiplications, and Drsumming
Preeti Gera the results
COMPUTER to produce
VISION UNIT- IV a feature map. 174
MCQ s(CONT’d)
1.What is optical character recognition (OCR) used for in computer vision? Answer: b. OCR is
used to convert printed or handwritten text from images into machine-readable text, enabling
automated text analysis or data extraction.
2.What is the purpose of non-maximum suppression in object detection algorithms? Answer: a.
Non-maximum suppression is used to eliminate redundant bounding box detections by keeping
only the most confident detection and suppressing overlapping or lower-confidence detections.
3.What are some common challenges faced in computer vision tasks? Answer: c. Variations in
lighting conditions, occlusion, viewpoint changes, and limited labeled data are common
challenges in computer vision tasks.
4.What is the difference between supervised and unsupervised learning in computer vision?
Answer: b. Supervised learning requires labeled training data, where input images are associated
with corresponding ground-truth labels. Unsupervised learning involves learning patterns or
structures from unlabeled data without explicit labels.
Computer vision is a field of study that deals with the extraction of information
from images or videos to understand and interpret visual data. It involves the
development of algorithms and techniques to enable computers to perceive and
understand the visual world in a way similar to humans.