Computer Vision Assignment

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Assignment-4

Sayed Aman Konen


Redg.: 1801227460
Q1. Use bag of words for category recognition. Explain image signatures in this context.
Ans: The bag-of-words (BoW) methodology was first proposed in the text retrieval domain problem for text document
analysis, and it was further adapted for computer vision applications [24]. For image analysis, a visual analogue of a word is
used in the BoW model, which is based on the vector quantization process by clustering low-level visual features of local
regions or points, such as color, texture, and so forth.

To extract the BoW feature from images involves the following steps:

(i) automatically detect regions/points of interest,

(ii) compute local descriptors over those regions/points,

(iii) quantize the descriptors into words to form the visual vocabulary, and

(iv) findthe occurrences in the image of each specific word in the vocabulary for constructing the BoW feature (or a
histogram of word frequencies). Figure 1 describes these four steps to extract the BoW feature from images.

The BoW model can be defined as follows. Given a training dataset D containing n images represented by D=d1,d2,… and dn
where d is the extracted visual features, a specific unsupervised learning algorithm, such as k-means, is used to group
D based on a fixed number of visual words W (or categories) represented by W=w1,w2,.. and wv where  is the cluster
number. Then, we can summarize the data in a  cooccurrence table of counts , where  denotes how often the word  occurred
in an image .

Q2. Explain different technique for face recognition. Explain active appearance and 3D shape
models.

Ans: Face detection techniques Viola-Jones Algorithm


Its an efficient algorithm for face detection .

Developers of this algo showed faces being detected in real time on a webcam feed.

It was the most stunning demonstration of computer vision and its potential at the time.
Soon, it was implemented in OpenCV & face detection became synonymous with Viola and Jones algorithm.

Histogram Of Oriented Gradients

BASIC IDEA :

For an Image I , analyze each pixel say P(i) of the image I for the relative dark pixels directly surounding it.

Then add an arrow pointing in the direction of the flow of darkness relative to P(i) .

This process of assigning an oriented gradient to a pixel P(i) by analyzing it's surrounding pixels is performed for every pixel in the image.

Assuming HOG(I) as a function that takes an input as an Image I, what it does is replaces every pixel with an arrow. Arrows = Gradients.

Gradients show the flow from light to dark across an entire image.

Complex features like eyes may give too many gradients, so we need to aggregate the whole HOG(I) in order to make a 'global

representation' . We break up the image into squares of 16 x 16 and assign an aggregate gradient G′ to each square , where the function

could be max(),min(),etc.

R-CNN
BASIC IDEA :

R-CNN creates bounding boxes, or regions , using selective search.

Selective search looks at the image through windows of different sizes, and for each size it tries to group together adjacent pixels by texture,

color, or intensity to identify objects.

Generate a

set of regions for

bounding

boxes.

Run the

images in the

bounding

boxes through a

pre-trained neural network and finally an SVM to see what object the image in the box is.

Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.
Holistic Matching
In this approach, complete face region is taken into account as input data into face catching system. One of the best example of holistic methods
are Eigenfaces, PCA, Linear Discriminant Analysis and independent component analysis etc.

Let's see the steps of Eigenfaces Method :


This approach covers face recognition as a two-dimensional recognition problem.Insert a set of images into a database, these images are named as
the training set because they will be used when we compare images and create the eigenfaces.

Eigenfaces are made by extracting characteristic features from the faces. The input images are normalized to line up the eyes and mouths. Then they

are resized so that they have the same size. Eigenfaces can now be extracted from the image data by using a mathematical tool called PCA.

Now each image will be represented as a vector of weights. System is now ready to accept queries. The weight of the incoming unknown image is

found and then compared to the weights of already present images in the system.

If the input image's weight is over a given threshold it is considered to be unidentified. The identification of the input image is done by finding the

image in the database whose weights are the closest to the weights of the input image.

The image in the database with the closest weight will be returned as a hit to the user.

Feature-based
Here local features such as eyes, nose, mouth are first of all extracted and their locations , geomety and appearance are fed into a structural
classifier. A challenge for feature extraction methods is feature "restoration", this is when the system tries to retrieve features that are invisible due to
large variations, e.g. head Pose while matching a frontal image with a profile image.

Different extraction methods:

Generic methods based on edges, lines, and curves

Feature-template-based methods
Structural matching methods

Model Based
The model-based approach tries to model a face. The new sample is introduced to the model and the parameters of the model are used to
recognise the image.Model-based method can be classified as 2D or 3D .

Hybrid Methods
This uses a combination of both holistic and feature extraction methods. Generally 3D Images are used in these methods. The image of a face is
caught in 3D, to note the curves of the eye sockets, or the shapes of the chin or forehead. Even a face in profile would serve because the system
uses depth, and an axis of measurement, which gives it enough information to construct a full face. The 3D system includes Detection, Position,
Measurement, Representation and Matching.

Detection - Capturing a face by scanning a photograph or photographing a person's face in real time.

Position - Determining the location, size and angle of the head. Measurement - Assigning measurements to each curve of the face to make a

template .

Representation - Converting the template into a numerical representation of the face .


Matching - Comparing the received data with faces in the database. The 3D image which is to be compared with an existing 3D image, needs to

have no alterations.
Recent research works are based on the hybrid approach.

Q3. Explain object detection. Give brief about any real application of object detection.

Ans: Object detection is a computer vision technique that works to identify and locate objects within an
image or video. Specifically, object detection draws bounding boxes around these detected objects, which
allow us to locate where said objects are in (or how they move through) a given scene.

Object detection is commonly confused with image recognition, so before we proceed, it’s important that
we clarify the distinctions between them.

Image recognition assigns a label to an image. A picture of a dog receives the label “dog”. A picture of two
dogs, still receives the label “dog”. Object detection, on the other hand, draws a box around each dog and
labels the box “dog”. The model predicts where each object is and what label should be applied. In that
way, object detection provides more information about an image than recognition.

Here’s an example of how this distinction looks in practice:

Tracking objects
It is needless to point out that in the field of security and surveillance object detection would play an even more important role. With object tracking
it would be easier to track a person in a video. Object tracking could also be used in tracking the motion of a ball during a match. In the field of
traffic monitoring too object tracking plays a crucial role.

Counting the crowd


Crowd counting or people counting is another significant application of object detection. During a big festival, or, in a crowded mall this application
comes in handy as it helps in dissecting the crowd and measure different groups.

Self-driving cars
Another unique application of object detection technique is definitely self-driving cars.  A self-driving car can only navigate through a street safely if
it could detect all the objects such as people, other cars, road signs on the road, in order to decide what action to take.

Detecting a vehicle
In a road full of speeding vehicles object detection can help in a big way by tracking a particular vehicle and even its number plate. So, if a car gets
into an accident or, breaks traffic rules then it is easier to detect that particular car using object detection model and thereby decreasing the rate of
crime while enhancing security.

Detecting anomaly
Another useful application of object detection is definitely spotting an anomaly and it has industry specific usages.   For instance, in the field of
agriculture object detection helps in identifying infected crops and thereby helps the farmers take measures accordingly.   It could also help identify
skin problems in healthcare.  In the manufacturing industry the object detection technique can help in detecting problematic parts really fast and
thereby allow the company to take the right step.

Q4. Explain instance recognition. How large database help in this.

Ans: Instance Level Recognition (ILR), is a visual recognition task to recognize a specific instance


of an object not just object class.For example, as shown in the above image, painting is an
object class, and “Mona Lisa” by Leonardo Da Vinci is an instance of that painting. Similarly, the
Taj Mahal, India is an instance of the object class building.

 Landmark Recognition: Recognize landmarks in images.

 Landmark Retrieval: Retrieve relevant landmark images from a large-scale database.

 Artwork Recognition: Recognize artworks in images.

 Product Retrieval: Retrieve relevant product images from a large-scale database.

Instance-level recognition will unravel the true potential of deep learning technologies for
semantic image classification/retrieval for eCommerce, travel, media & entertainment,
agriculture, etc. Some of the major building block for an efficient instance-level solution is

 Backbone Network Selection (Residual, Squeeze & Excitation, EfficientNet)

 Data Augmentation (Albumentation, AutoAugment, Cutout, etc).

 Loss function (ArcFace, AccuracyLoss).

 Multi-scale processing.

 Fine-tuning and post-processing.

Q5. Use eigen space for face recognition. Explain different


terminology.

Ans: in order to make things easier and somewhat faster, we need to reduce the dimension of
these images while still retaining as much information as possible.
This is where eigenvectors and eigenvalues come to the rescue, from this point on, things
become easier. When the matrices (face images) dimensions have been reduced, the resulting
eigenvector is known as an Eigen face because when displayed it produces a ghostly face
(images are provided in the research paper in the link section). Another advantage to this is that
this method extracts the import features of the face that may not be directly linked to the facial
features we intuitively think of, such as eyes, noses, lips, etc. It is also worthy to note that
number of eigenvectors is equivalent to the number of matrices (face images) present, so we
pick the vectors whose eigenvalues are high.

Now, this N’ dimensional vectors(eigenvectors) span an N’ dimensional subspace called a “face


space” which represents all the possible Eigen faces. To understand the concept of a face space,
let’s imagine we have a smart magical board that groups objects together based on their color
or a specified feature. Given the fact that we have 3 sacs of balls, where a sac is filled with balls
of the same color, the colors we have are blue, red, and green.

When we throw these balls on the board, the board clusters all the balls with the same color
together, now let’s say some balls irrespective of color represent the letter “A” while some do
not. Here, the dynamics of things change, when the board clusters, we would have a region on
the board that has a combination of all the colors the represent the letter A together while the
others that do not represent anything scattered about the space. Now, let’s say the balls that
represent A are the eigenvectors, in other words if A represents faces, we say they are Eigen
faces, then the region on the board that has the combination of balls together is called the face
space. This is some what expected since faces have a similar structure, they would most likely
not appear randomly in a vector space.

The next thing would be to classify a new face, that is, “is this a face or not?” so what we do is to
reduce the dimension of this image, obtain the new vector, and project it onto this face space,
i.e. get a new ball, make it smaller and throw it on the board, well since the board is magical it
would know where to place the new ball. Mathematically speaking, what happens is, we find the
Euclidean distance between the new vector to the face space. If the minimum distance passes a
specified threshold to the face space we can say this is a face, the farther it is, the more
confidence we have to say it is not a face.

Assuming we have different faces i.e. the balls represent face images of people. Blue — faces of
Dave, red — faces of Kene ; ), green — faces of Sam. On the magical board we are going to
have a face space containing three different regions of Eigen faces. When we get a new ball( a
new face), make it smaller ( get the eigenvector i.e. Eigen face) and then project it onto the face
space. Like the board did earlier it will find the Euclidean distance, if the ball is closest to or in
the region of the “green Eigen face” we say, the new image is probably or most likely a face
image of Sam.

That’s it, basically we are just applying eigenvectors and eigenvalues in the field of face
recognition.

Q6. Illustrate surface interpolation in 3D reconstruction. Explain radial basis function in this
context.

Ans: In solving a nonlinearly separable pattern classification problem, there is generally a practical
benefit in mapping the input space into a new space of sufficiently high dimension. This is an
important point that comes forth from Cover’s Theorem on the separability of patterns. Let us
consider a feed forward network with an input layer, a single hidden layer, and an output layer
having a single unit. The network can be designed to perform a nonlinear mapping from the input
space to the hidden space, and a linear mapping from the hidden space to the output space. The
network represents a map from p-dimensional input space to the single dimensional output space,
expressed as s: p ® 1 (3.6)29 The theory of multivariable interpolation in high-dimensional space
has a long history starting with Davis. The interpolation problem, in its strict sense can
Q7. Explain Shape from shading in 3D reconstruction. What are the constraints required.

Ans: To obtain accurate 3D reconstruction of a face object that has very smooth regions and subtle
shading, a traditional stereo or structure from motion algorithms may not be effective. Shape from
shading is an effective approach complementary to stereo and SfM.
In [135], PCA was suggested as a tool for solving the parametric SFS problem. An eigenhead
approximation of a 3D head was obtained after training on about 300 laser-scanned range images
of real human heads. The ill-posed SFS problem is thereby transformed into a parametric problem,
but constant albedo is still assumed. This assumption does not hold for most real face images, and
it is one of the reasons why most SFS algorithms fail on real face images. The authors of [159]
proposed using a varying-albedo reflectance model (Equation 9), where the albedo ρ is also a
function of pixel location. They first proposed a symmetric SFS scheme based on the use of self-
ratio image rI. Unlike existing SFS algorithms, the symmetric SFS method theoretically allows
pointwise shape information, the gradients (p, q), to be uniquely recovered from a single 2D image.
Q8. Explain volumetric representation for 3D reconstruction.
Ans: volumetric methods were developed, which are a low-cost and versatile alternative to the above mention
3D reconstruction approaches. Since volumetric (or voxel-based) methods work on the 3D objects’ space and do not require a matching process
between the images used in the reconstruction process, they are well suitable for objects with smooth surfaces or capable to suffer from occlusion
problems.
The first volumetric methods proposed were called shape-from-silhouettes or shape-from-contours (e.g. [6], [7]), and were based on the visual-hull
concept,. They combine object’s silhouette images with camera’s calibration information, to build a set of visual rays in the
scene space for all silhouette points. These visual rays define generalized cones within which the object to
reconstruct is guaranteed to be. The intersection of these cones is commonly referred as visual hull, Figure 1. The major drawback of these
silhouette-based methods is that
they cannot deal with concavities on the object. Recently, volumetric methods use color consistency to
verify if a certain voxel belongs to the object’s surface, Figure 2. Thus, they are not only able to build objects with complex geometry, as they also
generate colour models without the need of an extra step for coloration. The first method to use a color consistency measure was the Voxel
Coloring,. Others well known methods are the Space Carving, Generalized Voxel Coloring, Roxels , etc. The accuracy of the reconstructions built using
volumetric methods, depends on the number of images used, positions of each viewpoint associated with them, accuracy of the camera’s
calibration procedure used and on the complexity of the object’s shape.

Q9. Explain range data merging in brief.


Ans: When combining data from multiple sources there is often a lot of issues to correct for. Different sources will often have
different naming conventions than your main source, different ways of grouping data, etc. Most of the time the additional data
source was created at a much different point in time, by different engineers and stakeholders, and (almost) always with different
goals and use-cases. In light of this, it shouldn’t be surprising to encounter a wide array of differences between multiple sources.

Here I would like to explore various ways of simplifying (hopefully) the merging process in a way that delivers concrete value to
downstream users. There are many use cases where this could be of value. For example: If you have two systems that operate in
parallel to each other and you need to perform some analysis of the relationship, you have a legacy system with poorly formatted
data that needs to be integrated into a crisp new system, etc. The example I would like to dive into is the analysis of parallel
systems.

Q10.Explain model based reconstruction of 3D model of head and faces.


Ans: 3D human face model reconstruction is essential to the generation of facial animations that is widely used
in the field of virtual reality (VR). The main issues of 3D facial model reconstruction based on images by vision
technologies are in twofold: one is to select and match the corresponding features of face from two images with
minimal interaction and the other is to generate the realistic-looking human face model. In this paper, a new
algorithm for realistic-looking face reconstruction is presented based on stereo vision. Firstly, a pattern is printed
and attached to a planar surface for camera calibration, and corners generation and corners matching between
two images are performed by integrating modified image pyramid Lucas-Kanade (PLK) algorithm and local
adjustment algorithm, and then 3D coordinates of corners are obtained by 3D reconstruction. Individual face
model is generated by the deformation of general 3D model and interpolation of the features. Finally,
realisticlooking human face model is obtained after texture mapping and eyes modeling. In addition, some
application examples in the field of VR are given. Experimental result shows that the proposed algorithm is
robust and the 3D model is photo-realistic.

You might also like