DLCV Day2

1.
Bag-of-Words (BoW) Model
The Bag-of-Words model is an approach to image representation that treats an image as an

unordered collection of local features. This method is inspired by the text processing
technique of the same name.
Detailed Process:
1. Feature Extraction:
o Detect and describe features: Use feature detectors like SIFT (Scale-
Invariant Feature Transform), SURF (Speeded-Up Robust Features), ORB
(Oriented FAST and Rotated BRIEF), or others to detect key points and
extract local feature descriptors from images. These descriptors capture
important information about local image patches.
o Descriptors: Each keypoint is described by a vector, which typically has
dimensions corresponding to the algorithm used (e.g., 128 for SIFT).
2. Codebook Generation:
o Clustering: Collect feature descriptors from a set of training images and
cluster them using a clustering algorithm such as k-means. Each cluster center
is considered a "visual word."
o Vocabulary: The set of all cluster centers forms the vocabulary or codebook,
where each cluster center represents a visual word in the dictionary.
3. Image Representation:
o Histogram Construction: For a given image, assign each feature descriptor to
the nearest visual word in the codebook. Construct a histogram of visual
words where each bin corresponds to a visual word and its value represents the
frequency of that word in the image.
4. Matching:
o Distance Metrics: Compare the histograms of different images using distance
metrics like Euclidean distance, Manhattan distance, or chi-squared distance to
find similar images.
Applications:
• Image Classification: Categorize images into predefined classes based on the

frequency of visual words.
• Object Recognition: Identify objects within images by comparing the histograms of
regions of interest with those of known objects.
2. Vector of Locally Aggregated Descriptors (VLAD)
VLAD is an advanced technique that improves upon the Bag-of-Words model by aggregating
local image descriptors in a way that preserves spatial relationships and produces a more
discriminative representation.
Detailed Process:
o Detect and describe features: Extract local features from images using
methods like SIFT, SURF, or ORB.
2. Codebook Generation:
o Clustering: Use k-means clustering to generate a set of cluster centers (visual
words) from a large set of training feature descriptors.
3. Residual Calculation:
o Assignment: For each feature descriptor in an image, find the nearest cluster
center (visual word).
o Residuals: Compute the residual vector, which is the difference between the
feature descriptor and its assigned cluster center.
4. Aggregation:
o Aggregate Residuals: For each cluster center, sum the residuals of all
descriptors assigned to it.
o Concatenation: Concatenate these summed residuals to form the VLAD
descriptor.
5. Normalization:
o Normalization: Normalize the aggregated vector to ensure that the descriptor
is invariant to the number of features and their scale. Common normalization
methods include L2 normalization and power normalization.
Advantages:
• Discriminative Power: VLAD captures more detailed information about the

distribution of features around each visual word.
• Compactness: Produces a compact representation suitable for efficient storage and
comparison.
Applications:
• Image Retrieval: Efficiently retrieve images from a large database by comparing

VLAD descriptors.
• Object Recognition: Identify objects within images using the discriminative power of
VLAD descriptors.
3. RANSAC (Random Sample Consensus)
RANSAC is an iterative method for robustly fitting models to data sets that contain a
significant number of outliers.
Detailed Process:
1. Model Hypothesis:
o Random Sampling: Randomly select a minimal subset of data points needed
to estimate the model parameters.
o Model Estimation: Fit the model to these points.
2. Model Verification:
o Inliers Determination: Determine which data points from the entire set are
consistent with the estimated model within a predefined tolerance.
o Inlier Count: Count the number of inliers.
3. Iteration:
o Repeat: Repeat the process of random sampling and model estimation for a
fixed number of iterations or until a satisfactory model with a high number of
inliers is found.
4. Final Model:
o Best Model: Select the model with the highest number of inliers.
o Refinement: Optionally, refine the model parameters using all inliers to
improve accuracy.
Applications:
• Homography Estimation: Estimate the homography matrix for aligning two images.
• Fundamental Matrix Estimation: Estimate the fundamental matrix in stereo vision.
• Object Recognition and Pose Estimation: Fit geometric models to data in the
presence of noise and outliers.
4. Hough Transform
The Hough Transform is a technique for detecting geometric shapes such as lines, circles, and
ellipses in an image by transforming the problem into a parameter space.
Detailed Process:
1. Edge Detection:
o Detect Edges: Use an edge detection algorithm like Canny edge detection to
find edges in the image.
2. Hough Space Transformation:
o Parameter Space: For each edge point, transform it into a parameter space
where each point represents a potential geometric shape passing through the
edge point.
o Lines: For line detection, use the polar representation (ρ, θ) where ρ is the
distance from the origin to the line, and θ is the angle of the normal to the line.
o Circles: For circle detection, use parameters (a, b, r) where (a, b) is the center
of the circle, and r is the radius.
3. Voting:
o Accumulator Array: Use an accumulator array to keep track of votes in
parameter space. Each edge point votes for all possible shapes that could pass
through it.
o Peak Detection: Identify peaks in the accumulator array, which correspond to
the parameters of the detected shapes.
4. Shape Detection:
o Extract Shapes: Extract the parameters of the shapes corresponding to the
peaks in the accumulator array.
Applications:
• Line Detection: Detect lines in images, useful in applications like lane detection in
autonomous driving.
• Circle Detection: Detect circles, useful in identifying circular objects like coins or
eyes in images.
• General Shape Detection: Detect other geometric shapes by extending the Hough
Transform to different parameter spaces.
5. Pyramid Matching
Pyramid Matching is a technique for matching sets of features by comparing them at multiple
resolutions, making it robust to variations in scale and translation.
Detailed Process:
o Detect and describe features: Extract features from images using methods
like SIFT, SURF, or ORB.
2. Pyramid Construction:
o Multi-Resolution Pyramid: Construct a pyramid of feature representations at
multiple resolutions by progressively down-sampling the image and extracting
features at each level.
3. Matching:
o Coarse-to-Fine Matching: Begin by matching features at the coarsest level of
the pyramid. Use these matches to constrain and refine the matching process at
finer levels.
o Feature Correspondence: Determine correspondences between features at
each level, ensuring that matches are consistent across levels.
4. Scoring:
o Similarity Score: Compute a similarity score based on the number and quality
of matches at each pyramid level.
Advantages:
• Robustness: Handles variations in scale and translation effectively.

• Efficiency: Coarse-to-fine matching reduces computational complexity.
Applications:
• Object Recognition: Recognize objects in images by matching features across

multiple scales.
• Image Retrieval: Retrieve similar images by comparing pyramid representations.
6. Optical Flow
Optical Flow refers to the apparent motion of objects in a visual scene caused by the relative
motion between the observer and the scene. It is used to estimate motion vectors for each
pixel in a sequence of images.
Detailed Process:
1. Feature Detection:
o Interest Points: Detect points of interest (keypoints) in the image using
methods like Harris corner detection or FAST.
2. Flow Computation:
o Local Methods: Use local methods like the Lucas-Kanade method, which
assumes small motion and uses a local window to estimate the motion vectors.
o Global Methods: Use global methods like the Horn-Schunck method, which
assumes smoothness of the flow field and uses global optimization to estimate
the motion vectors.
3. Flow Field Representation:
o Velocity Vectors: Represent the motion as a flow field, where each pixel in
the image is associated with a velocity vector indicating its movement
between frames.
Algorithms:
• Lucas-Kanade Method:
o Assumes small displacements.
o Solves for the optical flow by minimizing the error in a local neighborhood.
• Horn-Schunck Method:
o Assumes smoothness of the flow field.
o Uses a global optimization approach to minimize both the error in image
intensity and the smoothness constraint.
Applications:
• Motion Tracking: Track moving objects in video sequences.

• Video Stabilization: Stabilize shaky videos by compensating for camera motion.
• Object Detection and Segmentation: Segment moving objects in videos based on
their motion.
These techniques form the foundation of many computer vision applications and are often
used in combination to achieve robust and accurate results.

DLCV Day2

Uploaded by

Copyright:

Available Formats

You might also like

DLCV Day2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DLCV Day2

Uploaded by

Copyright:

Available Formats

1.

Bag-of-Words (BoW) Model

The Bag-of-Words model is an approach to image representation that treats an image as an

• Image Classification: Categorize images into predefined classes based on the

2. Vector of Locally Aggregated Descriptors (VLAD)

• Discriminative Power: VLAD captures more detailed information about the

• Image Retrieval: Efficiently retrieve images from a large database by comparing

3. RANSAC (Random Sample Consensus)

• Robustness: Handles variations in scale and translation effectively.

• Object Recognition: Recognize objects in images by matching features across

• Motion Tracking: Track moving objects in video sequences.

You might also like