Professional Documents
Culture Documents
Multimedia Retrieval: Jimmy Lin, University of Maryland
Multimedia Retrieval: Jimmy Lin, University of Maryland
Multimedia Retrieval: Jimmy Lin, University of Maryland
• images
• video
• speech
2
A Picture…
3
… is comprised of pixels
4
This is nothing new!
5
The Semantic Gap
6
The Semantic Gap
Image-level descriptors
SKY
MOUNTAINS
Content descriptors
This is what we want
TREES
Photo of Yosemite
valley showing El
Capitan and Glacier Semantic content
Point with the Half
Dome in the distance
8
Sub-image matching
10
Example 1: Result
• Best matching image with sub-
image identified
NB. Query is
before restoration
work, target is a
restored image.
Query and target
image also differ
in resolution
11
Example 2: Query
12
Example 2: Result
Best match
found, with sub-
image identified
13
…Subsequent Best Matches
Retrieved results
start from top-
left to bottom
right.
14
The IR Black Box
Multimedia
Query Documents
Objects
Representation Representation
Function Function
Comparison
Function Index
Hits
15
Recipe for Multimedia Retrieval
• Extract features
– Low-level features: blobs, textures,
color histograms
– Textual annotations: captions, ASR,
video OCR, human labels
• Match features
– From “bag of words” to “bag of
features”
16
Visual Features ...
Texture
Color
Shape
17
Combination of Evidence
18
New Algorithm for Similarity-
Based Retrieval of Images
• Images in the database are stored as JPEG-
compressed images
19
Histogram of DC Coefficients for
the Image “Elephant”
20
Comparison of Histograms
of DC Coefficients
21
Example of Similarity-Based Retrieval
Using the DC Histograms
22
Similarity-Based Retrieval of
Compressed Video
23
DC Histogram Technique Applied for
Video Partitioning
40
35
30
25
NSD x 100 [%]
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Frame number
24
Example of Similarity-Based Retrieval of
Key Frames Using DC Histograms
25
outline
• images
• video
• speech
26
Images and Video
• A digital image = a collection of pixels
– Each pixel has a “color”
• Different types of pixels
– Binary (1 bit): black/white
– Grayscale (8 bits)
– Color (3 colors, 8 bits each): red, green, blue
• A video is simply lots of images in rapid
sequence
– Each image is called a frame
– Smooth motion requires about 24 frames/sec
• Compression is the key!
27
The Structure of Video
Video
Scenes
Shots
Frames
28
TREC For Video Retrieval?
• TREC Video Track (TRECVID)
– Started in 2001
– Goal is to investigate content-based retrieval
from digital video
– Focus on the shot as the unit of information
retrieval
(why?)
http://www-nlpir.nist.gov/projects/trecvid/
29
Searching Performance
31
Market Trends(2005)
– How many of you are aware of video on the Web?
• Large portion of online users
– How many have viewed a video on the Web in
• The last 3 months?
– 50%
• The last 6 months?
• Ever?
– Would you watch video on your devices
(ipod/wireless)?
• 1M downloads in 20 days (iPod)
– How many of you have produced video (personal or
otherwise) recently?
• Continuing to skyrocket with digital camera
phones/devices
– How many of you have shared that with your
friends/community? Would you have liked to?
• Huge interest & adoption in viral communities
32
Market Trends
– Technology more media friendly
• Storage costs plummeting (GB Æ TB)
• CPU speed continuing to double
(Moore’s law)
• Increased bandwidth
• Device support for media
• Adding media to sites drives traffic
• Web continues to propel scalable
infrastructure for media
products/communities
33
Video Search
Structured +
Unstructured Data
Content Features +
Video Production Meta-Content
Video Consumption
Meta-Content /Community
Unstructured
Data
34
Video Search
35
Video Search
Active Research Area
Graph from “A new perspective on Visual Information Retrieval”, Horst Eidenberger, 2004
Black: “Image Retrieval”; Grey:”Video Retrieval”; IEEE Digital Library
36
Video Search
Popular features/techniques:
– Color, Shape, Texture, Shape descriptors
– OCR, ASR
– A number of prototype or research products
with small data sets
– More researched for visual queries
37
Video Search: Features
38
Video Search: Features
Region
Color Texture Segmentation
• Robust to background • One of the earliest Image • Partition image into regions
• Independent of size, features [Harlick et al 70s] • Strong Segmentation: Object
orientation • Co-occurrence matrix segmentation is difficult.
• Color Histogram [Swain & • Orientation and distance on • Weak segmentation: Region
Ballard] gray-scale pixels segmentation based on some
• “Sensitive to noise and • Contrast, inverse deference homegenity criteria
moment, and entropy [Gotlieb
sparse”- Cumulative & Kreyszig]
Histograms [Stricker & • Human visual texture
Orgengo] properties: coarseness,
contrast, directionality,
Scene
• Color Moments
• Color Sets: Map RGB Color
likeliness, regularity and
roughness [Tamura et al] Segmentation
space to Hue Saturation • Wavelet Transforms [90s] • Shot detection, scene detection
Value, & quantize [Smith, • [Smith & Chang] extracted • Look for changes in color, texture,
Chang] mean and variance from brightness
• Color layout- local color wavelet subbands
• Context based scene segmentation
features by dividing image into • Gabor Filters applied to certain categories such
regions • And so on as broadcast news
• Color Autocorrelograms
39
Video Search: Features
Shape Face
• Outer Boundary based vs. • Face detection is highly reliable
region based - Neural Networks [Rwoley]
• Fourier descriptors - Wavelet based histograms of
• Moment invariants facial features [Schneiderman]
• Finite Element Method • Face recognition for video is
(Stiffness matrix- how each still a challenging problem.
point is connected to - EigenFaces: Extract
others; Eigen vectors of eigenvectors and use as feature
matrix) space
• Turing function based
(similar to Fourier
descriptor) convex/concave OCR
polygons[Arkin et al] • OCR is fairly successful
• Wavelet transforms technology.
leverages multiresolution • Accurate, especially with good
[Chuang & Kao] matching vocabularies.
• Chamfer matching for • Script recognition still an open
comparing 2 shapes (linear problem.
dimension rather than
•
area)
3-D object representations
ASR
using similar invariant • Automatic speech recognition
features fairly accurate for medium to
large vocabulary broadcast type
• Well-known edge detection data
algorithms.
• Large number of available
speech vendors.
• Still open for free
conversational speech in noisy
conditions.
40
Video Search: Video TREC
• Overview:
– Shot detection, story segmentation, semantic
feature extraction, information retrieval
– Corpora of documentaries, advertising films
– Broadcast news added in 2003
– Interactive and non-interactive tests
– CBIR features
– Speech transcribed (LIMSI)
– OCR
41
Video Search
ar ch
a Se
e di
Structured + al M
o n
Unstructured Data diti
Tra Content Features +
Video Production Meta-Content
h
earc
S
ideo
b V Video Consumption
We
Meta-Content /Community
Unstructured
Data
42
Opportunities
Example 1:
– User query “zorro”
43
44
Opportunities
Example 2:
– For main-stream head content such as news videos.
– Meta-data are fairly descriptive
– Usually queried based on non-visual attributes.
45
46
Opportunities
Example 3:
– Creative Home Video
– Community video rendering!
47
Play Video 1 Play Video 2
48
Opportunities
A hard example:
– Lets take an example:
“Supposing you want to find videos that depict a
monkey/chimp doing karate”!
– CBIR Approach:
Train models for Chimps/Monkeys
Motion Analysis for Karate movement models
Many open issues/problems!
49
Play Video
50
Video Data Management
1. Video Parsing
• Manipulation of whole video for breakdown into
key frames.
– Scene: single dramatic event taken by a small
number of related cameras.
– Shot: A sequence taken by a single camera
– Frame: A still image
2. Video Indexing
• Retrieving information about the frame for
indexing in a database.
3. Video Retrieval and browsing
• Users access the db through queries or through
interactions.
51
System overview
45217853 Relevance
Query .
. feedback
.
45217853
Shot k-nn
Feature
boundary (boost, VSM) ΣwD
generation
detection
Weighted Retrieved
45217853 Distances
. sum of results
.
. distances
Test Key 45217853
data frames
Feature
vectors
52
An Architecture for Video
Database System
Spatio-Temporal Semantics:
Object Definitions Formal Specification of Inter-Object Movement
(Events/Concepts) Event/Activity/Episode for (Analysis)
Content-Based Retrieval
Intra/Inter-Frame
Analysis
(Motion Analysis)
Spatial-Semantics Semantic Association
of Objects (President, Capitol,...)
(human,building,…) Object Identification
and Tracking
Physical
Image Features Object
Database Sequence of Frames
(indexed)
Raw
Frame Video
Database
Spatial Abstraction 53
Video Data Management
• Metadata-based method
• Text-based method
• Audio-based method
• Content-based method
• Integrated approach
54
Metadata-based Method
• Video is indexed and retrieved
based on structured metadata
information by using a traditional
DBMS
55
Text-based Method
• Video is indexed and retrieved
based on associated subtitles (text)
using traditional IR techniques for
text documents.
56
Text-based Method
• Basic method is to use human
annotation
• Can be done automatically where
subtitles / transcriptions exist
– BBC: 100% output subtitled by 2008
• Speech recognition for archive
material
57
Text Detection
58
Video Frames Filtered Frames AND-ed Frames
(1/2 s intervals)
60
Text-based Method
• Key word search
based on subtitles
• Content based
• Live demo:
http://km.doc.ic.ac.uk/vse/ 61
Text-based Method
62
outline
• images
• video
• speech
63
Spoken Document Retrieval
Acoustic Modeling
Describes the sounds that
make up speech
Speech Recognition
Grammar
Phonetic Decoder
Speech Signal Probability (Language Words
Processing Model)
Estimator
(Acoustic
Model)
Pronunciation
Lexicon
65
Hints For Better
•
Recognition
Goal: improve the estimation p(word|acoustic_sig)
• Main idea:
p(word|acoustic_sign) Æ p(word|acoustic_signal, X)
What could be X?
• Topical information
• News of the day
• Image information ?
Hints For Better
•
Recognition
Goal: improve the estimation p(word|acoustic_sig)
• Main idea:
p(word|acoustic_sign) Æ p(word|acoustic_signal, X)
What could be X?
• Topical information
• News of the day
• Image information
– Lip reading
– Video Optical Character
Recognition (VOCR)
Speech Recognition Accuracy
Word Error Rate
100
Commercials
Documentary
80
News
60 Dialog
40
TV Studio
20 Lab
Benchmark
0
Information Retrieval Precision
vs. Speech Accuracy
Relative Precision
100
90
80
70
60
50
% of Text IR
40
30
0 10 20 30 40 50 60 70
70
Demos
http://images.google.com/
http://video.google.com/
http://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English
http://amazon.ece.utexas.edu/~qasim/research.htm
http://mp7.watson.ibm.com/
http://viper.unige.ch/research/video/
71
END
72
exam topics
• Evaluation
– Recall, precision, E, F • Compression
– AP – Hufmman
– R-prec – LempelZiv
– PCutoff • Relevance feedback
– precision-recall curves – Real
• Retrieval models – Assumed
– Boolean • Clustering
– Vector space – Graph, partitioning, nearest
– Language modeling neighbor
– Inference networks – clustering algorithms
• Indexing • Markov chains
– Manual vs. automatic – stationary distribution
– Tokens, stopping, stemming, • Page Rank formula
• File organization • Metasearch
– CombSUM
– Bitmaps
– Borda
– Signature files – Condorcet
– Inverted files • Collaborative filtering
• Statistics of text • P2P
– Zipf’s law – 3 generations
– Heap’s Law
– Information theory
73