Multimedia and WS-CS 550-Content Analysis v1

CS 550-Multimedia and Web Services
Content Analysis:
Text recognition, Similarity Based search, Video analysis, Audio analysis. SMIL tutorials.
Text Recognition
Text recognition in images is a research area which attempts to develop a computer system with the
ability to automatically read the text from images. These days there is a huge demand in storing the
information available in paper documents format in to a computer storage disk and then later reusing
this information by searching process. One simple way to store information from these paper documents
in to computer system is to first scan the documents and then store them as images. But to reuse this
information it is very difficult to read the individual contents and searching the contents form these
documents line-by-line and word-by-word. The challenges involved in this is the font characteristics of
the characters in paper documents and quality of images. Due to these challenges, computer is unable
to recognize the characters while reading them. Thus, there is a need of character recognition
mechanisms to perform Document Image Analysis (DIA) which transforms documents in paper format
to electronic format. In this section we discuss methods for text recognition from images.
For example, when the source of a PDF was an image instead of a typed document, the PDF file
does not contain searchable text by default. We have to use a software with built-in Optical Character
Recognition (OCR) feature to covert the pdf document in to text.
Optical character recognition (OCR)

Optical Character Recognition or Optical Character Reader (OCR) is the conversion of images of typed,
handwritten or printed text into machine encoded text, whether from a scanned document, a photo of a
document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from
subtitle text superimposed on an image (for example: from a television broadcast).
Widely used as a form of data entry from printed paper data records – whether passport documents,
invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any
suitable documentation – it is a common method of digitizing printed texts so that they can be
electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes
such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining.
OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
Early versions needed to be trained with images of each character, and worked on one font at a time.
Advanced software systems capable of producing a high degree of recognition accuracy for most fonts
are now common, and with support for a variety of digital image file format inputs. Some systems are
capable of reproducing formatted output that closely approximates the original page including images,
columns, and other non-textual components.
In the 2000s, OCR was made available online as a service (WebOCR), in a cloud
computing environment, and in mobile applications like real-time translation of foreign-language signs
on a smartphone. With the advent of smart-phones and smart glasses, OCR can be used in internet
connected mobile device applications that extract text captured using the device's camera. These devices
that do not have OCR functionality built into the operating system will typically use an OCR API to
extract the text from the image file captured and provided by the device. The OCR API returns the
extracted text, along with information about the location of the detected text in the original image back
to the device app for further processing (such as text-to-speech) or display.
Various commercial and open source OCR systems are available for most common writing systems,
including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese,
Japanese, and Korean characters.
Applications
OCR engines have been developed into many kinds of domain-specific OCR applications, such as
receipt OCR, invoice OCR, check OCR, legal billing document OCR.
They can be used for:
• Data entry for business documents, e.g. Cheque, passport, invoice, bank statement and receipt
• Automatic number plate recognition
• In airports, for passport recognition and information extraction
• Automatic insurance documents key information extraction
• Traffic sign recognition
TEXT_DETECTION detects and extracts text from any image. For example, a photograph
might contain a street sign or traffic sign.
• Extracting business card information into a contact list

• Make electronic images of printed documents searchable, e.g. Google Books
• Converting handwriting in real time to control a computer (pen computing)
DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response
is optimized for dense text and documents. One specific use
of DOCUMENT_TEXT_DETECTION is to detect handwriting in an image.
• Assistive technology for blind and visually impaired users

• Writing the instructions for vehicles by identifying CAD images in a database that are
appropriate to the vehicle design as it changes in real time.
• Making scanned documents searchable by converting them to searchable PDFs
Types
• Optical character recognition (OCR) – targets typewritten text, one glyph or character at a
time.
• Optical word recognition – targets typewritten text, one word at a time (for languages that use
a space as a word divider). (Usually just called "OCR".)
• Intelligent character recognition (ICR) – also targets handwritten printscript or cursive text
(Cursive is also known as script or joint writing and is a unique form of handwriting in which
the language symbols are conjointly written in a flowing style. The initial purpose
of cursive writing was to create a smoother, faster way to write) one glyph or character at a time,
usually involving machine learning.
• Intelligent word recognition (IWR) – also targets handwritten printscript or cursive text, one
word at a time. This is especially useful for languages where glyphs are not separated in cursive
script.
• Handwriting movement analysis can be used as input to handwriting recognition. Instead of
merely using the shapes of glyphs and words, this technique is able to capture motions, such as
the order in which segments are drawn, the direction, and the pattern of putting the pen down
and lifting it. This additional information can make the end-to-end process more accurate. This
technology is also known as "on-line character recognition", "dynamic character recognition",
"real-time character recognition", and "intelligent character recognition".
Techniques:
1. Pre-processing
OCR software often "pre-processes" images to improve the chances of successful recognition.
Techniques include:
• De-skew – If the document was not aligned properly when scanned, it may need to be tilted a
few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal
or vertical.
• Despeckle – remove positive and negative spots, smoothing edges
• Binarization – Convert an image from color or greyscale to black-and-white (called a "binary
image" because there are two colors). The task of binarization is performed as a simple way of
separating the text (or any other desired image component) from the background. The task of
binarization itself is necessary since most commercial recognition algorithms work only on
binary images since it proves to be simpler to do so. In addition, the effectiveness of the
binarization step influences to a significant extent the quality of the character recognition stage
and the careful decisions are made in the choice of the binarization employed for a given input
image type; since the quality of the binarization method employed to obtain the binary result
depends on the type of the input image (scanned document, scene text image, historical degraded
document etc.).
• Line removal – Cleans up non-glyph boxes and lines
• Layout analysis or "zoning" – Identifies columns, paragraphs, captions, etc. as distinct blocks.
Especially important in multi-column layouts and tables.
• Line and word detection – Establishes baseline for word and character shapes, separates words
if necessary.
• Script recognition – In multilingual documents, the script may change at the level of the words
and hence, identification of the script is necessary, before the right OCR can be invoked to
handle the specific script.
• Character isolation or "segmentation" – For per-character OCR, multiple characters that are
connected due to image artifacts must be separated; single characters that are broken into
multiple pieces due to artifacts must be connected.
• Normalize aspect ratio and scale
• Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a
uniform grid based on where vertical grid lines will least often intersect black areas.
For proportional fonts, more sophisticated techniques are needed because whitespace between
letters can sometimes be greater than that between words, and vertical lines can intersect more
than one character.
2. Text recognition
There are two basic types of core OCR algorithm, which may produce a ranked list of candidate
characters.
• Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is
also known as "pattern matching", "pattern recognition", or "image correlation". This relies on
the input glyph being correctly isolated from the rest of the image, and on the stored glyph being
in a similar font and at the same scale. This technique works best with typewritten text and does
not work well when new fonts are encountered. This is the technique the early physical
photocell-based OCR implemented, rather directly.
• Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction,
and line intersections. The extraction features reduces the dimensionality of the representation
and makes the recognition process computationally efficient. These features are compared with
an abstract vector-like representation of a character, which might reduce to one or more glyph
prototypes. General techniques of feature detection in computer vision are applicable to this
type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most
modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors
algorithm are used to compare image features with stored glyph features and choose the nearest
match.
Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The
second pass is known as "adaptive recognition" and uses the letter shapes recognized with high
confidence on the first pass to recognize better the remaining letters on the second pass. This is
advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).
Modern OCR software like for example OCRopus or Tesseract uses neural networks which were trained
to recognize whole lines of text instead of focusing on single characters.
3. Post-processing
OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed
to occur in a document. This might be, for example, all the words in the English language, or a more
technical lexicon for a specific field. This technique can be problematic if the document contains words
not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character
segmentation step, for improved accuracy.
The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems
can preserve the original layout of the page and produce, for example, an annotated PDF that includes
both the original image of the page and a searchable textual representation.
§ "Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting
that certain words are often seen together. For example, "Washington, D.C." is generally far
more common in English than "Washington DOC".
§ Knowledge of the grammar of the language being scanned can also help determine if a word is
likely to be a verb or a noun, for example, allowing greater accuracy.
§ The Levenshtein Distance algorithm has also been used in OCR post-processing to further
optimize results from an OCR API.
An example OCR API will be discussed in later section.
Application-specific optimizations
In recent years, the major OCR technology providers began to tweak OCR systems to deal more
efficiently with specific types of input. Beyond an application-specific lexicon, better performance may
be had by taking into account business rules, standard expression, or rich information contained in color
images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been
applied to OCR of license plates, invoices, screenshots, ID cards, driver licenses, and automobile
manufacturing.
Accuracy
Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is
available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded
that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%; total
accuracy can be achieved by human review or Data Dictionary Authentication. Other areas—including
recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East
Asian language characters which have many strokes for a single character)—are still the subject of active
research. The Modified National Institute of Standards and Technology (MNIST) database is commonly
used for testing systems' ability to recognize handwritten digits.
Accuracy rates can be measured in several ways, and how they are measured can greatly affect the
reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct
software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error
rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was
recognized with no incorrect letters.
An example of the difficulties inherent in digitizing old text is the inability of OCR to differentiate
between the "long s" and "f" characters.
Web-based OCR systems for recognizing hand-printed text on the fly have become well known as
commercial products in recent years. Accuracy rates of 80% to 90% on neat, clean hand-printed
characters can be achieved by pen computing software, but that accuracy rate still translates to dozens
of errors per page, making the technology useful only in very limited applications.
Recognition of cursive text is an active area of research, with recognition rates even lower than that
of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible
without the use of contextual or grammatical information. For example, recognizing entire words from
a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of
a cheque (which is always a written-out number) is an example where using a smaller dictionary can
increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not
contain enough information to accurately (greater than 98%) recognize all handwritten cursive script
Most programs allow users to set "confidence rates". This means that if the software does not achieve
their desired level of accuracy, a user can be notified for manual review.
An error introduced by OCR scanning is sometimes termed a "scanno" (by analogy with the
term "typo").
Before, looking in to OCR API let us learn about XML, HTML, JASON: What's the
Difference?
HTML
HTML is use for creating Web pages. HTML is basically a markup language. It represents the page
structure. Using HTML you are telling a browser “This is what my page should look like”. It does not
store anything. HTML is directly rendered onto the browser.
XML
XML is a markup language which is designed to store data. It's popularly used for transfer of data. It is
case sensitive. XML offers you to define markup elements and generate customized markup language.
The basic unit in the XML is known as an element. Extension of XML file is .xml
JSON
JSON is called JavaScript Object Notation. JSON it's an Format which is mainly use for sending and
receiving data. This is important because when we exchange data from one browser to another browser
the data is in the form of text. It offers a human-readable collection of data which can be accessed
logically.
With the rise of AJAX-powered sites (AJAX stands for Asynchronous JavaScript And XML. In a
nutshell, it is the use of the XMLHttpRequest object to communicate with servers. It can send and
receive information in various formats, including JSON, XML, HTML), it’s becoming more and more
important for sites to be able to load data quickly and asynchronously, or in the background without
delaying page rendering. Switching up the contents of a certain element within our layouts without
requiring a page refresh adds a “wow” factor to our applications, not to mention the added convenience
for our users. Because of the popularity and ease of social media, many sites rely on the content provided
by sites such as Twitter, Flickr, and others.
History of JSON
Here are important landmarks that form the history of JSON:
• Douglas Crockford specified the JSON format in the early 2000s.
• The official website was launched in 2002.
• In December 2005, Yahoo! starts offering some of its web services in JSON.
• JSON became an ECMA international standard in 2013.
• The most updated JSON format standard was published in 2017.
History of XML
Here, are the important landmark from the history of XML:
• XML was also derived from Standard Generalized Markup Language(SGML).
• Version 1.0 of XML was released in February 1998.
• Jan 2001:IETF Proposed Standard: XML Media Types
• XML is the Extensible Markup Language.
• 1970: Charles Goldfarb, Ed Mosher, and Ray Lorie invented GML
• The development of XML started in the year 1996 at Sun Microsystem
Features of JSON
• Easy to use - JSON API offers high-level facade, which helps you to simplify commonly used
use-cases.
• Performance - JSON is quite fast as it consumes very less memory space, which is especially
suitable for large object graphs or systems.
• Free tool - JSON library is open source and free to use.
• Doesn't require to create mapping - Jackson API provides default mapping for many objects
to be serialized.
• Clean JSON - Creates clean, and compatible JSON result that is easy to read.
• Dependency - JSON library does not require any other library for processing.
Features of XML
• XML tags are not predefined. You need to define your customized tags.
• XML was designed to carry data, not allows you to display that data.
• Mark-up code of XML is easy to understand for a human.
• Well, the structured format is easy to read and write from programs.
• XML is an extensible markup language like HTML.
Difference between JSON and XML
JSON XML
JSON object has a type XML data is typeless
JSON types: string, number, array, All XML data should be string
Boolean
Data is readily accessible as JSON XML data needs to be parsed.

objects
JSON is supported by most browsers. Cross-browser XML parsing can be tricky
JSON has no display capabilities. XML offers the capability to display data because it is
a markup language.
JSON supports only text and number XML support various data types such as number, text,
data type. images, charts, graphs, etc. It also provides options for
transferring the structure or format of the data with
actual data.
Retrieving value is easy Retrieving value is difficult
Supported by many Ajax toolkit Not fully supported by Ajax toolkit
A fully automated way of Developers have to write JavaScript code to

deserializing/serializing JavaScript. serialize/de-serialize from XML
Native support for object. The object has to be express by conventions - mostly
missed use of attributes and elements.
It supports only UTF-8 encoding. It supports various encoding.
It doesn't support comments. It supports comments.
JSON files are easy to read as compared XML documents are relatively more difficult to read
to XML. and interpret.
It does not provide any support for It supports namespaces.

namespaces.
It is less secured. It is more secure than JSON.
JSON Code vs XML Code
Advantages of using JSON

Here are the important benefits/ pros of using JSON:
• Provide support for all browsers
• Easy to read and write
• Straightforward syntax
• You can natively parse in JavaScript using eval() function
• Easy to create and manipulate
• Supported by all major JavaScript frameworks
• Supported by most backend technologies
• JSON is recognized natively by JavaScript
• It allows you to transmit and serialize structured data using a network connection.
• You can use it with modern programming languages.
• JSON is text which can be converted to any object of JavaScript into JSON and send this JSON
to the server.
Advantages of using XML

Here are significant benefits/cons of using XML:
• Makes documents transportable across systems and applications. With the help of XML, you
can exchange data quickly between different platforms.
• XML separates the data from HTML
• XML simplifies platform change process
Disadvantages of using JSON
Here are cons/ drawback of using JSON:
• No namespace support, hence poor extensibility
• Limited development tools support
• It offers support for formal grammar definition
Disadvantages of using XML

Here, are cons/ drawbacks of using XML:
• XML requires a processing application
• The XML syntax is very similar to other alternatives 'text-based' data transmission formats
which is sometimes confusing
• No intrinsic data type support
• The XML syntax is redundant
• Does n't allow the user to create his tags.
JSON is promoted as a low-overhead alternative to XML as both of these formats have widespread
support for creation, reading, and decoding in the real-world situations where they are commonly used.
In layman terms, you can think of JSON as language interpreter between two people who don’t speak
the same language.
The following example shows a possible XML and JSON representation describing personal
information.
JSON Sample program
{
"first_name": "John",
"last_name": "Smith",
"age": 25,
"address": {
"street_address": "21 2nd Street",
"city": "New York",
"state": "NY",
"postal_code": "10021"
},
"phone_numbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"sex": {
"type": "male"
}
}
XML sample program
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
</phoneNumbers>
<sex>
<type>male</type>
</sex>
</person>
OCR API
The OCR API provides a simple way of parsing images and multi-page PDF documents (PDF OCR)
and getting the extracted text results returned in a JSON format (JavaScript Object Notation (JSON), is
an open standard file format, and data interchange format, that uses human-readable text to store and
transmit data objects consisting of attribute–value pairs and array data types or any other serializable
value).
One such is Google Cloud Vision API.
The Google Cloud Vision API allows developers to easily integrate vision detection features within
applications, including image labeling, face and landmark detection, optical character recognition
(OCR), and tagging of explicit content.
The Vision API can perform feature detection on a local image file or remote image file by sending the
contents of the image file as a base64 encoded string in the body of your request.
(Base64 is most commonly used to encode binary data (for example, images, or sound files) for
embedding into HTML, CSS, EML, and other text documents. Images can be inlined into html code.
The method in which this is done is called base64 encoding. Base64 encoded images become part of the
html and displays without loading anything instead of a web browser having to download the image)
Sample C# and Java sample programmes for detecting text from images as base64 string format are
given below.
(i) Detect document text in a local image:
C# (Before trying this sample, follow the C# setup instructions in the Vision Quickstart Using Client
Libraries. For more information, see the Vision C# API reference documentation.)
// Load an image from a local file.

var image = Image.FromFile(filePath);
var client = ImageAnnotatorClient.Create();
var response = client.DetectDocumentText(image);
foreach (var page in response.Pages)
{
foreach (var block in page.Blocks)
{
foreach (var paragraph in block.Paragraphs)
{
Console.WriteLine(string.Join("\n", paragraph.Words));
}
}
}
Java (Before trying this sample, follow the Java setup instructions in the Vision API Quickstart Using
Client Libraries. For more information, see the Vision API Java API reference documentation.)
public static void detectDocumentText(String filePath) throws IOException {

List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
Image img = Image.newBuilder().setContent(imgBytes).build();

Feature feat = Feature.newBuilder().setType(Type.DOCUMENT_TEXT_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests. After completing all of your requests, call
// the "close" method on the client to safely clean up any remaining background resources.
try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
client.close();
for (AnnotateImageResponse res : responses) {

if (res.hasError()) {
System.out.format("Error: %s%n", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs

TextAnnotation annotation = res.getFullTextAnnotation();
for (Page page : annotation.getPagesList()) {
String pageText = "";
for (Block block : page.getBlocksList()) {
String blockText = "";
for (Paragraph para : block.getParagraphsList()) {
String paraText = "";
for (Word word : para.getWordsList()) {
String wordText = "";
for (Symbol symbol : word.getSymbolsList()) {
wordText = wordText + symbol.getText();
System.out.format(
"Symbol text: %s (confidence: %f)%n",
symbol.getText(), symbol.getConfidence());
}
System.out.format(
"Word text: %s (confidence: %f)%n%n", wordText, word.getConfidence());
paraText = String.format("%s %s", paraText, wordText);
}
// Output Example using Paragraph:
System.out.println("%nParagraph: %n" + paraText);
System.out.format("Paragraph Confidence: %f%n", para.getConfidence());
blockText = blockText + paraText;
}
pageText = pageText + blockText;
}
}
System.out.println("%nComplete annotation:");
System.out.println(annotation.getText());
}
}
}
(ii) Detect document text in a remote image
The Vision API can perform feature detection directly on an image file located in Google Cloud Storage
or on the Web without the need to send the contents of the image file in the body of your request.
C# (Before trying this sample, follow the C# setup instructions in the Vision Quickstart Using Client
Libraries. For more information, see the Vision C# API reference documentation.)
// Specify a Google Cloud Storage uri for the image

// or a publicly accessible HTTP or HTTPS uri.
var image = Image.FromUri(uri);
var client = ImageAnnotatorClient.Create();
var response = client.DetectDocumentText(image);
foreach (var page in response.Pages)
{
foreach (var block in page.Blocks)
{
foreach (var paragraph in block.Paragraphs)
{
Console.WriteLine(string.Join("\n", paragraph.Words));
}
}
}
Java (Before trying this sample, follow the Java setup instructions in the Vision API Quickstart Using
Client Libraries. For more information, see the Vision API Java API reference documentation.)
public static void detectDocumentTextGcs(String gcsPath) throws IOException {

List<AnnotateImageRequest> requests = new ArrayList<>();
ImageSource imgSource = ImageSource.newBuilder().setGcsImageUri(gcsPath).build();

Image img = Image.newBuilder().setSource(imgSource).build();
Feature feat = Feature.newBuilder().setType(Type.DOCUMENT_TEXT_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests. After completing all of your requests, call
// the "close" method on the client to safely clean up any remaining background resources.
try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
client.close();
for (AnnotateImageResponse res : responses) {

if (res.hasError()) {
System.out.format("Error: %s%n", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
TextAnnotation annotation = res.getFullTextAnnotation();
for (Page page : annotation.getPagesList()) {
String pageText = "";
for (Block block : page.getBlocksList()) {
String blockText = "";
for (Paragraph para : block.getParagraphsList()) {
String paraText = "";
for (Word word : para.getWordsList()) {
String wordText = "";
for (Symbol symbol : word.getSymbolsList()) {
wordText = wordText + symbol.getText();
System.out.format(
"Symbol text: %s (confidence: %f)%n",
symbol.getText(), symbol.getConfidence());
}
System.out.format(
"Word text: %s (confidence: %f)%n%n", wordText, word.getConfidence());
paraText = String.format("%s %s", paraText, wordText);
}
// Output Example using Paragraph:
System.out.println("%nParagraph: %n" + paraText);
System.out.format("Paragraph Confidence: %f%n", para.getConfidence());
blockText = blockText + paraText;
}
pageText = pageText + blockText;
}
}
System.out.println("%nComplete annotation:");
System.out.println(annotation.getText());
}
}
}
Specify the language (optional)

Both types of OCR requests support one or more languageHints that specify the language of any text in
the image. However, in most cases, an empty value yields the best results since it enables automatic
language detection. In rare cases, when the language of the text in the image is known, setting a hint
helps get better results (although it can be a significant hindrance if the hint is wrong). Text detection
returns an error if one or more of the specified languages is not one of the supported languages.
Similarity Based search:
Similarity search is the most general term used for a range of mechanisms which share the principle of
searching (typically, very large) spaces of objects where the only available comparator is the similarity
between any pair of objects. This is becoming increasingly important in an age of large information
repositories where the objects contained do not possess any natural order, for example large collections
of images, sounds and other sophisticated digital objects.
Nearest neighbor search and Range queries -are important subclasses of similarity search, and a number
of solutions exist.
Nearest neighbor search -For a simple cubic lattice the nearest neighbor distance is the lattice parameter
a. Therefore, for a simple cubic lattice there are six nearest neighbors for any given lattice point.
Problem definition:
Range queries -The range searching problem most generally consists of preprocessing a set S of
objects, in order to determine which objects from S intersect with a query object, called a range. For
example, if S is a set of points corresponding to the coordinates of several cities, a geometric variant of
the problem is to find cities within a certain latitude and longitude range.
There are several variations of the problem, and different data structures may be necessary for different
variations. In order to obtain an efficient solution, several aspects of the problem need to be specified:
• Object types: Algorithms depend on whether S consists of points, lines, line
segments, boxes, polygons. The simplest and most studied objects to search are points.
• Range types: The query ranges also need to be drawn from a predetermined set. Some well-
studied sets of ranges, and the names of the respective problems are axis-aligned rectangles
(orthogonal range searching), simplices, halfspaces, and spheres/circles.
• Query types: If the list of all objects that intersect the query range must be reported, the problem
is called range reporting, and the query is called a reporting query. Sometimes, only the number
of objects that intersect the range is required. In this case, the problem is called range counting,
and the query is called a counting query. The emptiness query reports whether there is at least
one object that intersects the range. In the semigroup version, a commutative semigroup (S,+)
is specified, each point is assigned a weight from S, and it is required to report the semigroup
sum of the weights of the points that intersect the range.
• Dynamic range searching vs. static range searching: In the static setting the set S is known in
advance. In dynamic setting objects may be inserted or deleted between queries.
• Offline range searching: Both the set of objects and the whole set of queries are known in
advance.
Research in Similarity Search is dominated by the inherent problems of searching over complex objects.
Such objects cause most known techniques to lose traction over large collections, due to a manifestation
of the so-called Curse of dimensionality, and there are still many unsolved problems. Unfortunately,
in many cases where similarity search is necessary, the objects are inherently complex.
Curse of dimensionality- As the number of features or dimensions grows, the amount of data we need
to generalize accurately grows exponentially.
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data
in high-dimensional spaces that do not occur in low-dimensional settings such as the three-
dimensional physical space of everyday experience.
Let’s take an example below. Fig. below for 1D shows 10 data points in one dimension i.e. there is only
one feature in the data set. It can be easily represented on a line with only 10 values, x=1, 2, 3... 10.
But if we add one more feature, same data will be represented in 2 dimensions (Fig. below for 2D)
causing increase in dimension space to 10*10 =100. And again if we add 3rd feature, dimension space
will increase to 10*10*10 = 1000. As dimensions grows, dimensions space increases exponentially.
10^1 = 10
10^2 = 100
10^3 = 1000 and so on...
This exponential growth in data causes high sparsity in the data set and unnecessarily increases storage
space and processing time for the particular modelling algorithm. Think of image recognition problem
of high-resolution images 1280 × 720 = 921,600 pixels i.e. 921600 dimensions. And that’s why it’s
called Curse of Dimensionality. Value added by additional dimension is much smaller compared to
overhead it adds to the algorithm.
Bottom line is, the data that can be represented using 10 space units of one true dimension, needs 1000
space units after adding 2 more dimensions just because we observed these dimensions during the
experiment. The true dimension means the dimension which accurately generalize the data and observed
dimensions means whatever other dimensions we consider in dataset which may or may not contribute
to accurately generalize the data.
The most general approach to similarity search relies upon the mathematical notion of metric space,
which allows the construction of efficient index structures in order to achieve scalability in the search
domain.
Metric Search
Metric search is similarity search which takes place within metric spaces (In mathematics, a metric
space is a set together with a metric on the set. The metric is a function that defines a concept
of distance between any two members of the set, which are usually called points). While
the semimetric properties are more or less necessary for any kind of search to be meaningful, the further
property of triangle inequality (In mathematics, the triangle inequality states that for any triangle, the
sum of the lengths of any two sides must be greater than or equal to the length of the remaining side) is
useful for engineering, rather than conceptual, purposes.
In a metric space M with metric d, the triangle inequality is a requirement upon distance:
d(x,z) £ d(x,y) + d(y,z)
for all x, y, z in M. That is, the distance from x to z is at most as large as the sum of the distance
from x to y and the distance from y to z.
The triangle inequality is responsible for most of the interesting structure on a metric space, namely,
convergence. This is because the remaining requirements for a metric are rather simplistic in
comparison.
A simple corollary of triangle inequality is that, if any two objects within the space are far apart, then
no third object can be close to both. This observation allows data structures to be built, based on
distances measured within the data collection, which allow subsets of the data to be excluded when a
query is executed. As a simple example, a reference object can be chosen from the data set, and the
remainder of the set divided into two parts based on distance to this object: those close to the reference
object in set A, and those far from the object in set B. If, when the set is later queried, the distance from
the query to the reference object is large, then none of the objects within set A can be very close to the
query; if it is very small, then no object within set B can be close to the query.
Once such situations are quantified and studied, many different metric indexing structures can be
designed, variously suitable for different types of collections. The research domain of metric search can
thus be characterized as the study of pre-processing algorithms over large and relatively static collections
of data which, using the properties of metric spaces, allow efficient similarity search to be performed.
Locality-sensitive hashing
A popular approach for similarity search is locality sensitive hashing (LSH). hashes input items so that
similar items map to the same "buckets" in memory with high probability (the number of buckets being
much smaller than the universe of possible input items). It is often applied in nearest neighbor search on
large scale high-dimensional data, e.g., image databases, document collections, time-series databases,
and genome databases.
Audio/Video Analysis principles and searching
The World Wide Web (Web) is an immense repository of multimedia information that includes
combinations of text, image, video, film or audio artifacts. Many museums and repositories of
multimedia information are going online. The hypertext transfer protocol (HTTP) lends itself to the easy
transfer of audio, video, and image formats integrated with textual information.
In general, Web users search for multimedia information as they search for textual information. The
simplest image search algorithm used by information retrieval (IR) systems locates multimedia files by
searching for file extensions and matching the filename to terms in a query. Some Web IR systems may
retrieve on-line documents with embedded multimedia files. The multimedia filename may not match
the query terms, but the Web document may contain text that does. The approach is that multimedia
searching is performed in an identical way to text searching. No additional burden is placed on the
searcher. If the searcher desires a multimedia document, the searcher enters a query and specifies a
multimedia attribute. For example, a user searching for recordings of Jimmy Buffet songs could enter
“Jimmy Buffet songs” or “audio of Jimmy Buffet songs.” This may retrieve lyric sheets of Jimmy
Buffet’s songs, rather than the actual audio files. The searcher could also use audio file extensions, such
as avi or wav. The same procedures are utilized for video or image retrieval, using appropriate terms
and file extensions for each media. The disadvantage of this approach is the placement of contextual
knowledge burden on the searchers, who may not be familiar with multimedia formats.
Research in principle starts with some kind of searching and collecting of materials. The search for
relevant materials often relies on previous analysis, annotation and sometimes transcription. There is no
absolute point of origin for searching since almost every search relies on a prior categorisation.
Searching and collecting take very different forms, and the technologies needed vary widely.
1. Searching the spoken word:
Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies
with methods from information retrieval (IR). SCR provides users with access to digitized audio-visual
content with a spoken language component.
In recent years, the phenomenon of “speech media,” media involving the spoken word, has developed
in four important respects.
• First, and perhaps most often noted, is the unprecedented volume of stored digital spoken
content that has accumulated online and in institutional, enterprise and other private contexts.
Speech media collections contain valuable information, but their sheer volume makes this
information useless unless spoken audio can be effectively browsed and searched.
• Second, the form taken by speech media has grown progressively diverse. Most obviously,
speech media includes spoken-word audio collections and collections of video containing
spoken content. However, a speech track can accompany an increasingly broad range of media.
For example, speech annotation can be associated with images captured with smartphones.
Current developments are characterized by dramatic growth in the volume of spoken content
that is spontaneous and is recorded outside of the studio, often in conversational settings.
• Third, the different functions fulfilled by speech media have increased in variety. The spoken
word can be used as a medium for communicating factual information. Examples of this
function range from material that has been scripted and produced explicitly as video, such as
television documentaries, to material produced for a live audience and then recorded, such as
lectures. The spoken word can be used as a historical record. Examples include speech media
that records events directly, such as meetings, as well as speech media that captures events that
are recounted, such as interviews. The spoken word can also be used as a form of entertainment.
The importance of the entertainment function is reflected in creative efforts ranging from
professional film to user-generated video on the Internet.
• Fourth, user attitudes towards speech media and the use of speech media have evolved greatly.
Although privacy concerns dominate, the acceptance of the creation of speech recordings, for
example, of call center conversations, has recently grown. Also, users are becoming increasingly
acquainted with the concept of the spoken word as a basis on which media can be searched and
browsed. The expectation has arisen that access to speech media should be as intuitive, reliable
and comfortable as access to conventional text media.
The convergence of these four developments has served to change the playing field. As a result, the
present time is one of unprecedented potential for innovative new applications for SCR that will bring
benefit to a broad range of users. Search engines and retrieval systems that make use of SCR are better
able to connect users with multimedia items that match their needs for information and content.
The basic technology used for SCR is Automatic Speech Recognition (ASR), which generates text
transcripts from spoken audio. SCR can be considered as the application of Information Retrieval (IR)
techniques to ASR transcripts. The overarching challenges of SCR present themselves differently in
different application domains. This survey takes the position that an SCR system for a particular
application domain will be more effective if careful consideration is given to the integration of ASR and
IR. Undeniably, ASR has made considerable progress in recent years. However, developing raw
technologies and computational power alone will not achieve the aim of making large volumes of speech
media content searchable. Rather, it is necessary to understand the nature of the spoken word, spoken
word collections and the interplay between ASR and IR technologies, in order to achieve this goal.
General Architecture of an SCR System

Although SCR systems are implemented differently depending on the deployment domain and the use
scenario, the underlying architecture consists of a set of conventional components that remain more or
less stable. This architecture is presented schematically in Figure 2.1, in order to present an initial
impression of the technologies “under the hood” of a typical SCR system.
Fig. 2.1 Block diagram depicting an abstraction of a typical spoken content retrieval system.
The Query depicted on the left represents the user input to the system. We emphasize that the query is
not the actual information need of the user, but rather an attempt of the user to express this information
need. Often the query is a highly under-specified representation of the information need, and part of the
goal of the system will be to automatically enhance this specification in order to return useful results to
the user.
The Retrieval System has the function of matching the query with the items in the collection. This
matching take place using one of the many IR frameworks that have been developed for text-based
applications.
The retrieval system consults the index, here labeled Timed Media Index, which contains features that
represent the items in the collections and, in general, also time-codes indicating the time points within
each item associated with occurrences of these features. An index is a representation of a collection as
indexing features, together with information that associates those features with particular items.
The process of indexing involves generation of an index, and can be defined as follows:
Spoken Content Indexing is the task of generating representations of spoken content for use in a retrieval
system. These representations include indexing features (i.e., terms consisting of words and phrases
derived from the spoken content and also terms describing the spoken content, such as speaker
identities), weights for the indexing terms and also time codes indicating the time points within the
spoken content associated with the indexing terms.
The exact form of the index depends on the domain and the application. For example, for some
applications, speech media is pre-segmented into documents and only entire items are returned to the
user in the results list. In this case, the time code information returned by the speech recognizer may be
discarded, and the index may contain only the information regarding which indexing term is associated
with which document. In other cases, the system might return a time point within an item as a result. In
this case, time code information must be retained in the index. The structure of the index is optimized
for efficient calculation of matches between the query and the speech media item.
Indexing features are generated by the Speech Recognition System that processes the material in the
Speech Collection at indexing time. The major source of features in the index is ASR, that is, the process
that transcribes the spoken word to text. It is also called Speech-To-Text (STT) technology, especially
in contexts such as dialogue systems. The designation STT emphasizes that ASR is essentially the
inverse of Text-To-Speech (TTS), which is also known as speech synthesis.
ASR systems range from isolated word recognition systems used, for example, in command-and-control
applications, to Large Vocabulary Continuous Speech Recognition (LVCSR) systems that transcribe
human speech in unconstrained human-to-human communication. “Large Vocabulary” speech
recognition aims to provide significant coverage of the large and diverse range of word forms used by
humans. “Continuous” speech recognition recognizes words in the natural stream of language, where
they are generally unseparated by pauses or other cues that could signal a word boundary. However,
other sources of information, such as metadata and rich transcripts that include labels indicating who
spoke when, are also important.
As the output of the SCR system, the user receives from the system a list of Ranked Results, a set of
results ordered in terms of their likelihood of potential relevance to the query. A result can take the form
of a spoken content item, a segment of the speech stream whose scope is dynamically customized to the
query (sometimes referred to as a “relevance interval”) or a time point at which the user should start
viewing/listening to the content (a so-called “listen-in point” or “jump- in point”). The choice of the
form of results depends on the domain and on the use scenario. In any case, it is important that the results
list contain proper surrogates, representations of each result that allow the user to make an initial
judgment of whether or not the result is a good match to the information need without having to initiate
playback.
Finally, the SCR system must offer the user a means of Visualization and Playback of the individual
results. Result visualization in a playback interface is necessary for the same reason as surrogates are
important for the results list: users must be able to judge the relevance of speech media results without
listening to or viewing long swaths of audio or video content, a very time-consuming process. Time can
also be saved by providing the user with an intelligent multimedia player, which makes it possible to
jump directly to certain points within a speech media result, for example the beginning of a particular
speaker turn.
The most widely used and highly developed search systems work with text, and so searching spoken
word collections often relies on previous annotation, transcription or content analysis to derive text from,
or associate text with, the spoken word.
Search methods:
(i) Transcript search
Systems supporting the free-text querying of textual transcripts are now ubiquitous and similar systems
exist for searching speech by querying the time-aligned transcripts automatically derived by speech-to-
text systems.
(ii) Browsing via metadata
Metadata (generated manually or automatically) can be added to indices of various types, analogous to
but more flexible than those found for books. Thus, a user might choose to browse only segments
corresponding to a particular speaker or those that have been associated with particular named entities
such as people, places or locations.
2. Searching for Music and Sound
An audio search engine is a web-based search engine which crawls the web for audio content. The
information can consist of web pages, images, audio files, or another type of document. Various
techniques exist for research on these engines.
Types of Search
(i) Audio search from text
Text entered into a search bar by the user is compared to the search engine's database. Matching results
are accompanied by a brief description of the audio file and its characteristics such as sample frequency,
bit rate, type of file, length, duration, or coding type. The user is given the option of downloading the
resulting files.
(ii) Audio search from image
The Query by Example (QBE) system is a searching algorithm that uses content-based image
retrieval (CBIR). Keywords are generated from the analysed image. These keywords are used to search
for audio files in the database. The results of the search are displayed according to the user preferences
regarding to the type of file (wav, mp3, aiff…) or other characteristics.
(iii) Audio search from audio
Above: a sound A waveform

Below: a sound B spectrogram
In audio search from audio, the user must play the audio of a song either with a music player, by singing
or by humming to the computer microphone. Subsequently, a sound pattern, A, is derived from the audio
waveform, and a frequency representation is derived from its Fourier Transform. This pattern will be
matched with a pattern, B, corresponding to the waveform and transform of sound files found in the
database. All those audio files in the database whose patterns are similar to the pattern search will be
displayed as search results
Algorithms used
Audio search has evolved slowly through several basic search formats which exist today and all
use keywords. The keywords for each search can be found in the title of the media, any text attached to
the media and content linked web pages, also defined by authors and users of video hosted resources.
Some search engines can search recorded speech such as podcasts, though this can be difficult if there
is background noise. Rather than applying a text search algorithm after speech-to-text processing is
completed, some engines use a phonetic search algorithm to find results within the spoken word. Others
work by listening to the entire podcast and creating a text transcription.
Applications as Munax, use several independent ranking algorithms processes, that the inverted
index together with hundreds of search parameters to produce the final ranking for each document.
Also, like Shazam that works by analyzing the captured sound and seeking a match based on an acoustic
fingerprint in a database of more than 11 million songs. Shazam identifies songs based on an audio
fingerprint based on a time-frequency graph called a spectrogram. Shazam stores a catalogue of audio
fingerprints in a database. The user tags a song for 10 seconds and the application creates an audio
fingerprint. Once it creates the fingerprint of the audio, Shazam starts the search for matches in the
database. If there is a match, it returns the information to the user; otherwise it returns a "song not
known" dialogue. Shazam can identify prerecorded music being broadcast from any source, such as a
radio, television, cinema or music in a club, provided that the background noise level is not high enough
to prevent an acoustic fingerprint being taken, and that the song is present in the software's database
Notable Search Engines

(i) Deep audio search
• Picsearch Audio Search has been licensed to search portals since 2006. Picsearch is a search
technology provider who powers image, video and audio search for over 100 major search engines
around the world.
(ii) For smartphones

• SoundHound (previously known as Midomi) is a software and company (both with the same name)
that lets users find results with audio. Its features are both an audio-based artificial
intelligence service and services to find songs and details about them by singing, humming or
recording them.
• Shazam is an app for smartphone or Mac best known for its music identification capabilities. It uses
a built-in microphone to gather a brief sample of the audio being played. It creates an acoustic
fingerprint based on the sample, and compares it against a central database for a match. If it finds a
match, it sends information such as the artist, song title, and album back to the user.
• Doreso identifies a song by humming or singing the melody using a microphone; and by direct input
of the name of a song or singer. The app gives information about the song title, its singer and allows
you to purchase the song.
• Munax (defunct) is a company that released their all-content search engine in its first version in
2005. Their PlayAudioVideo multimedia search engine, created in July 2007, was the first true
search engine for multimedia, providing search on the web for images, video and audio in the same
search engine, and allowing users to preview them on the same page.
3. Audio Content Analysis
The objective of Audio Content Analysis (ACA) is the extraction of information from audio signals
such as music recordings stored on digital media. The information to be extracted is usually referred to
as metadata: it is data about (audio) data and can essentially cover any information allowing a
meaningful description or explanation of the raw audio data. The meta data represents (among other
things) the musical content of the recording. Nowadays, attempts have been made to automatically
extract practically everything from the music recording including formal, perceptual, musical, and
technical meta data. Examples range from tempo and key analysis — ultimately leading to the complete
transcription of recordings into a score-like format — over the analysis of artists' performances of
specific pieces of music to approaches to modeling the human emotional affection when listening to
music.
In addition to the metadata extractable from the signal itself there is also metadata which is neither
implicitly nor explicitly included in the music signal itself but represents additional information on the
signal, such as the year of the composition or recording, the record label, the song title, information on
the artists, etc.
The term audio content analysis is not the only one used for systems analyzing audio signals. Frequently,
the research field is also called Music Information Retrieval (MIR). MIR should be understood as a
more general, broader field of which ACA is a part. In contrast to ACA, MIR also includes the analysis
of symbolic non-audio music formats such as musical scores and files or signals compliant to the so-
called Musical Instrument Digital Interface (MIDI) protocol. Furthermore, MIR may include the
analysis and retrieval of information that is music-related but cannot be (easily) extracted from the audio
signal such as the song lyrics, user ratings, performance instructions in the score, or bibliographical
information such as publisher, publishing date, the work's title, etc. Therefore, the term audio content
analysis seems to be the most accurate term.
In the past, other terms have been used more or less synonymously to the term audio content analysis.
Examples of such synonyms are machine listening and computer audition. Computational Auditory
Scene Analysis (CASA) is closely related to ACA but usually has a strong focus on modeling the human
perception of audio.
Audio content analysis systems can be used on a relatively wide variety of tasks. Obviously, the
automatic generation of metadata is of great use for the retrieval of music signals with specific
characteristics from large databases or the Internet. Here, the manual annotation of metadata by humans
is simply not feasible due to the sheer amount of (audio) data.
Therefore, only computerized tags can be used to find files or excerpts of files with, e.g., a specific
tempo, instrumentation, chord progression, etc. The same information can be used in end consumer
applications such as for the automatic generation of play lists in music players or in automatic music
recommendation systems based on the user's music database or listening habits. Another typical area of
application is music production software. Here, the aim of ACA is on the one hand to allow the user to
interact with a more "musical" software interface — e.g., by displaying score-like information along
with the audio data — and thus enabling a more intuitive approach to visualization and editing the audio
data. On the other hand, the software can support the user by giving suggestions of how to combine and
process different audio signals. For instance, software applications for DJs nowadays include technology
allowing the (semi-) automatic alignment of audio loops and complete mixes based on previously
extracted information such as the tempo and key of the signals.
In summary, ACA can help with

• automatic organization of audio content in large databases as well as search and retrieve audio
files with specific characteristics in such databases (including the tasks of song identification
and recommendation),
• new approaches and interfaces to search and retrieval of audio data such as query-by humming
systems,
• new ways of sound visualization, user interaction, and musical processing in music software
such as an audio editor displaying the current score position or an automatically generated
accompaniment,
• intelligent, content-dependent control of audio processing (effect parameters, intelligent cross
fades, time stretching, etc.) and audio coding algorithms, and
• automatic play list generation in media players.
Audio Content and its characteristics

The content or information conveyed by recordings of music is obviously multi-faceted. It originates
from three different sources:
• Score: The term score will be used broadly as a definition of musical ideas. It can refer to any
form of notating music from the basso continuo (a historic way of defining the harmonic
structure) and the classic western score notation to the lead sheet and other forms of notation
used for contemporary and popular music.
Examples of information originating in the score are the melody or hook line, the key and the harmony
progression, rhythmic aspects and specific temporal patterns, the instrumentation, as well as structural
information such as repetitions and phrase boundaries.
• Performance: Music as a performing art requires a performer or group of performers to generate
a unique acoustical rendition of the underlying musical ideas. The performers will use the
information provided by the score but may interpret and modify it as well as they may dismiss
parts of the contained information or add new information.
Typical performance aspects include the tempo and its variation as well as the microtiming, the
realization of musical dynamics, accents and instantaneous dynamic modulations such as tremolo, the
usage of specific temperaments and expressive intonation and vibrato, and specific playing (e.g.,
bowing) techniques influencing the sound quality.
• Production: The process of recording the performance and the (post-) production process will
impact certain characteristics of the recording. These are mainly the sound quality of the
recording (by microphone positioning, equalization, and by applying effects to the signal) and
the dynamics (by applying manual or automatic gain adjustments). Changes in timing and pitch
may occur as well by editing the recording and applying software for pitch correction.
There are certain characteristics which cannot easily be assigned to a single category;
the timbre of a recording can be determined by the instrumentation indicated by the score, by
the specific choice of instruments (e.g., historical instruments, specific guitar amps, etc.), by
specific playing techniques, and by sound processing choices made by the sound engineer or
producer.
ACA systems may in principle cover the extraction of information from all three categories.
In many cases, however, no distinction is being made between those categories by researchers and their
systems, respectively. The reason is that popular music in the tradition of western music is one of the
main targets of the research for several — last but not least commercial — reasons and that with popular
music a score-like raw representation of musical ideas cannot be distinguished as easily from the
performance and production as in "classical" or traditional western music.
From a technical point of view, five general classes can be identified to describe the content of a music
recording on a low level:
• statistical or technical signal characteristics derived from the audio data such as the
amplitude distribution etc.
• timbre or sound quality characteristics
• intensity-related characteristics such as envelope-, level-, and loudness-related properties
• tonal characteristics which include the pitches and pitch relations in the signal
and
• temporal characteristics such as rhythmic and timing properties of the signal.
The basic information clustered in each individual class can be used and combined to gather a deeper
knowledge of the music such as on musical structure, style, performance characteristics, or even
transported mood or emotional affection. Moreover, some music properties independent of the
perceptual context (e.g., key, tempo), other properties depend on the individual listener's music
experience or way of perceiving music. Not only might this experience vary between the multitude of
different listeners, but it might also vary with the listener's individual mood and situation.
4. Searching Video and Film
A video search engine is a web-based search engine which crawls the web for video content. Some
video search engines parse externally hosted content while others allow content to be uploaded and
hosted on their own servers. Some engines also allow users to search by video format type and by length
of the clip. The video search results are usually accompanied by a thumbnail view of the video.
Video search engines are computer programs designed to find videos stored on digital devices, either
through Internet servers or in storage units from the same computer. These searches can be made through
audiovisual indexing, which can extract information from audiovisual material and record it as metadata,
which will be tracked by search engines.
The main use of these search engines is the increasing creation of audiovisual content and the need to
manage it properly. The digitization of audiovisual archives and the establishment of the Internet, has
led to large quantities of video files stored in big databases, whose recovery can be very difficult because
of the huge volumes of data and the existence of a semantic gap.
Search Criteria:
The search criterion used by each search engine depends on its nature and purpose of the searches.
(i) Metadata
Metadata is information about facts. It could be information about who is the author of the video,
creation date, duration, and all the information that could be extracted and included in the same files.
Internet is often used in a language called XML to encode metadata, which works very well through the
web and is readable by people. Thus, through this information contained in these files is the easiest way
to find data of interest to us.
In the videos there are two types of metadata, that we can integrate in the video code itself and external
metadata from the page where the video is. In both cases we optimize them to make them ideal when
indexed.
Internal metadata
All video formats incorporate their own metadata. The title, description, coding quality or transcription
of the content are possible. To review these data exist programs like FLV MetaData Injector, Sorenson
Squeeze or Castfire. Each one has some utilities and special specifications.
Converting from one format to another can lose much of this data, so check that the new format
information is correct. It is therefore advisable to have the video in multiple formats, so all search robots
will be able to find and index it.
External metadata
In most cases the same mechanisms must be applied as in the positioning of an image or text content.
• Title and description
They are the most important factors when positioning a video, because they contain most of the
necessary information. The titles have to be clearly descriptive and should remove every word
or phrase that is not useful.
• Filename
It should be descriptive, including keywords that describe the video with no need to see their
title or description. Ideally, separate the words by dashes "-".
• Tags
On the page where the video is, it should be a list of keywords linked to the microformat "rel-
tag". These words will be used by search engines as a basis for organizing information.
• Transcription and subtitles
Although not completely standard, there are two formats that store information in a temporal
component that is specified, one for subtitles and another for transcripts, which can also be used
for subtitles. The formats are SRT or SUB for subtitles and TTXT for transcripts.
(ii) Speech recognition

Speech recognition consists of a transcript of the speech of the audio track of the videos, creating a text
file. In this way and with the help of a phrase extractor can easily search if the video content is of interest.
Some search engines apart from using speech recognition to search for videos, also use it to find the
specific point of a multimedia file in which a specific word or phrase is located and so go directly to this
point. Gaudi (Google Audio Indexing), a project developed by Google Labs, uses voice recognition
technology to locate the exact moment that one or more words have been spoken within an audio,
allowing the user to go directly to exact moment that the words were spoken. If the search query matches
some videos from YouTube, the positions are indicated by yellow markers, and must pass the mouse
over to read the transcribed text.
(iii) Text recognition
The text recognition can be very useful to recognize characters in the videos through "chyrons". As with
speech recognizers, there are search engines that allow (through character recognition) to play a video
from a particular point.
TalkMiner, an example of search of specific fragments from videos by text recognition, analyzes each
video once per second looking for identifier signs of a slide, such as its shape and static nature, captures
the image of the slide and uses Optical Character Recognition (OCR) to detect the words on the slides.
Then, these words are indexed in the search engine of TalkMiner, which currently offers to users more
than 20,000 videos from institutions such as Stanford University, the University of California at
Berkeley, and TED.
(iv) Frame analysis

Through the visual descriptors we can analyze the frames of a video and extract information that can be
scored as metadata. Descriptions are generated automatically and can describe different aspects of the
frames, such as color, texture, shape, motion, and the situation.
Ranking criteria:
The usefulness of a search engine depends on the relevance of the result set returned. While there may
be millions of videos that include a particular word or phrase, some videos may be more relevant,
popular or have more authority than others. This arrangement has a lot to do with search engine
optimization.
Most search engines use different methods to classify the results and provide the best video in the first
results. However, most programs allow sorting the results by several criteria.
• Order by relevance
This criterion is more ambiguous and less objective, but sometimes it is the closest to what we want;
depends entirely on the searcher and the algorithm that the owner has chosen. That's why it has always
been discussed and now that search results are so ingrained into our society it has been discussed even
more. This type of management often depends on the number of times that the searched word comes
out, the number of viewings of this, the number of pages that link to this content and ratings given by
users who have seen it.
• Order by date of upload
This is a criterion based totally on timeline. Results can be sorted according to their seniority in the
repository.
• Order by number of views
It can give us an idea of the popularity of each video.
• Order by length
This is the length of the video and can give a taste of which video it is.
• Order by user rating
It is common practice in repositories let the users rate the videos, so that a content of quality and
relevance will have a high rank on the list of results gaining visibility. This practice is closely related to
virtual communities.
Interfaces:
We can distinguish two basic types of interfaces, some are web pages hosted on servers which are
accessed by Internet and searched through the network, and the others are computer programs that search
within a private network.
(i) Internet
Within Internet interfaces we can find repositories that host video files which incorporate a search engine
that searches only their own databases, and video searchers without repository that search in sources of
external software.
• Repositories with video searcher
Provides accommodation in video files stored on its servers and usually has an integrated search engine
that searches through videos uploaded by its users. One of the first web repositories, or at least the most
famous are the portals Vimeo, Dailymotion and YouTube.
Their searches are often based on reading the metadata tags, titles and descriptions that users assign to
their videos. The disposal and order criterion of the results of these searches are usually selectable
between the file upload date, the number of viewings or what they call the relevance. Still, sorting
criterion are nowadays the main weapon of these websites, because the positioning of videos is important
in terms of promotion.
• Video searchers repositories
They are websites specialized in searching videos across the network or certain pre-selected repositories.
They work by web spiders that inspect the network in an automated way to create copies of the visited
websites, which will then be indexed by search engines, so they can provide faster searches.
(ii) Private network
Functioning scheme
Sometimes a search engine only searches in audiovisual files stored within a computer or, as it happens
in televisions, on a private server where users access through a local area network. These searchers are
usually software or rich Internet applications with a very specific search options for maximum speed
and efficiency when presenting the results. They are typically used for large databases and are therefore
highly focused to satisfy the needs of television companies.
An example of this type of software would be the Digition Suite. This particular suite and perhaps in its
strongest point is that it integrates the entire process of creating, indexing, storing, searching, editing,
and a recovery. Once we have a digitized audiovisual content is indexed with different techniques of
different level depending on the importance of content and it's stored. The user, when he wants to
retrieve a particular file, has to fill a search fields such as program title, issue date, characters who act
or the name of the producer, and the robot starts the search. Once the results appear and they arranged
according to preferences, the user can play the low quality videos to work as quickly as possible. When
he finds the desired content, it is downloaded with good definition, it's edited and reproduced.
Video search Engines:

(i) Agnostic search
Search that is not affected by the hosting of video, where results are agnostic no matter where the video
is located:
• blinkx was launched in 2004 and uses speech recognition and visual analysis to process spidered
video rather than rely on metadata alone. blinkx claims to have the largest archive of video on the
web and puts its collection at around 26,000,000 hours of content.
• CastTV is a Web-wide video search engine that was founded in 2006 and funded by Draper Fisher
Jurvetson, Ron Conway, and Marc Andreessen.
• Munax released their first version all-content search engine in 2005 and powers both nationwide
and worldwide search engines with video search.
• Picsearch Video Search has been licensed to search portals since 2006. Picsearch is a search
technology provider who powers image, video and audio search for over 100 major search engines
around the world.
(ii) Non-agnostic search
Search results are modified, or suspect, due to the large hosted video being given preferential treatment
in search results:
• AOL Video offers a video search engine that can be used to find video located on popular video
destinations across the web. In December 2005, AOL acquired Truveo Video Search.
• Bing video search is a search engine powered by Bing and also used by Yahoo! Video Search.
• Google Videos is a video search engine from Google.
• Tencent Video offers video search from Tencent.
5. Video Content Analysis
Video content analysis (also video content analytics, VCA) is the capability of automatically
analyzing video to detect and determine temporal and spatial events.
This technical capability is used in a wide range of domains including entertainment, video
retrieval and video browsing, health-care, retail, automotive, transport, home automation, flame and
smoke detection, safety and security. The algorithms can be implemented as software on general
purpose machines, or as hardware in specialized video processing units.
Many different functionalities can be implemented in VCA. Video Motion Detection is one of the
simpler forms where motion is detected with regard to a fixed background scene. More advanced
functionalities include video tracking and egomotion estimation.
Based on the internal representation that VCA generates in the machine, it is possible to build other
functionalities, such as identification, behavior analysis or other forms of situation awareness.
VCA relies on good input video, so it is often combined with video enhancement technologies such
as video denoising, image stabilization, unsharp masking and super-resolution.
Functionalities:
Several articles provide an overview of the modules involved in the development of video analytic
applications.[4][5] This is a list of known functionalities and a short description.
Function Description
Dynamic Blocking a part of the video signal based on the signal itself, for example because
masking of privacy concerns.
IP cameras with intelligent video surveillance technology can be used to detect
Flame and flame and smoke in 15–20 seconds or even less because of the built-in DSP chip.
smoke The chip processes algorithms that analyzes the videos captured for flame and
detection smoke characteristics such as color chrominance, flickering ratio, shape, pattern and
moving direction.
Egomotion Egomotion estimation is used to determine the location of a camera by analyzing
estimation its output signal.
Motion Motion detection is used to determine the presence of relevant motion in the
detection observed scene.
Shape recognition is used to recognize shapes in the input video, for example circles
Shape
or squares. This functionality is typically used in more advanced functionalities
recognition
such as object detection.
Object Object detection is used to determine the presence of a type of object or entity, for
detection example a person or car. Other examples include fire and smoke detection.
Face recognition and Automatic Number Plate Recognition are used to recognize,
Recognition
and therefore possibly identify, persons or cars.
Style detection is used in settings where the video signal has been produced, for
Style detection example for television broadcast. Style detection detects the style of the production
process.
Tamper Tamper detection is used to determine whether the camera or output signal is
detection tampered with.
Video tracking is used to determine the location of persons or objects in the video
Video tracking
signal, possibly with regard to an external reference grid.
Video error Video scene content tamper analysis using free software. Video Error level
level analysis analysis (VELA)
Object co- Joint object discovery, classification and segmentation of targets in one or multiple
segmentation related video sequences
Commercial Applications:
VCA is a relatively new technology, with numerous companies releasing VCA-enhanced products in
the mid-2000s . While there are many applications, the track record of different VCA solutions differ
widely. Functionalities such as motion detection, people counting and gun detection are available
as commercial off-the-shelf products and believed to have a decent track-record (for example, even
freeware such as dsprobotics Flowstone can handle movement and color analysis). In response to
the COVID-19 pandemic, many software manufacturers have introduced new public health analytics
like face mask detection or social distancing tracking.
In many domains VCA is implemented on CCTV systems, either distributed on the cameras (at-the-
edge) or centralized on dedicated processing systems. Video Analytics and Smart CCTV are commercial
terms for VCA in the security domain. In the UK the (British Security Industry Association) BSIA has
developed an introduction guide for VCA in the security domain. In addition to video analytics and to
complement it, audio analytics can also be used.
Video management software manufacturers are constantly expanding the range of the video analytics
modules available. With the new suspect tracking technology, it is then possible to track all of this
subject's movements easily: where they came from, and when, where, and how they moved. Within a
particular surveillance system, the indexing technology is able to locate people with similar features
who were within the cameras’ viewpoints during or within a specific period of time. Usually, the system
finds a lot of different people with similar features and presents them in the form of snapshots. The
operator only needs to click on those images and subjects which need to be tracked. Within a minute or
so, it's possible to track all the movements of a particular person, and even to create a step-by-step video
of the movements.
Kinect is an add-on peripheral for the Xbox 360 gaming console that uses VCA for part of the user input.
In retail industry, VCA is used to track shoppers inside the store. By this way, a heatmap of the store
can be obtained, which is beneficial for store design and marketing optimisations. Other applications
include dwell time when looking at a products and item removed/left detection.
The quality of VCA in the commercial setting is difficult to determine. It depends on many variables
such as use case, implementation, system configuration and computing platform. Typical methods to
get an objective idea of the quality in commercial settings include independent benchmarking and
designated test locations.
VCA has been used for crowd management purposes, notably at The O2 Arena in London and The
London Eye.

Multimedia and WS-CS 550-Content Analysis v1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimedia and WS-CS 550-Content Analysis v1

Uploaded by

Copyright:

Available Formats

CS 550-Multimedia and Web Services

Optical character recognition (OCR)

• Extracting business card information into a contact list

• Assistive technology for blind and visually impaired users

Difference between JSON and XML

JSON object has a type XML data is typeless

Data is readily accessible as JSON XML data needs to be parsed.

JSON is supported by most browsers. Cross-browser XML parsing can be tricky

Retrieving value is easy Retrieving value is difficult

Supported by many Ajax toolkit Not fully supported by Ajax toolkit

A fully automated way of Developers have to write JavaScript code to

It doesn't support comments. It supports comments.

It does not provide any support for It supports namespaces.

It is less secured. It is more secure than JSON.

JSON Code vs XML Code

Advantages of using JSON

Advantages of using XML

Disadvantages of using XML

JSON Sample program

XML sample program

(i) Detect document text in a local image:

// Load an image from a local file.

public static void detectDocumentText(String filePath) throws IOException {

ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));

Image img = Image.newBuilder().setContent(imgBytes).build();

for (AnnotateImageResponse res : responses) {

// For full list of available annotations, see http://g.co/cloud/vision/docs

(ii) Detect document text in a remote image

// Specify a Google Cloud Storage uri for the image

public static void detectDocumentTextGcs(String gcsPath) throws IOException {

ImageSource imgSource = ImageSource.newBuilder().setGcsImageUri(gcsPath).build();

for (AnnotateImageResponse res : responses) {

Specify the language (optional)

10^3 = 1000 and so on...

Audio/Video Analysis principles and searching

1. Searching the spoken word:

General Architecture of an SCR System

2. Searching for Music and Sound

Above: a sound A waveform

Notable Search Engines

(ii) For smartphones

3. Audio Content Analysis

In summary, ACA can help with

Audio Content and its characteristics

4. Searching Video and Film

(ii) Speech recognition

(iv) Frame analysis

(ii) Private network

Video search Engines:

5. Video Content Analysis

You might also like