Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

VikeshKaran.

P (16PGM54)
What is text analytics
 Text Analytics is the process of converting
unstructured text data into meaningful data for
analysis, to measure customer opinions, product
reviews, feedback, to provide search facility,
sentimental analysis and entity modeling to
support fact based decision making.
 Text analysis is about deriving high-quality
structured data from unstructured text. Another
name for text analytics is text mining.
IE – information extraction
 Information extraction (IE): Identification and
extraction of relevant facts and relationships from
unstructured text; the process of making structured
data from unstructured and semi structured text.
NAMED ENTITY RECOGNITION
RELATION EXTREATION
How IE works
 Feature Selection takes raw text as input and identifies low-
level entities called features. Features can be things like
capitalized words, sequences of numbers, or the names of
Fortune 500 companies.
 Identification uses features to build more complex entities and
the relationships among them, including sentiment and events.
For example, a common first name followed by a capitalized
word might resolve into a person’s full name.
 Resolution involves cleaning up ambiguities that arise in the
output of the “Identification” step (that is, the entities,
relationships, events, and sentiment identified in the text). For
example, a document may use several different strings – first
name, full name, a pronoun – to identify the same person.
Flow chart of how IE works
What is system T web tool
 The SystemT web tool is a drag-and-drop graphical
interface, available as part of IBM Big Insights.
 It aids rapid iterative development, allowing users to develop
extractors, run them, immediately view results, and refine the
extractors.
 It offers the benefit of text analytics, without requiring users to
write code.
 It contains a rich library of pre-built extractors (general,
domain-specific, and task-specific).
BASIC VISUAL CONSTRUCTS FOR CREATING
EXTRACTORS USING THE SYSTEMT WEB TOOL

 Atomic constructs
 Composite constructs
 Output refinement constructs
Information extraction with AQL
 In SystemT, extraction programs are expressed in a
language called Annotation Query Language (AQL). In
the SystemT Web Tool that you used so far, the
visual specification of the extractor is in fact automatically
translated into an AQL program.
 AQL is a declarative language: the developer declares the
semantics of the extractor in AQL in a logical way, without
specifying how the AQL program should be executed. The
SystemT Optimizer compiles the AQL program into
a compiled execution plan. This compiled plan is
executed on an input document to output the extraction
results.
Advantages of the declarative AQL
languages
 It separates the extractor semantics and the implementation.
This means that the AQL developer must think only about
"what" to extract, enabling an Optimizer to automatically
determine "how" to implement it efficiently. Hence, you get
better runtime performance through optimization.

 It implements the full power of the SystemT algebra, which


enables rich, clean rule semantics. This means that you can
program extractors with higher quality of results compared to
another language that has a smaller number of operators, or
whose semantics are intertwined with the implementation,
leading to a less rich set of core operators.

 It uses syntax similar to the SQL language used to query


relational databases, which makes it easier to learn the language.
Flow Chart
THE BASIC AQL DATA MODEL
 Span: A contiguous region of characters in a text
object
 Integer: A 32-bit signed integer
 Float: A single precision-floating-point number
 Text: A Unicode string with additional metadata
to indicate the tuple of the string
 List : The built-in list type, representing a bag of
values of the same scalar type such as Span,
Integer, Float, or Text
Extract & Select Statement
 Extract statement
An Extract statement is used to perform extraction
based on constructs such as regular expressions,
dictionaries, and sequence patterns.
An Extract statement may include a Having clause to
specify predicates, and/or a Consolidate clause to
specify how to remove overlapping matches.
 Select statement
A Select statement may include a Where clause to
specify predicates, a Consolidate clause to specify how
to remove overlapping matches, and other clauses
such as Group By, Order By, and Limit.
USER DEFINED FUNCTIONS (UDFS)
 User Defined Functions are custom functions that
bridge the gap between your application requirements
and SystemT’s built-in capabilities. They define
capabilities not yet supported by the built-in
constructs. For example, you can invoke a trained SVM
classifier in AQL by defining your own UDF function.
Scalar and table UDFs
 Scalar UDFs:
 Input can be scalar or table value.
 Return value is of scalar type, such as boolean, integer, string,
text, span, or scalar list.
 Table UDFs:
 Input can be scalar or table value.
 Return value is a table value in the form of one or more
tuples.
Steps for using UDFs
 Implement the UDF in Java.
 Declare the UDF in the AQL code using
the Create Function statement.
 Use the UDF in AQL.

 If it is a scalar UDF, you can use it in a


similar way as a built-in scalar function.
 If it is a table UDF, you can use it in a
similar way as a regular view.

You might also like