Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 57

Organizing Information: Metadata and Controlled Vocabularies

Ray R. Larson University of California, Berkeley School of Information Management and Systems
12/11/98 SIMS Affiliates Meeting

Overview: Metadata and Controlled Vocabularies


Definitions Origins and Uses of Controlled Vocabularies for Information Retrieval Metadata Types of Indexing Languages, Thesauri and Classification Systems Process of Design and Development of Thesauri
12/11/98 SIMS Affiliates Meeting

Information Organization and Retrieval


To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for. Knowledge is knowing, familiarity gained by experience; persons range of information; a theoretical or practical understanding of; the sum of what is known. To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right. Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.
The Oxford English Dictionary, cf. Rowley

12/11/98

SIMS Affiliates Meeting

Information Properties
Information can be communicated electronically
Broadcasting Networking

Information can be easily duplicated and shared


Problems of Ownership Problems of Control
Adapted from Silicon Dreams by Robert W. Lucky

12/11/98

SIMS Affiliates Meeting

Information Hierarchy
Data
The raw material of information

Information
Data organized and presented by someone

Knowledge
Information read, heard or seen and understood

Wisdom
Distilled and integrated knowledge and understanding
12/11/98 SIMS Affiliates Meeting

Information Hierarchy
Wisdom Knowledge Information Data

12/11/98

SIMS Affiliates Meeting

Information Life Cycle


Creation
Active
Authoring Modifying Using Creating Organizing Indexing

Retention/ Mining Discard

Accessing Filtering

Storing Retrieval

Semi-Active

Utilization Disposition
Inactive

Distribution Networking

Searching

12/11/98

SIMS Affiliates Meeting

Information Life Cycle


Authoring/Modifying Organizing/Indexing Storing/Retrieving Distribution/Networking Accessing/Filtering Using/Creating

12/11/98

SIMS Affiliates Meeting

Origins
Very early history of content representation
Sumerian tokens and envelopes Alexandria - pinakes Indices

12/11/98

SIMS Affiliates Meeting

Origins
Biblical Indexes and Concordances (Hugo de St. Caro & 500 monks, 1247 -- KWIC) Journal Indexes Information Explosion following WWII
Cranfield Studies of indexing languages and information retrieval Development of bibliographic databases
Index Medicus -- production and Medlars searching 12/11/98 SIMS Affiliates Meeting

Origins
Communication theory revisited Problems with transmission of meaning
Message Message Source Encoding Channel Decoding Destination

Noise Message Encoding (writing/indexing) Decoding (Retrieval/Reading) Message

Source

Storage

Destination

12/11/98

SIMS Affiliates Meeting

Structure of an IR System
Search Line Interest profiles & Queries Information Storage and Retrieval System Documents & data Storage Line Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language

Formulating query in terms of descriptors

Indexing (Descriptive and Subject)

Storage of profiles

Storage of Documents

Store1: Profiles/ Search requests

Comparison/ Matching

Store2: Document representations

Potentially Relevant Documents

Adapted from Soergel, p. 19

12/11/98

SIMS Affiliates Meeting

Metadata
Data about data Information about Information Description of information structure and contents for individual information items, or entire collections of information

12/11/98

SIMS Affiliates Meeting

Types of Metadata
Element names. Element description. Element representation. Element coding. Element semantics. Element classification.

12/11/98

SIMS Affiliates Meeting

Metadata Systems
AACRII/MARC Dublin Core RDF (Resource Description Framework) SGML/XML DBMS Metadata Controlled vocabularies

12/11/98

SIMS Affiliates Meeting

Goals of Descriptive Cataloging (AACRII/MARC) 1. To enable a person to find a document of which


the author, or the title, or the subject is known

2. To show what a library has


by a given author on a given subject (and related subjects) in a given kind (or form) of literature.

3. To assist in the choice of a document


as to its edition (bibliographically) as to its character (literary or topical)
Charles A. Cutter, 1876

12/11/98

SIMS Affiliates Meeting

Dublin Core Elements


Title Creator Subject Description Publisher Other Contributors Date Resource Type Format Resource Identifier Source Language Relation Coverage Rights Management

12/11/98

SIMS Affiliates Meeting

RDF (W3C)
A model for representing named properties and property values
Resources (the things described) Properties (aspects, attributes, characteristics of resources) Statements (Resource+Property+Value of Property for the Resource)

Expressed in XML
12/11/98 SIMS Affiliates Meeting

SGML & XML


What is SGML/XML? Document Type Definitions Document Markup Sources and Resources

12/11/98

SIMS Affiliates Meeting

Databases & Metadata


Particularly in the Relational Model metadata is part of the Database, providing information about the structure and contents of the database
What Relations (tables) in the the DB Relation(table) attributes (domains) Attribute representation and storage Other information (indexes, etc)
SIMS Affiliates Meeting

12/11/98

Controlled Vocabularies
Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

12/11/98

SIMS Affiliates Meeting

Controlled Vocabularies
Names and name authorities Design of controlled vocabularies for subject access -- Thesaurus design

12/11/98

SIMS Affiliates Meeting

Names
Cutters (1876) objectives of bibliographic description:
To enable a person to find a document of which the author is known. To show what the library has by a given author.

First serves access. Second serves collocation.


12/11/98 SIMS Affiliates Meeting

Problems with Names


How many names should be associated with a document? Which of these should be the main entry? What form should each of the names take? What references should be made from other possible forms of names that havent been used?
12/11/98 SIMS Affiliates Meeting

The problem
Proliferation of the forms of names
Different names for the same person Different people with the same names

Examples
from Books in Print (semi-controlled but not consistent) ERIC author index (not controlled)

12/11/98

SIMS Affiliates Meeting

Rules for description


AACR II and other sets of descriptive cataloging rules provide guidelines for:
Determining the number of name entries Choosing a main entry Deciding on the form of name to be used Deciding when to make references

12/11/98

SIMS Affiliates Meeting

Authority control
Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules. If you have rules, why do you need to keep track of all of the headings?
12/11/98 SIMS Affiliates Meeting

Conditions of Authorship?
Single person or single corporate entity Unknown or anonymous authors Shared responsibility Collections or editorially assembled works Works of mixed responsibility (e.g. translations) Related Works
12/11/98 SIMS Affiliates Meeting

Added Entries
Personal names
Collaborators Editors, compilers, writers Translators (in some cases) Illustrators (in some cases) Other persons associated with the work (such as the honoree in a Festschrift).

Corporate Names
Any prominently named corporate body that has involvement in the work beyond publication, distribution, etc. 12/11/98 SIMS Affiliates Meeting

Choice of Name
AACR II says that the predominant form of the name used in a particular authors writings should be chosen as the form of name. References should be made from the other forms of the name.

12/11/98

SIMS Affiliates Meeting

When names appear in multiple forms, one form needs to be chosen. Criteria for choice are
Fullness (e.g. Full names vs. initials only) Language of the name. Spelling (choose predominant form)

Form of the Name

Entry element:
John Smith or Smith, John? Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
12/11/98 SIMS Affiliates Meeting

Name Authority Files


ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for the same person

12/11/98

SIMS Affiliates Meeting

Name Authority Files


ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

12/11/98

SIMS Affiliates Meeting

Name authority files


ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927400 10 Butler, W. V.$q(William Vivian),$d1927400 10 Marric, J. J.,$d1927670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name 12/11/98 SIMS Affiliates Meeting

Controlled Vocabularies for Information Access


The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all. (W.H. Auden) Similarly, there are too many ways of expressing or explaining the topic of a document. Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of lead-in vocabulary and an limited and selective Indexing Language sometimes with special coding or structures. 12/11/98 SIMS Affiliates Meeting

Structure of an IR System
Search Line Interest profiles & Queries Information Storage and Retrieval System Documents & data Storage Line Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language

Formulating query in terms of descriptors

Indexing (Descriptive and Subject)

Storage of profiles

Storage of Documents

Store1: Profiles/ Search requests

Comparison/ Matching

Store2: Document representations

Potentially Relevant Documents

Adapted from Soergel, p. 19

12/11/98

SIMS Affiliates Meeting

Uses of Controlled Vocabularies


Library Subject Headings, Classification and Authority Files. Commercial Journal Indexing Services and databases Yahoo, and other Web classification schemes Online and Manual Systems within organizations
SunSolve MacArthur
12/11/98 SIMS Affiliates Meeting

Types of Indexing Languages


Uncontrolled Keyword Indexing Indexing Languages
Controlled, but not structured

Thesauri
Controlled and Structured

Classification Systems
Controlled, Structured, and Coded

Faceted Classification Systems


12/11/98 SIMS Affiliates Meeting

Indexing Languages
An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents. An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.
12/11/98 SIMS Affiliates Meeting

Indexing Languages
Library of Congress Subject Headings Yellow Pages Topics Wilson Indexes (Readers Guide)

12/11/98

SIMS Affiliates Meeting

Thesauri
A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

12/11/98

SIMS Affiliates Meeting

Thesauri (cont.)
National and International Standards for Thesauri
ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri

12/11/98

SIMS Affiliates Meeting

Thesauri (cont.)
Examples:
The ERIC Thesaurus of Descriptors The Art and Architecture Thesaurus The Medical Subject Headings (MESH) of the National Library of Medicine

12/11/98

SIMS Affiliates Meeting

Why develop a thesaurus?


To provide a conceptual structure or space for a body of information
To make it possible to adequately describe the topical contents of informational objects at an appropriate level of generality or specificity To provide enhanced search capabilities and to improve the effectiveness of searching (I.e., to retrieve most of the relevant material without too much irrelevant material).
12/11/98 SIMS Affiliates Meeting

Why develop a thesaurus?


To provide vocabulary (or terminological) control.
When there are several possible terms designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with.

12/11/98

SIMS Affiliates Meeting

Preliminary considerations
What is used now?
Continue using an existing thesaurus? Ad hoc modification of existing thesaurus? Develop a new well-structured thesaurus?

What is the scope and complexity of the subject field? What kind of retrieval objects or data will be dealt with? How exhaustive and specific is the desired description of objects?
12/11/98 SIMS Affiliates Meeting

Preliminary Considerations
The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus.
It is better to plan for a larger and more comprehensive system than a smaller system that rapidly will become inadequate as the database grows.

Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists.
12/11/98 SIMS Affiliates Meeting

Development of a Thesaurus
Term Selection. Merging and Development of Concept Classes. Definition of Broad Subject Fields and Subfields. Development of Classificatory structure Review, Testing, Application, Revision.

12/11/98

SIMS Affiliates Meeting

Flow of Work in Thesaurus Construction


Select Sources Assign codes Select Terms
Record Selected Terms Define Broad Subject Fields Sort Terms into Broad Subject Fields Define Subfields within one Subject Field Work out detailed structure of the Subject Field Select Preferred Terms Improve Class Structure Print Classified Index and review Discuss with Experts and Users Select descriptors and checklist items
Many Modifications?

Sort Terms
Merge identical Terms Merge Terms in Same Concept class
Based on Soergel, pp 327-333

Yes

Revise as needed

No
All Subfields of Broad Subject finished?

No

Assign Notation Produce Full Thesaurus and Check references Review and Test

Yes
All Broad Subjects finished?

No

Yes

12/11/98

SIMS Affiliates Meeting

The Indexing Process


Concept identification term selection (via thesaurus) term assignment

12/11/98

SIMS Affiliates Meeting

Application: The Indexing Process (Manual)


Start Examine Document and Identify Significant Concepts YES Select Preferred Term NO Preferred Term? YES YES YES Consider First Concept Does Thesaurus contain term for Concept YES Can Concept NO be expressed combining terms? NO Select Alternative term to represent Concept Establish Term Denoting Concept NO Is Term suitable

End

NO

Is There Another Concept

Assign Terms to Document

Consider Preferred Term NO Would Concept be better represented by one of these terms

Consider Each of These Terms

Admit New Term Into Thesaurus

Prefer Alternative Term(s)

YES

Consider any associated terms in Thesaurus (NT,BT)

12/11/98

SIMS Affiliates Meeting

Adapted from ISO 5963, p.5

Classification Systems
A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and its place in relation to other terms.
12/11/98 SIMS Affiliates Meeting

Classification Systems (cont.)


Examples:
The Library of Congress Classification System The Dewey Decimal Classification System The ACM Computing Reviews Categories The American Mathematical Society Classification System

12/11/98

SIMS Affiliates Meeting

Automatic Indexing and Classification


Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. Automatic classification attempts to automatically group similar documents using either:
A fully automatic clustering method. An established classification scheme and set of documents already indexed by that scheme.

12/11/98

SIMS Affiliates Meeting

Clustering
Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent.

Doc Doc Doc Doc Doc Doc Doc Doc

Rocchios method
1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s)

12/11/98

SIMS Affiliates Meeting

Automatic Class Assignment


Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme
Doc Doc Doc Doc Doc Doc Doc

Search Engine

1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

12/11/98

SIMS Affiliates Meeting

References
Soegel, D. Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles : Melville Publishing Co., 1974 Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982. Standards:
ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri

12/11/98

SIMS Affiliates Meeting

You might also like