Professional Documents
Culture Documents
Organizing Information: Metadata and Controlled Vocabularies
Organizing Information: Metadata and Controlled Vocabularies
Ray R. Larson University of California, Berkeley School of Information Management and Systems
12/11/98 SIMS Affiliates Meeting 1998
Information Properties
Information can be communicated electronically
Broadcasting Networking
Information Hierarchy
Data
The raw material of information
Information
Data organized and presented by someone
Knowledge
Information read, heard or seen and understood
Wisdom
Distilled and integrated knowledge and understanding
12/11/98 SIMS Affiliates Meeting 1998
Information Hierarchy
12/11/98
Accessing Filtering
Storing Retrieval
Semi-Active
Distribution Networking
Utilization Disposition
Inactive
12/11/98 SIMS Affiliates Meeting 1998
Searching
12/11/98
Origins
Very early history of content representation
Sumerian tokens and envelopes Alexandria - pinakes Indices
12/11/98
Origins
Biblical Indexes and Concordances (Hugo de St. Caro & 500 monks, 1247 -- KWIC) Journal Indexes Information Explosion following WWII
Cranfield Studies of indexing languages and information retrieval Development of bibliographic databases
Index Medicus -- production and Medlars searching
12/11/98 SIMS Affiliates Meeting 1998
Origins
Communication theory revisited Problems with transmission of meaning
Message Message Source Encoding Channel Decoding Destination
Source
Storage
Destination
12/11/98
Structure of an IR System
Search Line Interest profiles & Queries Information Storage and Retrieval System Documents & data Storage Line Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language
Storage of profiles
Storage of Documents
Comparison/ Matching
Metadata
Data about data Information about Information Description of information structure and contents for individual information items, or entire collections of information
12/11/98
Types of Metadata
Element names. Element description. Element representation. Element coding. Element semantics. Element classification.
12/11/98
Metadata Systems
AACRII/MARC Dublin Core RDF (Resource Description Framework) SGML/XML DBMS Metadata Controlled vocabularies
12/11/98
12/11/98
RDF (W3C)
A model for representing named properties and property values
Resources (the things described) Properties (aspects, attributes, characteristics of resources) Statements (Resource+Property+Value of Property for the Resource)
Expressed in XML
12/11/98 SIMS Affiliates Meeting 1998
12/11/98
What Relations (tables) in the the DB Relation(table) attributes (domains) Attribute representation and storage Other information (indexes, etc)
SIMS Affiliates Meeting 1998
Controlled Vocabularies
Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.
12/11/98
Controlled Vocabularies
Names and name authorities Design of controlled vocabularies for subject access -- Thesaurus design
12/11/98
Names
Cutters (1876) objectives of bibliographic description:
To enable a person to find a document of which the author is known. To show what the library has by a given author.
The problem
Proliferation of the forms of names
Different names for the same person Different people with the same names
Examples
from Books in Print (semi-controlled but not consistent) ERIC author index (not controlled)
12/11/98
12/11/98
Authority control
Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules. If you have rules, why do you need to keep track of all of the headings?
12/11/98
Conditions of Authorship?
Single person or single corporate entity Unknown or anonymous authors Shared responsibility Collections or editorially assembled works Works of mixed responsibility (e.g. translations) Related Works
12/11/98 SIMS Affiliates Meeting 1998
Added Entries
Personal names
Collaborators Editors, compilers, writers Translators (in some cases) Illustrators (in some cases) Other persons associated with the work (such as the honoree in a Festschrift).
Corporate Names
Any prominently named corporate body that has involvement in the work beyond publication, distribution, etc.
12/11/98 SIMS Affiliates Meeting 1998
Choice of Name
AACR II says that the predominant form of the name used in a particular authors writings should be chosen as the form of name. References should be made from the other forms of the name.
12/11/98
Entry element:
John Smith or Smith, John? Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
12/11/98 SIMS Affiliates Meeting 1998
12/11/98
Structure of an IR System
Search Line Interest profiles & Queries Information Storage and Retrieval System Documents & data Storage Line Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language
Storage of profiles
Storage of Documents
Comparison/ Matching
Thesauri
Controlled and Structured
Classification Systems
Controlled, Structured, and Coded
Indexing Languages
An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents. An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.
12/11/98 SIMS Affiliates Meeting 1998
Indexing Languages
Library of Congress Subject Headings Yellow Pages Topics Wilson Indexes (Readers Guide)
12/11/98
Thesauri
A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms
12/11/98
Thesauri (cont.)
National and International Standards for Thesauri
ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri
12/11/98 SIMS Affiliates Meeting 1998
Thesauri (cont.)
Examples:
The ERIC Thesaurus of Descriptors The Art and Architecture Thesaurus The Medical Subject Headings (MESH) of the National Library of Medicine
12/11/98
12/11/98
Preliminary considerations
What is used now?
Continue using an existing thesaurus? Ad hoc modification of existing thesaurus? Develop a new well-structured thesaurus?
What is the scope and complexity of the subject field? What kind of retrieval objects or data will be dealt with? How exhaustive and specific is the desired description of objects?
12/11/98 SIMS Affiliates Meeting 1998
Preliminary Considerations
The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus.
It is better to plan for a larger and more comprehensive system than a smaller system that rapidly will become inadequate as the database grows.
Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists.
12/11/98 SIMS Affiliates Meeting 1998
Development of a Thesaurus
Term Selection. Merging and Development of Concept Classes. Definition of Broad Subject Fields and Subfields. Development of Classificatory structure Review, Testing, Application, Revision.
12/11/98
Sort Terms
Merge identical Terms Merge Terms in Same Concept class
Based on Soergel, pp 327-333 12/11/98
Yes
Revise as needed
No
All Subfields of Broad Subject finished?
No
Assign Notation
Produce Full Thesaurus and Check references Review and Test
Yes
All Broad Subjects finished?
No
12/11/98
End
NO
Consider Preferred Term NO Would Concept be better represented by one of these terms SIMS Affiliates Meeting 1998
YES
12/11/98
Classification Systems
A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and its place in relation to other terms.
12/11/98 SIMS Affiliates Meeting 1998
12/11/98
12/11/98
Clustering
Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent.
Rocchios method
1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s)
12/11/98 SIMS Affiliates Meeting 1998
Search Engine
1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category
12/11/98 SIMS Affiliates Meeting 1998
References
Soegel, D. Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles : Melville Publishing Co., 1974 Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982. Standards:
ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri
12/11/98