Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

z

z
z

Relevance, Standards and Usage of Metadata


for Electronic Language Resources

Daan Broeder,, Peter Wittenburg


g
MPI for Psycholinguistics
CLARIN Research Infrastructure

HT: there
HT th is
i nott (yet)
( t) one agreed
d descriptive
d i ti system
t for
f LRT.
LRT Let’s
L t’
limit the damage!
z
z
z
Library
y History
y

• concept of descriptive metadata is of course very old


• library catalogues were used to easily manage and find books
stored somewhere on the shelves
• some liked the catalogues – others liked to look at the book instances
• these catalogues typically had very limited information
• finding the right book (title, author, year, etc)
• quick inspection (citation
(citation, older versions
versions, statistics
statistics, etc)
• managing the library holding (overview, reorganization, missing, etc)
• not the place for deep characterization
• in some catalogues content classification
• genre
• subject (LCSH, IconClass , …)
• libraries the first to introduce/push electronic catalogues
and exchange formats (MARC, etc)
• Dublin Core to describe any authored web-resource was
pushed forward also by librarians.
z
z
z
Motivation in Language
g g Resource Domain

• Constantly more language resources are created of all types


types.
• At MPI about 500.000 digital objects deposited from a large group
of researchers independently of each other with a high annual
increase
• The shear quantity requires new methods to prevent
Digital Chaos or Data Cemetery
1. need good and stable repositories/archives
2. need a good Descriptive Metadata infrastructure
• Several of us realized this
• Early approaches
• TEI header tags (deep descriptive intention)
were used in various projects (Dutch Spoken Corpus)
• CHILDES annotation file header tags (search, filtering etc)
• …
z
z
z
Initiatives for Descriptive
p Metadata for LRT

•Dublin Core MD initiative for all types of authored web-resources


•1998 TEI header
•May 2000 ISLE MD White Paper (IMDI) presented at LREC in
Athens & establishment of an IMDI working gggroup
p
•May 2000 LREC necessity of language classification system
(Ethnologue) now an ISO standard
•December
December 2000 Presentation of the OLAC initiative
•2000 DFKI/ACL Registry of tools
• important activities in other fields
• LOM: DMD for learning objects
• MPEG7: complex integrated approach (DMD + content)
• ISO 19115: g geographic
g p information
• Indecs
•…
• social
i l ttagging
i as alternative
lt ti ffor expertt metadata,
t d t b butt usability
bilit ffor
our domain may be limited
z
z
z
Functions of Descriptive
p Metadata for LRT I

Differences in the approaches


pp wrt to different interest g
groups
p
• users
• search big catalogues with a large number of descriptions
• browsing through linked hierarchies or networks of DMD
• facetted browsing as a combination
• geographic browsing based on GIS coordinates
• quick inspection of metadata to check suitability
• virtual collection building and workflow creation (process journal)
• creating relations between LRs of various sorts
• creating
ti diff
differentt views
i iincluding
l di d dynamic i web-sites
b it
• research questions vs. discovery
•g
give me frequencies
q of correct usage
g of 3. p
person p
plural
inflected form for children of different age and sex
• give me lexicons for Trumai
• granularity
g y
• let me find a specific individual object
• let me find a corpus
z
z
z
Functions of Descriptive
p Metadata for LRT II

•depositors/managers
p g
• canonical hierarchy according to linguistic criteria
and resource bundling (container building)
• for resource management (migration
(migration, moving
moving, etc)
• for simple access and license management
• adding valuable information/knowledge about resources
• for copying parts (access
(access, long
long-term
term archiving)

Example
p Scenario:
• all copied to computer centers
• only parts exchanged between MPI
and regional centers in both directions
z
z
z
DMD Infrastructure Components
p ((until now))

• metadata p
provider <-> service p
provider
• one major difference: DMD Data vs. DMD Service Provider
• DMD Service Provider has no resource management task
• the
th DMD specification
ifi ti
• a schema (flat or structured - until now is one of the main pillars)
• a vocabulary of descriptor elements with key-value pairs
• per elements
l t value
l sets t ((closed,
l d open, semi-closed)
i l d)
• special profiles to include new sub-disciplines
• the tools
• editor, browser(s), search engine (structured vs. unstructured)
• DBMSs. (relational or XML based)
• OAI-PMH protocol (gateway, harvesting)
• linker and virtual collection builder
• view generators
• APIs (Web services: SOAP & REST)
• ..
z
z
z
DMD Experience
p

• some initiatives have done an excellent job


j
• IMDI, OLAC have stabilized and offer services
• DC moved from 15 broad categories to qualified concepts
• vocabularies are registered (community sites
sites, ISO DCR)
• OAI PMH is widely accepted for metadata exchange
• XML harvesting as a less expensive alternative is accepted
• but total coverage is not at all sufficient
• too few repositories are ready/willing to participate
• DMD usage is not at all satisfying (see IMDI usage*)
• necessity not believed despite evangelization
• DMD generation costs money, but is not budgeted
• some researchers still don’t want to share
• some researchers
h would
ld lik
like to participate,
i i b
but …
• lot of legacy material – how to get that in?
• DMD is open – some have ethical/political problems
• user friendliness
f i dli ((what
h t iis thi
this?)
?) tto b
be iimproved
d
• not all functions supported
z
z
z
DMD Lessons learned

• schemas are secondary y - let everyone


y create his/her own schema
• primary are registered and suitable vocabularies and persistent IDs
• a registry for schemas to allow re-usage and look-up

• need a flexible component based framework for DMD (similar to


LMF)
• a REQUIREMENT to use registered vocabularies
• need to support localization and sub-discipline terminology
• a registry allowing to re-use existing schemas or blocks (schema
fragments)
• easy registration of new schemas (using registered vocabularies)
• full support of PIDs at all relevant levels: concepts, resources and
other metadata
• possibility to register useful relations between concepts (pragmatic
ontologies)
• a next ggeneration tools should support
pp such a framework
• need thorough studies of resource types / descriptor sets per type
• need to include web services so others may interact with it.
z
z
z
Standards and other Trends

• which standards/suggestions are there


• ISO TC37/SC4: ISOcat as DCR standard on the way
• of course reuse trustful registries such as DCMI
• W3C, ISO, IETF: PID standard
p
•TEI ODD component framework

• which standards/suggestions are missing


• exhaustive LRT taxonomyy and description
p p
per data resource type
yp
• a feasible suggestion for WS description (UDDI, ebXML did not work)
• an accepted model for new generation DMD
• DMD is in the focus of large initiatives such as DRIVER
(European project to create a Digital Repository Infrastructure)
• will someone take care?
• in the CLARIN project this all will be one of the main issues
• a flexible component model for MD is on the CLARIN list
• poster on Thursday P15
z
z
z

Thank yyou for yyour kind attention.


d g2 z
z
z
IMDI Usage
g
Das Bild k ann nicht angezeigt werden. Dieser Computer v erfügt möglicherweise über zu wenig A rbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen.

%
IMDI statistics on 27.000 records: w r itte n r e s o u r c e la n g u a g e ID
w r itte n r e s o u r c e c h a r a c te r e n c o d in g

• many creators did not use content


w r itte n r e s o u r c e c o n te n t e n c o din g
w r itte n r e s o u r c e s iz e
w r itte n r e s o u r c e f or mat

fields (Genre, Subject)


w r itte n r e s o u r c e s u b ty p e
w r itte n r e s o u r c e ty p e
w r itte n r e s o u r c e r es o ur c e lin k

• difficulties with classification, me d ia f ile qu a lity


me d ia f ile f o r ma t

laziness – why should I invest


med ia f ile ty p e
me d ia f ile s iz e
me d ia f ile r e s o u r c e lin k

time, etc
a c to r d e s c r ip tio n
a c to r ed u c a tio n
a c to r s e x

• now after 6 yyears special


p web
a c to r a g e
a c to r b ir th
a c to r e th n ic g r ou p

pages with dynamic REST-based


a c to r f a mily s o c ia l r o le
a c to r c o d e
a c to r f u lln a me

content generation, motivation


a c to r na me
a c to r r o le
a c to r la n g u a g e n a me

increases
a c to r la n g u a g e ID
a c to r lan g ua g e d e s c r ip tio n
c on te n t la n g u a g e n a me
%
c on te nt la n g u a g e ID
c o n te n t la n g u a g e d e s c r ip tio n
c o mmu n ic a tio n c o n te x t c h a n n e l
c ommu n ic atio n c o nte x t e v e nt
c ommun ic a tio n c o n te x t s o c ia l c on te x t
c o mmu n ic a tio n c o n te x t in v o lv e me n t
c o mmu n ic atio n c o nte x t p la n n in g ty p e
c o mmun ic a tio n c o n te x t in te r a c tiv ity
c o n te nt s u b je c t
c o n te n t mo d a litie s
c on te nt ta s k
c on te n t s u b g e n r e
c on te nt g e n r e
c o n te n t d es c r iptio n
p r o je c t n a me
s e s s ion r e g io n
s es s io n a d d r e s s
s e s s io n c o u n tr y
s es s io n c on tin e n t
s e s s io n d e s c r ip tio n
s e s s io n r e c o r d in g d a te
s e s s ion .title
s es s io n .n a me

0 20 40 60 80 100 120
Folie 12

d g2 they already know the data


time investment is for other users
data is considered own prperty and not that of the funder or research community

REST -> more IMDI ????


but it makes them more aware of the
possibilities of metadata
broeder; 23.05.2008

You might also like