PreservationMetadata NCDDworkshop 2014 Titia VD Werf

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Preservation Metadata: between

theory and practice


Preservation Metadata Workshop (2)
The Hague, the Netherlands
19 June 2014
Titia van der Werf

adapted from:
Rebecca Guenther, “Metadata for preservation of digital objects:
background, functions, and standards” – Preservation Metadata Workshop (1),
Hilversum, The Netherlands, 4 March 2014
OUTLINE
1.  General introduction to preservation metadata
2.  The PREMIS Data Dictionary
3.  A use case: the Preservation Health Check

2
Introduction to preservation
metadata

3
metadata
Function Type
—  Discovery —  Descriptive
—  Access —  Administrative
—  Management —  Technical
—  Control intellectual property —  Rights/Access
rights —  Structural
—  Identification —  Meta-metadata
—  Certify authenticity —  Etc.
—  Mark content structure
—  Indicate status
—  Describe processes
—  Etc.

4
digital preservation
Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution.

Digital information poses its own set of challenges to preservation:


•  The overwhelming volume of digital information created daily and
the uncontrolled duplication of information;
•  The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the
scholarly record and the cultural record;
•  The dependency on software/hardware (incl. incompatible, obscure
or proprietary systems)
•  The rapid technological change and the danger of obsolescence
•  The ease of (accidental or malicious) content alteration
•  Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity

5
digital preservation
Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution.

Digital information poses its own set of challenges to preservation:


•  The overwhelming volume of digital information created daily and
the uncontrolled duplication of information;
•  The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the
scholarly record and the cultural record;
Ø  The dependency on software/hardware (incl. incompatible,
obscure or proprietary systems)
Ø  The rapid technological change and the danger of
obsolescence
•  The ease of (accidental or malicious) content alteration
•  Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity

6
preservation metadata in 2000
“We can then say that the main problem metadata
for long term preservation will help to solve is the
problem of technological obsolescence.” (p.4)

7
http://www.kb.nl/sites/default/files/docs/NEDLIBmetadata.pdf
preservation metadata in 2002
“Preservation metadata (…) is the information
necessary to maintain the viability,
renderability, and understandability of digital
resources over the long-term.” (p.1)
http://www.oclc.org/content/dam/research/activities/pmwg/pm_framework.pdf?urlm=161391

8
preservation metadata in 2005
“Preservation metadata (…) metadata supporting
the functions of maintaining viability,
renderability, understandability, authenticity,
and identity in a preservation context.” (p. ix)

http://www.loc.gov/standards/premis/

9
The SPOT Model for risk assessment
http://www.dlib.org/dlib/september12/vermaaten/09vermaaten.html

Availability
Threats
Identity

Persistence
SPOT
Model
Renderability

Understandability

Authenticity
Six essential properties of successful digital preservation
metadata and preservation metadata

PRESERVATION
“Structured information that METADATA
describes, explains, locates,
or otherwise makes it easier to
retrieve, use, or manage an
information resource”

“Metadata that supports


and documents the digital
preservation process”

METADATA
supporting and documenting the
digital preservation process
•  Provenance:
–  The chain of custody/ownership of the digital object; info about the
depositor; etc.

•  Authenticity:
–  The documentation of changes affecting the authenticity of the digital object
during the preservation process

•  Preservation Activity:
–  The documentation of actions taken to preserve the digital object

•  Technical Environment:
–  The documentation of the dependencies on and changes in the technical
environment needed to render and use the digital object

•  Rights:
–  The documentation of the rights and permissions for carrying out
preservation activities on the digital object (duplication, migration,
transformations)
OAIS Information Model

Information Package Concepts and Relationships (Figure 2-3)


Preservation Description Information
Reference information: identifiers of the Content
Provenance information: history of the custody
Context information: relation of the Content to other objects
Fixity information: a data integrity checksum of the Content
Access Rights Information: permissions for preservation operations

Preservation
Description
Information

Reference Provenance Context Fixity Access


Information Information Information Information Rights Information

Preservation Description Information (Figure 4-16) – June 2012 version


How to record and manage change

OAIS rule: if the PDI changes, the AIP version


changes.

Implementation choices:
e.g. fixity information in source AIP
+ keep log of data integrity checks and their
outcomes separate from the AIP.

16
OAIS compliance relevant to preservation metadata

OAIS Mandatory Responsibilities:


1.  Negotiating and accepting information
2.  Obtaining sufficient control of the information to
ensure long-term preservation
3.  Determining the "designated community"
4.  Ensuring that information is independently
understandable
5.  Following documented policies and procedures
6.  Making the preserved information available
Digital repository certification
–  RLG-NARA Task Force on Digital Repository Certification
–  Various other certification initiatives (CRL, DCC, nestor,
DRAMBORA)
–  Trusted Repositories Audit & Certification (TRAC): Criteria and
Checklist (March 2007)
•  Organisational infrastructure
–  e.g., governance, organisational structures, mandates, policy
frameworks, funding systems, contracts and licenses
•  Digital Object Management (OAIS functions)
–  e.g., ingest, metadata, preservation strategies
•  Technologies, Technical Infrastructure, & Security
Functions of a trusted digital repository
relevant to preservation metadata
•  Maintains persistent, unique identifiers for all archived
objects
•  Identifies properties it will preserve
•  Verifies each submitted object during ingest
•  Creates archival package from submission package to
include technical and rights metadata
•  Has mechanisms to authenticate content and its source
•  Ensures that content information isn’t corrupted and
maintains integrity by using fixity information
•  Manages number and location of copies of all digital
objects
•  Employs documented preservation strategies

19
Functions of a trusted digital repository
relevant to preservation metadata
•  Maintains precise descriptions of actions necessary to ensure
that objects are preserved
•  Has mechanisms for monitoring and notification when formats
are becoming obsolete
•  Uses tools and resources such as format registries to
establish semantic and technical context
•  Has processes for storage media and/or hardware changes
•  Tracks and manages intellectual property rights and
restrictions
•  Ensures that agreements applicable to access conditions are
adhered to
•  Maintains descriptive metadata for access and retrieval and
associates it with object

20
PREMIS

21
Standards that address preservation
metadata: technical
•  PREMIS
•  Images
–  NISO Z39.87 and MIX
–  Adobe and XMP (Extensible Metadata Platform)
–  Exif (Exchangeable Image File Format)
–  IPTC (International Press Telecommunications Council)/XMP
•  Text: textMD
•  Sound
–  AES57-2011: Audio Object XML Schema
–  AES60-2011: Core Audio Metadata
–  AudioMD (Library of Congress)
Standards that address preservation
metadata: technical

•  Video
–  VideoMD
–  SMPTE RP210
–  Technical metadata in EBUCore, PBCore
–  U.S. Federal Agencies Digitization Guidelines
–  MPEG-7 and MPEG-21 for video
Standards that address preservation
metadata: Structural
§  METS
§  PREMIS
§  MPEG 21 Digital Item Declaration
§  OAI/ORE
§  Specific format types
–  MXF
–  AVI
Standards that address preservation
metadata: Rights
•  PREMIS
•  METS Rights
•  CDL Copyright schema
•  Creative commons
•  PLUS for images
•  MPEG-21 REL for moving images
•  ONIX for licensing terms
•  Full rights expression languages
–  XRML/MPEG-21
–  ODRL
PREMIS Data Dictionary
•  May 2005: Data Dictionary for Preservation
Metadata: Final Report of the PREMIS Working Group

•  March 2008: PREMIS Data Dictionary for Preservation


Metadata, version 2.0

•  Jan. 2011: version 2.1

•  April 2012: version 2.2

•  Announced in September 2013: version 3.0

•  Data Dictionary:
–  Comprehensive view of information needed to support digital preservation
•  Guidelines/recommendations to support creation, use, management
–  Based on deep pool of institutional experiences in setting up and managing operational
capacity for digital preservation
Guiding principles: “implementable,
core preservation metadata”
•  Preservation metadata: maintain viability, renderability,
understandability, authenticity, identity in a preservation
context

•  Core: What most preservation repositories need to know to


preserve digital materials over the long-term

•  Implementable: rigorously defined; supported by usage


guidelines/recommendations; emphasis on automated
workflows and metadata generation

•  Technical neutrality: no assumptions about technologies,


systems and architectures, where metadata is stored
Scope
•  What PREMIS DD is:
–  Common data model for organizing/thinking about preservation metadata
–  Guidance for local implementations
–  Standard for exchanging information packages between repositories
–  Compatible with the OAIS reference and information model

•  What PREMIS DD is not:


–  Out-of-the-box solution: need to instantiate as metadata elements in repository
system
–  All needed metadata: excludes business rules, format-specific technical
metadata, descriptive metadata for access, non-core preservation metadata
–  Lifecycle management of objects outside repository
–  Rights management: limited to permissions regarding actions taken within
repository
PREMIS Data Model

Intellectual
Entities
Rights
Statements

Objects Agents

Events
Intellectual Entities
•  Set of content that is considered a
single intellectual unit for purposes of
management and description (e.g., a
book, a photograph, a map, a
database)

•  Has one or more digital


representations

•  May include other Intellectual Entities


(e.g. a website that includes a web
Examples: page)
•  The Chamber by John Grisham (an
ebook) •  Not fully described in PREMIS DD, but
•  “Maggie at the beach” can be linked to in metadata
(a photograph) describing digital representation THIS
•  The Metropolitan New York Library WILL CHANGE IN 3.0
Council Website (a website)
Objects Objects are what repository
actually preserves

FILE: named and ordered sequence of


bytes that is known by an operating
system

REPRESENTATION: set of files, including


structural metadata, that, taken together,
constitute a complete rendering of an
Intellectual Entity

BITSTREAM: data within a file with


Examples: properties relevant for preservation
§  a PDF file purposes (but needs additional structure
§  A book composed of several or reformatting to be stand-alone file)
XML files and many images FILESTREAMS (files within files)
§  TIFF file containing a header are considered files since can be
and 2 images rendered alone
Object Example: book in two
versions

Intellectual Entity
Da Vinci Code by
Dan Brown

Representation 1
Representation 2
Page image
ebook version
version

File 1: File 2: File N: File N+1: File 1:


page1.tiff page2.tiff pageN.tiff METS.xml book.lit
Semantic units pertaining to Objects

•  Object identifier •  Storage


•  Preservation level •  Environment
•  Significant characteristics –  software
•  Object characteristics –  hardware
–  fixity will change in 3.0
–  format •  Digital signatures
–  size •  Relationships
–  creating application •  Linking event identifier
–  inhibitors •  Linking rights statement
identifier
–  object characteristics
extension
•  Original name
Events
•  An action that involves or impacts at
least one Object or Agent associated
with or known by the preservation
repository

•  Helps document digital provenance.


Can track history of Object through the
chain of Events that occur during the
Objects lifecycle

•  Determining which Events are in scope


is up to the repository (e.g., Events
which occur before ingest, or after de-
Examples: accession)
§  Validation Event: use JHOVE tool to
verify that chapter1.pdf is a valid PDF •  Determining which Events should be
file recorded, and at what level of
§  Ingest Event: transform an OAIS SIP granularity is up to the repository
into an AIP (one Event or multiple
Events?)
Semantic units pertaining to Events:
provenance and preservation activity

§  Event identifier


§  Event type (e.g. capture, creation, validation, migration,
fixity check, ingestion)
§  Event dateTime
§  Event detail
§  Event outcome
§  Event outcome detail
§  Linking agent identifier
§  Linking object identifier
Agents
•  Person, organization, or software
program/system associated with
an Event or a Right (permission
statement)

•  Agents are associated only


indirectly to Objects through
Events or Rights

•  Not defined in detail in PREMIS


Examples: DD; not considered core
§  Rebecca Guenther (a person) preservation metadata beyond
§  New York Public Library (an identification
organization)
§  JHOVE version 1.0 (a software
program)
Semantic units pertaining to Agents

•  Agent Identifier
•  Agent Name
•  Agent Type
•  Agent Note
•  Agent Extension
•  Linking Event Identifier
•  Linking Rights Identifier
Rights Statements
•  An agreement with a rights holder
that grants permission for the
repository to undertake an
action(s) associated with an
Object(s) in the repository.
•  Not a full rights expression
language; focuses exclusively on
permissions that take the form:
Example: –  Agent X grants Permission Y
§  Priscilla Caplan grants FCLA to the repository in regard to
digital repository permission to Object Z.
make three copies of
metadata_fundamentals.pdf for
preservation purposes.
Semantic units pertaining to Rights

•  Rights Statement •  Rights Granted


•  Rights Statement Identifier •  act
•  Rights Basis •  restriction
•  Copyright Information •  termOfGrant
•  License Information •  rightsGranted
•  Statute Information •  Linking Object Identifier
•  Other Rights Information •  Linking Agent Identifier
•  rightsExtension
Relationships

•  PREMIS Data Dictionary supports expression


of relationships between:
–  Different Objects
•  Structural: relationships between parts of a whole
•  Derivation: relationships resulting from replication or transformation of
an Object
•  New relationships in 3.0: replacement, dependency, generalization,
reference
–  Different Entities
•  Relationships are established through
reference to Identifiers of other Objects or
Entities
PREMIS Maintenance Activity
•  Web site:
–  Permanent Web presence, hosted by
Library of Congress
–  Central destination for PREMIS-related
info, announcements, resources
–  Home of the PREMIS Implementers’ Group (PIG)
discussion list
•  PREMIS Editorial Committee:
–  Set directions/priorities for PREMIS development
–  Coordinate future revisions of Data Dictionary and XML
schema
–  Promote implementation
–  International in scope, cross domain

http://www.loc.gov/standards/premis/
Implementation resources
•  Tools:
–  XML schema
–  PREMIS-in-METS toolbox <http://pim.fcla.edu>
–  Controlled vocabularies at http://id.loc.gov
–  RDF/OWL ontology for use as Linked Data
•  Guidelines:
–  PREMIS conformance statement
–  PREMIS & METS guidelines
•  Community Working groups on special topics
•  Implementation Fairs
•  Others:
–  Understanding PREMIS (available in multiple languages)
–  PIG Forum
–  Implementation Registry
–  Tools Registry
Some implementers …
•  DAITTSS (Florida)
•  Ex Libris Rosetta
•  OCLC’s Digital Archive™
•  Archivematica
•  HathiTrust
•  TIPR (Towards Interoperable Preservation
Repositories)
–  FCLA, NYU and Cornell
•  Digital libraries in Spain
–  Mandated for use in cultural heritage preservation
repositories
See: http://www.loc.gov/premis/premis-registry.html
PREMIS Conformance
•  Conformance statement issued in 2010
•  PREMIS Conformance Working Group active
now
•  Levels of conformance:
–  Level 1
A repository uses an internal metadata schema whose elements can be
mapped to PREMIS. The mapped metadata can satisfy the principles of
use at both the semantic unit and Data Dictionary levels. The repository
is able to produce documentation demonstrating such mapping for
representative samples of its holdings.
–  Level 2
A repository implements the PREMIS Data Dictionary as its internal
metadata schema in a way that satisfies the principles of use at both the
semantic unit and Data Dictionary levels and in a form that does not
require further mapping or conversion.
URLs, etc.
•  PREMIS Maintenance Activity:
http://www.loc.gov/standards/premis/

•  PREMIS Data Dictionary for Preservation Metadata:


http://www.loc.gov/standards/premis/v2/
premis-2-1.pdf
•  PREMIS Implementation Registry
http://www.loc.gov/standards/premis/registry

•  PREMIS Implementers Group list


http://listserv.loc.gov/listarch/pig.html
A use case: the preservation
health check

46
What is the Preservation Health Check
Pilot?
-  Open Planets Foundation (OPF)
A community hub for digital preservation whose main goal is
to jointly manage and improve tools and research
outcomes for practical use.
-  OCLC Research
A community resource for shared R&D that addresses
challenges facing libraries and archives in a rapidly
changing information technology environment.
-  Bibliothèque nationale de France
The BnF runs a fully operational trusted digital repository
(SPAR). They volunteered to become a PHC-pilot site.
The Preservation Health Check
proposition
As part of their preservation management task, repository
managers need to be able to monitor the preservation
status of the content of their repository.
We are looking at regular “routine check-ups” that can
support this monitoring task.
–  Monitoring should be made easy (automatically
generated reports or dashboard)
–  Monitoring should be based on objective data,
generated by the repository (e.g. preservation
metadata)
The analogy
The research question
If a Preservation Health Check is a monitoring activity to be
performed on a repository with digital content
1.  What are empirical indicators (i.e. measures) for PHCs?
2.  Are preservation metadata recorded by repositories
useful as health indicators for PHCs?

Monitoring is about tracking change ... intentional and


unintentional change.
Goal:
To develop an implementable logic (or protocol) to
support PHCs, and to test this logic against the
store of preservation metadata maintained by an
operational preservation repository.
The pilot site
The BnF runs a fully operational trusted digital repository
(SPAR). They volunteered to become a PHC-pilot site.
The empirical data consists of:
1.  A sample (200 GB) of the PREMIS data (AIP-METS
files), covering the following collections:
–  Gallica = digitised periodicals, monographs, still images and
manuscripts (TIFF + OCR-files)
–  Legal deposit Web harvests (warc files)
–  3rd party collection (Centre Pompidou)
The pilot site
The empirical data consists of (continued):
2.  All the Reference Information packages in SPAR that
contain reference information/code/specifications of
(external) tools used during INGEST (ex. JHOVE) and
of formats ingested;
3.  Per collection: SLAs defining policy agreements with
SIP suppliers concerning the preservation regime to be
applied at the INGEST and ARCHIVAL STORAGE
stages.
Mapping PREMIS on to SPOT
Semantic Units
Int. Ent. Availability
Threats
Objects Identity

PREMIS Persistence
Data Events SPOT
Model Model
Renderability

Rights
Understandability
Agents
Authenticity
preservation metadata in 2005
“Preservation metadata (…) metadata supporting
the functions of maintaining viability,
renderability, understandability, authenticity,
and identity in a preservation context.” (p. ix)

http://www.loc.gov/standards/premis/

55
Findings: coverage
SPOT property # of PREMIS semantic
units*
•  Availability 16
•  Identity 19
•  Persistence 10
•  Renderability 15
•  Understandability 14
•  Authenticity 16

*Container level only; Agents, Events, Rights considered one semantic unit
Findings: coverage
•  What does coverage in terms of “number of PREMIS
semantic units” mean?
•  More meaningful: Do the PREMIS semantic units
address the threats associated with a SPOT property?

Example of a gap between SPOT and PREMIS:


SPOT property: Understandability
We found no PREMIS semantic units that provide
information that aids in the understanding or
interpretation of the content of the archived digital object.
Findings: preservation policies
A repository usually implements a large number of explicit
and implicit policy decisions; however, PREMIS currently
makes few provisions for recording these in preservation
metadata (the semantic unit preservationLevel being a
notable exception).
Findings: explicit encoding
PREMIS conformance does not require explicit encoding of
metadata if the information applies to all objects in the
repository.
This impedes the provision of automated PHC services (by
a third-party provider) because efficient provision of this
service would likely require the information in semantic
units to be explicitly recorded, and implemented in a
standard way.
Logic for assessing Persistence

Availability
Threats
Identity

Persistence
SPOT
Model
Renderability

Understandability

Authenticity
Six essential properties of successful digital preservation
62
Logic for assessing Persistence
•  If storage medium information is not available in PREMIS metadata,
the PHC will need to take other information sources into account –
such as audit reports generated by storage management systems.

•  We note that there are no pre-defined events for Corruption and


Readability in PREMIS, which means that the repositories need to
define their own events. PREMIS does provide a list of
recommended event labels for the semantic unit eventType, but it is
just a “suggested starter list”.

•  The repository should have policies in place that prescribe


frequencies of fixity checks, of medium refreshment, backup policy,
etc. The PREMIS semantic unit preservationLevel does not address
such policies. The PHC flow thus needs to get the policy information
from other sources.
A use case: the preservation
health check (to be continued)

64
Thank You!
titia.vanderwerf@oclc.org

©2014 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This
work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license:
http://creativecommons.org/licenses/by/3.0/”

You might also like