4.2 Creation of the AIP

4.2 Creation of the AIP #

4.2.1 - The repository shall have for each AIP or class of AIPs preserved by the repository an associated definition that is adequate for parsing the AIP and fit for long-term preservation needs. #

Response #

SP’s Definition of AIPs explains the different components that make up an AIP within the repository. SP’s ejournal AIPs are all fairly similar in composition and are therefore treated in a consistent manner.

  • SP stores the Content Data Object separately from the preservation metadata file. The CDO is stored on a filesystem with a reference to its location contained in the preservation metadata.
  • The Representation Information, Preservation Description Information (PDI), Packaging Information and Descriptive Information of the content object are all contained within a metadata scheme based on METS (Metadata Encoding and Transmission Specification). Within this container, SP uses the PREMIS (Preservation Metadata Implementation Strategy) vocabulary. The repository makes use of the objects, events and rights entities described in the PREMIS Data Model.

Content Note - Journals #

In the case of journals, Scholars Portal uses the NLM Journal Archiving & Interchange Tag Set to structure descriptive metadata. Scholars Portal normalizes the descriptive metadata and structural relationships submitted by Providers to version 3.0 of the NIH Journal Archiving & Interchange Tag Set.

Responsibility #

  • Digital Preservation Librarian
  • Metadata Librarian

Documents #

  1. Definition of AIP
  2. Metadata Specifications

4.2.1.1 - The repository shall be able to identify which definition applies to which AIP. #

Content Note - Journals #

In the case of journals, SP creates only one class of AIP. There is one AIP for each article ingested into the repository, and it is consistent with the SP Definition of AIP.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Definition of AIP

4.2.1.2 - The repository shall have a definition of each AIP that is adequate for long-term preservation, enabling the identification and parsing of all the required components within that AIP. #

Response #

For objects, SP designed a single class of AIP to be adequate for long-term preservation. SP implements the AIP through a combination of files and databases, linked by uniform identifiers. To support the preservation needs of the repository, the AIP includes a number of components:

  • A Content Data Object in PDF format or another well known, widely accepted format that supports long-term preservation and migration
  • Representation Information - contains information on the CDO’s file format, version, and a reference to its standard in a format registry
  • Packaging Information - SP’s version of METS holds the Descriptive Information and the Preservation Description Information
  • Descriptive Information - SP normalizes the Provider’s descriptive metadata and structural relationships to a version of the NLM Journal Archiving & Interchange Tag Set

Preservation Description Information:

  • Reference Information - URIs and other identifiers are stored for each article in the associated metadata
  • Provenance Information - uses PREMIS vocabulary to record events in the life of the AIP, at both the file and article levels, and to manage Representation Information
  • Context Information - information on how a CDO relates to other CDOs or to other conceptual entities is found primarily in the descriptive metadata contributed by a provider. SP generates some context information as a part of the preservation metadata
  • Fixity Information - checksums generated at the time of ingest

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Definition of AIP
  2. Metadata Specifications

4.2.2 - The repository shall have a description of how AIPs are constructed from SIPs. #

Response #

SP has a documented process for constructing AIPs from SIPs. Please see the Workflow Charts for an overview of the process.

The Provider-provided descriptive and structural metadata are normalized to a version of the NLM Journal Archiving and Interchange Tag set. The normalization actions and metadata are included in the preservation metadata generated by SP and stored separately from the content object.

The details of transformation and normalization are specific to each Provider, though the process is the same.

Content Note - Journals #

For most journal deposits into the repository, the Content Data Object format is unchanged from the SIP to the AIP. The content object’s format would only be transformed if its current format was in danger of obsolescence.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Workflow Charts

4.2.3 - The repository shall document the final disposition of all SIPs. #

Response #

Error logs are kept that record the SIPs that were not ingested properly and the reasons for their failure. SIPs that are correctly ingested are kept indefinitely and backed up every 60 days.

Responsibility #

  • Digital Preservation Librarian
  • System and Web Development Analyst

Documents #

  1. Error logs (available on request)
  2. Workflow Charts

4.2.3.1 - The repository shall follow documented procedures if a SIP is not incorporated into an AIP or discarded and shall indicate why the SIP was not incorporated or discarded. #

Response #

SP uses loader scripts to automatically and consistently ingest SIPs and transform them into AIPs. If there is an error during this process, the loader records the error in a log and sends an email to SP staff. If staff cannot fix the error, the repository deletes the SIP and notifies the Provider. All SIPs are recorded in the FTP download log file.

Responsibility #

  • Software Developer
  • Digital Preservation Librarian
  • System and Web Development Analyst

Documents #

  1. Provider Agreement
  2. Workflow Charts
  3. System Logs (available on request)

4.2.4 - The repository shall have and use a convention that generates persistent, unique identifiers for all AIPs. #

Response #

SP uses a systematic convention to generate unique and unambiguous identifiers for AIPs within the repository. This process creates a stable name and persistent reference for every object. Please see the URI & File Naming Plan for details. An identifier may change if the Provider replaces an article with a new version of the same article. Please see 4.2.4.1 for more information.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. URI & File Naming Plan

4.2.4.1 - The repository shall uniquely identify each AIP within the repository. #

Response #

SP’s AIPs are identified by their unique URIs.

Content Note - Journals #

For journal article AIPs, SP URIs are consistently constructed in the following manner:

  • /<ISSN>/v<volume number>i<issue number padded to four digits>/<article hash>
  • The article hash is generated by concatenating the starting page number of the article, an underscore character, the first letter of the first six words in the article title, and the first letter in the last six words in the article title. In cases where there are not enough words in the article title to construct to this specification, the first letters of each word in the title are used.
  • In the case of a collision, the URI that was generated will be appended with an underscore (_) and a sequential number beginning with one. This number will increment for each duplicate URI.
  • In the case of a replacement article, the new copy of an article will supersede the old and claim the original identifier. The old copy of the article will retain the original identifier with _old1 added onto the end. If another newer copy replaces the previous one, the previous will retain the identifier with _old2 added.

Please see URI & File Naming Plan for complete details.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. URI & File Naming Plan

4.2.4.1.1 - The repository shall have unique identifiers. #

Response #

SP creates unique URIs for each AIP, as well as unique URIs for each file that comprises the AIP. Please refer to 4.2.4.1 and the URI & File Naming Plan for the creation of URIs for AIPs.

SP URIs for individual files and events are consistently constructed in the following manner:

  • The URI of the parent object and a hash generated from the current date/time are concatenated. In this way, there should be no chance of collisions.
Content Note #

PDF files are identified by adding “pdf_fulltext” after the parent object URI and then the current date\time hash.

XML files are identified by adding “xml_fulltext” after the parent object URI and then the current date/time hash.

Examples: #
  • Parent object: A C. elegans LSD1 Demethylase Contributes to Germline Immortality by Reprogramming Epigenetic Memory - Cell (April 2009), 137 (2), pg. 308-320
  • Parent object URI: /00928674/v137i0002/308_aceldcgibrem
  • PDF Fulltext: /00928674/v137i0002/308_aceldcgibrem/pdf_fulltext/1303399300907
  • XML Fulltext: /00928674/v137i0002/308_aceldcgibrem/xml_fulltext/130339930294
  • Other files: /00928674/v137i0002/308_aceldcgibrem/1303399302709
  • Events: /00928674/v137i0002/308_aceldcgibrem/1303399314628
Responsibility #
  • Digital Preservation Librarian
  • Software Developer
Documents #
  1. URI & File Naming Plan
4.2.4.1.2 - The repository shall assign and maintain persistent identifiers of the AIP and its components so as to be unique within the context of the repository. #
Response #

All of the AIPs retain their persistent identifiers for their entire life cycle in order to ensure that every AIP is unique in the repository. Please see the URI & File Naming Plan for details. The only alteration made to an identifier is in the case of object replacement. Please see 4.2.4.1 for more information.

Responsibility #
  • Digital Preservation Librarian
Documents #
  1. URI & File Naming Plan
4.2.4.1.3 - Documentation shall describe any processes used for changes to such identifiers. #
Response #

SP has a systematic process for changing identifiers. Changes are rare. The only situation encountered to date involves the replacement of an object with a new version of the same object at the Provider’s request. The new copy of an object will supersede the old and claim the original identifier. The old copy of the article will retain the original identifier with _old1 appended to the end. In the case of subsequent replacements, the replaced object will be appended with _old<X>", where <X> is the next available integer.

Responsibility #
  • Digital Preservation Librarian
Documents #
  1. URI & File Naming Plan
4.2.4.1.4 - The repository shall be able to provide a complete list of all such identifiers and do spot checks for duplications. #
Response #

SP can provide a complete list of all identifiers upon request. Spot checks are not necessary because the automatic identifier generation process does not allow duplication. Please see 4.2.4.1 and 4.2.4.1.1 for more information about the creation of unique identifiers.

Responsibility #
  • Digital Preservation Librarian
Documents #
  1. URI & File Naming Plan
4.2.4.1.5 - The system of identifiers shall be adequate to fit the repository’s current and foreseeable future requirements such as numbers of objects. #
Response #

In order to accommodate future growth, SP designed its URI system to be flexible and extensible. There is no foreseeable limit to the set of URIs or the length of any individual URI. Please see the URI & File Naming Plan for more information.

Responsibility #
  • Digital Preservation Librarian
Documents #
  1. URI & File Naming Plan

4.2.4.2 - The repository shall have a system of reliable linking/resolution services in order to find the uniquely identified object, regardless of its physical location. #

Response #

The AIP contains a unique and persistent link to every component file regardless of location on disk. Please refer to Metadata Specifications for details.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Metadata Specifications
  2. URI & File Naming Plan

4.2.5 - The repository shall have access to necessary tools and resources to provide authoritative Representation Information for all of the digital objects it contains. #

Response #

SP gathers all file format information for each file using authoritative and trusted tools. This information includes details about the file format, including a link to format registry. Please see 4.2.5.1, 4.2.5.20, 4.2.5.3, and 4.2.5.4 for details.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Registry of File Formats

4.2.5.1 - The repository shall have tools or methods to identify the file type of all submitted Data Objects. #

Response #

SP employs the use of a number of tools through the FITS software package to identify and validate the file formats contained within the SIP. The primary tools used through FITS are:

  • DROID: identifies formats during the SIP ingestion process. Where possible, the file is linked to the format’s entry in PRONOM, the British National Archive’s format registry.
  • JHOVE: used for further format-specific identification, validation, and characterization of the file.

The FITS package also contains a number of other applications to generate additional metadata regarding the file. For a full description of the software, please see the project page.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Registry of File Formats
  2. Definition of SIP

4.2.5.2 - The repository shall have tools or methods to determine what Representation Information is necessary to make each Data Object understandable to the Designated Community. #

Response #

SP’s definition of requisite Representation Information is directly informed by input from the Designated Community. This input comes through formal standing committees, advisory groups, and user feedback. Furthermore, SP carries out extensive usability testing, through which additional insights may come to light.

Responsibility #

  • Digital Preservation Librarian
  • Metadata Librarian

Documents #

  1. Designated Community
  2. Organizational Chart

4.2.5.3 - The repository shall have access to the requisite Representation Information. #

Response #

The identification tools that SP uses to generate the requisite Representation Information are open source and widely available. They require no inputs other than the files being ingested. Please see 4.2.5.1 for an overview of the tools.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Registry of File Formats

4.2.5.4 - The repository shall have tools or methods to ensure that the requisite Representation Information is persistently associated with the relevant Data Objects. #

Response #

Representation information for each file is defined through SP’s use of the FITS software package. This Representation Information is stored in the SP AIP Preservation Metadata. The Preservation Metadata also contains references to the location of the data object stored in a filesystem on disk.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Metadata Specifications

4.2.6 - The repository shall have documented processes for acquiring Preservation Description Information (PDI) for its associated Content Information and acquire PDI in accordance with the documented processes. #

Response #

SP PDI comes from two sources: the Provider and the repository’s internal processes. In both cases the PDI is stored in the preservation metadata which contains a link to Content Information.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Workflow Charts
  2. Definition of AIP

4.2.6.1 - The repository shall have documented processes for acquiring PDI. #

Response #

Below are the components included in SP’s Preservation Description Information:

  • Reference Information - Identifiers are stored for each article identifying it globally (e.g. DOI) and locally (e.g. URI). Global identifiers are generally provided by the Provider while local identifiers are generated locally.
  • Provenance Information - Provenance metadata is generated locally for each object. It provides a history of preservation events in the object’s lifetime, beginning at ingest into the SP repository and referencing any preservation activities taken on the object (e.g., replacement due to corruption, format migration, etc.).
  • Context Information - Context metadata is generated locally or supplied by the Provider. This metadata describes relationships between the CDO and other CDOs in the repository. Examples of these relationships can include: a newer version of a document that supersedes an older one, or a journal article that is a part of a journal issue.
  • Fixity Information - Fixity information is generated locally at the time of ingest in order to later determine whether or not the item remains in the same state as when it was ingested. This information can be used to determine integrity of an object being copied within the system (as in the case of a change in storage location), or for periodic integrity checks.
  • Access Rights Information - Access rights information is generated locally based on Provider licensing terms as negotiated between SP and the Provider.

For more information on the procedures used to generate or gather this metadata, see Workflow Charts.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Workflow Charts

4.2.6.2 - The repository shall execute its documented processes for acquiring PDI. #

Response #

SP acquires and formats PDI automatically during ingest. For more information on the procedures used to generate or gather this metadata, see Workflow Charts.

SP runs each file through the FITS software package. The output created by FITS is then run through a style sheet that formats the file information into PREMIS.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Workflow Charts

4.2.6.3 - The repository shall ensure that the PDI is persistently associated with the relevant Content Information. #

Response #

SP stores Preservation Description Information in each AIP’s Preservation Metadata. PDI is persistently associated with Content Information by a link to the location of the relevant object(s) on disk and, if necessary, a link to information stored in the MarkLogic database.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Metadata Specifications
  2. Definition of AIP

4.2.7 - The repository shall ensure that the Content Information of the AIPs is understandable for their Designated Community at the time of creation of the AIP. #

Response #

SP conceives of the “understandability” of Content Information by its Designated Community on three different levels: Intellectual Understandability, Usability, and Accessibility.

  • Intellectual Understandability - SP ingests Content Information based on demand and direction from OCUL member institutions. For this reason, SP relies on OCUL members to select content that is useful and understandable to faculty, researchers, and students. SP receives ongoing and extensive feedback from its Designated Community and will work with libraries to resolve understandability issues.
  • Usability - As explained in the Preservation Implementation Plan, SP is committed to using file formats that support long-term usability. In general, the considerations for selecting file formats include the “openness” of the file format, its level of support as a preservation format in the scholarly community, and its uptake among SP’s Designated Community, as well as its well-suitedness to later format migration. SP continuously monitors developments in file formats to determine if and when formats require migration (see Environmental Monitoring of Preservation Formats).
  • Accessibility - In order to understand the Content Information, the Designated Community must be able to access content. SP works to provide accessibility for the Designated Community to all of its material. SP tests its user interface for compatibility with a variety of web browsers and operating systems. The repository is currently establishing an accessibility program for people with special needs.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Designated Community Definition
  2. Preservation Implementation Plan
  3. Environmental Monitoring of Preservation Formats

4.2.7.1 - Repository shall have a documented process for testing understandability for their Designated Communities of the Content Information of the AIPs at their creation. #

Response #

SP has an active and diverse Designated Community, composed of librarians, faculty, researchers, and students, who provide feedback about the intellectual understandability of Content Information. SP receives direct feedback from the Feedback Forum located on the repository’s user interface and from a ‘contact support’ email address. SP reviews all feedback from users about the repository and, where possible, works to resolve understandability issues. SP receives indirect feedback from users through ongoing communication with OCUL member institutions.

To support usability, SP has a DTD and documentation describing what an acceptably formatted object looks like. In addition, SP processes each file to generate format identification and validation. The output created by this processing is then run through a style sheet that formats the information into PREMIS metadata. This metadata is stored with the content object and used by SP to ensure understandability of the contents in the repository.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Designated Community Definition
  2. Preservation Implementation Plan
  3. Environmental Monitoring of Preservation Formats

4.2.7.2 - The repository shall execute the testing process for each class of Content Information of the AIPs. #

Response #

Examples of feedback from the Designated Community can be found on the Feedback Forum located on the repository’s user interface as well as in user emails sent directly to SP staff.

The preservation metadata gathered within the AIP can be used to evaluate the composition of the collections in SP, in order to identify those objects which are at risk of not being usable, understandable, or accessible by the Designated Community.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Designated Community Definition
  2. Metadata Specifications

4.2.7.3 - The repository shall bring the Content Information of the AIP up to the required level of understandability if it fails the understandability testing. #

Response #

In concert with OCUL IR and various appointed committees that tell SP what the collection should look like, SP collects material in accordance with collection development best practices. In the end, the scope of SP’s collection is determined by the Designated Community through the Collection Policy.

All Provider produced metadata is normalized to a standard acceptable by SP and the Designated Community. If a Content information format within the repository is no longer understandable by the Designated Community, the content information will be migrated to an understandable format agreed upon through consultation with the Designated Community.

SP will continue working to provide accessibility for the Designated Community to all of its material in accordance with its Preservation Strategic Plan. The repository is currently establishing an accessibility program for people with special needs.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Designated Community Definition
  2. Collection Policy
  3. Preservation Strategic Plan

4.2.8 - The repository shall verify each AIP for completeness and correctness at the point it is created. #

Response #

Based on the license signed by the Provider and initial testing (including the receipt of test SIPs) to determine how a given Provider’s SIP is structured, a loader script is written for each Provider that determines the manner in which the Provider’s SIP will be transformed into a corresponding AIP. The loader program uses the determined structure of the Provider’s SIP, along with an XML schema definition of the SP AIP structure to create a properly formed AIP. In the event that this transformation fails or encounters errors, these are logged.

Responsibility #

  • Digital Preservation Librarian
  • Software Developer

Documents #

  1. Workflow Charts

4.2.9 - The repository shall provide an independent mechanism for verifying the integrity of the repository collection/content. #

Response #

SP maintains logs of the ingest process to ensure that everything that is received as part of a Provider’s dataset is ingested into the repository. See 4.2.1, 4.2.2, 4.2.3, 4.2.4, and 4.1.8

SP provides mechanisms to ensure that the content held in the repository remains uncorrupted through the use of regular fixity checks. See Fixity Check Procedures.

While SP has a record and can generate a list of everything it has in the repository, the repository can only ingest the content that a Provider chooses to send. It is difficult to know whether or not this represents the entirety of the collection without a complete list to check it against. These gaps can be identified through feedback from the Designated Community. Additionally, SP is a member of the Keepers Registry which can help identify the scope of preservation for journals content.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Fixity Check Procedures

4.2.10 - The repository shall have contemporaneous records of actions and administration processes that are relevant to AIP creation. #

Response #

During the ingest process, the repository automatically records all SIPs in a downloaded file log. Any problems that occur during ingest are recorded in an error log and a Publisher problem directory. All other events that occur to the file while in the repository are recorded in the Preservation metadata, as well as the direct location link to the content object. All such events are recorded at the time they occur.

Responsibility #

  • Digital Preservation Librarian

Documents #

  1. Metadata Specifications
  2. Workflow Charts