Quality Control Specifications

Quality Control Specifications #

1. Policy Statement #

Scholars Portal is committed to ensuring that the integrity of digital objects within the repository is maintained.

1.1. Quality control standards #

  • Upon ingest, Scholars Portal requires that publishers provide their content in PDF or XML format, and descriptive metadata in XML or SGML (in a format agreed upon between SP and the publisher).
  • Every time a digital object is moved during the ingest process, a fixity check is performed. This ensures that the file has been transferred correctly, and not become corrupted in the process.
  • Any errors are recorded automatically in an error log, as well as in the publisher problem directory. A notification of error is emailed immediately to the metadata librarian. The errors are investigated corrected as soon as possible.
  • Descriptive metadata is normalized to a SP-specific profile of the NIH Journal Archive & Interchange Tag Set (JATS). Transformed metadata is validated against a DTD to ensure its compliance.

1.2. Organizational responsibility #

Please refer to the Scholars Portal Roles and Responsibilities document for delineation of responsibilities by staff member.

Please refer to the Scholars Portal Organizational Chart for the structure of the organization.

2. Implementation Examples #

2.1. Procedures for data integrity testing #

2.1.1. Test during pull script #

  • After a new dataset is saved into the Ejournals FTP in Pillar, it is retrieved and the file size is compared to that of the original copy held in the publisher FTP server.
  • If the file size matches, the script makes a record of the FTPed files with the file name, size, and current date and adds the file name to the FTP downloaded log file.
  • If the file size does not match, the script sets the error flag and increments the try count. Once the try count hits three and there is still an error flag, the file is deemed corrupt and an email is sent to loaders@scholarsportal.info.

See the Pull Script Detail diagram for more details.

2.1.2. Test during the preparation of datasets #

  • New datasets are retrieved from the Ejournals FTP in Pillar and decompressed.
  • If decompression is successful, the script checks for the file name in the publisher error log and removes the file name from the publisher problem directory if it exists.
  • If there is an error during decompression, the script writes the file name to the publisher error log and moves the error file to publisher problem directory. The zip file information is then emailed to JIRA.

See the Prepare Datasets Detail diagram for more details.

2.1.3. Test during ejournals loader #

  • New datasets are converted from the publisher XML/SGML into NLM XML, given a URI, and inserted into the ejournals database. Any errors are recorded in a log file.
  • Either the SP programmer or metadata librarian manually checks the log file for errors during the conversion and insertion into the ejournals database of each dataset.
  • If the conversion or the insertion failed, the SP programmer or metadata librarian investigate the error log file to find out where the error occurred.
  • If the error is due to a publisher problem, the publisher reloads the SIPs.
  • If the error is due to a loader problem, a reworking of the loader script is necessary.

See the Manual Log File Check diagram for more details.

2.2. Control of incoming data #

  • Depending on the publisher, incoming data is either pulled or pushed from the publisher FTP into SPs ejournals FTP in Pillar.
  • If it is pushed in by the publisher, they send new content to the location that SP gives them and notifies the SP programmers or metadata librarian.
  • If it is pulled in by SP’s pull script, the process is activated daily at 11am.
  • The pull script loads configuration properties of the publisher and connects to the publisher FTP server using the FTP username and password.
  • The script retrieves the file names from the publisher FTP server and compares them to names in the FTP downloaded log file to look for new file names.
  • If a new file name is found in the publisher FTP server, the script creates a new dataset and saves it in the ejournals FTP in Pillar.
  • SP ingests new files at a controlled rate.

See the Pull Script Detail diagram for more details.

2.3. Explanation of repository workflow #

See the series of workflow charts.

2.4. Metadata fields with quality control information #

  • objectCharacteristics – contains technical properties of a file or bitstream that are applicable to all or most formats. It includes the following:
  • fixity – information used to verify whether an object has been altered in an undocumented or unauthorized way
  • size – size in bytes of the file or bitstream stored in the repository
  • format – identification of the format of a file or bitstream where format is the organization of digital information according to preset specifications
  • storage – information about how and where a file is stored in the storage system

See the Metadata Specifications for more details.

Review Cycle #

Ongoing