Overview of Digital Materials Processing

An overview of how we process and store digitized and born-digital materials

Digital Materials storage

Preservation storage for digital materials is on Lincoln at \\Lincoln\Masters\Archives.

SPE and Systems staff should have read-only access, and the railsdev server (ESPYdev.svc user) should be the only user with write access.

SIP and AIP packages

\\Lincoln\Masters\Archives contains subfolders for Submission Information Packages (SIP) and Archival Information Packages (AIP). These are modeled in spirit after the Open Archival Information System (OAIS) framework.

  • SIP packages are to get materials stable after ingest and prior to processing.

  • AIP packages are stored permanently. SIPs become AIPs after processing.

→ There are some other directories in \\Lincoln\Masters\Archives, like arrangements and logs. There were attempts to automate digital records transfers from different campus offices using the transfer folders (\\Lincoln\Library\UA200), but it didn't seem worth it in the end. This will be cleaned up at some point,

SIP and AIP packages are not really specified as well they should be, but they are Bagit bags according to RFC8693.

  • All SIPs and AIPs are arranged in folders named for there collection ID, and are named with the collectionID, an underscore (_) and an identifier, such as "apap014_B3k7PbpHHBznBJX8FNwsqb"

  • All packages have a data subfolder containing the package contents

  • AIPs also have a metadata subfolder that contains metadata and working documents like spreadsheets that were used during processing

Python code for creating and managing them are in built into the Processing app here: https://github.com/UAlbanyArchives/processing/tree/main/utilities/packages

Digitizing materials

Collections materials on paper are scanned using either:

Other analog materials can be digitized with appropriate equipment, but should be ingested and processed using the same workflow.

Processing app

This is a Dockerized Flask app which is designed to be a web interface wrapper for automating processing workflows.

The Processing app automates processing functions such as:

Instructions on running the app are in the processing Github.

Ingesting materials

All collection materials, including all digitized materials and born-digital transfers must be ingested as soon as feasible. This creates a second secure copy and a standard processing package so that later steps can be automated.

→ Ingesting digitized materials

Born-digital materials that are new transfers must also be documented in ArchivesSpace accession records.

→ Accessioning Born-Digital materials

Processing packages

Processing packages are copies of SIPs that live in \\Lincoln\Library\SPE_Processing\backlog

After digital records are packaged into a SIP, archivists need to process them and work directly with the files. The ingest step creates a working copy package in \\Lincoln\Library\SPE_Processing\backlog. These packages have subfolders for:

  • masters (an exact copy of the files that were ingested)

  • derivatives (an empty directory to create derivative files in)

  • metadata (an empty directory to work in)

Processing digital records will always require manual steps, but certain actions can be automated. This setup creates directories where archivists have the access and flexibility to manually manipulate files as-needed, but is also consistent enough to build scripts to automate repeated processes.

→ Keep original files in the masters folder and if other versions are needed for access, they go in the derivatives folder

Any working files for the package should go in the \metadata folder. This includes asInventory spreadsheets for describing items in ArchivesSpace, and metadata upload spreadsheets for Hyrax. The metadata directory later gets packaged into the AIP so we have a complete record of all the metadata work we did.

Creating Derivatives

A common process is creating derivatives for digital materials and the Processing app automates common conversions. This might be creating compressed JPGs or PDFs for access from PNGs or TIFF preservation files. For born-digital records, we might create derivatives that are better for preservation. If we get WordPerfect files, for example, we might also create more modern version of the files.

The separate masters/ and derivatives/ subfolders in the working packages are for managing this. After ingest, derivatives/ is empty, but archivists can optionally make derivative versions of files and place them there so that they'll be included in the AIP. Often the directory structure is repeated in both, so we can associate original files and derivatives.

OCR

Another common process is to run OCR and embed text in PDFs, which is also automated by the processing app. This is important not only so that users can copy text out of a PDF, but Hyrax also extracts this text during upload so that users can search and discover items using it. Without OCR, text in documents will not be added to the Hyrax search index.

This process targets PDFs in the \derivatives folder and overwrites those files with PDFs with embeded text.

Arranging and Describing materials

All digital materials should be arranged and described in ArchivesSpace. This can be done manually, or in bulk.

For digitized materials, this step is usually already done.

Bulk arrangement and description of digital materials is done in spreadsheets using asInventory. This is the same tool we use to manage file or item-level description created from data entry.

→ asInventory spreadsheets should be managed in a package's \metadata folder

Uploading to Hyrax

After digital materials are ingested, arranged in ArchivesSpace, and any derivatives created, files can be uploaded to Hyrax.

In addition to adding files to Hyrax, this requires creating digital object records in ArchivesSpace that points to the URL in Hyrax.

Uploading Single Digital Objects to Hyrax

Bulk Hyrax Upload

→ after bulk upload, there is a second process to add Hyrax URLs to ArchivesSpace

Finalizing a Package into an AIP for Preservation Storage

The final processing step requires packaging the SIP together with the Processing Package into an AIP for long term preservation.

  • There in an option here to overwrite the original ingested files with the files in the \masters directory in the Processing Package

    • This is useful for when some materials are not retained during processing

  • Soon, this will also copy the AIP into additional redundant storage automatically

Related articles