An overview of how we process and store digitized and born-digital materials

Digital Materials storage

Preservation storage for digital materials is on Lincoln at \\Lincoln\Masters\Archives.

SPE and Systems staff should have read-only access, and the railsdev server (ESPYdev.svc user) should be the only user with write access.

SIP and AIP packages

\\Lincoln\Masters\Archives contains subfolders for Submission Information Packages (SIP) and Archival Information Packages (AIP). These are modeled in spirit after the Open Archival Information System (OAIS) framework.

SIP packages are to get materials stable after ingest and prior to processing.
AIP packages are stored permanently. SIPs become AIPs after processing.

→ There are some other directories in \\Lincoln\Masters\Archives, like arrangements and logs. There were attempts to automate digital records transfers from different campus offices using the transfer folders (\\Lincoln\Library\UA200), but it didn't seem worth it in the end. This will be cleaned up at some point,

SIP and AIP packages are not really specified as well they should be, but they are Bagit bags according to RFC8693.

All SIPs and AIPs are arranged in folders named for there collection ID, and are named with the collectionID, an underscore (_) and an identifier, such as "apap014_B3k7PbpHHBznBJX8FNwsqb"
All packages have a data subfolder containing the package contents
AIPs also have a metadata subfolder that contains metadata and working documents like spreadsheets that were used during processing

Python code for creating and managing them are in built into the Processing app here: https://github.com/UAlbanyArchives/processing/tree/main/utilities/packages

Digitizing materials

Collections materials on paper are scanned using either:

Bookeye overhead scanner (in process)
Sheet feed scanner

Other analog materials can be digitized with appropriate equipment, but should be ingested and processed using the same workflow.

Processing app

This is a Dockerized Flask app which is designed to be a web interface wrapper for automating processing workflows.

The Processing app automates processing functions such as:

Ingesting Digital Materials
Accessioning New Born-Digital Records
Creating Derivatives such as access PDFs and JPGs
OCRing PDFs
Automating connections between Hyrax and ArchivesSpace
Packaging AIPs for preservation storage

Instructions on running the app are in the processing Github.

Ingesting materials

All collection materials, including all digitized materials and born-digital transfers must be ingested as soon as feasible. This creates a second secure copy and a standard processing package so that later steps can be automated.

→ Ingesting digitized materials

Born-digital materials that are new transfers must also be documented in ArchivesSpace accession records.

→ Accessioning Born-Digital materials

Processing packages

Processing packages are copies of SIPs that live in \\Lincoln\Library\SPE_Processing\backlog

After digital records are packaged into a SIP, archivists need to process them and work directly with the files. The ingest step creates a working copy package in \\Lincoln\Library\SPE_Processing\backlog. These packages have subfolders for:

masters (an exact copy of the files that were ingested)
derivatives (an empty directory to create derivative files in)
metadata (an empty directory to work in)

Processing digital records will always require manual steps, but certain actions can be automated. This setup creates directories where archivists have the access and flexibility to manually manipulate files as-needed, but is also consistent enough to build scripts to automate repeated processes.

→ Keep original files in the masters folder and if other versions are needed for access, they go in the derivatives folder

Any working files for the package should go in the \metadata folder. This includes asInventory spreadsheets for describing items in ArchivesSpace, and metadata upload spreadsheets for Hyrax. The metadata directory later gets packaged into the AIP so we have a complete record of all the metadata work we did.

Creating Derivatives

A common process is creating derivatives for digital materials and the Processing app automates common conversions. This might be creating compressed JPGs or PDFs for access from PNGs or TIFF preservation files. For born-digital records, we might create derivatives that are better for preservation. If we get WordPerfect files, for example, we might also create more modern version of the files.

The separate masters/ and derivatives/ subfolders in the working packages are for managing this. After ingest, derivatives/ is empty, but archivists can optionally make derivative versions of files and place them there so that they'll be included in the AIP. Often the directory structure is repeated in both, so we can associate original files and derivatives.

OCR

Another common process is to run OCR and embed text in PDFs, which is also automated by the processing app. This is important not only so that users can copy text out of a PDF, but Hyrax also extracts this text during upload so that users can search and discover items using it. Without OCR, text in documents will not be added to the Hyrax search index.

This process targets PDFs in the \derivatives folder and overwrites those files with PDFs with embeded text.

Arranging and Describing materials

All digital materials should be arranged and described in ArchivesSpace. This can be done manually, or in bulk.

For digitized materials, this step is usually already done.

Bulk arrangement and description of digital materials is done in spreadsheets using asInventory. This is the same tool we use to manage file or item-level description created from data entry.

→ asInventory spreadsheets should be managed in a package's \metadata folder

Uploading to Hyrax

After digital materials are ingested, arranged in ArchivesSpace, and any derivatives created, files can be uploaded to Hyrax.

In addition to adding files to Hyrax, this requires creating digital object records in ArchivesSpace that points tp

There is a batch process for this that uses spreadsheets, but its always going to require a CLI input on the railsprod server, so this is going to be limited to Greg for the forseeable future. This workflow is described in Processing Ingested Digital Files and Batch Upload to Hyrax.

Individual files, however can be uploaded directly into Hyrax, which happens more often than getting a big batch back from a vendor.

Add DAO to ASpace

After individual files are uploaded to Hyrax, the archivist has to add the URL to the object in Hyrax to ArchivesSpace as a Digital Object record. This is a manual step.

Adding ASpace IDs to package

This script doesn't exist yet, as this setup was included in the batch Hyrax upload described in Processing Ingested Digital Files.

Yet, when files are uploaded to Hyrax individually, the Ref ID to the component in ArchivesSpace will have to be added into the package in the metadata/ directory.

→ When we currently create lower-quality access scans on the photocopier and upload them to Hyrax, there is a wonky script, processNewUploads.py that creates the AIP from Hyrax. We can do it this way since there's only a single copy of the file and no TIFF/JPEG derivative copies. However, since Hyrax doesn't have an AIP, this script is hacky and I'm kind of shocked it hasn't broken and has been running without issue for 4+ years.

The resource type and license/rights statement fields are also manually added in Hyrax so it would be good to add these to the package as well.

So the form here might have fields for:

package ID
ASpace ref ID
resource type dropdown
license/rights statement dropdown

Or it might be best to just have two fields and query Hyrax for the other info so we know its consistent.

package ID
ASpace ref ID

For consistency, we might want to recreate the CSV file that the batch upload process uses. There is code for this in processNewUploads.py we can use.

Packaging a SIP into an AIP

The final processing step requires packaging the SIP into an AIP for long term preservation.

packageAIP.py copies the preservation files from the SIP and the derivatives/ and metadata/ directories from the Working Directory into the AIP. It runs some safety measures, like checking the hashes from the SIP with the AIP after its copied to make sure all the files are there, before finally deleting the SIP and the Working Directory package.

    packageAIP.py ger071_DZXPx2c6aKaV5zmdfasjJm

There are two additional options:

-u, --update : Uses the preservation files from the Working Directory package instead of the files originally ingested in the SIP. This is for when we are not keeping all the files that we originally ingested.
-n, --noderivatives : Will not include derivatives in the AIP. This is for cases were preservation copies, (like PDFs) are the same as derivatives.

Overview of Digital Materials Processing