Overnight Export and Indexing Scripts

How ArchivesSpace, ArcLight, and Hyrax are synced overnight

Github Repository: https://github.com/UAlbanyArchives/ArchivesSpace-ArcLight-Workflow

Overnight Export and Indexing Scripts

High-Level Overview

What Each Script Does


  • Each night, exportPublicData.py uses ArchivesSnake to query ArchivesSpace for resources updated since the last run.

  • For collections with the complete set of DACS-minimum elements it exports EAD 2002 files and for collections with only abstracts and extents it saves them to Pipe-delimited CSVs.

  • It also builds a CSV of local subjects and collection IDs.

  • All this data is pushed to Github.


Indexing Shell Scripts

  • Later, collection data is updated with git pull and indexNewEAD.sh indexes EAD files updated in the past day with find -mtime -1 into the ArcLight Solr instance.

  • There are also additional indexing shell scripts for ad hoc updates.

    • indexAllEAD.sh reindexes all EAD files

    • indexOneEAD.sh indexes only one EAD by collection ID (./indexOneEAD.sh apap101)

    • indexOneNDPA.sh indexes one NDPA EAD file, necessary because they have the same collection ID prefixes

    • indexNewNoLog.sh indexes one EAD file, but logs to the stdout instead of a log file

    • indexOneURL.sh indexes via a URL instead of from disk (not actively used)


  • Finally, processNewUploads.py queries the Hyrax Solr index for new uploads that are connected to ArchivesSpace ref_ids, but do not have accession numbers.

  • It downloads the new binaries and metadata and creates basic Archival Information Packages (AIPs) using bagit-python

  • It then uses ArchivesSnake to add a new Digital Object Record in ArchivesSpace that links to the object in Hyrax

  • Last, it adds a new accession ID in Hyrax

  • (Also check out Noah Huffman's talk that probably does this better [Direct Link].)


  • A simple library that converts Posix timestamps and ISO 8601 Dates to DACS-compliant display dates.

  • exportPublicData.py uses this to make dates for the static browse pages.


Example Crontab

# get new image from Bing 0 2 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/image_a_day.py 1>> /media/SPE/indexing-logs/image_a_day.log 2>&1 && pyenv deactivate # export data from ASpace 0 0 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/exportPublicData.py 1>> /media/SPE/indexing-logs/export.log 2>&1 && pyenv deactivate # pull new EADs from Gitub 30 0 * * * echo "$(date) $line git pull" >> /media/SPE/indexing-logs/git.log && git --git-dir=/opt/lib/collections/.git --work-tree=/opt/lib/collections pull 1>> /media/SPE/indexing-logs/git.log 2>&1 # Index modified apap collections 5 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "apap" # Index modified ua collections 15 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ua" # Index modified ndpa collections 25 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ndpa" # Index modified ger collections 35 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ger" # Index modified mss collections 45 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "mss" # Download new Hyrax uploads and create new ASpace digital objects 0 2 * * * source /home/user/.bashrc; pyenv activate processNewUploads && python /opt/lib/ArchivesSpace-ArcLight-Workflow/processNewUploads.py 1>> /media/SPE/indexing-logs/processNewUploads.log 2>&1 && pyenv deactivate

Related articles