How ArchivesSpace, ArcLight, and Hyrax are synced overnight
Overnight Export and Indexing Scripts
High-Level Overview
What Each Script Does
exportPublicData.py
- Each night,
exportPublicData.py
uses ArchivesSnake to query ArchivesSpace for resources updated since the last run. - For collections with the complete set of DACS-minimum elements it exports EAD 2002 files and for collections with only abstracts and extents it saves them to Pipe-delimited CSVs.
- It also builds a CSV of local subjects and collection IDs.
- All this data is pushed to Github.
staticPages.py
exportPublicData.py
runsstaticPages.py
when its finished, which builds static browse pages for all collections, including a complete A-Z list, alpha lists for each collecting area, and pages for each local subject.
Indexing Shell Scripts
- Later, collection data is updated with
git pull
andindexNewEAD.sh
indexes EAD files updated in the past day withfind -mtime -1
into the ArcLight Solr instance. - There are also additional indexing shell scripts for ad hoc updates.
indexAllEAD.sh
reindexes all EAD filesindexOneEAD.sh
indexes only one EAD by collection ID (./indexOneEAD.sh apap101
)indexOneNDPA.sh
indexes one NDPA EAD file, necessary because they have the same collection ID prefixesindexNewNoLog.sh
indexes one EAD file, but logs to the stdout instead of a log fileindexOneURL.sh
indexes via a URL instead of from disk (not actively used)
processNewUploads.py
- Finally,
processNewUploads.py
queries the Hyrax Solr index for new uploads that are connected to ArchivesSpace ref_ids, but do not have accession numbers. - It downloads the new binaries and metadata and creates basic Archival Information Packages (AIPs) using bagit-python
- It then uses ArchivesSnake to add a new Digital Object Record in ArchivesSpace that links to the object in Hyrax
- Last, it adds a new accession ID in Hyrax
- (Also check out Noah Huffman's talk that probably does this better [Direct Link].)
dacs.py
- A simple library that converts Posix timestamps and ISO 8601 Dates to DACS-compliant display dates.
exportPublicData.py
uses this to make dates for the static browse pages.
imageaday.py
- Queries the Bing background image API each night to display new background images for ArchivesSpace and Find-It just for fun.
Related articles