Survey and Processing Guidance for Students
For more information as to the structure of the collection folders in the backlog, please see the “Ingest” instructions.
To begin digital processing, you must first assess what is in the masters folder. These are the objects as they were arranged by the donor, the original order (if any) should be taken into consideration. Sometimes it is an entire computer drive copied into our system, sometimes it is more of a thoughtful organization of materials. There is no way of telling until you survey it!
Arrangement Options
The collection you are processing may be new and thus the arrangement scheme is completely up to you, or you may be interfiling with a pre-existing collection. And that pre-existing collection may already have digital material, or this may be the first accession to have digital objects. We still receive many paper records from the 20th century, but the number of digital objects increases each year as many of our donor organizations begin to operate largely in the digital sphere, leading to many of these collections being “hybrid” and containing both physical and digital materials.
That is all to say, it is hard to predict what kind of collection you are going to be assigned to process. And though it may seem like physical and digital materials are polar opposites of one another, how we consider them and arrange them remains largely the same. You will still want to remove duplicates and maintain as much of the original order as possible (if there is one). Your goal is still to create a contextualized experience for users that will stand the test of time.
One of the unfortunate drawbacks of processing digital materials is the inability to view the contents of an object without clicking into it. Where you could just flip through the materials in a folder in a physical collection, for digital materials you only have the title and date to work with without having to investigate further. And unfortunately you cannot control the naming conventions of the files that are donated, so you could end up trying to identify files named things like “Booth.jpg” or “WBJ Report.doc” or “4 pages up.pdf”. These names don’t offer specificity in terms of description, are vague, and require knowledge of acronyms specific to the donor organization.
This leaves us with two options: 1) investigate every single file for content, dates, and possibly rename it, or 2) describe in larger, aggregate groups that describe a single common element of the files.
Aggregate Description
Previous practice had us describing each individual file, but moving forward we will be doing the second option and describing in aggregate. This aligns with Delivering Archives and Digital Objects: a Conceptual Model (DadoCM) which recommends arranging in aggregate: a meaningful grouping of materials as defined by an archivist based on natural relationships among records that preserves their functional and administrative context. This could be any grouping that is useful for users.
These aggregates can be the accession of materials: “June 20 2025 Donation”; grouping provided by the organization or that you have determined through your survey like “Letters” or “State Fair 2021”; based on the format of the materials in connection to the collection like “Photographs” or “Newsletters”; or the physical media that the objects were acquired from “Floppy Disk” or “Thumb drive”; or by date like month or year. The decision should be made with regard to what most supports the user and what best highlights the context of the collection. It is a choice that you the processor make, just as it would be with the arrangement of physical materials. If you are unsure about these aggregates, ask an archivist! There are no right or wrong answers, it’s a judgement call, and only experience will help guide you to make these decisions.
After you make this decision, all of the files pertaining to that one aggregate group will be uploaded as one digital object, meaning the user will have to scroll through all the materials until reaching the one they are looking for (much like flipping through a folder of material). An example of this is the “Speak Out” Newsletter, which we have grouped by year (Ex: Speak Out, 1976-1977). The individual newsletters are organized chronologically within the digital object, but they are not separated into individual objects for each issue, even though each newsletter was a separate entity.
If there are few digital objects in the collection, you may choose to describe them all individually, especially if you wish to integrate them with a pre-existing arrangement scheme. Or you may choose to integrate them with physical materials if it is a hybrid collection, meaning a PowerPoint file would be described alongside printed presentation scripts under a series you call “Presentations”. The collection may also have no series, meaning the limited folders are described with title and date, and you can slide the digital objects in alongside it.
Aggregation is meant to lessen the time and labor spent to bring large quantities of digital materials into a collection, hybrid or born-digital. Don’t spend more time forcing groupings than you would just describing the digital objects one-by-one. Make a determination after your survey, and go from there. And always ask for help if you are unsure.
Data entry Sheet
You can download a blank template of the asinventory sheet through the processing app by utilizing List Files and Empty Sheet” or “Download Empty Sheet”.
Deduplication Process
When you have a large amount of files, it can be really difficult to see if there are duplicates, especially if the original file structure contains multiple subfolders. The best way to search for, and remove duplicates, is through the “List files” tool and Excel.
Note: this process should be done to the derivative folder, not the masters as you don’t want to alter the original order, or delete any original files.
The “List files” tool will provide a list of all the files in both the derivative and master folders, though for this process, we are only concerned with the derivative folder. The “derivatives.txt” file will list all files and their paths whereas the “derivatives-directories.txt” lists all folders, including subfolders.
You will want to update or create the list after you have arranged the derivatives folder to match the current or planned series structure (if any). You can do this by using the “List Files and empty sheet” tab under the “Batch Upload” tab. Enter the package ID that you are working in, and a new list will be created or your current list will be updated.
The “derivatives.txt” list will provide the file paths for all items in the derivatives file. You may remove duplicates as you go, by not moving them over to the derivatives folder if you see them while processing.
Step 1: Copy File Paths and Add to Excel Sheet
After you have created an updated “derivatives.txt” list, copy the entire list and paste it into an Excel spreadsheet column.
Step 2A: Check for Duplicate File Paths in Excel
Highlight the column
Select the “Conditional Formatting” tool from the Menu Bar
Then under “Highlight Cell Rules”, select “Duplicate Values”
You can select what color to mark your duplicate values, or the inverse of your unique values (I would recommend only marking duplicate values but to each their own!) and then hit “okay”
Because we have not removed the file paths, this will only catch duplicates that exist within the same files.
Step 2B: Check for Duplicate Files in Excel
Prior to running the Conditional Formatting, remove the file paths.
This can be done by using the “Find and Replace” feature (the second tab of ctrl+f) and replacing the file strings with blank spaces (leave blank)
You will have to replace each piece of the file path, the larger folder and each of the subfolders (delineated by the “/”) .
Duplicate files will be highlighted in red and ready for your assessment!
Step 3: Search for Files in File Explorer
Once you have a list ready for review, you can begin to assess each of the duplicate values and determine their correct location/value
Note: You can see in the screenshot above, the “thumbs.db” file is not necessary and should be deleted wherever it is found. It is left over from the previous files system and helped the computer to read the folders in the system. Deleting it will not ruin your ability to access the files.
Copy the file name and search for it within the derivatives folder
Right click each file to open the file location. Ensure that the contents of both files are exactly the same as two files may have the same name but different contents, and if that is the case, we will have to provide new descriptive titles for each file during the Data Upload Sheet process, but we will want to capture both sets of data regardless of name.
Next, determine which file location makes the most sense for the arrangement choices you have made. Delete the file (from derivatives ONLY) from the locations that it does not make sense.
Step 4: Check for alternate file formats
As mentioned above, you may have multiple file types of the same content. This may mean that you have to search for multiple file types of the same name.
You can search and remove the different file extensions using the same Find and Replace process as described in Step 2B. The Conditional Formatting applied in previous steps should remain and highlight any newly identified duplicates.
You will need to repeat the process of Step 3 above and find and remove the files in the derivatives folder using the File Explorer.
Note: You may notice that some files have the same name and not the same content, especially if they are in different files (the system won’t let you save two files with the same name in the same place). Make sure to click into the files to ensure that the content is exact before removing, don’t just go off of the naming conventions.
Once you have a complete file list, with all of your duplicates removed, you can re-run the “List Files” step and use the updated list to create your Data Upload Sheet.