DOI: https://doi.org/10.59350/m5fmn-2f721
Esther Jackson, The New York Botanical Garden
Sebastian Karcher, Qualitative Data Repository
What Is Data Rescue?
The US government and its agencies collect economic and demographic data; data on soil, plants, and livestock; data on climate and astronomy, and much more. Countless scientists across disciplines rely on this data for their research. Much of these data are only available digitally, and may not be preserved by institutions traditionally tasked with federal recordkeeping such as the National Archives or Federal Depository Libraries. A transfer of power to a new administration leaves a lot of these data at risk. In some cases it may fall victim to the inevitable changes made to websites and portals in a new administration. In other cases, different political and policy priorities mean that data are removed or databases are no longer maintained, causing the loss of these invaluable resources.
A growing group of researchers, librarians, programmers, and interested citizens are organizing to preserve such at-risk government data. Loosely coordinate through the Environmental Data Governance Initiative (EDGI) and the UPenn Environmental Humanities Lab (PPEHLAB), “Data Rescue” events are convening across the United States for that purpose. Instead of going to great length on these events and their backstory, we suggest you watch Kim Ede’s terrific video on Libraries and Data Refuge:
And read some of the fascinating posts on the PPHEHLAB’s blog.
The JSTOR/QDR Data Rescue (aka NYC 2)
JSTOR and the Qualitative Data Repository (QDR) co-organized a Data Rescue event on March 25th at JSTOR’s offices in downtown New York City. Following the “sold out” event at NYU’s Tisch library in February, this was the second public Data Rescue in New York City, nicknamed “NYC 2.” Counting on the help of participants from the New York Botanical Garden, the event focused on the US Department of Agriculture and its sub-agencies. Moreover, given JSTOR’s and QDR’s expertise in cataloging and preservation, we put a focus on describing data over harvesting tasks.
The event was attended by a diverse group of data-enthusiasts. Several attendees had participated in the NYU Data Rescue event, while others were brand new to applied Data Rescue tools and procedures. A range of ages and professions were represented, including contingents from JSTOR, the New York Botanical Garden, and Pratt Library School. Facilitators of the event included Erin McCabe of JSTOR, Sebastian Karcher of Syracuse University, and Brendan O'Brien from the archivers.space team. While many of the attendees had backgrounds in libraries, archives, or science, there was a range of technical knowledge and proficiencies within the group. As the facilitators took time to explain each step of the Data Rescue workflow in great detail, all attendees were able to use their existing skillsets along with newly-learned tools to work with at-risk data sets.
We focused in particular on three USDA sub-agencies: The National Institute of Food and Agriculture (NIFA), the Animal and Plant Health Inspection Service (APHIS), and the National Agricultural Library (NAL). NIFA’s role is principally as a grant-giving body, supporting applied science integrated with the education and participation of local communities. APHIS’s mission is to “protect America's animal and plant resources from agricultural pests and disease.” This includes not just the regulation and inspection of crops and animals domestically, but also a significant role in international trade in agriculture, where it develops standards with other countries and coordinates emergency responses in the case of disease outbreaks. APHIS data recently received wide media coverage when the agency removed animal welfare inspection reports from its site (the information has since been partially restored). The National Agricultural Library, one of four national libraries in the US, houses one of the world's largest collections devoted to agriculture and its related sciences. You can learn more about these organizations and some of the most valuable information/data they store by visiting the “subagency primers,” short documents that provide basic information about agencies to help structure Data Rescues: NIFA primer, APHIS primer, NAL primer.
Three Paths to Rescue Data
After attendees introduced themselves to the group, Erin began by giving attendees a high-level sense of the Data Rescue project. Attendees were directed to the website archivers.space. The landing page of archivers.space shares the URLs that are in the process of being harvested. Data activists are able to nominate or propose URLs for inclusion through a Chrome browser plugin. In the first stage of Data Rescue, known as the surveying stage, data activists identify key programs, datasets, and documents on federal agency websites that are vulnerable to change and loss. Next, in the seeding stage, important URLs are identified and reviewed to see if they can be easily crawled by the Internet Archive’s Wayback Machine. URLs that cannot be easily crawled move to the researching stage for further review. After descriptive information is added to the URLs during the researching stage, programmers download complex data sets using dedicated code as part of the harvesting stage. This is followed by the bagging stage wherein the harvested data sets are consolidated and packaged, and the describing stage wherein the metadata used to describe the data set is written and reviewed. After the final stage, and after the data set with its metadata has been ingested into the Data Refuge repository, it is considered to be complete. At each stage, different tools and skillsets are required. The Data Rescue Workflow provided by Data Refuge offers excellent documentation about each stage of the Data Rescue process.
Jointly, we began our day by working on United States Department of Agriculture (USDA) URLs that were in the researching stage. For this stage, URLs that were not crawlable, (and therefore could not be easily “harvested” for archiving), were researched. Attendees made recommendations about how pages and content might be rescued, writing detailed notes about the nature of the data included on the at-risk pages. This included information about the file types present, notes about the nature of the databases used to store and display the information, and comments about the size of the data sets. This early stage introduced attendees to many of the tools used in Data Rescue.
After we completed research for the USDA URLs, the group split into three “paths,” each focusing on a different task for the second portion of the event. NYC 2 only used three of the six paths, described on the Data Refuge website. Path 1 attendees focused on website archiving, which meant that they worked in the seeding stage of the workflow. Using subagency primers, they reviewed the website of the National Institute of Food and Agriculture (NIFA), nominating pages for archiving or “harvesting” by using the Google Chrome extension. This work was completed quickly, as the content on the NIFA website was relatively straight-forward and did not include complex, nested links.
During this time, path 2 attendees focused on the describing stage for a different group of data sets. This involved checking the metadata associated with previously “bagged” datasets and uploading the content into the Data Refuge repository. Attendees performed quality assurance by reviewing metadata files written in JSON and spot-checking and verifying the content in the data sets. Because of their size and limited download speed (many of the datasets are 10 GB or larger), ultimately one data set, the NASA Enterprise Directory, was uploaded to Data Refuge during this event. The Directory includes the names and contact information for NASA employees and contractors. 102,615 entries containing names, emails and phone numbers have been shared through Data Refuge as a .csv file.
Later, Brendan offered path 2 attendees a preview of the next generation of the archivers.space app. The app can be viewed here [edit: no longer available], although please note it’s pre-beta. The new app more smoothly integrates the different steps of the Data Rescue process and structures them more closely around the concept of agency primers. It also is better designed for crowdsourced information, allowing for multiple versions of the same field that can then be reconciled. The goal for archivers.space 2.0 is to lower the burden on organizers and on seeders, focussing Data Rescue efforts on harvesting and describing data. Participants discussed the new interface and offered suggestions related to the site’s usage of controlled vocabularies, folksonomies, taxonomies, and thesauri.
Throughout the event, path 3 attendees focused on their mission of storytelling. This path involved interpreting technical content for the public and conducting interviews of other event participants. The path 3 attendees drafted a blogpost about the event and had the opportunity to learn about new technologies. Also present was a reporter for Marketplace NYC working on a story about hacktivism.
During the last part of the day, we worked on writing new subagency primer documents to be used in future Data Rescue initiatives. These primers are a foundational aspect of Data Rescue as they begin the process of gathering information about at-risk data. Each primer originates in the Agency Office Database, created as part of the Boston Data Rescue. These documents are associated with specific offices or agencies within the Departments of the United States Government. Each primer includes background information about a specific office, an assessment of the most high-risk data present on that office’s website, and a site map or link summary of all content present on the office website. While aspects of primer construction are subjective, (such as the question of what data is really “high risk”), the format of these documents allows for national and international collaboration between primer authors and the data activists who use them as guides in Data Rescue initiatives. NYC 2 attendees drafted primers for the Animal and Plant Health Inspection Service and the National Agricultural Library.
What Did We Learn? What’s Next?
We believe that smaller, thematically focussed Data Rescue events are useful; not to replace larger Data Rescue events with 100-200 participants, but as a complement that takes advantage of disciplinary knowledge to ad and allows for a different sort of conversation in smaller groups. Not least, smaller events also place a smaller onus on organizers. Our event also emphasized understanding/explaining the entire Data Rescue process to attendees, empowering them to be guides or experts in future events or even to work on topics individually. This is critical for two reasons: for one, having a large number of experts at hand is important in the case when a Data Rescue effort requires rapid and immediate attention. It also helps to create a more sustainable model of Data Rescues that requires fewer events and can run more akin to other crowdsourced online projects. Finally, we found the work on the sub-agency primers to be quite rewarding. The primers are immensely useful documents (and may well prove useful for other purposes than Data Rescue). Moreover, writing and researching them is a fun way to get to know the wealth of information provided by the US government. You may even hit upon a gem like Sebastian’s new favorite government website, hungrypests.com.
The attendees of this event already knew themselves to be interested in Data Rescue and data activism prior to their arrival. During the day, that interest only grew. By teaching new tools, working with meaningful datasets, and providing the opportunity to network with other data professionals and data enthusiasts, the facilitators of NYC 2 ultimately hosted an extremely successful capacity building event. Based on the enthusiasm of attendees as they left from the day, it will be exciting to see what additional 2017 events and skillshares are planned around on the topic of Data Rescue in NYC and beyond.