Web Archive Collections as Data

Olga Holownia, Gustavo Candela, Helena Byrne, Jon Carlstedt Tønnessen, Anders Klindt Myrvoll, Sophie Ham, Steven Claeyssens

GLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles[1]. The International GLAM Labs Community[2] has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist[3] was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections – ranging from sharing seed lists to derivatives to “cleaned” WARC files – there is currently no standardised checklist to prepare those collections for researchers.

This workshop aims to involve researchers and web archive practitioners in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by four use cases that show how different web archiving teams have been working with their institutions’ Labs to prepare data packages and corpora for researchers. For the second part of the workshop, we want to issue a separate call for use cases and have researchers present examples of their use of web archive collections and discuss their workflows. In the final part, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections.

First use case

The UK Web Archive has recently started to publish the metadata to some of our inactive curated collections as data. This project developed new workflows by using the Datasheets for Datasets framework to provide provenance information on the individual collections that were published as data.

In this presentation, we will highlight how participants can:

  • Use Datasheets for Datasets to describe their collections.
  • Potential research uses for the data sets that were published.
  • Gain insights from the lessons learnt phase of the project.

Second use case

The National Library of Norway (NLN) has recently launched its first ‘Web News Collection’, making more than 1.5 million texts from 268 news websites openly available for computational analysis via API. The objective is to facilitate computational text analysis of news content from the web, both with computational notebooks and user-friendly web apps [4].

This presentation will:

  • Briefly explain the ‘warc2corpus’ pipeline for extracting natural language from Web ARChive (WARC) files and preparing text data for computational analysis,[5]
  • Show the metadata schema and overall statistics for the collection,
  • Demonstrate how students, scholars and others can tailor their own corpora and perform various forms of ‘distant reading’,
  • Reflect on how the ‘Web News Collection’ aligns with the FAIR principles while also taking immaterial rights into account.

Third use case

The Royal Danish Library has recently launched a service for everyone to free text search in the entire Danish web archive using Smurf – N-gram visualisation[6]. The free text search will search in HTML-pages for each year, and the number of results found will be compared to the total number of HTML from that year. The total number of HTML pages in the archive is currently more than 20 billion pages (20.000.000.000), spanning from 1995 to the present. Due to legal reasons, only one or two words can be used in the search, and no special characters are allowed. Normally, you can only gain access to the Danish web archive upon application if you are a researcher or PhD student affiliated with a Danish research institution. So this is a possibility for everyone to gain insights and see trends from an otherwise very restrictive and closed archive.

This presentation will:

  • Show how to use the NGRAM search for feasibility studies for researchers before applying for access to the archive, as well as for the general public who would like to have a sneak peek into the archive.
  • Look into the needs, practicalities, and possibilities of sharing corporate data or metadata with researchers and the general public.
  • Explore particular situations concerning data availability that can be useful to extend the scope of the GLAM Labs checklist as well as provide additional examples of application.

Fourth use case

The KB, National Library of the Netherlands, has been refining its strategy for Collections as Data, with a stronger focus on implementing FAIR principles, documenting data provenance, and transparency of selection workflows. A critical question in this effort is how elements of the GLAM Labs checklist can be adapted for use at different levels: collection-specific (e.g., web collections) versus a more institution-wide, generic approach.

This presentation will:

  • Showcase the current capabilities of KB in offering web collections as data, along with their descriptions.
  • Explore the adaptability of the GLAM Labs checklist for both collection-specific applications and broader institutional use
  • Offer a glance at our future plans, concerning collections as data and inviting the participants to provide feedback.

TARGET AUDIENCE: researchers, web archive curators, and anyone interested in working with web archive collections

ANTICIPATED NUMBER OF PARTICIPANTS: 35

EXPECTED LEARNING OUTCOMES:

  • Understanding the steps involved in preparing digitised and born-digital collections for publication
  • Understanding the challenges involved in preparing different types of web archive collections for publication and/or sharing with researchers
  • Creating a draft checklist for publishing web archives collections as data

Bibliography

[1] Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8;

[2] https://glamlabs.io/

[3] Candela, G. et al. (2023), “A checklist to publish collections as data in GLAM institutions”, Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195

[4]: Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/; “Apper fra DH-lab”. National Library of Norway. https://dh.nb.no/apps/.

[5]: Bremnes T,. Birkenes M., Tønnessen J. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings.

[6]:https://www.kb.dk/en/find-materials/collections/netarkivet linking to https://labs.statsbiblioteket.dk/netarchive/ngram/#/