SolrWayback as a search & discovery tool for researchers to work with web archive collections

Time: Tuesday, 28/May/2024: 8:30am – 16:30am

Organisers: Anders Klindt Myrvoll, Royal Danish Library, Denmark & Jon Carlstedt Tønnessen, National Library of Norway, Norway

The archived web is considered inaccessible for most digital humanities researchers partly due to lack of easy-to-use tools and/or a narrow single URL-based methodology. This is where SolrWayback becomes relevant.

SolrWayback is a powerful web application for searching and exploring data of the archived web (ARC/WARC files). It has an advanced search syntax that allows full-text search and search in metadata. It also has a replay service, similar to the Wayback Machine, with a toolbox for source inspection. SolrWayback further has tools for network analysis of domains, powerful visualization-tools like N-gram, tools for data insight and, not least, the possibility to export data and derivatives from search results.
SolrWayback is open source and getting more and more users. By spreading the knowledge and gains using SolrWayback at workshops for scholars and anyone interested in working with web archive collections, we aim to build a thriving, resilient and robust SolrWayback software and community.

During this workshop, participants will install and run the SolrWayback bundle to index WARC files. After indexing, they will be able to explore and analyze archival records in the web application, and learn how to export derivatives from the search results.

This workshop will

  1. Briefly explain how a web archive is typically created and key elements like crawling, indexing, what is a WARC-file, tools to view archived web and more.

  2. Present the SolrWayback solutions at the web archives in Norway and Denmark – search/discovery/QA/tools/visualization and more to show clearly how easily SolrWayback can be adapted to the needs of other institutions and users.

  3. Explain the ecosystem for SolrWayback.

  4. Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to follow the installation guide and will be assisted whenever stuck.

  5. Leave participants with a fully working stack to index, discovery and playback WARC files.

  6. End with an open discussion on findings from the workshop and key takeaways.

Prerequisites:

  • Participants should have a Linux, Mac or Windows computer with Java installed. To see java is installed type this in a terminal: java -version

  • Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

  • Mac users with M1-processors (MacBook Pro 2021+) need to have Java 17 installed (later versions are not compatible with solr9, which is used for indexing).

  • An administration user may be required for Windows computers.

  • Downloading the latest release of SolrWayback Bundle beforehand is recommended: https://github.com/netarchivesuite/solrwayback/releases

  • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles or you can make your own WARC-files with https://archiveweb.page/ or similar tools.

  • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities in the best way possible.

  • Optional to have installed SolrWayback in advance (we aim to make instructional videos for different OS´s in advance).

Target audience

  • Researchers

  • Cultural heritage experts, developers, curators, and archivists (not just web archivists), or simply anyone interested in working or being inspired by these technologies and topics.

Anticipated number of participants:
20-30 (but less would be good as well – 1:1 help is difficult with many attendees – also a reason to install the SolrWayback bundle in advance if possible).

Length:
5 hours – full day workshop

Background & resources

Access to and interaction with web archive data are commonly described as a problem for researchers and SolrWayback is the key to unlocking the potential of data, archives and collections.

SolrWayback 5 (https://github.com/netarchivesuite/solrwayback) is a major rewrite that strongly focuses on improving usability. It provides real-time full-text search, discovery, statistics extraction and visualisation, data export and playback of web archive material.
SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source.

Documentation for SolrWayback in the Norwegian online archive:
https://nlnwa.github.io/research-services/docs/solrwayback

IIPC WAC 2022: SESSION 1 #2: SOLRWAYBACK AT THE ROYAL DANISH LIBRARY
https://www.youtube.com/watch?v=-q4a-edVP5E
See more at SolrWayback – presentations, examples & more