Catalogues as Data in Library practice and research: A Use Case with the British Library Printed Catalogue of Books Published in the 15th Century.

Time: Wednesday, 29/May/2024: 8:30 – 12:00


The workshop will be delivered by members of the Digital Research Team at the British Library.

  • Rossitza Atanassova, Digital Curator, has worked on major digitisation projects and supports digital scholarship activities at the Library with focus on access and reuse of digital collections. She was the RLUK Professional Practice Fellow 2022 who led the project with the incunabula data.

  • Harry Lloyd, Research Software Engineer, has a background in Chemistry and Data Science and supports a range of digital research projects to enable enrichment of the Library’s collections. Harry maintains the code for the derived data.

Target Audience

20 maximum, laptops required

  • Librarians and curators interested in computational methods for working with collections metadata
  • Cultural Heritage and Higher Education professionals who support digital research projects
  • Researchers interested in library professional practice around data
  • Those with subject specific interest in incunabula collections or catalogues

Session format

The workshop will demonstrate to both cultural heritage professionals and researchers how bibliographic data can be generated from printed catalogues and disseminated for library and research use using computational approaches. It will cover the methodology, process and tools used to transform printed catalogue descriptions into data to be used for computational analysis and metadata enrichment.

Catalogues occupy a central place in the work of GLAM professionals and are an essential resource for users, including digital humanities researchers. They are not only an important aid to accessing the collections but are scholarly resources in their own right revealing institutional cataloguing and curatorial practice. We will showcase how computational analysis of catalogue data by library professionals can reveal new insights about their holdings and historical collecting practices, and how this can contribute to efforts of cultural heritage and HE institutions to improve the accessibility and inclusivity of their collections and data for diverse audiences. We will demonstrate one particular approach, that is, a method for gaining new insights into historical cataloguing and curatorial voice through linguistic analysis.

For this workshop we use the outputs from a research project funded under the Research Libraries UK and Arts and Humanities Research Council Professional Practice Fellowship scheme that enables library professionals to be active participants in research. The project digitised and extracted descriptions from volumes 1-10 of the Catalogue of books printed in the 15th century now at the British Museum (1908-1974) and prepared the data for corpus linguistics analysis and enrichment of the British Library’s online catalogue. The incunabula catalogue dataset is published on the BL Shared Research Repository.

The workshop fits in with several of the conference themes and puts an emphasis on the life-cycle of creating and using cultural heritage collections as data and the importance of cross-institutional collaboration. We will reflect both on our experience of working with historical printed catalogues as data and on how the project benefited from the close collaboration with printed heritage curators at the Library and Digital Humanities colleagues at the University of Southampton. The corpus linguistics approach we are using was first tested by a previous research project which co-produced training materials with input from GLAM professionals in the UK. We have shared this method and our learning with colleagues at the British Library and the Research Libraries UK community and we want to disseminate our practice with the DH Nordic and Baltic community of GLAM professionals and DH researchers.


The workshop includes a combination of presentations and hands-on exercises.

  • Welcome [15 min]:

    • A brief overview of the concept of Catalogues as Data.

  • Demonstration: Transforming Printed Catalogues into Data: [20 min]:

    • Use case

    • Data preparation: training data, transcription (OCR) with Transkribus and code development

    • Data outputs

  • Hands-on Exercise:

    • Guided exploration of catalogue entry extraction from transcribed text using a Jupyter notebook [40 min]

Comfort Break [30 min]

  • Demonstration: Introduction to the corpus analysis tool AntConc: [20 min]

    • Set up

    • Data import

    • Tools and Setting

  • Hands-on Exercise:

    • Guided exploration of the text data with AntConc [40 min]

  • Wrap up discussion [15 min]

Technical requirements

No prior knowledge is required. Participants will need to bring their laptops and have internet access.

Before the workshop participants will be sent detailed instructions on how to download the data they will work with and how to pre-install the corpus linguistics analysis tool. Help will be offered during the session.

As an optional pre-workshop activity participants could take a look at existing training materials on Computational analysis of catalogue data with AntConc.

Learning Outcomes

Will understand:

  • An approach to working with collection catalogues as data, from transforming the digitised images into text data and exploring text data with computational tools.
  • The workflow that takes digitised images, transcribes them, processes the transcribed text into logical units ready for corpus linguistic analysis.
  • The potential for computational analysis of catalogue records and collection metadata to provide professional and scholarly insights for library professionals and researchers alike.

Will be able to:

  • Use code to parse an xml containing transcribed catalogue text into catalogue entries.
  • Load corpus text data into AntConc and carry out basic linguistic analysis.