1. Introduction
The Nordic and Baltic region has been at the forefront of large-scale digitization. Over the last two decades, libraries, museums, and archives have worked tirelessly to make vast troves of cultural heritage material available: newspapers and journals, parish records and census lists, personal letters and diaries, folklore collections, images, artworks, and sound recordings – examples include the digitized collection of SMK Open in Copenhagen (https://open.smk.dk), the MeMo collection of XIX century Danish novels (https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1), and the ENO (Enevældens Nyheder Online) historical newspaper dataset (https://huggingface.co/datasets/JohanHeinsen/ENO). The results are extraordinary. Researchers today can draw on corpora of a size and richness unimaginable to earlier generations, and entire new fields of inquiry in the digital humanities are made possible by this abundance.
Yet this very abundance comes with a profound complication: access does not equate to quality. Anyone who has worked with digitized cultural heritage data knows the dissonance between the sleek interface of a digital repository and the stubborn flaws beneath: OCR misreadings that render words illegible, HTR models that stumble over idiosyncratic handwriting, cataloguing practices that vary wildly across collections, and metadata records whose gaps or inconsistencies obscure entire dimensions of a source. Behind every clean search box lies a complex history of omissions and distortions – missing volumes, mislabeled genres, truncated series, or cultural biases that privilege some voices while leaving others invisible.
For researchers, this noisy abundance poses urgent methodological and interpretive challenges. What is the threshold at which error distorts analysis? How do we responsibly report findings derived from corpora whose quality we cannot fully control? And how might we recalibrate our questions to the reality of the data, rather than to an imagined ideal? For GLAM institutions, the challenges are equally pressing: digitization pipelines operate under financial, technical, and ethical constraints, and decisions about quality are also decisions about priorities. What should be digitized first? How much post-correction is enough? How can collaboration with researchers guide these choices?
This workshop, Lost in Noise? Data Quality and Representativeness in Cultural Heritage Collections, proposes to make these questions the subject of direct encounter. It will bring together researchers from digital humanities projects and representatives from GLAM institutions to discuss not only the technicalities of data quality but also its interpretive stakes. Grounded in concrete cases and everyday practices, the workshop invites reflection on how data quality shapes interpretation, collaboration, and the very contours of what can be known from digitized cultural heritage for more transparent and inclusive digital infrastructures.
2. Rationale
The DHNB 2026 theme, Lost in Abundance: Encounters with the Non-Canonical, provides an apt frame-work for addressing data quality. Abundance is never neutral. In practice, the ability to access millions of pages of text or thousands of images is always mediated by infrastructures, standards, and technical processes that introduce noise. The non-canonical is often hardest hit: while well-preserved canonical works are likely to be digitized in high-quality editions, marginalized or scattered materials – folk songs, pamphlets, ephemera – often enter the digital record through OCR pipelines with high error rates, incomplete metadata, or uncertain provenance. The result is a double marginalization: sources that were once overlooked in traditional scholarship remain difficult to access, hidden this time not in archives but in the distortions of the digital.
At the same time, data quality can be generative. Working with imperfect data forces creativity: researchers develop error-tolerant methods, experiment with probabilistic models, or find ways to triangulate across multiple sources. In some cases, noise even reveals hidden histories – for example, OCR mistakes that cluster around particular letter forms can shed light on typographic conventions, or gaps in metadata can point to the blind spots of earlier cataloguing practices.
The workshop therefore treats data quality not only as a problem to be solved but also as a site of methodological and interpretive opportunity. Its aim is to foster dialogue across the research–GLAM divide and to articulate a shared understanding of where challenges lie, what strategies are being attempted, and how collaboration might move us forward.
3. Workshop Format
The workshop will be a half-day event divided into two parts: one focused on the research perspective, the other on the GLAM perspective. Each part will include:
• One 30-minutes keynote;
• Three 10-minutes presentations;
• One 30-minutes Q&A session.
A short 5-minute break will follow each keynote, and a longer 30-minute break will separate the two parts. Two keynote speakers will be present, each representing either the research or the GLAM perspective. For the 10-minute presentations, a call for abstracts will invite both researchers and GLAM representatives to submit projects that address challenges relevant to the workshop theme. We invite submissions with the following focal points:
• Errors (OCR or HTR quality, technical issues);
• Gaps (representativeness, unheard voices, missing data, e.g. metadata, biases);
• Workflows (How do we work around or with errors and gaps?).
The workshop organizers will select six abstracts – three from the research perspective and three from the GLAM perspective – and invite the authors to give 10-minute presentations. In addition, registration will be open to other interested participants.
Workshop Program:
Activity Part I – Research Perspective
Keynote 1: Nina Tahmasebi (30 min)
Short break (5 min)
Session I – Research Cluster: three 10-minute presentations followed by Q&A (60 min)
Coffee break and conversations (30 min)
Part II – GLAM Perspective
Keynote 2: Søren Bitsch Christensen (30 min)
Short break (5 min)
Session II – GLAM Cluster: three 10-minute presentations followed by Q&A (60 min)
Plenary wrap-up and discussion (30 min)
4. Invited Speakers
The conference features two invited keynote speakers who will open each part of the program and provide complementary perspectives on data quality in cultural heritage research and digitization.
Nina Tahmasebi (University of Gothenburg) is Professor of Computational Linguistics and a leading researcher in computational approaches to language change, semantics, and historical text analysis. Her work on long-term text dynamics and diachronic semantic models provides crucial insights into the ways data quality affects linguistic interpretation and reproducibility in large-scale digital corpora. She will deliver the Research Keynote.
Søren Bitsch Christensen (Royal Library of Denmark / Aarhus University) is Deputy Director of Cultural Heritage at Royal Library of Denmark and Associate Professor at Aarhus University. He has long experience in coordinating large-scale digitization projects and in developing digital infrastructures that connect archives and research institutions. His perspective bridges practical digitization workflows and strategic decisions around metadata and accessibility. He will deliver the GLAM Keynote.
5. Intended Audience
The workshop addresses a wide spectrum of participants. Digital humanities researchers who work with large corpora will find a forum to share experiences and strategies. Historians, folklorists, and literary scholars will encounter methodological reflections relevant to their own engagement with noisy sources. GLAM professionals will gain direct feedback on how data quality issues shape research outcomes and will have the opportunity to articulate their own constraints and priorities. Finally, students and early-career scholars will acquire a critical awareness of data quality as a constitutive element of digital research, rather than as a hidden technicality.
6. Outcomes
The immediate outcome of the workshop will be dialogue: a deeper mutual understanding of the data quality challenges faced by researchers and GLAM partners alike. Participants will leave with practical strategies, hopefully ranging from technical tools for post-correction, to methodological approaches that tolerate or even exploit noise, to interpretive frameworks that recalibrate scholarly claims. The longer-term hope is that the workshop will seed an ongoing cross-institutional network. By bringing together researchers and GLAM professionals around a shared concern, the event may stimulate new partnerships, collaborative funding applications, or joint infrastructure initiatives.
7. Importance
Why devoting a full day to data quality? Because it is the invisible infrastructure on which digital humanities research rests. Abundance alone does not guarantee insight; it can even be misleading if quality issues are ignored. By confronting noise directly, we acknowledge that data is never simply given but always made and the processes of making leave traces that must be understood. At the same time, there is an intellectual opportunity: rather than treating noisy data as a flaw to be corrected before “real” research can begin, the workshop invites participants to consider how noise itself might be meaningful. What does it reveal about the material history of sources, the contingencies of digitization, or the biases of archival selection? To be lost in noise is not only a frustration but a provocation to think differently.
8. Organizing Team
The workshop will be coordinated by Aarhus-based scholars (Center for Humanities Computing and Center for Digital Textual Heritage) with extensive experience in both large-scale digital/computational humanities projects and collaborations with GLAM institutions. We will ensure that the roster of presenters includes both researchers and GLAM representatives, with diversity in discipline, institution, and career stage. The organizing team includes: Yuri Bizzoni (Senior Researcher, CHC); Rie Schmidt Eriksen (doctoral student, CHC); Pascale Feldkamp (doctoral student, CHC), Marta Kipke (Postdoc, CHC), Alie Lassche (Postdoc, CHC), Kristoffer L. Nielbo (Center Director, CHC) and Katrine L. Baunvig (Center Director, Center for Digital Textual Heritage).
9. Practical Details
• Duration: Half-day workshop (4 hours including breaks)
• Format: Two keynote talks, short case presentations (in the case of short case presentations, we will try to favor smaller or under-resourced GLAM institutions), and plenary discussions
• Participation: Open to both presenters and attendees; contributions selected through an open call for abstracts
• Selection: The organizing committee will review submitted abstracts and curate a balanced program representing both research and GLAM perspectives
10. Toward a Shared Understanding of Quality and Representativeness
By the end of the workshop, participants will have exchanged methods and experiences and will have begun articulating a common language for data quality across disciplines and institutions. The aim is to move from isolated efforts – researchers fixing data on their own, or GLAM professionals facing constraints in isolation – toward shared principles that guide sustainable digitization and analysis. In this sense, the workshop aims to be an opportunity to transform the noise that complicates our work into a signal for collaboration and reflection across the digital humanities communities.
11. Call for Abstracts
Summary: This half-day workshop addresses the challenges of “noisy” abundance by bringing together scholars and GLAM professionals to discuss data quality and representativeness as both technical and interpretive concerns. Participants will be invited to share concrete experiences and strategies regarding the gaps, errors, and biases of digital collections, and create a much-needed dialogue across the research-GLAM divide. The workshop’s aim is to create spaces to reason about robust and reproducible practices with abundant and noisy data in digital scholarship.
We invite abstract proposals for 10 minutes presentations focusing on Errors (OCR quality, misidentified documents), Gaps (representativeness, missing data or metadata, biases) or Workflows (working with noise and errors in abundant collections).
Topics of interest include, but are not limited to:
• Quality or lack of quality in GLAM digital archives
• Problems related to OCR’ed source materials
• Automatic identification of noisy documents
• Representativeness of GLAM collections
• Holes in data or metadata collections
• Fair representations in digital archives
• Workflows and pipelines to work with large-scale noisy collections
Abstract length: Max 300 words.
Submission Deadline: 25 January 2026.
Notification of Acceptance: 10 February 2026.
Submission Method: Upload your abstract through https://www.conftool.org/dhnb2026/
Contact for Inquiries: yuri.bizzoni@cas.au.dk