From Data Cleanup to Linked Open Data: Hands-on with OpenRefine and Wikidata

Alicia Fagerving, Ida Nordlander

A Network of Places: Open Linked (Building) Data as Research Infrastructure‘ is a collaborative research and development project which aims to connect museum and archival collections using linked open data practices, initially looking at data related to the geographical areas Stockholm and Sápmi. The project is led by ArkDes (the Swedish Centre for Architecture and Design) and involves Nationalmuseum (the National Museum of Fine Arts), Tekniska museet (the National Museum of Science and Technology), the Swedish National Heritage Board’s archive, and Wikimedia Sverige.

The challenge many cultural heritage institutions face today is the vast amount of data stored in their databases. Often this information remains inaccessible by the public and stored in closed systems. The lack of access to collection materials for researchers, the public and other cultural heritage institutions, also makes it difficult to analyze potential overlapping that occurs across collections. Therefore, the aim of the project is to demonstrate how cultural heritage institutions can make their data more accessible to the public, and by doing so show the relevance of cultural heritage data by giving it further context.

ArkDes houses one of the world’s largest architectural collections, featuring photographs, architectural drawings, models, and documents. This invaluable resource reflects the development of Swedish society through architectural material. In contrast, Nationalmuseum manages a diverse array of collections that include not only architectural materials but also a rich variety of artworks, such as paintings, sculptures, and decorative arts. Tekniska museet has collections related to technology and development. Their collections focus on technological advancements and explore the theme of societal development, which ties in with the goals of this project. By enhancing existing data with geographical metadata, it would be possible to gain insights into the technological evolution of specific locations over time.

The Swedish National Heritage Board is Sweden’s central administrative agency in cultural heritage. Their archive houses records, drawings, photographs, and other documents related to archaeology, ancient relics, churches and other buildings, and cultural environments. The oldest documents in the archive date back to the 17th century. In addition to the physical archive, there is also a digital one based on the OAIS model. A very important part of the archive is the topographical series, in which the documents are organized according to their geographical location, which makes it possible to track what has happened at a specific location over time. Wikimedia Sverige, the Swedish chapter of the Wikimedia movement, has extensive experience supporting cultural heritage organizations in their work with free knowledge, including research projects of this kind being conducted through the Wikimedia platforms, including Wikidata. Wikidata has, prior to this project, been deployed as a central authority hub for cultural heritage data. Wikimedia Sverige provides a user’s perspective outside of the cultural heritage field. An example that illustrates this type of research is the three-year-long research and development project Usable Authorities for Data-driven Cultural Heritage Research. Whereas the current project focuses on building data and geographical metadata for linked open data practices, the prior project mainly focused on authoritative person data and utilized Wikidata as a platform for publishing and linking their cultural heritage data (Fagerving, 2023).

In this project, A Network of Places: Open Linked (Building) Data as Research Infrastructure, the foci lie in publishing geographical information as linked open data and demonstrating howsuch a method can be established as a commonly known practice within the cultural heritage sector – while establishing it as authority data. The core idea of applying linked open data methods, in this context, is to make cultural heritage more accessible, usable and easier to analyze. Imagine a researcher investigating a specific topic. Instead of having to consult various sources, all the information is collected in the same place – with additional linkages to related materials found in museum collections. It is reasonable to assume that there is some overlapping in the collected materials from each cultural heritage institution related to geographic information.

For example, while there is known overlap concerning the geographic area of Sweden, the vastness of the area makes detailed analysis challenging. Narrowing it down to a specific area or coordinate location would allow for more focused analysis of overlaps in materials, such as those related to a particular building and its geospatial data. Cultural heritage institutions today face the challenge of managing vast amounts of data stored in closed databases, making this information often inaccessible to the public. This lack of access hinders researchers, the public, and other institutions from analyzing potential overlaps across collections.

Having 100 objects from four institutions related to a specific city provides valuable insights. By linking 30 of these objects to the same location or building, we unlock new research opportunities. Utilizing linked open data allows us to show how a building connects to its municipality and country, ensuring the relevance of these objects. This approach helps us identify the 30 objects specifically associated with a particular building while also considering a broader collection across Sweden, thereby enhancing our understanding of their significance.

The workshop

The workshop will focus on the methods employed by the research group for identifying and enriching collection data with Wikidata. Participants will learn how to utilize the open-source tool OpenRefine to enhance their own collection data and actively contribute to improving existing data on Wikidata. This can be achieved by uploading external identifiers that link to their collections or by updating current metadata on the platform.
The target group for the workshop is mainly those who work with cultural heritage data. However, OpenRefine is a powerful tool that can be applied in a multitude of ways for data clean-up and analysis. The workshop can also be helpful for those who have previously worked with platforms such as Wikipedia and Wikidata and who want to hone their skills in the field. Participants do not have to have any prior experience with Wikidata, Wikipedia or OpenRefine, but a basic understanding of data management.

The workshop is beginner-friendly and offers, in a practical and tangible way, basic knowledge in analysis, cleaning and harmonization of data, and uses existing platforms to demonstrate the possibilities of LOD for institutions working with cultural heritage data. By participating in the workshop, the participants will have a solid foundation to stand on to be able to continue exploring the possibilities of their data in the subsequent workshop organized by Eero Hyvönen and the Semantic Computing Research group, which explores Linked Open Data service for managing one’s own data and its applications in Digital Humanities research and development.

The workshop will be structured into two parts:

  1. In the first part of the workshop participants will be informed about the on-going research project, useful terminology in linked open data practices and how to use Wikidata as a hub for linking and enriching data.
  2. The second part of the workshop will be practical. Participants will get the opportunity to either bring their own data to clean, analyze and match to Wikidata through OpenRefine or use a test dataset provided by the research team (this alternative is more suitable for beginners). The expected outcome of the practical part of the workshop is for participants to gain new insights into their own data and how linked open data can be applied to accomplish this. They will also gain an understanding of the Wikimedia open knowledge ecosystem.

The expected outcome from attending the workshop will be:

  • A basic understanding on how to perform data analysis through OpenRefine.
  • How to match collection metadata to Wikidata through OpenRefine.
  • How to enrich your own data and metadata with Wikidata.
  • How to upload and edit data to Wikidata through OpenRefine.
  • An understanding of how to link data to Wikidata by using external identifiers.
  • An understanding of the Wikimedia open knowledge ecosystem.

To-do before the workshop:

  • Bring a laptop: You’ll need a laptop to participate in the hands-on activities.
  • Register a Wikimedia user account: If you don’t have an account yet, please take a few minutes before the workshop to sign up. This will give you access to all the tools and resources we’ll be using.
  • Download OpenRefine (optional): To enhance your understanding of the tool, feel free to download OpenRefine ahead of time. It’s a powerful data management tool that we’ll be working with during the workshop. We’ll also present an alternative way of using it on a cloud platform, which does not require you to install anything locally.
  • Bring your own data (optional): If you have specific data you’d like to work on, please bring it along! This can make the exercises more relevant and tailored to your needs.

About the Organizers

Alicia Fagerving is a developer at Wikimedia Sverige, a Swedish non-profit working to make free and open knowledge accessible to everyone. They work with cultural heritage organizations and other content owners, supporting them in their efforts to contribute to the Wikimedia platforms – Wikipedia, Wikidata and Wikimedia Commons. In this role at WMSE, Alicia has worked with the National Library of Sweden, Statistics Sweden, the Swedish National Heritage Board, UNESCO Archives and many museums, helping them understand the open knowledge ecosystem and share their content with the world.

Ida Nordlander is an Assistant Curator at the Swedish Centre for Architecture and Design, specializing in digital humanities and cultural heritage. Over the past two years, she has actively participated in research projects that investigate how linked open data can enhance accessibility for researchers and other users . Her work also emphasizes the use of digital storytelling to communicate cultural heritage and the importance of fostering digital literacy within the sector to improve accessibility, data integration, and knowledge sharing.

Bibliography

Fagerving, A. (2023). Wikidata for authority control: sharing museum knowledge with the world. In Digital Humanities in the Nordic and Baltic Countries Publications (Vol. 5, Issue 1, pp. 222–239). University of Oslo Library. https://doi.org/10.5617/dhnbpub.10665


Linked Open Data for Digital Humanities Research and Applications

Related to using Linked Open Data in Digital Humanities, the DHNB2025 workshop program includes two complementary independent tutorials that can be attended in sequel or separately. The first tutorial focuses on Linked Open Data production and Wikidata using the OpenRefine tool with hands-on exercises. The latter tutorial in the afternoon explains how to create a Linked Open Data service for ones’s own data and use it in Digital Humanities research and application development.