Ga verder naar de inhoud

Metadata creation and enrichment using artificial intelligence at meemoo

Creating descriptive metadata is time-consuming, but essential for making digital archive material findable and searchable. For several years now, Meemoo has been exploring the possibilities of artificial intelligence to do this in a (semi-)automated way. From small-scale experiments to large-scale projects: discover the applications we use, the results and the lessons we have learned from some of our projects.

Authors

  • Bart Magnus (Expertise Officer, meemoo, Belgium)
  • Matthias Priem (Archiving Manager, meemoo, Belgium)
  • Nastasia Vanderperren (Expertise Officer, meemoo, Belgium)
  • Peter Vanden Berghe (Data Science Consultant, Sopra Steria, Belgium)
  • Ellen Van Keer (Expertise Officer, meemoo, Belgium)
  • Rony Vissers (Expertise Manager, meemoo, Belgium)

Acknowledgments to David Chambaere, Kristof Cnops, Alexander Derveaux, Jonas De Vos, Giovanni Lardenoit, Sam Legrand, Kenzo Milleville, Walter Schreppers, Joren Sips, Dieter Suls, Bram Ulrichts, Alec Van den Broeck, Brecht Van de Vyvere, Henk Vanstappen and Steven Verstockt.

This article was first published in Journal of Digital Media Management, Volume 13 (Issue 2), p. 110-123 (2025), [Https://doi.org/10.69554/NGFF5280 https://doi.org/10.69554/NGFF5280].

Summary

This paper discusses the use of artificial intelligence (AI) applications for the creation and enrichment of descriptive metadata at meemoo, the Flemish Institute for Archives. We begin by explaining why we use AI tools for metadata creation and enrichment, and which AI applications we specifically employ. We then describe our evolution from small-scale pilots to large-scale projects, and how meemoo aims to transition from a project-based approach to a structural operation in using AI applications for metadata creation and enrichment. We provide details on our approach, results, lessons learned and future steps, and describe how the use of AI applications poses not only technical challenges but also raises a series of legal and ethical questions. This paper highlights our journey into AI as a service organisation supporting over 180 organisations in the cultural, media and government sectors in Flanders — not in isolation but in close collaboration with our partners, leading to shared solutions.

Meemoo and the use of AI for descriptive metadata creation and enrichment

Meemoo, the Flemish Institute for Archives

Meemoo is a nonprofit service organisation based in Ghent (Belgium), funded by the Flemish government, dedicated to supporting the digital archive operations of over 180 organisations in the cultural, media and government sectors in Flanders, that is to say, our content partners. These partners

manage extensive collections and archival materials, whether yet to be digitised, already digitised or born-digital. We support them in this with various services. Our content partners include around 50 performing arts organisations, 50 museums, 28 archival institutions, seven heritage libraries, 19 heritage cells, nine regional broadcasters, three national broadcasters and eight government agencies.1

We primarily assist our content partners in digitising audiovisual material from their collections and archives, but recently have also undertaken large-scale digitisation projects for newspapers2 and photographic collections — glass plates more specifically.3 We also produce high-quality digital photographs of artworks and other types of masterpieces.4 Additionally, we organise the inflow of both digitised material and archival items, and existing digital collections, from our content partners into our archive system — to preserve, manage, use and make accessible and available for reuse by third parties.5 The meemoo archive system currently contains over 7 million digital objects, amounting to more than 26 petabytes of high-quality digital archive content. We support our content partners in making their digital archive content accessible by offering application programming interfaces (APIs). We also make selected digital archive content available through our own overarching platforms and other channels.6 Our main platform focuses on educational use, serving thousands of teachers and students in Flanders every day.7 Additionally, we actively collect and share knowledge about our work domains and digital heritage processes in general.8 We also initiate various projects with partners from different sectors.9

Creating descriptive metadata with AI

When our content partners do not have sufficient descriptive metadata for their digital archive content, it is difficult to make this material easily findable and searchable for interested users — significantly limiting its usability and valorisation.

The manual creation of descriptive metadata is a time-consuming and expensive process, especially for large volumes of audiovisual and/or photographic captures. To describe the content of audiovisual recordings, the information carriers need to be played and viewed from start to end, while photographic negatives must first be converted into a positive image before they can be easily ‘read’ by the human eye. An additional challenge is that the audiovisual and/or photographic captures often do not form the core of the collections and archives of our content partners. Furthermore, archive management is not a core activity for many of them, especially performing arts organisations.

We started exploring the possibilities of using AI tools to create descriptive metadata in a (semi-)automated way about nine years ago. AI encompasses various smart applications which, with the help of machine learning, quickly perform well-defined complex tasks that previously required significant human

input, including the creation of descriptive metadata.

An overarching approach to deploying AI

Not all organisations in the cultural, media and government sectors have the knowledge and resources needed to deploy AI applications in their archive operations, and this includes a large number of meemoo’s content partners. As a service organisation, we are therefore seeking a centralised solution to serve various organisations, and our content partners in particular. A centrally managed solution also avoids the need for each individual organisation having to develop its own in-depth technical expertise. A large group of organisations using shared AI applications also implies a significant scaling up, which has a positive impact on the cost per processed item or hour of archive content. This applies both to purchasing AI applications and services on the commercial market and using the AI services we develop ourselves. We can deploy and offer AI applications in various ways. The current practice is to centrally apply these applications to the archival material we store for our content partners and share the results with them. In the longer term, we could also offer these AI applications as software as a service (SaaS) to content partners or other organisations, so they can use them

independently.

It is crucial that the solution offered can be used by various interested parties in an accessible and cost-effective way. This presents several challenges, not only due to the rapid evolution of technology but also in designing and implementing efficient workflows, clarifying ethical and legal issues, and managing the applications, data and user community involved.

Harnessing AI at meemoo

At meemoo, we are currently focusing on deploying some specific AI applications, ie facial recognition, speech-to-text (STT) and optical character recognition (OCR). We are also applying named entity recognition (NER) and named entity linking (NEL) to the outputs of the two latter applications.

Facial recognition and OCR are computer vision applications that use trained algorithms to search for patterns in images on a large scale. With facial recognition, we detect and identify (relevant) individuals appearing in digitised or born-digital photos and videos (Figure 1). This enables us to determine, in a (semi-)automated way, who is present in the visual material, and to distinguish between different individuals who appear together. In videos, we can even pinpoint when each person appears on screen. The solutions used by meemoo generate descriptive metadata in an automatic way, but the results are verified manually for validation, eg to measure the accuracy of the results.

Figure 1: Face detection applied to a group photo of four former Belgian prime ministers (Jean-Luc Dehaene, Leo Tindemans, Wilfried Martens, and Mark Eyskens). Original photo: Michiel Hendryckx. CC BY-SA 3.0.

STT converts spoken language from audio and video recordings into machine-readable text. OCR transforms printed text from digitised newspapers into machine-readable text. NER is a natural-language processing application that detects named entities in unstructured text, eg the results from OCR or STT, and then classifies them into predefined semantic categories, such as names of people, organisations, locations or time indications. With NEL, we link these terms to corresponding terms in online keyword lists and knowledge databases, eg Wikidata.org, allowing us to uniquely identify the people, organisations and locations and semantically enrich the data about them (Figure 2).

Figure 2: NEL uniquely identifies individuals, organisations and locations and semantically enriches the data about them.

We also have concrete plans to deploy AI applications for voice recognition, audio classification (including noise, spoken word

or music), and generating short summary descriptions. We are aiming to explore the possibilities of these additional applications between 2024 and 2026.

Meemoo’s AI projects for descriptive metadata creation and enrichment

The initial projects for creating descriptive metadata with AI

In 2015, we used OCR to convert the printed text of 270,000 digitised pages of Belgian newspapers from the First World War into machine-readable text. In 2017, we used NER to detect person names in this output. We then used NEL to match and link these names with The Names List — a database from the In Flanders Fields Museum containing data on over 500,000 casualties from the First World War. This led to the publication of nearly 122,000 links to additional information about the individuals, each with a score indicating the reliability of the match.1011

Between 2018 and 2020, in two projects with FOMU — the Photo Museum Antwerp and MoMu — the Fashion Museum Antwerp as content partners, and Datable as a technical partner, we studied the feasibility of using image recognition services for automating the descriptive labelling of cultural heritage objects. We explored how to use existing online image recognition services such as Google Cloud Vision, Microsoft Computer Vision (now Azure AI Vision) and Clarifai to tag and/or categorise photographic content, and how to integrate the obtained results with the existing collection metadata.1213

In 2021–2022, together with the content partners ADVN — the Archive for National Movements, the Flemish Parliament Archive, KOERS Museum of Cycle Racing, Kunstenpunt — the Flanders Arts Institute and Ghent University’s IDLab as our technical partner, we researched the application of facial recognition on a larger scale. In the FAME (FAce MEtadata) project,14 we investigated workflows and best practices for using facial recognition to identify and tag individuals with descriptive metadata in photos and videos in a (semi-)automated way. We also explored how existing descriptive metadata could enhance the accuracy of the facial recognition results. The FAME project focused on three types of public figures: politicians and activists, cyclists and performing artists (Figure 3). For this project, we initially compiled a name list and a reference set of portrait photos for about 6,000 relevant individuals. We were ultimately able to recognise more than 78,440 faces in the collections of the four participating content partners, linked to nearly 1,700 individuals from our name list.151617

Figure 3: Face detection applied to a photograph of performing artists Josse De Pauw and Fumiyo Ikeda. Original photograph: Michiel Hendryckx. CC0.

The projects conducted between 2015 and 2022 clearly demonstrated that using AI tools to create and enrich descriptive metadata in digital collection registration is not only very time and cost-efficient but also very comprehensive.

The GIVE metadata project

The GIVE metadata project18 ran from 2021 to 2023. It focused on expanding and scaling up the use of AI applications for generating descriptive metadata for digital audiovisual archives (Figure 4).

Figure 4: In the GIVE metadata project, speech recognition, entity recognition and facial recognition were all used to enrich descriptive metadata.

The initial phase of the project involved applying STT technology to approximately 170,000 hours of audio and video files. Following this, we embarked on NER to identify names of individuals, organisations or locations in the output text. We then used NEL, wherever possible, to link these entities with corresponding entities in existing external open data sources, such as Wikidata.org.

In the second phase, we conducted large-scale facial recognition across 124,000 hours of video content. We had two objectives here: to detect and recognise faces within the videos and to identify relevant public figures appearing in them. It is important to note that not every detected face is of a person significant enough for us to name. To give an idea of the extent of the scale-up: in the FAME project we processed just over 154,000 photos, and in the GIVE metadata project we processed 124,000 hours of video. This amount of video results in 432,000,000 frames to process, if we keep just one frame per second.

Crucially, the GIVE metadata project was not profiled as a research project. Our goal was to develop workflows that integrate with the meemoo archive system, are further scalable and remain useful beyond the conclusion of the project. The intention was not to develop new technologies or train AI models, but to achieve qualitative descriptive metadata through the integration of existing commercial or open source solutions.

Alignment and collaboration with the partners

In total, 120 content partners were involved in the GIVE metadata project. Key to the success was ensuring that these partners accepted the new approach as a viable solution for creating descriptive metadata. A significant portion of the project was therefore dedicated to alignment and collaboration with all partners. Securing buy-in from all content partners was crucial for developing a solution suitable for everyone, as implementing facial recognition changes certain processes and practices within their organisations.

Take, for example, the reference list of relevant public figures on which we want to perform face recognition. In many cases, person authority lists are currently being managed by individual organisations themselves, they are not fully standardised, and they are only shared to a limited extent. The development process revealed the significant benefits of having an overarching reference list of names with a reference set of images for each face. Dealing with a single shared database containing all relevant public figures reduces the complexity of the solution in terms of data management. This required a shift in how organisations use their reference lists, as all organisations can edit or add public figures in the shared database. The major advantage of this approach is that every organisation benefits from the work of other organisations. This change in mindset was facilitated by the project team and in collaboration with the stakeholders.

Alongside information sessions and newsletters, we also launched an appeal for participation in a project working group, which ultimately led to us involving ten content partners more closely in the project. This group was kept up to date with detailed information on the project’s progress, and played a key role in decision-making on ethical issues, model parameterisation and the development of a shared image reference set management tool.

Facial detection, facial recognition and the role of reference sets

Facial detection

Before you can recognise a person in a video, you first need to identify which elements in the video are faces. Following extensive benchmarking, meemoo opted for a combination of two open-source models: YuNet and MediaPipe. Subsequently, a unique ‘faceprint’ (also known as a vector profile) is generated for each detected face, and

meemoo used MagFace for this. These unique vector profiles allow you to measure how similar two faces are — so, for example, world famous Belgian tennis legend Kim Clijsters will not be mistaken for the Belgian actress Maaike Cafmeyer.

Linking detected faces in videos to vector profiles allows us to track specific faces across multiple frames within the same video shot. This technique, known as ‘face tracking’, aims to group appearances of the same person’s face within a shot. When a person appears in multiple shots, these vector profiles are merged into a cluster (Figure 5), so we can identify a unique individual in the same video. Due to technical and legal considerations, we keep only the three best suitable facial croppings and their corresponding vector profiles; these are subsequently used for comparison with the faces/vector profiles in the reference set. Keeping such data to a minimum is essential to prevent the dataset from becoming so huge that the process becomes infeasible on a large scale. We do, however, meticulously record detailed information about the time intervals when a person appears in the video, so that we can provide this to users.

Figure 5: In face detection, facial croppings are first created and then grouped into clusters per person.

Facial recognition

The final step in the process involves the actual facial recognition. This involves comparing the three representative vector profiles of an individual detected in a video against those of the public figures in the reference set. This set is crucial for bridging facial detection and recognition. Only individuals in the reference set of public figures will be recognised in the 124,000 hours of video in the meemoo archive and become findable and searchable via descriptive metadata.

The three stored vector profiles from a person detected in the video are matched with the ten most similar vector profiles from the reference set. Thanks to indexing,19 we can perform this comparison very efficiently. If the distance between the vector profiles is sufficiently small, and if the vector profiles from the video consistently match those of the same figure in the reference set, it is considered a match (Figure 6). This match is then assigned a certainty score based on the distances between the vector profiles.

Figure 6: In facial recognition, the clustered facial cutouts are matched with facial croppings from the reference set.

The face of an unidentified individual detected in the video is then linked to an identified public figure in the reference list, effectively assigning an identity to the detected person. If the distance is too great, no match is made — indicating that the person in the video is not in the reference set. The process does not end there, however, because we also use a clustering algorithm to aggregate unidentified individuals across multiple videos — allowing us to present faces of the most frequently occurring, unidentified individuals in an interface, so that content partners can attempt to name the relevant

public figure concerned.

Creation of the reference set

The identifying of previously unidentified but relevant individuals essentially involves adding them to the reference set. The content partner adds a few photos and the name of the public figure depicted, and after the overnight matching run, the previously unidentified individual will become an identified public figure. This not only applies to all videos in the archive system of the content partner who identified and added this relevant public figure but also to all other videos in the meemoo archive, including those from other content partners — ensuring that everyone benefits from one another’s work and expertise.

We can now also automatically select high-quality images of faces from video frames to add to the reference set, avoiding the need for content partners to search for photos in other sources, and crop and upload them to the reference set (Figure 7). The decision to add an individual to the reference set of public figures remains a manual step.

Figure 7: An archivist who recognises a person may decide to add them manually to the reference set.

Meemoo’s approach for the creation of the reference set

Given the large scale of our operation, it is particularly important to consider ethical implications in the decision to add an individual to the shared reference set of public figures. We held extensive discussions about this decision-making process with stakeholders in workgroup sessions. These conversations included content partners, AI experts, archivists and individuals with public profiles who could potentially be added to our reference set — allowing us to view the process from various perspectives and establish a shared framework of agreements. These sessions were facilitated by the [https:// data-en-maatschappij.ai/en/ Knowledge Centre Data & Society].

Key elements to ensure a robust framework of use for the shared reference set include:

  • guidelines on who can access the reference set and who can add public figures to it;
  • a practical guide on the legal and ethical context, answering questions such as which (types of) photos may be added; and
  • a manual explaining, among other things, what technical characteristics a representative photo must have for the reference set, eg picturing only one person, how to crop while maintaining margins and age diversity in photos.

For management purposes, we have incorporated additional functionality into the reference set tool. Like Wikidata.org, we automatically record who has made which changes to which records. This increases transparency. Content partners can declare themselves a ‘stakeholder’ for a specific person, so each organisation can work with its ‘own’ persons list — a subset of the full reference set. The complete reference set will still be used in the next matching run, but a content partner can, for instance, decide to import only the results related to persons for whom he is a ‘stakeholder’ into his own collection management system at a later stage. We have also limited the data in records — keeping mainly a small set of representative photos of each person, their

name, a brief description and a link to their Wikidata entry. This approach simplifies management and prevents the tool from becoming a person name authority in itself. Content partners can add links to other internal or external sources, such as their own collection management system’s authority files or other public sources.

The decisions to use and develop AI tools at meemoo

To build or to buy?

We faced several technical questions at the outset of the GIVE metadata project, and we adopted a similar approach for each.

First, we listed the functional requirements for the desired solution — distinguishing between must-haves and nice-to-haves — for speech recognition, facial recognition and named entity recognition technologies.

Following this, we organised a market survey to evaluate what commercially available solutions could offer at the time, and also to consider the capabilities of open-source models or libraries. We weighed the cost of commercial solutions against the integration costs of open source models into our workflow, and benchmarked the quality of the models to help us select the most suitable solution based on price and quality.

Figure 8: Build or buy?

To build: Facial recognition solution

For facial recognition, we divided our search into two distinct parts. On the one hand, we aimed to find and label faces of public figures from a predetermined list, and on the other, we sought to identify the most frequently occurring yet unnamed faces of other possibly relevant (public) figures in the video content. We operated under the assumption that individuals appearing more frequently are likely more relevant, and asked domain experts — including archivists and collection registrars — to name these.

We then conducted another market survey based on these requirements, but only a limited number of providers offered a suitable facial recognition solution for our purposes. Specifically, finding and clustering frequently occurring but yet unnamed faces, labelling them and adding them to the reference list proved to be a functionality not commonly available as standard. The cost per hour of processed video was also quite high. So, given that the FAME project

had demonstrated that open source models could yield high-quality results, and as the total development cost was comparable to the expenses of using a commercial solution, we opted to develop the facial recognition solution ourselves using available open source components.

To buy: STT and NER solutions

For STT, we decided to buy a solution due to the lack of sufficiently good pre-trained models at the time. Given that meemoo operates with public funds, this required a tendering process. Besides price, quality was a key focus, in particular measured by the word error rate — alongside additional features such as speaker recognition, language support and model update frequency.

To objectively compare the potential solutions based on word error rate, we manually created 240 minutes of ground-truth data — encompassing a cross-

section of our archive content, primarily in Dutch, and ranging from studio and street recordings to theatre performance recordings, also including various dialects. This helped us understand how the software would perform on our partners’ content. The comparison itself was carried out using the

EBU Benchmark STT tool, with Speechmatics emerging as the best solution. It is interesting to note that the open source model Whisper became available during the tendering process in late 2022. We included this in our benchmarking exercise and found that it achieved good results, especially on more challenging content, but was not yet at the level of the state-of-the-art models.

We also chose a commercially available solution for NER after a similar comparison process. We again relied on our ground-truth

dataset to compare the open-source models spaCY and flair against TextRazor, a commercial contender. The transcriptions were enhanced with keywords, allowing us to measure the accuracy of NER. We found that TextRazor’s performance was on par with that of the open source models, but decided to use this commercial solution for both NER and NEL because of its capability to directly link entities to Wikidata.

Lessons learned and results

Lessons learned: The FAME project

In the FAME project,20 we discovered that there was no user-friendly, ready-made technical solution for facial recognition that cultural heritage organisations in Flanders can use individually in a cost-effective manner. Many cultural heritage organisations will, at least for the time being, have to rely on overarching solutions — and these solutions require the development of a (shared) technical infrastructure. We also found that integrating the outcomes of facial recognition as new metadata into our project partners’ information management systems still poses technical challenges.

We learned that the workflow involves two particularly time-consuming tasks: creating a reference set of photos and manually validating the identifications suggested by the AI application. Research into making these processes more time-efficient is needed. Whether the reference set can be built and used collectively across different organisations should also be explored.

A significant challenge lies in the application of facial recognition technology to vast amounts of video content. Videos are first converted into sequences of still images in the facial recognition process before they can go through a similar workflow to the photos, but videos often consist of 25 still images (frames) per second. So, one hour of video consists of 90,000 frames. Even if we kept only one frame per second, one hour of video would still result in 3,600 frames. Vector profiles, which are the mathematical elements that compile the features or characteristics of a face, require immense computational power and processing time for analysis (in the FAME project, facial features were assembled into 512-dimensional vectors), especially as video content often features multiple individuals. It is possible to simply analyse a limited number of frames from each video shot depicting one or more people, but the number of (still) images to be analysed for facial recognition in video content is much higher than for photo material. Video content therefore generates a much higher number of vector profiles to be compared and clustered based on similarity. Another issue for further upscaling is whether and to what extent manual

validation of facial recognition outcomes is necessary and feasible.

We have observed not only that facial recognition technology is evolving rapidly, but also that vector profiles are not interoperable between different models — so they need to be recreated for each new model. It is therefore not very useful to exchange vector profiles at present; however, sharing the cropped faces and portraits (ie the reference set) could lead to significant time-efficiency gains.

Facial recognition yields the best results when there are 5–10 portrait photos for each person, ideally taken from the front and showing the individuals at various ages throughout their careers.

Results: The GIVE project

The facial detection and vector calculation step is notably computation-intensive, requiring a one-time execution that can run in parallel. In the GIVE project, we opted for a cloud infrastructure due to the high computational demand but limited duration of the process. This step took approximately ten weeks to complete. We performed the matching and clustering overnight, on a daily basis. Currently, with 124,000 hours of video processed, this takes several hours. Public figures added to the reference set the previous day are recognised in these 124,000 hours of video within 24 hours.

We used facial recognition to link 2,500 public figures from the reference set to one or more videos — enriching 40,293 videos with additional descriptive metadata. The potential for growth here is immense: we are currently identifying thousands of unnamed individuals who appear frequently and could be added to our reference set of public figures if relevant. The quality of facial recognition is exceptionally high, with errors mainly occurring in cases of twins or mistakes in the reference set.

We have also used STT to transcribe more than 110,000 of the 170,000 hours of audio and video content. We did not apply STT to audio and video content where the language detection, performed just before transcription, could not determine the language with sufficient certainty or when no speech was present in the audio or video clip. The entire corpus was processed in about four months using Speechmatics’ SaaS solution and internal integration software.

Figure 9: Using STT, more than 110,000 of the 170,000 hours of audio and video material were transcribed. The application was not used for audio and video material where language detection could not determine the language with sufficient certainty or where there was no speech present in the audio or video clip.

The general quality of the transcription is good (as also seen in the benchmark during the procurement process), but it varies depending on the sound and speech quality. Segments where the language changes or where music is present result in lower-quality transcriptions.

NER on text was performed on all transcriptions — identifying 1.7 million locations, 3.7 million persons and 1.1 million organisations. The quality conclusions are the same here: generally good, but the quality of NER decreases in line with the quality of the transcription.

Ethical aspects

The use of AI applications, especially facial recognition, raises certain ethical questions.(Figure 10).2122 The technology relies on good reference sets. An algorithm compares the vector profiles of the reference images with those of faces detected in digital archive content, predicting the likelihood that they represent the individual in question (Figure 3). Reference photos may be sourced from various places, including the public web. The ethics of extensively collecting — including ‘scraping’ — and storing photos from the web without the consent of the individuals depicted and the creators pose significant concerns. The situation varies when the photos in question are in the public domain, published under a free licence or uploaded by the individuals themselves. A tension emerges between the principle and goals of open content and ethical considerations in the broad application of facial recognition technology — which raises the complex issue of whether all open data and content should be freely used in all AI applications, including for commercial purposes.

Figure 10: An overview of the moments in the implementation of the GIVE metadata project when ethical considerations had to be made.
Figure 11: The facial recognition workflow used in the FAME project.

Facial recognition applications can often reproduce and reinforce existing racial and gender inequalities. Imbalanced training datasets risk introducing bias into algorithms, leading to the better or worse recognition of individuals based on their ethnicity or gender. It is crucial not only to prevent bias where possible but also to try to mitigate its effects and make any potential bias as visible as possible.

The AI applications utilised rely on models trained by developers, and this training demands significant computational power and energy, a considerable portion of which does not come from renewable sources. Reducing adverse effects — eg through energy-efficient workflows for analysing many hours of video material — and transparency in these processes are key points for attention. Training models, fine-tuning algorithms and validating results from large datasets also require substantial human resources, raising questions about the conditions under which data labelling is performed — whether by properly paid employees, interns or volunteers, or possibly through crowdsourcing. There are also instances, however, where so-called ‘ghost workers’ or ‘click workers’ are employed in low-wage countries and in precarious working conditions.

Deploying AI applications also involves significant legal challenges, particularly concerning personal data protection and copyright.23 We conducted extensive legal research into these subjects.24

Images are personal data, and the processing of them must therefore comply with personal data protection legislation. Cultural heritage organisations can invoke a task of public interest, and exceptions for archiving purposes in the public interest, if they have a legal mandate to do so. This is not always straightforward, however, especially for smaller organisations. A particular challenge with facial recognition is that photos are considered biometric data when processed using software that makes it possible to uniquely identify a person. Under data protection laws,25 biometric data constitute a special category of personal data, and the (large-scale) processing of such data requires stricter protection measures (DPIA)26 (Figure 12). This imposes additional burdens on organisations, such as the need to conduct an in-depth risk analysis beforehand. In our projects, we purposely limit the storage of vector profiles to a minimum and do not share them with third parties. There is also more legal flexibility when it comes to public figures (such as politicians and activists, performing artists and cyclists), which is where we focused our efforts.

Figure 12: The implementation of the legal risk analysis in the GIVE metadata project.

Facial recognition involves cropping faces from photos and videos, saving them in a reference set, and then using AI software to identify these individuals in digital photo and video content. Additional identifications are added as metadata to the information management systems’ search indexes. As the output from purely technical applications, the new metadata are not subject to copyright protection, unlike the photos and videos used as input, which are typically still protected — and therefore require permission to be reproduced27 or publicly displayed — for 70 years after the creator’s death. Relevant exceptions exist for heritage organisations and scientific research, however, and faces cropped from photos and videos often lack original elements for copyright protection.

Conclusions and next steps

Thanks to the GIVE metadata project, we have been able to create a significant amount of additional descriptive metadata in a uniform manner, with the collaboration of over 120 content partners. We are currently storing the obtained metadata in a specially designed toolbox — giving our content partners access to the data and insights into its quality and potential. Over the coming months, we will work with them to determine which descriptive metadata they will import into their collection management systems. This will include meta-metadata to indicate that the new descriptive metadata tags have been automatically generated. We are keeping track of which AI engine generated which metadata at what time, and we aim to make all new descriptive metadata transparent to end users in our knowledge graph infrastructure, which is ideally suited for storing complex metadata.

In the future, we will continue with projects where AI enriches the descriptive metadata of digital archive content. We are initially aiming to use large language models for written text to distinguish between transcriptions of a higher and poorer quality — we cannot rely solely on the confidence scores from the transcription engine to determine transcription quality. Follow-up projects will also focus on metadata corrections and additional AI techniques to further enrich and improve the findability and searchability of digital archives, including voice recognition, audio classification and facial recognition in photos on a larger scale. Finally, we have planned a project to explore how to handle corrections to automatically generated data.

We firmly believe that employing AI for descriptive metadata creation and enrichment is an essential next step following the digitisation of heritage and can significantly enhance the findability and searchability of digital content of digital archives. The human involvement at set points in the process helps us to achieve this within a proper ethical and legal framework.

Footnotes

Deze pagina aanvullen of corrigeren?

Foutje gespot? Of heb je aanvullende inzichten? Deel je ervaringen via onderstaande knop.