Skip to content

Process

Description

The Process stage aims to make the data ready for analysis. This includes pre-processing, cleaning, refining, reformatting, filtering, and conducting quality control of the collected data. This simultaneously creates a reusable and consistent dataset.


Describe your processing process.

→ Include all considerations below each task.
→ Write about what types of sources and/or data you are creating the dataset from, their creators and historical context (provenance).
→ Revise outdated metadata of original source material.
→ Define and describe your categories and choices made to create them.


Tasks

Annotate your data

When undertaking this task, what should you consider?

Accuracy

  • How accurate are your annotations?

Transparency

  • What are you basing your annotations on?

Multivocality

  • Does your annotation represent diversity adequately?

Expertise

  • Who is annotating your data? What is their expertise and how can this impact your data annotation?

What are good practices in relation to this task?

  • Document your annotation practices and further processing workflow.

    • Include what technique and strategy your research uses for annotations.

    • Include what categories and variables have been used for data fields.

    • Include an explanation of significance of empty fields and meaning of any special value, if applicable.

    • Include an outline of all relationships between data fields (e.g. if a dataset contains “medication” and “disease”, is that medication actually used to treat the disease? Or is it a medication that the patient is using for other reasons?).3

  • Make use of an inter-annotation agreement (IAA).

  • Work with communities and collaborators to ensure diversity of perspectives and accuracy of annotations.


Resources

Globalise:

HUB Global Labour Conflicts:

  • van Kasteel, Teun; Aurich, Jens, 2024, “Amok Events in 19th Century Dutch Newspapers”, https://hdl.handle.net/10622/0WWVWT, IISH Data Collection, V4, UNF:6:YSefEtN1bDpqaOZ/lApgJg== [fileUNF]
  • Läuferts, Josephine; Aurich, Jens, 2024, “Desertion Events in 19th Century Dutch Newspapers”, https://hdl.handle.net/10622/D6OXZ9, IISH Data Collection, V3, UNF:6:PLqCWvjO39KGxxQhgp2J6w== [fileUNF]

Convert data into readable format1

When undertaking this task, what should you consider?

Harmful Language

  • Does your data contain offensive/harmful language and/or categories?

**What are good practices in relation to this task? ** - Use ‘preferred’ and ‘alternative’ labels to distinguish offensive and usable terminology. - This flags to the computational model that certain words are not preferable, yet retains historicity.

  • Making changes to data formats such that different datasets will be compatible for integration with each other.2

Resources

Previous research to help identify harmful language:

DE-BIAS: - Vocabulary: https://pro.europeana.eu/page/the-de-bias-vocabulary - Identification tool: https://pro.europeana.eu/page/the-de-bias-tool

  • Knowledge Graph of Contentious Terminology
    Nesterov, A., Hollink, L., van Erp, M., van Ossenbruggen, J. (2023). A Knowledge Graph of Contentious Terminology for Inclusive Representation of Cultural Heritage. In: Pesquita, C., et al. The Semantic Web. ESWC 2023. Lecture Notes in Computer Science, vol 13870. Springer, Cham. https://doi.org/10.1007/978-3-031-33455-9_30

Alternative vocabulary lists:


Reflect: Reflect on the categories and variables used

When undertaking this task, what should you consider?

Representation

  • Is there data that is unfairly represented by the used categories?

What are good practices in relation to this task?

  • Rethink/add categories or find a new framework for your data so that it reflects your data accurately.

Resources

See Collection: Create usable categories and variables for your dataset


  1. Taken from RDMkit, Processing (accessed 21 August 2025). 

  2. Taken from RDMkit, Processing (accessed 21 August 2025). 

  3. Taken from RDMkit, Processing (accessed 21 August 2025).