Process
Description¶
The Process stage aims to make the data ready for analysis. This includes pre-processing, cleaning, refining, reformatting, filtering, and conducting quality control of the collected data. This simultaneously creates a reusable and consistent dataset.
📔 Recommended documentation¶
Describe your processing process.
→ Include all considerations below each task.
→ Write about what types of sources and/or data you are creating the dataset from, their creators and historical context (provenance).
→ Revise outdated metadata of original source material.
→ Define and describe your categories and choices made to create them.
Tasks¶
Annotate your data¶
When undertaking this task, what should you consider?
Accuracy¶
- How accurate are your annotations?
Transparency¶
- What are you basing your annotations on?
Multivocality¶
- Does your annotation represent diversity adequately?
Expertise¶
- Who is annotating your data? What is their expertise and how can this impact your data annotation?
What are good practices in relation to this task?
-
Document your annotation practices and further processing workflow.
-
Include what technique and strategy your research uses for annotations.
-
Include what categories and variables have been used for data fields.
-
Include an explanation of significance of empty fields and meaning of any special value, if applicable.
-
Include an outline of all relationships between data fields (e.g. if a dataset contains “medication” and “disease”, is that medication actually used to treat the disease? Or is it a medication that the patient is using for other reasons?).3
-
-
Make use of an inter-annotation agreement (IAA).
-
Work with communities and collaborators to ensure diversity of perspectives and accuracy of annotations.
Resources¶
Globalise:
- Background to annotation: https://globalise.huygens.knaw.nl/automatic-event-detection-in-early-modern-dutch-voc-documents-the-annotation-phase/
- Analysis of annotation: https://globalise.huygens.knaw.nl/yes-okay-but-what-were-they-doing/
HUB Global Labour Conflicts:
- van Kasteel, Teun; Aurich, Jens, 2024, “Amok Events in 19th Century Dutch Newspapers”, https://hdl.handle.net/10622/0WWVWT, IISH Data Collection, V4, UNF:6:YSefEtN1bDpqaOZ/lApgJg== [fileUNF]
- Läuferts, Josephine; Aurich, Jens, 2024, “Desertion Events in 19th Century Dutch Newspapers”, https://hdl.handle.net/10622/D6OXZ9, IISH Data Collection, V3, UNF:6:PLqCWvjO39KGxxQhgp2J6w== [fileUNF]
Convert data into readable format1¶
When undertaking this task, what should you consider?
Harmful Language¶
- Does your data contain offensive/harmful language and/or categories?
**What are good practices in relation to this task? ** - Use ‘preferred’ and ‘alternative’ labels to distinguish offensive and usable terminology. - This flags to the computational model that certain words are not preferable, yet retains historicity.
- Making changes to data formats such that different datasets will be compatible for integration with each other.2
Resources¶
Previous research to help identify harmful language:
DE-BIAS: - Vocabulary: https://pro.europeana.eu/page/the-de-bias-vocabulary - Identification tool: https://pro.europeana.eu/page/the-de-bias-tool
- Knowledge Graph of Contentious Terminology
Nesterov, A., Hollink, L., van Erp, M., van Ossenbruggen, J. (2023). A Knowledge Graph of Contentious Terminology for Inclusive Representation of Cultural Heritage. In: Pesquita, C., et al. The Semantic Web. ESWC 2023. Lecture Notes in Computer Science, vol 13870. Springer, Cham. https://doi.org/10.1007/978-3-031-33455-9_30Alternative vocabulary lists:
- Words Matter: https://www.materialculture.nl/en/publications/words-matter
- Transformative Arabic Language Guide: https://library.fes.de/pdf-files/bueros/beirut/22183-20250724.pdf
- Inclusive Terminology Glossary: https://culturalheritageterminology.co.uk/
- Indigenous Peoples Terminology: https://www.ictinc.ca/blog/indigenous-peoples-terminology-guidelines-for-usage
- DEI Controlled Vocabs List: https://docs.google.com/spreadsheets/d/19solOX6tQTYvlF4lr_JNz2WlcsA76CcK3bxvYZ8cHzg/edit?gid=0#gid=0
- Gender, Sex and Sexual Orientation: https://gsso.research.cchmc.org/#!/ (ontology)
- Homosaurus: https://homosaurus.org/ (linked data vocabulary)
Reflect: Reflect on the categories and variables used¶
When undertaking this task, what should you consider?
Representation¶
- Is there data that is unfairly represented by the used categories?
What are good practices in relation to this task?
- Rethink/add categories or find a new framework for your data so that it reflects your data accurately.
Resources¶
See Collection: Create usable categories and variables for your dataset
-
Taken from RDMkit, Processing (accessed 21 August 2025). ↩
-
Taken from RDMkit, Processing (accessed 21 August 2025). ↩
-
Taken from RDMkit, Processing (accessed 21 August 2025). ↩