Skip to content

Collection

Description

In the Collect stage, relevant data is collected and stored in the chosen organisational structure of the project, creating the content of the dataset. It builds on the Set Up stage, but also reflects and updates decisions made then.


Describe your collection process, metadata creation, contextualisation documentation.

→ Include all considerations below each task.
→ Write about what types of sources and/or data you are creating the dataset from, their creators and historical context (provenance).
→ Revise outdated metadata of original source material.
→ Define and describe your categories and choices made to create them.


Tasks

Select and Curate Sources and/or Data

When undertaking this task, what should you consider?

Availability

  • What sources are available (e.g. only written sources, no oral sources; only governmental reports, no personal histories)?
  • In what way does a lack of availability of sources skew your dataset (e.g. do you only have data on certain communities and not others)?

Expertise

  • What skills and expertise does your research team possess? How does this impact the sources/data you are able to work with and the ways in which you process this data (e.g. knowledge of paleography and language/cultural contexts or knowledge of a historical context might be necessary to work with certain kinds of sources)?

Accessibility

  • What sources are accessible (e.g. not destroyed; in country; in language researcher understands, digitized and freely available online)?

Multivocality

  • Does your data allow for the inclusion of multiple perspectives?
  • What sources are you not collecting from - and would they provide different perspectives?
  • How can you incorporate diverse viewpoints within your dataset?

Representation

  • Which individuals, communities, or viewpoints are represented in your data?
  • Which are absent or underrepresented?
  • How might these representation patterns affect analysis and how will this be communicated to other users/visitors/researchers?

Historicity

  • Did you modify the data from its original source?
  • Are you using terms as they are in the source document (extraction/verbatim)? Or did you interpret content in the source document in order to capture the data point?
    e.g. Sex of a person can be explicitly stated in the document or interpreted by the data collector (through names, for example).
  • How are you handling ambiguous elements like implied gender or social status?

Provenance

  • Where does your data come from? What is the context of its creation/inheritance?

What are good practices in relation to this task?

  • Document your collection process according to a documentation template chosen before, e.g. data-envelopes.

    • Include all the knowledge you possess about the data: its shortcomings and your collection practices and how you have modified the data (e.g. your data may not cover certain chronological periods; you may only have access to digitized files of certain types of archives; you may have normalised the spellings of concepts in your data which means that it cannot be used to research spelling variation of terms).
    • Include how you went about dealing with and adapting ambiguous/illegible information in the source.
  • Outline your collection/selection criteria for sources.

    • Discuss the reasons justifying the exclusion/inclusion of research data in a particular context.
    • If you have learned something about your datasource that was not previously documented, be sure to make note of it such that you can convey it to future users of the dataset.
  • Reflect on how your sources impact the (research) questions you ask.

    • And vice versa: what questions can you ask with the available sources?
  • Critically assess your collection practices, in order to identify gaps and oversights.

  • Critically assess the skills and expertise of your team.

  • Actively pursue the inclusion of different perspectives – diversify your data.

    • Through e.g. use of different types of sources; different types of producers of sources; different provenances of sources; different interpretations
    • Involving communities or other specialists.
  • Find complementary data sources to the ones you use to offer an alternative or wider perspective on the data.

    • e.g. if you are working with VOC shipping lists, attempt to bring in shipping lists for the same period and region from other data sources to offer a fuller view of shipping data, and mitigate absences in each of the datasets.
  • Really know your sources: conduct research into their provenance, the reason they were created, the author(s), the time period, its significance. Understanding the context of your sources allows you to ask the right questions.


Resources

  • Hamed Taherdoost. Data Collection Methods and Tools for Research; A Step-by-Step Guide to Choose Data Collection Technique for Academic and Business Research Projects. International Journal of Academic Research in Management (IJARM), 2021, 10 (1), pp.10-38. https://hal.science/hal-03741847/document

  • GLAM Workbench. Finding GLAM Data.

  • OSF Support, How to Make a Data Dictionary

    • Helps to create an overview with definitions for your categories.
  • Pressing Matter project: https://pressingmatter.nl/

  • Background reading:

    • Quinn, Brian. “Collection Development and the Psychology of Bias.” The Library Quarterly 82, no. 3 (2012): 277–304. https://doi.org/10.1086/665933.
    • Sander Molenaar, Late Imperial China Special Issue, forthcoming publication.
    • Sarah Binta Alam Shoilee, Annastiina Ahola, Heikki Rantala, Eero Hyvönen, Victor de Boer, Jacco van Ossenbruggen, and Susan Legene. “Enhancing Provenance Research with Linked Data: A Visual Approach to Knowledge Discovery.” SemDH‘25: Second International Workshop of Semantic Digital Humanities, June 1-2 2025, Portoroz, Slovenia. https://seco.cs.aalto.fi/publications/2025/shoilee-et-al-pm-sampo-2025.pdf

Create Metadata for the Data

Description: This task refers to the metadata of individual parts of the data as well as the combined dataset. It also refers to a reflection on the original metadata of the source.

When undertaking this task, what should you consider?

Accuracy

  • Is your metadata correct?
  • Is the metadata of your source (data) correct?

Transparency

  • What are you basing your metadata descriptions on?
  • Is it clear how the original source (data) metadata was created?

Representation

  • Are you providing consistent levels of detail across all descriptions? If not, what are the implications of this? And how will this be communicated to others?
  • Does the metadata of your source (data) do so?

Multivocality

  • Does your metadata fairly describe different perspectives that exist?
  • Does the metadata of your source (data) do so?
  • How does your metadata handle contested or evolving terminology?

What are good practices in relation to this task?

  • For technical metadata: use a metadata standard useful for your project - and adapt it to include all information you need. Using a metadata standard ensures consistency across research.2

    • Check the metadata standards of the repository you will use to store your research.
  • Work with collaborators from diverse groups to accurately describe your data with the historical complexity it has.3

  • When writing metadata: Focus on the humanity of an individual before their identity/ies, e.g. always mention names before saying their social status.4

  • Avoid the use of passive voice when describing oppressive relationships.

  • If, as a result of above considerations, the original source (data) metadata is not inclusive, transparent or factually incorrect (e.g. antiquated descriptions with outdated references to places and communities): consider how to improve these descriptions.

    • Convey actions taken in your documentation.
    • Share this information with the institutions that host the original data and metadata.

Resources


Create usable categories and variables for your dataset

When undertaking this task, what should you consider?

FAIR

  • Are your categories clearly defined and documented?
  • Can your categories be linked to established ontologies?
  • Do your categories support interoperability with other datasets?

Representation

  • Be mindful of how these categories affect your research: does creating the category ‘ethnic category’ perpetuate the views of the colonial governments - and is this what you want in your research? What is its value and implications?
  • How might users interpret or misinterpret your categories?

Multivocality

  • Do your categories represent complex data adequately?
  • Can your framework accommodate multiple perspectives/worldviews?

Historicity

  • Are you using categories as used by your sources? If so, consider elaborating your choice of categories to your users.
  • How do you balance historical accuracy with contemporary ethical considerations while avoiding presentism?1

Methodology (including algorithms)

  • How might your category choices introduce or amplify bias in ML applications?
  • Could your variables create problematic correlations or inferences when used in predictive models?
  • How might simplifications in categorization lead to algorithmic discrimination?
  • How might missing or unbalanced data within categories affect computational analysis?

What are good practices in relation to this task?

  • Look at other vocabularies out there (see Resources).

  • Create and optimise your template for collecting data through the inclusion of usable categories and variables.

  • Describe in documentation what each category/variable means.

  • Base your categories on previously collected material and historical context. Do you know your data sources and existing research about these data sources well enough to know the strength and weaknesses of categories used in the data sources themselves?

  • Create metadata flags for variables that require special handling in ML contexts.

  • Test your categorisation with diverse stakeholders to identify unforeseen biases.5


Resources

Examples of vocabularies, thesauri, ontologies:

Existing categorisations:

  • Museum cataloguing: Government of Canada, Canadian Heritage. “Nomenclature for Museum Cataloging.” September 1, 2018. https://page.nomenclature.info/apropos-about.app?lang=en.
  • Biased gender language classification: Havens, Lucy. “Towards Gender Biased Language Classification: A Case Study with British English Archival Metadata Descriptions.” Paper presented at NAACL-HLT. Student Research Workshop, 2022.
  • Taxonomy of Labour Relations: Hofmeester, Karin; Lucassen, Jan; Lucassen, Leo; Stapel, Rombert; Zijdeman, Richard, 2016, “The Global Collaboratory on the History of Labour Relations, 1500-2000: Background, Set-Up, Taxonomy, and Applications”, https://hdl.handle.net/10622/4OGRAD, IISH Data Collection, V1

Background reading:

  • Bowker, Geoffrey C., and Susan Leigh Star. Sorting Things out: Classification and Its Consequences. MIT press, 2000.

Reflect: Assess data storage structure of DMP

When undertaking this task, what should you consider?

Privacy

  • Is the collected data stored safely?

What are good practices in relation to this task?

  • If necessary, reconsider your storage plans in the DMP and find alternatives.

Resources

See Set up: Write Data Management Plan


Reflect: Revisit your Set Up documentation

When undertaking this task, what should you consider?

Transparency

  • Does the documentation reflect your current thinking in your research?

What are good practices in relation to this task?

  • Update the living documents (DMP, Mission Statement, and Ethics Commitment) to reflect changes in research and thinking.

  • Update the versioning of the documentation.


Resources

See Set up


  1. With regards presentist bias, consider reading Manjusha Kuruppath’s blogpost: https://combattingbias.huygens.knaw.nl/news/historicalbias/ 

  2. Adapted from Lib4RI, Collecting (accessed 27 August 2025). 

  3. Adapted from Archives for Black Lives, Anti-Racist Description Resources (2019), pp. 3-4. 

  4. Adapted from Archives for Black Lives, Anti-Racist Description Resources (2019), pp. 3-4. 

  5. For example, the project Unsilencing Colonial Archives benefited hugely from critique on essentialism on their first version of the proposed taxonomy by archivists and historians from the Dutch National Archives and Het Nieuwe Instituut.