De-Identification


Often – although not always – research data need to be collected, managed, used, shared, and potentially re-used in ways that protect human participants’ identities, i.e., in ways that do not allow the data to be associated with a specific person. During the process of soliciting informed consent from project participants, you must clearly and completely describe to them how you plan to manage, use, and potentially share the information they provide and confirm that they have understood your plans. When private/sensitive identifying information is collected as part of the research process and human participants request confidentiality, you may need to de-identify the data before sharing them with other scholars.

Direct and Indirect Identifiers

A person’s or an organization’s identity can be disclosed through direct and indirect identifiers. These identifiers may be found in the data or their documentation.

  • Direct identifiers
    • Information that is sufficient, on its own, to disclose the identity of a research participant or organization.
    • Examples: name, address, zip code, telephone number, voice, picture
  • Indirect identifiers
    • Demographic and contextual information that creates a risk of possible disclosure in combination with other information about a participant or organization (either collected as part of your project, or available elsewhere).
    • Examples: institutional affiliations, occupation, geographic region, unique values or characteristics (outliers)

With advance planning, you can minimize the collection of such identifying information (for example, by not asking a participant to state their name during a recorded interview).

De-identification

When project participants request confidentiality, you need to balance the risk of inadvertently disclosing their private/sensitive information (thereby associating other information that they conveyed with them personally) against the analytic utility of having that identifying information associated with the other information. Because qualitative social science researchers typically use smaller samples (i.e., involve fewer individuals in their work), private/sensitive information often informs participant selection, and might provide critical context. Even when this is the case, you can guard against disclosing participants’ identities through targeted and thoughtful de-identification – removing direct identifiers and reducing the precision of indirect identifiers.

Scholars who collected the data that need to be de-identified and are very familiar with their research context are best positioned to perform this delicate task. Even they, however, may not be able to de-identify the data to the extent that no one, including others who are intimately familiar with the research context, will be able to identify respondents.  Under these conditions, researchers may be best served by placing access controls on the data when they share them, in addition to performing de-identification.

De-identification Guidelines

Here are some steps you can take to de-identify text (e.g., interview or focus group transcripts):

  • Develop uniform de-identification rules at the beginning of your project and follow them consistently throughout the project; this is particularly important if working as part of a team.
  • Remove direct identifiers
  • Reduce precision/detail of direct and/or indirect identifiers through aggregation
    • Date of birth → year or decade
    • Town → region
  • Generalize meaning of detailed variables
    • Specific professional position → occupation or area of expertise 
  • Restrict upper or lower ranges to hide outliers
    • Group income or age into broader categories
    • A 72-year old → grouped in “people in their 70s” or “senior citizens”
  • Combine variables
    • Individual place name → aggregated urban/rural location
    • Richmond → Virginia or DelMarVa or the South
  • Avoid blanking out or replacing items without any indication that you did so: identify where you used pseudonyms or replacement, for example with [brackets]
    • Mary → [Monica]
  • Avoid unnecessary de-identification, as removing/aggregating information can make data more difficult to interpret, distort them, or make them misleading or unusable.
  • Maintain a master log of all replacements, aggregations, or removals made and keep it in a secure location separate from the de-identified data files.
  • Develop a careful description of the de-identification process; this is part of your project documentation.
    • Ex: Dunning, Thad; Camp, Edwin. 2015. "0_ Camp_Anonymization Protocol.pdf". Brokers, voters, and clientelism: The puzzle of distributive politics. Qualitative Data Repository. https://doi.org/10.5064/F6Z60KZB/0NR0VZ.

Audio-visual data present additional de-identification challenges. Digital manipulation of audio and image files, such as voice alteration and image blurring (e.g., of faces), can remove personal identifiers. However, this practice is labor intensive, expensive, and can compromise the analytic value of data. It is often preferable to obtain consent to use and share unaltered data for research purposes. Avoiding mentioning information that discloses identity during audio recording decreases the risk of sharing unaltered data. When possible, consider sharing de-identified transcripts openly, while placing recordings under more stringent access controls.