GDPR: Pseudonymisation of personal data

There are various methods and techniques to protect (sensitive) personal data against unwanted access, pseudonymization is one of them.

Pseudonymisation is different from anonymisation. Pseudonymisation (referred to as 'coding' in the previous Privacy Legislation) of personal data means that the personal data will be processed in such a way that it can no longer be linked to a specific individual (also referred to as 'data subject') without the use of additional data. This usually involves replacing identifiers from the data with a pseudonym. The link between the data subject's identity and the pseudonym must be kept in a separate file (the keyfile, see below). The researcher can at all times access the original data and identity of the data subject (the natural person whose personal data are processed), by using the key file. Anonymisation on the other hand, irreversibly removes the link between the data subject's data and the data subject's identity (see this research tip for more details).

The purpose of pseudonymisation is to create a more secure version of the dataset, at least in terms of privacy, but at the same time to preserve the possibility of (re)identification. Working with a pseudonymised dataset is not only safer than working with the original dataset, it also makes sharing or processing by other parties possible without compromising the privacy of the data subjects. In an opinion of the European Data Protection Board (EDPB), pseudonymisation is even put forward as an effective additional measure to protect personal data during international transfers.

 

Terms:

Direct identifiers: data that leads to the direct identification of a person. Examples are name, address, phone number, etc.

Indirect identifier: data that in itself does not lead to the identification of a person but through combination with other data allows persons to be (re)identified. Examples are age, gender, weight, a personal opinion, etc.

Keyfile: the document in which the link is made between the pseudonymised and the raw data (e.g. a list with the names of the data subjects and the codes (pseudonyms) that were used in the pseudonymised dataset).

 

When to pseudonymise?

When you process personal data, you have both an ethical and a legal obligation to ensure that the privacy of those involved is adequately protected at all times. The choice of which and how many security measures are required is made on the basis of both the nature of the personal data and an assessment of the risks involved in the processing. Thus, riskier processing (for example when sharing data with external parties) will have to be accompanied by a more extensive set of security measures. Similarly, when working with special categories of personal data (also referred to as "sensitive" personal data), more attention will need to be paid to additional security measures such as pseudonymisation.

 

Pseudonymisation of quantitative data

For quantitative data, pseudonymisation is a technique that is relatively easy to apply because there is a clear distinction between data (variables) with and without identifying features. Data with identifying features and data without identifying features are two separate things. Let’s take survey data as an example, where participants complete (online) surveys and where often contact details (name, e-mail address, ...) and/or demographic data are collected. The survey data itself, however, often does not contain (direct) identifying features (e.g. scores on Likert scales).

Exactly how pseudonymisation should be done, is highly dependent on the dataset. In some, more simple cases, it suffices to replace the direct identifier with a pseudonym and create a keyfile. Through this keyfile, the data can then be linked to an identifiable person.

There are several techniques available (see below).

Managing a keyfile also requires some technical and/or organisational precautions:

  • Store the keyfile separately from the pseudonymised research data
  • Encrypt the keyfile and share the password with at least one trusted person (e.g. the (co)promotor of the research)
  • Restrict access to the keyfile

Pseudonymisation of a basic dataset. In this case, it is sufficient to pseudonymise the name, as there are not enough indirect identifiers to enable re-identification by means of the pseudonymised dataset. The data can only be re-linked to an individual via the keyfile.

 

With more complex datasets, pseudonymisation becomes a bit more difficult. Research often requires the processing and analysing of (extensive) demographic data to obtain the research goal. Simply replacing the direct identifiers (e.g. name) with a pseudonym may not be sufficient in this case. This is because by combining demographic data (e.g. date of birth + gender + place of residence), it may still be possible to localize individuals in a dataset. Here, there are 2 options; (1) either separating the demographic data (or all potential indirect identifiers) from the dataset, (2) or 'generalising' the data with identifying features.

Option 1 (separation of data) allows the researcher to process/analyse the research data in pseudonymised form, while the demographic data are kept in a secure environment (e.g. on a network drive with restricted access).

 

In this case we collected some demographic data. For the research purposes here, it is not desirable to lose any information (e.g. by generalising data). The safest option then is to expand the keyfile with all demographic data. It is more safe to work with the pseudonymised dataset and all (additional) data remains available.

 

In some cases, however, it’s not feasible to separate the demographic variables from the rest of the data because all variables are important for the analysis of the dataset. If we want to pseudonymise this dataset, it will be necessary to use the technique of generalisation. In concrete terms, this means generalising the variables of interest, to make the data less specific. For example, "date of birth" could be generalised to year of birth or age category. Or, a specific address could be generalised to a city or region. Note that this leads to a loss of details in the data, which is not always desirable. When pseudonymising, you will therefore always have to consider how far you can go without interfering with the research objectives.

 

In this example, we generalised the demographic data to make it less specific. You cannot identify a specific individual in this; for example, there are several women in the 50-60 age group who reside in Europe and who have the same level of education. As described, this technique leads to a loss of details.

 

Regardless of the option you choose as a researcher, you need to double check whether the dataset is sufficiently pseudonymised. To verify this, you need to look at the dataset from the point of view of a participating individual (data subject). If you suspect that a participant can still recognise his/her data in the pseudonymised dataset, then it has not been sufficiently pseudonymised!

For both options, it is relatively easy to anonymize the data afterwards; as it is usually sufficient to permanently delete the keyfile. If you have chosen to store the demographic data in the keyfile as well, you will of course lose this information.

 

Pseudonymisation of qualitative data

The pseudonymisation of qualitative data, such as transcripts of interviews, audio or video files, is generally less obvious and more labour-intensive. Even more than for quantitative data, the possibilities depend on the format of the data (image, speech, text,...).

 

Recordings (audio & video)

Interviews, focus groups, panel discussions, etc. are often recorded so that no details are lost. Pseudonymisation of these qualitative data is not evident. When a data subject is recognisably displayed, they are already immediately identifiable. An individual's voice is also considered a direct identifier. A face or image can be blurred with video editing software and a voice can be made unrecognisable with audio editing software, but these edits require a certain amount of technical knowledge and a large investment of time. Moreover, 'unrecognizability' is not always guaranteed. In some cases, technical filters can be undone. The specific vocabulary that is being used or the specific dialect can also so so unique, that it can still be possible to recognise an individual.

An additional challenge is that data subjects may share personal information during interviews or conversations (not necessarily linked to the focus of the study). This information, combined with other data, may make it possible to (re)identify the data subject. If (re)identification is reasonably possible, all this information should be eliminated (e.g. by editing a beep over the original sound).

Thus, the conclusion is that the pseudonymisation of audio-visual data is technically more complex and usually demands greater effort from the researcher.

An alternative strategy may be to keep the original audio and/or video files safe and to work with transcriptions. After all, textual data are relatively easier to pseudonymise (see below). Depending on the situation, the original recordings may be deleted.

Transcriptions

To further process and analyse audio and video data (e.g. of interviews, focus groups, etc.), the recordings are usually transcribed. This opens up possibilities for pseudonymisation. Both specific software for processing qualitative data (e.g. Nvivo (only available in Dutch) & ATLAS.ti) and more generic software (e.g. MS Word) offer possibilities for finding and replacing specific words. Note that the use of this software requires you to know in advance which specific (parts of) words you should look for.

The following points are important:

  • Start the process of pseudonymisation as soon as the qualitative data has been collected, e.g. immediately at the start of the analysis of the images or the transcription;
  • When replacing personal data in a transcription, use the 'find and replace' function, but perform this process with the necessary attention and care so as not to overlook typos;
  • Searching for words with capital letters and numbers in a text can help find identifiable information such as a name, place name, date of birth, etc;
  • As with pseudonymisation of quantitative data, you should take sufficient organisational and technical measures to secure the keyfile.