As researchers or programmers, we will often want to protect our data by anonymising sensitive information like names and addresses. To do this, we can combine pieces of user data to make an ’anonymous’ key that can be used in-place of the sensitive information. Instead of referring to “Jane Smith of Drury Lane”, Jane could have a nonsense identifier like “675AF3C”, which can be used throughout our study.
A common method for anonymising fields such as name and date of birth is to combine them with a hash function. But, because secure hash functions are ’deterministic’, they produce the same identifier for the same set of input data. If we have limited hash inputs, we will have a limited range of possible outputs; if we limit things too far, an attacker can run a brute force search to identify our original inputs.
Ensuring anonymous record security relies on ‘entropy’, which is just a fancy way of saying ‘disorder’ or ‘randomness’. When you have something that’s personally identifiable like a name or an age, there really isn’t much variability — there isn’t much entropy. But, what if we combine many pieces of information to increase the entropy?
Let’s do the math:
- 22-million unique names in Australia (assume 95% have a unique full name)
- 11-million postal addresses
- 45,000 possible birth-dates = 10,000,000,000,000,000,000 combinations!
It sounds like a lot of combinations! But, a determined attacker can calculate every hash combination in a few weeks. The hashed record key built from name, address and date of birth is somewhat secure against an unsophisticated attacker, but it doesn’t stand up against someone who really wants to crack our data. For an attacker wanting to commit identity theft, ‘de-anonymising’ our data may prove very lucrative.
A big concern for record keepers is someone de-anonymising an entire database. If the anonymous record keys are poorly constructed, or low on entropy, then the attacker can simply brute force all possible keys and match them to record identifiers. Once de-anonymised, the attacker can search through data for anything interesting. This is a serious data-breach because it opens up all database records in one hit. There is a way of militating against this threat with a security mechanism called a ‘salt’.
A ‘salt’ is an extra piece of information that is baked into a hash record. If every single record has its own unique salt, then an attacker can’t crack all records in one pass. This is because the salt is part of the hash; an attacker must brute force each record one-by-one by cracking the salt plus every combination of our inputs before moving on to the next record. Salted data is resistant to large-scale brute force attack, but the individual records still need enough entropy to make attacking records one-by-one impractical.
Unfortunately, salted records come with a price: because the anonymised identifier depends on the salt, it is difficult to compare participants across multiple studies. This is because both studies would need to use the exact same hash inputs and salt values to create matching identifiers.
The punchline is that if we’re handling sensitive data and want to anonymise it, we need to understand and apply security correctly. We must have enough entropy in our anonymous identifier to make brute-force attacks difficult, we need to use secure hashes, and ensure records are salted. Without enough entropy or salt-free records, our data is at risk of large-scale de-anonymising attacks.