Scraping Data with Python and XPath

I decided to write a short post about how I use Python and XPath to extract web content. I do this often to build research data sets. This post was inspired by another blog post: Luciano Mammino – Extracting data from Wikipedia using curl, grep, cut and other shell commands.

Where Luciano uses a bunch of Linux command line tools to extract data from Wikipedia, I thought I’d demonstrate pulling the same data using Python and XPath. Once I discovered using XPath in Python, my online data collection for research became a whole lot easier! Continue reading

Tech Brief: Anonymising sensitive data with entropy and salt.

As researchers or programmers, we will often want to protect our data by anonymising sensitive information like names and addresses. To do this, we can combine pieces of user data to make an ’anonymous’ key that can be used in-place of the sensitive information. Instead of referring to “Jane Smith of Drury Lane”, Jane could have a nonsense identifier like “675AF3C”, which can be used throughout our study.

(Want more info? See security brief: Statistical Linkage Keys and Security)

Anonymising data with hashes and entropy

A common method for anonymising fields such as name and date of birth is to combine them with a hash function. But, because secure hash functions are ’deterministic’, they produce the same identifier for the same set of input data. If we have limited hash inputs, we will have a limited range of possible outputs; if we limit things too far, an attacker can run a brute force search to identify our original inputs. Continue reading

Security Brief: The Australian Census and Statistical Linkage Keys

There have been concerns among security professionals and privacy advocates about changes to the Australian 2016 Census. The biggest concern is how the ABS plans to combine your private data. The ABS will link your Census records across multiple products, services and share it with other government departments.

In the past, this has never been a problem because the ABS never used our individual name and address data. Consequently, people could answer uncomfortable questions honestly, with the knowledge that even if data were to leak, there would be no back to them.

The Census Data Statistical Linkage Key (SLK)

This year that has changed, the ABS revealed plans to assign Australians a unique identification number called a Statistical Linkage Key or SLK. Continue reading

Tech explained: What is a hash, what is brute force and are hashes secure?

Identifying Data

Security professionals often use hashes to represent data – think of it like a unique fingerprint or “key” for the data. While there are many ways to make data keys (we could assign them sequentially, or pick them at random) hashes provide a way to build a unique key from the data itself.

The purpose of a key is to allow us to reference a piece of data. Perhaps we need a key to identify movies; we could define a data key as:

- the first letter of each word in the title,
- directors initials
- and the year of release.

So, Indiana Jones and the Temple of Doom, by Steven Speilberg (1984) would have the key: IJATTODSS1984.

This key is pretty simple and easy to reverse. Because we know the key (IJATTODSS1984) and how it’s made, we can identify the movie by searching the Internet for releases in 1984, and directors with the initials S.S. This key is also not guaranteed to be unique, Continue reading

Security concerns and Census Statistical Linkage Keys explained

An in-depth explanation regarding the security surrounding statistical linkage keys, why they’re important and how their security can be compromised…

The security of the Australian 2016 Census has sparked much debate and consternation among privacy advocates and security professionals alike. At the core of these concerns is a move by the Australian Bureau of Statistics (the ABS) to start linking census records to other data. The mechanism proposed for linking records and data is a ‘random looking’ Statistical Linkage Key. We have been told that the linkage key is secure and will be ‘hashed’ to make it irreversible – but what exactly does that mean, and how does it secure your data?

Introducing the Statistical Linkage Key

Statistical Linkage Keys or SLKs have been used frequently by people doing data research, it provides some very basic anonymity, and a sanity check on the data while retaining a way of identifying an individual throughout a study.

The Australian Bureau of Statistics publishes a standard called the SLK581 cluster. It defines a method for turning “Jane Smith 01/01/2007 Female” in to random looking serial number like “MIHAN010120072”. Continue reading