Scraping Data with Python and XPath

I decided to write a short post about how I use Python and XPath to extract web content. I do this often to build research data sets. This post was inspired by another blog post: Luciano Mammino – Extracting data from Wikipedia using curl, grep, cut and other shell commands.

Where Luciano uses a bunch of Linux command line tools to extract data from Wikipedia, I thought I’d demonstrate pulling the same data using Python and XPath. Once I discovered using XPath in Python, my online data collection for research became a whole lot easier! Continue reading

Tech Brief: Anonymising sensitive data with entropy and salt.

As researchers or programmers, we will often want to protect our data by anonymising sensitive information like names and addresses. To do this, we can combine pieces of user data to make an ’anonymous’ key that can be used in-place of the sensitive information. Instead of referring to “Jane Smith of Drury Lane”, Jane could have a nonsense identifier like “675AF3C”, which can be used throughout our study.

(Want more info? See security brief: Statistical Linkage Keys and Security)

Anonymising data with hashes and entropy

A common method for anonymising fields such as name and date of birth is to combine them with a hash function. But, because secure hash functions are ’deterministic’, they produce the same identifier for the same set of input data. If we have limited hash inputs, we will have a limited range of possible outputs; if we limit things too far, an attacker can run a brute force search to identify our original inputs. Continue reading

Tech explained: What is a hash, what is brute force and are hashes secure?

Identifying Data

Security professionals often use hashes to represent data – think of it like a unique fingerprint or “key” for the data. While there are many ways to make data keys (we could assign them sequentially, or pick them at random) hashes provide a way to build a unique key from the data itself.

The purpose of a key is to allow us to reference a piece of data. Perhaps we need a key to identify movies; we could define a data key as:

- the first letter of each word in the title,
- directors initials
- and the year of release.

So, Indiana Jones and the Temple of Doom, by Steven Speilberg (1984) would have the key: IJATTODSS1984.

This key is pretty simple and easy to reverse. Because we know the key (IJATTODSS1984) and how it’s made, we can identify the movie by searching the Internet for releases in 1984, and directors with the initials S.S. This key is also not guaranteed to be unique, Continue reading

Analysis – Email list integrity, 96% of organisations handle my email address appropriately

How are businesses and organisations handling your email? I know how they’re handling mine!

For about 10 years I’ve used “burnable email addresses”. These are email addresses that I can use and expire. They are unique to every relationship between me and another organisation, business or blog that I register with. This means I know who’s got my email and if they’ve leaked it. I know if they’ve shared if or if they’re spamming it.

I guess that makes me a living honeypot? But, unlike many automated honeypots that try to trap malicious users, the data from my email servers are based on real-world interactions between myself and others. Continue reading