Scraping Data with Python and XPath

Posted by on Wed, Aug 17, 2016

I decided to write a short post about how I use Python and XPath to extract web content. I do this often to build research data sets. This post was inspired by another blog post: Luciano Mammino - Extracting data from Wikipedia using curl, grep, cut and other shell commands.

Where Luciano uses a bunch of Linux command line tools to extract data from Wikipedia, I thought I’d demonstrate pulling the same data using Python and XPath. Once I discovered using XPath in Python, my online data collection for research became a whole lot easier!

XPath to query parts of an HTML structure

XPath is a way of identifying nodes and content in an XML document structure (including HTML). You can create an XPath query to find specific tables, reference specific rows, or even find cells of a table with certain attributes. It’s a great way to slice up content on a web site.

We’ll start with the target URL https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo. We extract the HTML document elements and identify our medalists using the table structure on the page.

Use an IDE!

I highly recommend you do this in a good Integrated Development Environment (IDE) - PyCharm (Scroll to the bottom to see the free Community Edition version) is the best I’ve seen for Python development and there’s a free community edition! If you think you’re too hardcore, then go for it with a text editor, whatever floats your boat.

In PyCharm I setup the basic URL download, set a breakpoint and then in debug mode, I evaluate expressions until I home in to my target content.

Python to grab HTML content

The first bit of Python code just pulls in the web page as a string, and creates an XML tree out of it, so we can use the data with XPath:

import requests
from lxml import html

pageContent=requests.get(
     'https://en.wikipedia.org/wiki/List\_of\_Olympic\_medalists\_in\_judo'
    )
tree = html.fromstring(pageContent.content)

Using Chrome to identify elements and XPaths

Now we need to know what to extract. An easy way to work out the approximate XPath query is to use Chrome web browser, right-click an element of interest and ”Inspect Element". You’ll get a bunch of data on the side about the element content:

Copy XPath using Chrome Inspect Element
Copying an XPath using Chrome browser and 'Inspect Element' gives us a good starting point

On the HTML element of interest, right click, select copy -> copy xpath. This will give you information on how to reference that very specific element.

Hot Tip! Code in debug break-mode

Insert a breakpoint and debug your code so we can test the copied XPath query in the ‘evaluate expression window’. You’ll need to play around with your query to make sure your getting the results you want.

XPath testing using ’evaluate expression’ in PyCharm debug mode
Using debug mode in PyCharm we can insert breakpoints and evaluate expressions. This is really handy when writing parsers and scrapers. Note the evaluated expression result includes an href to 'Thierry Rey' - A Judo gold medalist, so we know we're on the right track!

Once we’re happy that we have the correct data coming out of our XPath query, we can bang the rest out in Python. This example selects Gold, Silver and Bronze medalists, but to simulate Luciano’s results, we’ll combine them all in to a single list:

goldWinners=tree.xpath(
    '//\*\[@id="mw-content-text"\]/table/tr/td\[2\]/a\[1\]/text()')
silverWinners=tree.xpath(
    '//\*\[@id="mw-content-text"\]/table/tr/td\[3\]/a\[1\]/text()')
#bronzeWinner we need rows where there's no rowspan - note XPath
bronzeWinners=tree.xpath(
'//\*\[@id="mw-content-text"\]/table/tr/td\[not(@rowspan=2)\]/a\[1\]/text()')

medalWinners=goldWinners+silverWinners+bronzeWinners

XPath looks a bit messy, but if you work backwards with me, it’s just saying:

  • “Get me the text() node, of the first[1] anchor <a> element
    • in the second[2] <td> of every <tr> and <table>
      - but only within elements with an attribute[@] and value of id=“mw-content-text”.

Post process extracted data

Finally we insert our tested XPath into our code, and the rest is straight forward Python. We can retrieve, manipulate and calculate on any of the list content. To simulate Luciano’s output, we’ll build a final list with total medal counts:

medalWinners=goldWinners+silverWinners+bronzeWinners

medalTotals={}
for name in medalWinners:
    if medalTotals.has\_key(name):
        medalTotals\[name\]=medalTotals\[name\]+1
    else:
        medalTotals\[name\]=1

And we're done, print the results!

for result in sorted(medalTotals.items(), key=lambda x:x\[1\],reverse=True):
      print '%s:%s' % result

Results

It worked! We get the same results as Luciano using just over a dozen lines of Python Code!

$/usr/local/bin/python2.7 judograbber.py
 Driulis González:4
 Angelo Parisi:4
 Amarilis Savón:3
 Edith Bosch:3
 Idalys Ortiz:3
 Teddy Riner:3
 David Douillet:3
 Mark Huizinga:3
 Tadahiro Nomura:3
 Rishod Sobirov:3
 Ryoko Tamura:3
 Mayra Aguiar:2
 ...

There’s no right or wrong way to extract data. Luciano’s method used pure command line tools, and that’s pretty neat. The Python and XPath method is very portable. It helped me significantly in my data collection for research.

PRINT FRE(0)
 230

READY.

Python example code ‘judograbber.py’

WARNING: This is a legacy post made in Python 2.7 - which is no longer supported or even readily available.

# Warning, this is a legacy blog post. This code is written in Python 2.7 which is no longer supported
# Do not try this at home.

import requests
from lxml import html


pageContent=requests.get('https://en.wikipedia.org/wiki/List\_of\_Olympic\_medalists\_in\_judo')
tree = html.fromstring(pageContent.content)

goldWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[2\]/a\[1\]/text()')
silverWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[3\]/a\[1\]/text()')
#bronzeWinner we need rows where there's no rowspan - note XPath
bronzeWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[not(@rowspan=2)\]/a\[1\]/text()')
medalWinners=goldWinners+silverWinners+bronzeWinners

medalTotals={}
for name in medalWinners:
    if medalTotals.has\_key(name):
        medalTotals\[name\]=medalTotals\[name\]+1
    else:
        medalTotals\[name\]=1

for result in sorted(
        medalTotals.items(), key=lambda x:x\[1\],reverse=True):
        print '%s:%s' % result