I decided to write a short post about how I use Python and XPath to extract web content. I do this often to build research data sets. This post was inspired by another blog post: Luciano Mammino - Extracting data from Wikipedia using curl, grep, cut and other shell commands.
Where Luciano uses a bunch of Linux command line tools to extract data from Wikipedia, I thought I’d demonstrate pulling the same data using Python and XPath. Once I discovered using XPath in Python, my online data collection for research became a whole lot easier!
XPath to query parts of an HTML structure
XPath is a way of identifying nodes and content in an XML document structure (including HTML). You can create an XPath query to find specific tables, reference specific rows, or even find cells of a table with certain attributes. It’s a great way to slice up content on a web site.
We’ll start with the target URL https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo. We extract the HTML document elements and identify our medalists using the table structure on the page.
Use an IDE!
I highly recommend you do this in a good Integrated Development Environment (IDE) - PyCharm (Scroll to the bottom to see the free Community Edition version) is the best I’ve seen for Python development and there’s a free community edition! If you think you’re too hardcore, then go for it with a text editor, whatever floats your boat.
In PyCharm I setup the basic URL download, set a breakpoint and then in debug mode, I evaluate expressions until I home in to my target content.
Python to grab HTML content
The first bit of Python code just pulls in the web page as a string, and creates an XML tree out of it, so we can use the data with XPath:
import requests
from lxml import html
pageContent=requests.get(
'https://en.wikipedia.org/wiki/List\_of\_Olympic\_medalists\_in\_judo'
)
tree = html.fromstring(pageContent.content)
Using Chrome to identify elements and XPaths
Now we need to know what to extract. An easy way to work out the approximate XPath query is to use Chrome web browser, right-click an element of interest and ”Inspect Element". You’ll get a bunch of data on the side about the element content:
On the HTML element of interest, right click, select copy -> copy xpath. This will give you information on how to reference that very specific element.
Hot Tip! Code in debug break-mode
Insert a breakpoint and debug your code so we can test the copied XPath query in the ‘evaluate expression window’. You’ll need to play around with your query to make sure your getting the results you want.
Once we’re happy that we have the correct data coming out of our XPath query, we can bang the rest out in Python. This example selects Gold, Silver and Bronze medalists, but to simulate Luciano’s results, we’ll combine them all in to a single list:
goldWinners=tree.xpath(
'//\*\[@id="mw-content-text"\]/table/tr/td\[2\]/a\[1\]/text()')
silverWinners=tree.xpath(
'//\*\[@id="mw-content-text"\]/table/tr/td\[3\]/a\[1\]/text()')
#bronzeWinner we need rows where there's no rowspan - note XPath
bronzeWinners=tree.xpath(
'//\*\[@id="mw-content-text"\]/table/tr/td\[not(@rowspan=2)\]/a\[1\]/text()')
medalWinners=goldWinners+silverWinners+bronzeWinners
XPath looks a bit messy, but if you work backwards with me, it’s just saying:
- “Get me the text() node, of the first[1] anchor <a> element
- in the second[2] <td> of every <tr> and <table>
- but only within elements with an attribute[@] and value of id=“mw-content-text”.
- in the second[2] <td> of every <tr> and <table>
Post process extracted data
Finally we insert our tested XPath into our code, and the rest is straight forward Python. We can retrieve, manipulate and calculate on any of the list content. To simulate Luciano’s output, we’ll build a final list with total medal counts:
medalWinners=goldWinners+silverWinners+bronzeWinners
medalTotals={}
for name in medalWinners:
if medalTotals.has\_key(name):
medalTotals\[name\]=medalTotals\[name\]+1
else:
medalTotals\[name\]=1
And we're done, print the results!
for result in sorted(medalTotals.items(), key=lambda x:x\[1\],reverse=True):
print '%s:%s' % result
Results
It worked! We get the same results as Luciano using just over a dozen lines of Python Code!
$/usr/local/bin/python2.7 judograbber.py
Driulis González:4
Angelo Parisi:4
Amarilis Savón:3
Edith Bosch:3
Idalys Ortiz:3
Teddy Riner:3
David Douillet:3
Mark Huizinga:3
Tadahiro Nomura:3
Rishod Sobirov:3
Ryoko Tamura:3
Mayra Aguiar:2
...
There’s no right or wrong way to extract data. Luciano’s method used pure command line tools, and that’s pretty neat. The Python and XPath method is very portable. It helped me significantly in my data collection for research.
230
READY.
Python example code ‘judograbber.py’
WARNING: This is a legacy post made in Python 2.7 - which is no longer supported or even readily available.
# Warning, this is a legacy blog post. This code is written in Python 2.7 which is no longer supported
# Do not try this at home.
import requests
from lxml import html
pageContent=requests.get('https://en.wikipedia.org/wiki/List\_of\_Olympic\_medalists\_in\_judo')
tree = html.fromstring(pageContent.content)
goldWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[2\]/a\[1\]/text()')
silverWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[3\]/a\[1\]/text()')
#bronzeWinner we need rows where there's no rowspan - note XPath
bronzeWinners=tree.xpath('//\*\[@id="mw-content-text"\]/table/tr/td\[not(@rowspan=2)\]/a\[1\]/text()')
medalWinners=goldWinners+silverWinners+bronzeWinners
medalTotals={}
for name in medalWinners:
if medalTotals.has\_key(name):
medalTotals\[name\]=medalTotals\[name\]+1
else:
medalTotals\[name\]=1
for result in sorted(
medalTotals.items(), key=lambda x:x\[1\],reverse=True):
print '%s:%s' % result