{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# XPath\n", "\n", "XPath is short for XML Path Language which is a query language for selecting nodes in an XML document. This is very useful in webscraping because all HTML documents are a form of XML documents." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "from lxml import html" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", "

Favorite Python Librarires

\n", " \n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", " \n", "

Favorite Python Librarires

\n", " \n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load HTML Code\n", "Now I'll read the code from cell number 2 and store it in `html_code`. Finally we will parse that into a lxml node object." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", "

Favorite Python Librarires

\n", " \n", "\n" ] } ], "source": [ "html_code = In[2]\n", "html_code = html_code[42:-2].replace(\"\\\\n\",\"\\n\")\n", "print(html_code)\n", "\n", "doc = html.fromstring(html_code)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using xpath to find nodes in a document\n", "\n", "There many methods for fidning a node that you are interested in from a XML or HTML document. The first way is to write the whole path separated by forward slashes `/`\n", "\n", "## Reading `

` tag" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = doc.xpath(\"/html/body/h1\")[0]\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To read the text inside that tag you can use the text variable." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Favorite Python Librarires'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way is read the text is to use the `text()` function in xpath." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Favorite Python Librarires'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = doc.xpath(\"/html/body/h1/text()\")[0]\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with multiple items\n", "\n", "xpath always returns a list. If there are no matches, it will return an empty list. If there is one match it will return a list with one item." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item_list = doc.xpath(\"/html/body/ul/li\")\n", "item_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `text()` function with multiple items." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Numpy', 'Pandas', 'requests']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc = html.fromstring(html_code)\n", "item_list = doc.xpath(\"/html/body/ul/li/text()\")\n", "item_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tag selector without full path\n", "\n", "you can select any node in your document that matches a node selector without using the full path with a double forward slash `//`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Numpy', 'Pandas', 'requests']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc = html.fromstring(html_code)\n", "item_list = doc.xpath(\"//li/text()\")\n", "item_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting one result\n", "\n", "You can select one result from a list using `[index]` after your tag selector. Make sure you use it on the tag selector and not a function selector.\n", "\n", "**Notice**: This is `index` starts from 1." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Numpy']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc = html.fromstring(html_code)\n", "item_list = doc.xpath(\"/html/body/ul/li[1]/text()\")\n", "item_list" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", "

Favorite Python Librarires

\n", " \n", "

Favorite JS Librarires

\n", " \n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", " \n", "

Favorite Python Librarires

\n", " \n", "

Favorite JS Librarires

\n", " \n", "" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", "

Favorite Python Librarires

\n", " \n", "

Favorite JS Librarires

\n", " \n", "\n" ] } ], "source": [ "html_code = In[11]\n", "html_code = html_code[42:-2].replace(\"\\\\n\",\"\\n\")\n", "print(html_code)\n", "\n", "doc = html.fromstring(html_code)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Attributes selector\n", "\n", "In this example we have two `

` tags with different css classes. We can select tags based on css classes as follows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Favorite Python Librarires'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = doc.xpath(\"/html/body/h1[@class='text-muted']/text()\")[0]\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `contains()` function\n", "\n", "I want to select all items in the first list. I could use the full class for selection or I could just use one of the classed only used in the first list with the `contains()` function." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Numpy', 'Pandas', 'requests']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item_list = doc.xpath(\"/html/body/ul[contains(@class,'nav-stacked')]/li/a/text()\")\n", "item_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Returning attributes\n", "\n", "What if we want to read the `href` attribute of the `` tag to get the link. This is how you do that:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['http://www.numpy.org/',\n", " 'http://pandas.pydata.org/',\n", " 'http://python-requests.org/']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item_list = doc.xpath(\"/html/body/ul[contains(@class,'nav-stacked')]/li/a/@href\")\n", "item_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Real world example\n", "\n", "Read the list of languages with 1M+ articles on http://www.wikipedia.org/" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "response = requests.get(\"http://www.wikipedia.org\")\n", "doc = html.fromstring(response.content, parser=html.HTMLParser(encoding=\"utf-8\"))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Deutsch',\n", " 'English',\n", " 'Español',\n", " 'Français',\n", " 'Italiano',\n", " 'Nederlands',\n", " 'Polski',\n", " 'Русский',\n", " 'Sinugboanong Binisaya',\n", " 'Svenska',\n", " 'Tiếng Việt',\n", " 'Winaray']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lang_list = doc.xpath(\"//div[@class='langlist langlist-large hlist'][1]/ul/li/a/text()\")\n", "lang_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }