elements \n",
"stream_date = pipe_date.output # The stream being output from the pipe\n",
"\n",
"sync_iterator = zip(stream_top, stream_date) # Create a synchronous iterator from the two pipes\n",
"for top_item, date_item in itertools.islice(sync_iterator, 3): # itertools.islice will allow us to get only\n",
" article = (top_item['title'], # the first n elements, which is 3 in this case\n",
" date_item['content'], \n",
" top_item['href'])\n",
" print article\n",
" print"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is awesome!\n",
"```python\n",
"(u'\\u0130lelebet payidar', u'30 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/ilelebet-payidar-2-1477851/')\n",
"\n",
"(u'Cumhuriyet, mucizedir', u'29 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/cumhuriyet-mucizedir-1475895/')\n",
"\n",
"(u'Yar\\u0131n bayram...', u'28 Ekim 2016', u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/yarin-bayram-1473877/')\n",
"```\n",
"You can go to the website and see the list elements for yourself. For a website that is mostly auto-generated (disastrously, might I say), this was relatively easy to achieve!\n",
"\n",
"Let's look at one last example: let's fetch this list and dynamically fetch the articles it points to and get the full article:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Article list elements stream\n",
"xpath_conf_list = {'xpath': xpath, 'url': url} # The XPath configuration for the article list\n",
"pipe_list = SyncPipe('xpathfetchpage', conf=xpath_conf_list) # A pipe that streams the article list elements\n",
"stream_list = pipe_list.output # The stream being output from the pipe\n",
"\n",
"# The article stream\n",
"xpath_article = '/html/body/div[5]/div[6]/div[3]/div/div[2]/div[1]/div/div[2]/div[2]' # XPath of article body\n",
"\n",
"xpath_conf_article = { # The XPath configuration for the articles\n",
" 'url': {'subkey': 'href'}, # Notice how we can refer to a 'subkey' as the\n",
" 'xpath': xpath_article # URL of this configuration\n",
"}\n",
"pipe_article = pipe_list.xpathfetchpage( # A pipe that streams the articles linked to\n",
" conf=xpath_conf_article # by the list stream\n",
") # Notice how we create this pipe by chaining a\n",
" # pipe on top of the list pipe; how this one is\n",
" # 'dependent' on the list pipe\n",
"stream_article = pipe_article.output # The stream being output from the article pipe\n",
"\n",
"sync_iterator = zip(stream_list, stream_article) # Create a synchronous iterator from the two pipes\n",
"for list_item, article in itertools.islice(sync_iterator, 3): # itertools.islice will allow us to get only\n",
" # the first n elements, which is 3 in this case\n",
" p_elements = article['{http://www.w3.org/1999/xhtml}p'] # Get the list of elements under this XPath\n",
" article_body = [paragraph # Grab only the strings under the
elements\n",
" for paragraph in p_elements\n",
" if type(paragraph) in [str, unicode]]\n",
" article_body = '\\n'.join(article_body) # Join strings to create the whole article\n",
" article = (list_item['title'], article_body) # Create the article's (title, body) tuple\n",
" print article\n",
" print"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a script to fetch the articles and read them easily, without needing to go to the website: \n",
"```python\n",
"(u'\\u0130lelebet payidar', u\"17 Kas\\u0131m 1938.\\n*\\nMaalesef, izdihamdan dalga ...\")\n",
"\n",
"(u'Cumhuriyet, mucizedir', u\"*\\nYanm\\u0131\\u015f bina say\\u0131s\\u0131 115 bin, ...\")\n",
"\n",
"(u'Yar\\u0131n bayram...', u\"An\\u0131tkabir'e gitti\\u011finde seni en \\xe7ok etk ...\")\n",
"```\n",
"The article body looks nicely formatted when you print it. Can this be the most effective ad blocker?\n",
"***\n",
"The power of Riko and its pipes may not be immediately visible through parsing just a website but as you explore different options, you can appreciate the power it gives you, the developer, over the mess that is HTML and the World Wide Web. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}