{ "metadata": { "name": "Extraindo New Topics from Esplanada-Geral" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": "import BeautifulSoup as bs\nfrom IPython.display import HTML\nimport urllib2\nimport re", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 59 }, { "cell_type": "code", "collapsed": false, "input": "url = 'http://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Esplanada/geral&action=history'\nheaders = { 'User-Agent' : 'Mozilla/5.0' }\nreq = urllib2.Request(url, None, headers)\nhtml = urllib2.urlopen(req).read()\nsoup = bs.BeautifulSoup(html)\ntopics = soup.findAll('li', text=re.compile(u'\\(novo t\u00f3pico:*'))\ntopics_l = []\nfor topic in topics:\n topics_l.append({})\n t = topic.findParent() \n topics_l[-1]['title'] = t.findAll('a')[1]\n topics_l[-1]['author'] = t.findParent().find('span', attrs={'class': 'history-user'}).a\n topics_l[-1]['date'] = t.findParent().find('a', attrs={'class': 'mw-changeslist-date'})\n\ndef html_new_topics(topics):\n html_list = '

{} novos t\u00f3picos

'\n return html_list\n \nHTML(html_new_topics(topics_l))", "language": "python", "metadata": {}, "outputs": [ { "html": "

35 novos t\u00f3picos

", "output_type": "pyout", "prompt_number": 80, "text": "" } ], "prompt_number": 80 }, { "cell_type": "markdown", "metadata": {}, "source": "Agora vamos tentar extrair o in\u00edcio da p\u00e1gina de cada novo t\u00f3pico. Para isso teremos que ir link por link da lista acima, fazer um novo request com urllib2 e fazer o parsing com BeautifulSoup. Bom, primeiro alguns experimentos:\n" }, { "cell_type": "code", "collapsed": false, "input": "url = topics_l[0]['title']['href']\nheaders = { 'User-Agent' : 'Mozilla/5.0' }\nreq = urllib2.Request(url, None, headers)\nhtml = urllib2.urlopen(req).read()\nsoup = bs.BeautifulSoup(html)", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": "len(cont_div.findAll(text=True))", "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 35, "text": "131" } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": "cont_div = soup.find('div', attrs={'id': 'mw-content-text'})\nfor i in cont_div:\n if type(i) == bs.Tag:\n if i.name != 'table' and i.name != 'dl':\n print i\n if i.name == 'dl':\n break", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "

Pelo Crit\u00e9rio de notoriedade sobre elementos de fic\u00e7\u00e3o, para cada obra de fic\u00e7\u00e3o podemos criar listas para personagens (inclusive secund\u00e1rios), lugares, esp\u00e9cies de animais, rob\u00f4s, objetos, conceitos, golpes, etc. O CDN fala para fazer a fus\u00e3o de praticamente qualquer elemento de fic\u00e7\u00e3o. Nesse caso, marca\u00e7\u00f5es de ESR (e PE tamb\u00e9m?) em elementos de fic\u00e7\u00e3o n\u00e3o deviam ser sempre impugnadas e levar para a Central de fus\u00e3o? Ou ao menos a maioria, j\u00e1 q mesmo sem fontes quase sempre tem iw com fontes mostrando q existe, e mesmo se n\u00e3o tiver \u00e9 um elemento de fic\u00e7\u00e3o. Rjclaudio msg 12h00min de 26 de janeiro de 2013 (UTC)

\n" } ], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": "def html_new_topics(topics, content=False):\n html_list = '

{} novos t\u00f3picos

'\n return html_list", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 81 }, { "cell_type": "code", "collapsed": false, "input": "HTML(html_new_topics(topics_l, content=True))", "language": "python", "metadata": {}, "outputs": [ { "html": "

35 novos t\u00f3picos

", "output_type": "pyout", "prompt_number": 82, "text": "" } ], "prompt_number": 82 }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 70 } ], "metadata": {} } ] }