{ "metadata": { "name": "", "signature": "sha256:c6b4fa22286ccf4491ce700d1871d7bc0c30d5a707ec5f9339bce103129455ed" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `TextBlob`: otro m\u00f3dulo para tareas de PLN (`NLTK` + `pattern`)\n", "\n", "[textblob](http://textblob.readthedocs.org/) es una librer\u00eda de procesamiento del texto para Python que permite realizar tareas de Procesamiento del Lenguaje Natural como an\u00e1lisis morfol\u00f3gico, extracci\u00f3n de entidades, an\u00e1lisis de opini\u00f3n, traducci\u00f3n autom\u00e1tica, etc. \n", "\n", "Est\u00e1 constru\u00edda sobre otras dos librer\u00edas que ya conoces [NLTK](http://www.nltk.org/) y [pattern](http://www.clips.ua.ac.be/pages/pattern-en) y su principal ventaja es que permite combinar el uso de las dos herramientas anteriores en un interfaz m\u00e1s simple.\n", "\n", "Vamos a apoyarnos en [este tutorial](http://textblob.readthedocs.org/en/dev/quickstart.html) para aprender a utilizar algunas de sus funcionalidades. Lo primero es importar el objeto `TextBlob` que nos permite acceder a todas las herramentas que incluye." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from textblob import TextBlob" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vamos a crear nuestro primer ejemplo de *textblob* a trav\u00e9s del objeto `TextBlob`. Piensa en estos *textblobs* como una especie de cadenas de texto de Python, analaizadas y enriquecidas con algunas caracter\u00edsticas extra. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "texto = '''In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who participate deserve a very clear picture of the risks they're taking'''\n", "t = TextBlob(texto)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "print t.sentences" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[Sentence(\"In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy.\"), Sentence(\"The consumers who participate deserve a very clear picture of the risks they're taking\")]\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Procesando oraciones, palabras y entidades\n", "\n", "Podemos segmentar en oraciones y en palabras nuestra texto de ejemplo simplemente accediendo a las propiedades `.sentences` y `.words`. Imprimimos por pantalla: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# imprimimos las oraciones\n", "for sentence in t.sentences:\n", " print sentence\n", " print \"--------------\"\n", " \n", "# y las palabras \n", "print t.words\n", "print texto.split()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy.\n", "--------------\n", "The consumers who participate deserve a very clear picture of the risks they're taking\n", "--------------\n", "['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', 'they', \"'re\", 'taking']\n", "['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft,', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy.', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', \"they're\", 'taking']\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "La propiedad `.noun_phrases` nos permite acceder a la lista de entidades (en realidad, son sintagmas nominales) inclu\u00eddos en nuestro *textblob*. As\u00ed es como funciona." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"el texto de ejemplo contiene\", len(t.noun_phrases), \"entidades\"\n", "for element in t.noun_phrases:\n", " print \"-\", element" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "el texto de ejemplo contiene " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "8 entidades\n", "- new lawsuits\n", "- uber\n", "- lyft\n", "- top prosecutors\n", "- los angeles\n", "- san francisco\n", "- important point\n", "- clear picture\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# jugando con lemas, singulares y plurales \n", "for word in t.words:\n", " if word.endswith(\"s\"):\n", " print word.lemmatize(), word, word.singularize()\n", " else:\n", " print word.lemmatize(), word, word.pluralize()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "In In Ins\n", "new new news\n", "lawsuit lawsuits lawsuit\n", "brought brought broughts\n", "against against againsts\n", "the the thes\n", "ride-sharing ride-sharing ride-sharings\n", "company companies company\n", "Uber Uber Ubers\n", "and and ands\n", "Lyft Lyft Lyfts\n", "the the thes\n", "top top tops\n", "prosecutor prosecutors prosecutor\n", "in in ins\n", "Los Los Lo\n", "Angeles Angeles Angele\n", "and and ands\n", "San San Sans\n", "Francisco Francisco Franciscoes\n", "county counties county\n", "make make makes\n", "an an some\n", "important important importants\n", "point point points\n", "about about abouts\n", "the the thes\n", "lightly lightly lightlies\n", "regulated regulated regulateds\n", "sharing sharing sharings\n", "economy economy economies\n", "The The Thes\n", "consumer consumers consumer\n", "who who whoes\n", "participate participate participates\n", "deserve deserve deserves\n", "a a some\n", "very very veries\n", "clear clear clears\n", "picture picture pictures\n", "of of ofs\n", "the the thes\n", "risk risks risk\n", "they they they\n", "'re 're 'res\n", "taking taking takings\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# \u00bfc\u00f3mo podemos hacer la lematizaci\u00f3n m\u00e1s inteligente?\n", "for element in t.tags:\n", " # solo lematizamos sustantivos\n", " if element[1] == \"NN\":\n", " print element[0], element[0].lemmatize(), element[0].pluralize() \n", " elif element[1] == \"NNS\":\n", " print element[0], element[0].lemmatize(), element[0].singularize() \n", " \n", " # y formas verbales\n", " if element[1].startswith(\"VB\"):\n", " print element[0], element[0].lemmatize(\"v\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "lawsuits lawsuit lawsuit\n", "brought bring\n", "companies company company\n", "prosecutors prosecutor prosecutor\n", "counties county county\n", "make make\n", "point point points\n", "regulated regulate\n", "sharing share\n", "economy economy economies\n", "consumers consumer consumer\n", "participate participate\n", "deserve deserve\n", "picture picture pictures\n", "risks risk risk\n", "re re res\n", "taking take\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An\u00e1lisis sint\u00e1tico\n", "\n", "Aunque podemos utilizar otros analizadores, por defecto el m\u00e9todo `.parse()` invoca al analizador morfosint\u00e1ctico del m\u00f3dulo `pattern.en` que ya conoces." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# an\u00e1lisis sint\u00e1ctico: \u00bfte suena de pattern?\n", "print t.parse()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "In/IN/B-PP/B-PNP new/JJ/B-NP/I-PNP lawsuits/NNS/I-NP/I-PNP brought/VBN/B-VP/I-PNP against/IN/B-PP/B-PNP the/DT/B-NP/I-PNP ride-sharing/JJ/I-NP/I-PNP companies/NNS/I-NP/I-PNP Uber/NNP/I-NP/I-PNP and/CC/O/O Lyft/NNP/B-NP/O ,/,/O/O the/DT/B-NP/O top/JJ/I-NP/O prosecutors/NNS/I-NP/O in/IN/B-PP/B-PNP Los/NNP/B-NP/I-PNP Angeles/NNP/I-NP/I-PNP and/CC/I-NP/I-PNP San/NNP/I-NP/I-PNP Francisco/NNP/I-NP/I-PNP counties/NNS/I-NP/I-PNP make/VB/B-VP/O an/DT/B-NP/O important/JJ/I-NP/O point/NN/I-NP/O about/IN/B-PP/O the/DT/O/O lightly/RB/B-VP/O regulated/VBN/I-VP/O sharing/VBG/I-VP/O economy/NN/B-NP/O ././O/O\n", "The/DT/B-NP/O consumers/NNS/I-NP/O who/WP/O/O participate/VB/B-VP/O deserve/VBP/I-VP/O a/DT/B-NP/O very/RB/I-NP/O clear/JJ/I-NP/O picture/NN/I-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP risks/NNS/I-NP/I-PNP they/PRP/I-NP/I-PNP '/POS/O/O re/NN/B-NP/O taking/VBG/B-VP/O\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Traducci\u00f3n autom\u00e1tica\n", "\n", "\n", "A partir de cualquier texto procesado con `TextBlob`, podemos acceder a un traductor autom\u00e1tico de bastante calidad con el m\u00e9todo `.translate`. F\u00edjate en c\u00f3mo lo usamos. Es obligatorio indicar la lengua de destinto. La lengua de origen, se puede predecir a partir del texto de entrada. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# de chino a ingl\u00e9s y espa\u00f1ol \n", "oracion_zh = u\"\u4e2d\u56fd\u63a2\u6708\u5de5\u7a0b \u4ea6\u7a31\u5ae6\u5a25\u5de5\u7a0b\uff0c\u662f\u4e2d\u56fd\u542f\u52a8\u7684\u7b2c\u4e00\u4e2a\u63a2\u6708\u5de5\u7a0b\uff0c\u4e8e2003\u5e743\u67081\u65e5\u6b63\u5f0f\u542f\u52a8\"\n", "t_zh = TextBlob(oracion_zh)\n", "print t_zh.translate(from_lang=\"zh-CN\", to=\"en\")\n", "print t_zh.translate(from_lang=\"zh-CN\", to=\"es\")\n", "\n", "print \"--------------\"\n", "t_es = TextBlob(u\"La deuda p\u00fablica ha marcado nuevos r\u00e9cords en Espa\u00f1a en el tercer trimestre\")\n", "print t_es.translate(to=\"el\")\n", "print t_es.translate(to=\"ru\")\n", "print t_es.translate(to=\"eu\")\n", "print t_es.translate(to=\"fi\")\n", "print t_es.translate(to=\"fr\")\n", "print t_es.translate(to=\"nl\")\n", "print t_es.translate(to=\"gl\")\n", "print t_es.translate(to=\"ca\")\n", "print t_es.translate(to=\"zh\")\n", "print t_es.translate(to=\"la\")\n", "\n", "# con el slang no funciona tan bien\n", "print \"--------------\"\n", "t_ita = TextBlob(\"Sono andato a Milano e mi sono divertito un bordello.\")\n", "print t_ita.translate(to=\"en\")\n", "print t_ita.translate(to=\"es\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Chinese Lunar Exploration Program , also known as Chang E project is the start of the first Chinese lunar exploration program , on March 1, 2003 officially launched\n", "Programa de Exploraci\u00f3n Lunar chino , tambi\u00e9n conocido como proyecto Chang E es el inicio del primer programa de exploraci\u00f3n lunar chino, el 01 de marzo 2003 lanz\u00f3 oficialmente" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "--------------\n", "\u03a4\u03bf \u03b4\u03b7\u03bc\u03cc\u03c3\u03b9\u03bf \u03c7\u03c1\u03ad\u03bf\u03c2 \u03ad\u03c7\u03b5\u03b9 \u03b8\u03ad\u03c3\u03b5\u03b9 \u03bd\u03ad\u03b1 \u03c1\u03b5\u03ba\u03cc\u03c1 \u03c3\u03c4\u03b7\u03bd \u0399\u03c3\u03c0\u03b1\u03bd\u03af\u03b1 \u03ba\u03b1\u03c4\u03ac \u03c4\u03bf \u03c4\u03c1\u03af\u03c4\u03bf \u03c4\u03c1\u03af\u03bc\u03b7\u03bd\u03bf" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\u0413\u043e\u0441\u0443\u0434\u0430\u0440\u0441\u0442\u0432\u0435\u043d\u043d\u044b\u0439 \u0434\u043e\u043b\u0433 \u0443\u0441\u0442\u0430\u043d\u043e\u0432\u0438\u0442\u044c \u043d\u043e\u0432\u044b\u0435 \u0440\u0435\u043a\u043e\u0440\u0434\u044b \u0432 \u0418\u0441\u043f\u0430\u043d\u0438\u0438 \u0432 \u0442\u0440\u0435\u0442\u044c\u0435\u043c \u043a\u0432\u0430\u0440\u0442\u0430\u043b\u0435" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Zor publikoa Espainian erregistro berriak ezarri ditu , hirugarren hiruhilekoan" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Julkinen velka on asettanut uusia enn\u00e4tyksi\u00e4 Espanjassa kolmannella nelj\u00e4nneksell\u00e4" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "La dette publique a \u00e9tabli de nouveaux records en Espagne au troisi\u00e8me trimestre" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "De overheidsschuld is nieuwe records in Spanje in het derde kwartaal" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "A d\u00e9beda p\u00fablica estableceu novos r\u00e9cords en Espa\u00f1a no terceiro trimestre" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "El deute p\u00fablic ha marcat nous r\u00e8cords a Espanya en el tercer trimestre" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\u516c\u5171\u503a\u52a1\u5df2\u5728\u7b2c\u4e09\u5b63\u5ea6\u8bbe\u5b9a\u5728\u897f\u73ed\u7259\u7684\u65b0\u7eaa\u5f55" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Constituit novam in Hispania gestis publice in tertia quartam" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "--------------\n", "I went to Milan and I enjoyed it a brothel ." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Fui a Mil\u00e1n y me gust\u00f3 mucho un burdel ." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## WordNet\n", "\n", "`textblob`, m\u00e1s concretamente, cualquier objeto de la clase `Word`, nos permite acceder a la informaci\u00f3n de WordNet. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# WordNet\n", "from textblob import Word\n", "from textblob.wordnet import VERB\n", "\n", "# \u00bfcu\u00e1ntos synsets tiene \"car\"\n", "word = Word(\"car\")\n", "print word.synsets\n", "\n", "# dame los synsets de la palabra \"hack\" como verbo\n", "print Word(\"hack\").get_synsets(pos=VERB)\n", "\n", "# imprime la lista de definiciones de \"car\"\n", "print Word(\"car\").definitions\n", "\n", "# recorre la jerarqu\u00eda de hiper\u00f3nimos\n", "for s in word.synsets:\n", " print s.hypernym_paths()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]\n", "[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]\n", "[u'a motor vehicle with four wheels; usually propelled by an internal combustion engine', u'a wheeled vehicle adapted to the rails of railroad', u'the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant', u'where passengers ride up and down', u'a conveyance for passengers or freight on a cable railway']\n", "[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')]]\n", "[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')]]\n", "[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.03')]]\n", "[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.04')]]\n", "[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('cable_car.n.01')]]\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An\u00e1lisis de opinion" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# an\u00e1lisis de opini\u00f3n\n", "opinion1 = TextBlob(\"This new restaurant is great. I had so much fun!! :-P\")\n", "print opinion1.sentiment\n", "\n", "opinion2 = TextBlob(\"Google News to close in Spain.\")\n", "print opinion2.sentiment\n", "\n", "\n", "print opinion1.sentiment.polarity\n", "\n", "if opinion1.sentiment.subjectivity > 0.5:\n", " print \"Hey, esto es una opinion\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Sentiment(polarity=0.5387784090909091, subjectivity=0.6011363636363636)\n", "Sentiment(polarity=0.0, subjectivity=0.0)\n", "0.538778409091\n", "Hey, esto es una opinion\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Otras curiosidades" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# correcci\u00f3n ortogr\u00e1fica\n", "b1 = TextBlob(\"I havv goood speling!\")\n", "print b1.correct()\n", "\n", "b2 = TextBlob(\"Mi naem iz Jonh!\")\n", "print b2.correct()\n", "\n", "b3 = TextBlob(\"Boyz dont cri\")\n", "print b3.correct()\n", "\n", "b4 = TextBlob(\"psychological posesion achivemen comitment\")\n", "print b4.correct()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "I have good spelling!\n", "I name in On!" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Boy dont cry" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "psychological position achievement commitment" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hasta el infinito, y m\u00e1s all\u00e1\n", "\n", "En este breve resumen solo consideramos las posibilidades que ofrece `TextBlob` por defecto. Pero si necesitas personalizar las herramientas, echa un vistazo a [la documentaci\u00f3n avanzada](http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced). " ] } ], "metadata": {} } ] }