{"cells":[{"cell_type":"markdown","source":"# Uso de spaCY","metadata":{"id":"8UtDTjZ4M3hn","cell_id":"f812aa1400404df6afcbe48d6d562620","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"!python -m spacy download es_core_news_sm\n! pip install -U symspellpy\nimport nltk # importar natural language toolkit\nnltk.download('punkt') \nnltk.download('stopwords') # modulo para descargar stopwords en diferentes idiomas\nnltk.download('wordnet')\nfrom nltk.corpus import stopwords\nimport pandas as pd\nimport numpy as np\nimport re\nimport string\nimport plotly\nimport matplotlib.pyplot as plt\nfrom nltk.stem import PorterStemmer \nimport time\nimport spacy\nimport es_core_news_sm\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import word_tokenize\nfrom nltk.tokenize import sent_tokenize\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction.text import TfidfTransformer\nfrom nltk.probability import FreqDist\nfrom wordcloud import WordCloud\nimport pickle\nfrom symspellpy import SymSpell\nimport pkg_resources\nfrom symspellpy import SymSpell, Verbosity","metadata":{"id":"qHIBPJTYOBwx","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"4f24c55d9f29402b9473b5a305ae3760","outputId":"e83f5dd3-ba43-4ab1-b468-b9e3e8e7e9ed","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":21095,"user_tz":240,"timestamp":1653425193234},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\nCollecting es_core_news_sm==2.2.5\n Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.2.5/es_core_news_sm-2.2.5.tar.gz (16.2 MB)\n\u001b[K |████████████████████████████████| 16.2 MB 5.2 MB/s \n\u001b[?25hRequirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/dist-packages (from es_core_news_sm==2.2.5) (2.2.4)\nRequirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (2.0.6)\nRequirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (3.0.6)\nRequirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (1.1.3)\nRequirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (0.9.1)\nRequirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (1.0.0)\nRequirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (1.21.6)\nRequirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (2.23.0)\nRequirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (4.64.0)\nRequirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (1.0.7)\nRequirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (1.0.5)\nRequirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (0.4.1)\nRequirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (7.4.0)\nRequirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_sm==2.2.5) (57.4.0)\nRequirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_sm==2.2.5) (4.11.3)\nRequirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_sm==2.2.5) (3.8.0)\nRequirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_sm==2.2.5) (4.2.0)\nRequirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_sm==2.2.5) (1.24.3)\nRequirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_sm==2.2.5) (3.0.4)\nRequirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_sm==2.2.5) (2.10)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_sm==2.2.5) (2022.5.18.1)\nBuilding wheels for collected packages: es-core-news-sm\n Building wheel for es-core-news-sm (setup.py) ... \u001b[?25l\u001b[?25hdone\n Created wheel for es-core-news-sm: filename=es_core_news_sm-2.2.5-py3-none-any.whl size=16172933 sha256=917e7d20c5f7f523b6fd6345df341e0f78a96c11fde073a00548603124eab74b\n Stored in directory: /tmp/pip-ephem-wheel-cache-l6s3j56i/wheels/21/8d/a9/6c1a2809c55dd22cd9644ae503a52ba6206b04aa57ba83a3d8\nSuccessfully built es-core-news-sm\nInstalling collected packages: es-core-news-sm\nSuccessfully installed es-core-news-sm-2.2.5\n\u001b[38;5;2m✔ Download and installation successful\u001b[0m\nYou can now load the model via spacy.load('es_core_news_sm')\nLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\nCollecting symspellpy\n Downloading symspellpy-6.7.6-py3-none-any.whl (2.6 MB)\n\u001b[K |████████████████████████████████| 2.6 MB 5.3 MB/s \n\u001b[?25hCollecting editdistpy>=0.1.3\n Downloading editdistpy-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (125 kB)\n\u001b[K |████████████████████████████████| 125 kB 52.2 MB/s \n\u001b[?25hInstalling collected packages: editdistpy, symspellpy\nSuccessfully installed editdistpy-0.1.3 symspellpy-6.7.6\n[nltk_data] Downloading package punkt to /root/nltk_data...\n[nltk_data] Unzipping tokenizers/punkt.zip.\n[nltk_data] Downloading package stopwords to /root/nltk_data...\n[nltk_data] Unzipping corpora/stopwords.zip.\n[nltk_data] Downloading package wordnet to /root/nltk_data...\n[nltk_data] Unzipping corpora/wordnet.zip.\n"}],"execution_count":6},{"cell_type":"code","source":"!python -m spacy download es_core_news_md","metadata":{"id":"Pd-DOmGVOF0t","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"130b2dcffb22496984b256b90f7f671f","outputId":"a8e321c8-9935-45f0-8186-ffdd23d5a00f","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":12478,"user_tz":240,"timestamp":1653230596954},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Collecting es_core_news_md==2.2.5\n Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-2.2.5/es_core_news_md-2.2.5.tar.gz (78.4 MB)\n\u001b[K |████████████████████████████████| 78.4 MB 1.1 MB/s \n\u001b[?25hRequirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/dist-packages (from es_core_news_md==2.2.5) (2.2.4)\nRequirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (0.4.1)\nRequirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (4.64.0)\nRequirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (1.0.7)\nRequirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (57.4.0)\nRequirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (3.0.6)\nRequirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (7.4.0)\nRequirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (1.21.6)\nRequirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (2.23.0)\nRequirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (0.9.1)\nRequirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (1.0.5)\nRequirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (1.1.3)\nRequirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (2.0.6)\nRequirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->es_core_news_md==2.2.5) (1.0.0)\nRequirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_md==2.2.5) (4.11.3)\nRequirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_md==2.2.5) (3.8.0)\nRequirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->es_core_news_md==2.2.5) (4.2.0)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_md==2.2.5) (2021.10.8)\nRequirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_md==2.2.5) (1.24.3)\nRequirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_md==2.2.5) (3.0.4)\nRequirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->es_core_news_md==2.2.5) (2.10)\n\u001b[38;5;2m✔ Download and installation successful\u001b[0m\nYou can now load the model via spacy.load('es_core_news_md')\n"}],"execution_count":null},{"cell_type":"code","source":"import es_core_news_md\nnlp = es_core_news_md.load()","metadata":{"id":"vsxa_DTEMtNC","cell_id":"23d2d20f6f154d72aa5e1533fdcc3ce8","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"# Leer un string","metadata":{"id":"2uB1-2VbOeRt","cell_id":"7479163554e64be5bbb99dd13c66b8ed","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"texto= ('Este es un tutorial acerca de Procesamiento de lenguaje usando Python con spaCy')\ndoc = nlp(texto)\n#tokenizar\nprint([token.text for token in doc])","metadata":{"id":"ht5U8hbEM6pA","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"d690cc6ee3c74a9fbb5fd3742f1ec177","outputId":"0e31fd03-3c5f-4fa0-8f08-20ce28ca5783","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":587,"user_tz":240,"timestamp":1653230907520},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"['Este', 'es', 'un', 'tutorial', 'acerca', 'de', 'Procesamiento', 'de', 'lenguaje', 'usando', 'Python', 'con', 'spaCy']\n"}],"execution_count":null},{"cell_type":"markdown","source":"# Deteccion de oraciones","metadata":{"id":"eB-FMmvvOh90","cell_id":"b7abf56c76714d4ba61c87523d0862d8","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"text=('Gus es un desarrollador en Python actualmente trabajando para una compañia Fintech en Londres Inglaterra. Se encuentra interesado en aprender NLP.')\nt=nlp(text)\noraciones= list(t.sents)\nprint(len(oraciones))\nfor x in oraciones:\n print(x)","metadata":{"id":"teHetL6wOmGc","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"5a87d51bfdd6417687ec454fdecbd186","outputId":"32f4d97d-6355-4adc-8433-721706c7b3e0","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":416,"user_tz":240,"timestamp":1653230960361},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"2\nGus es un desarrollador en Python actualmente trabajando para una compañia Fintech en Londres Inglaterra.\nSe encuentra interesado en aprender NLP.\n"}],"execution_count":null},{"cell_type":"markdown","source":"# Tokenizacion","metadata":{"id":"RrNKtx8EPrr1","cell_id":"727263397ef84421a4722da4ad36ea8c","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"for token in t:\n print(token, token.idx)","metadata":{"id":"RROu30nCPs11","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"e1ad0b4edc734566b9a6b20f88fce279","outputId":"4773ec60-bee9-459f-86be-b230e7965ea5","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":415,"user_tz":240,"timestamp":1653231007774},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Gus 0\nes 4\nun 7\ndesarrollador 10\nen 24\nPython 27\nactualmente 34\ntrabajando 46\npara 57\nuna 62\ncompañia 66\nFintech 75\nen 83\nLondres 86\nInglaterra 94\n. 104\nSe 106\nencuentra 109\ninteresado 119\nen 130\naprender 133\nNLP 142\n. 145\n"}],"execution_count":null},{"cell_type":"code","source":"for token in t:\n print(token, token.idx, token.text_with_ws,\n token.is_alpha, token.is_punct, token.is_space,\n token.shape_, token.is_stop)","metadata":{"id":"Rhv5on_BP9Jt","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"a9c420ace2cc4b3f90253afa50ab9128","outputId":"5e73a8b8-c0a6-44b5-89dc-5432c9db9071","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":404,"user_tz":240,"timestamp":1653231066952},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Gus 0 Gus True False False Xxx False\nes 4 es True False False xx True\nun 7 un True False False xx True\ndesarrollador 10 desarrollador True False False xxxx False\nen 24 en True False False xx True\nPython 27 Python True False False Xxxxx False\nactualmente 34 actualmente True False False xxxx True\ntrabajando 46 trabajando True False False xxxx False\npara 57 para True False False xxxx True\nuna 62 una True False False xxx True\ncompañia 66 compañia True False False xxxx False\nFintech 75 Fintech True False False Xxxxx False\nen 83 en True False False xx True\nLondres 86 Londres True False False Xxxxx False\nInglaterra 94 Inglaterra True False False Xxxxx False\n. 104 . False True False . False\nSe 106 Se True False False Xx True\nencuentra 109 encuentra True False False xxxx True\ninteresado 119 interesado True False False xxxx False\nen 130 en True False False xx True\naprender 133 aprender True False False xxxx False\nNLP 142 NLP True False False XXX False\n. 145 . False True False . False\n"}],"execution_count":null},{"cell_type":"markdown","source":"En este ejemplo tenemos:\n\n- text_with_ws imprime el texto del token \n- is_alpha detecta si el token consiste en caracteres alfa numericos o no\n- is_punct detecta si el token es un simbolo de puntuacion.\n- is_space detecta si el token es un espacio o no.\n- shape_ imprime el output shape de la palabra.\n- is_stop detecta si es una stopword o no","metadata":{"id":"v6FvUx-UQFoz","cell_id":"042720b31c574d13a3539ffcefc017b0","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Stopwords","metadata":{"id":"Lq1Cev75Qjy5","cell_id":"f45428c7ff7e4b66b7c110958ae403cd","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"import spacy\nspacy_stopwords = spacy.lang.es.stop_words.STOP_WORDS\nprint(len(spacy_stopwords))\nfor stop_word in list(spacy_stopwords)[:10]:\n print(stop_word)","metadata":{"id":"_h2N2XmmQmaC","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"bd1bcf583a5446a3835b8b8a454fb372","outputId":"329adf96-1b13-4b92-8c1e-93b3d6df645d","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":432,"user_tz":240,"timestamp":1653231241658},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"551\nsea\nsiguiente\ninformo\ntampoco\ndan\nestuvo\nenseguida\nhabia\nninguno\nocho\n"}],"execution_count":null},{"cell_type":"code","source":"for token in t:\n if not token.is_stop:\n print(token)","metadata":{"id":"PF9yA7o7Qu-5","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"a32eb3981fb2483a82695ae4ffaf803e","outputId":"296aa79f-e135-4fd6-e33f-ea5e2b5cea2d","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":516,"user_tz":240,"timestamp":1653231266950},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Gus\ndesarrollador\nPython\ntrabajando\ncompañia\nFintech\nLondres\nInglaterra\n.\ninteresado\naprender\nNLP\n.\n"}],"execution_count":null},{"cell_type":"code","source":"# Creacion adicional de stopwrods\ndocumento_sin_stopword = [token for token in t if not token.is_stop]\nprint(documento_sin_stopword)","metadata":{"id":"XZqDOi_qQ1td","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"a3ef3deb5dea4f2e95b6d260184220cc","outputId":"2e6b41a6-6e11-43a9-ab8b-2c7b670bdf3e","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":403,"user_tz":240,"timestamp":1653231317027},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[Gus, desarrollador, Python, trabajando, compañia, Fintech, Londres, Inglaterra, ., interesado, aprender, NLP, .]\n"}],"execution_count":null},{"cell_type":"markdown","source":"# Lemantizacion","metadata":{"id":"27Owz6clRAe2","cell_id":"253eae6ddd004a6c8a174b286075193f","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"for token in t:\n print(token, '-', token.lemma_)","metadata":{"id":"siLYeWXjRDl6","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"71a15384d0e742dcbbb651be6f28ad8e","outputId":"c6af8fa3-0409-402b-95a1-c5e256ea694b","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":619,"user_tz":240,"timestamp":1653231365055},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Gus - Gus\nes - ser\nun - uno\ndesarrollador - desarrollador\nen - en\nPython - Python\nactualmente - actualmente\ntrabajando - trabajar\npara - parir\nuna - uno\ncompañia - compañia\nFintech - Fintech\nen - en\nLondres - Londres\nInglaterra - Inglaterra\n. - .\nSe - Se\nencuentra - encontrar\ninteresado - interesar\nen - en\naprender - aprender\nNLP - NLP\n. - .\n"}],"execution_count":null},{"cell_type":"markdown","source":"# Word Frequency","metadata":{"id":"WPyUzf66RL-V","cell_id":"aa286db56dbc4a09a089973a71be8feb","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"texto= '''\nLa FIFA responde así a una denuncia interpuesta por la Federación de Chile ante esa Comisión Disciplinaria, en la que presentaba alegaciones sobre la posible falsificación de los documentos que conceden la nacionalidad ecuatoriana Byron Castillo.\n\nLa selección de Ecuador se clasificó de forma directa para el Mundial, junto con las de Brasil, Argentina y Uruguay, al contrario que las de Chile y Perú. El combinado peruano, que terminó quinto por detrás del ecuatoriano, disputará una repesca.\n\nEl defensa fue alineado por el seleccionador ecuatoriano Gustavo Alfaro para los dos partidos contra Paraguay y Chile y en una ocasión ante Uruguay, Bolivia, Venezuela y Argentina, partidos clave para que el equipo lograse uno de los cupos directos para el Mundial.\n\n\"Innumerables pruebas de que nació en Colombia\"\nLa Federación de Chile denunció el pasado día 5 que hay \"innumerables pruebas de que el jugador nació en Colombia\".\n\n\"Las investigaciones realizadas en Ecuador, entre ellas, un informe jurídico de la Dirección Nacional de Registro Civil, declararon la existencia de inconsistencias en el certificado de nacimiento presentado por el jugador\", afirmó este organismo, que acusó a la Federación Ecuatoriana de tener \"total conocimiento\" de las irregularidades.\n\nUna posible sanción de la FIFA podría implicar la resta de puntos a Ecuador por los partidos que Castillo jugó, lo que alteraría la nómina de clasificados.\n\nUn informe técnico jurídico de la dirección nacional del registro civil de Ecuador afirma que la inscripción de nacimiento de Byron Castillo en la ciudad ecuatoriana de Guayas no consta en el tomo, la página y el acta solicitado, según un documento oficial.\n'''","metadata":{"id":"a_P6U9n1RNUo","cell_id":"f324339c05264984a0ef0c7e07d866e4","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"type(texto)","metadata":{"id":"OVFSTmTCSbUI","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"02432d1e992341a19bb81acbfc9cd47c","outputId":"d57fe1df-0380-4867-96cd-c117691fe361","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":3,"user_tz":240,"timestamp":1653231750249},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"str"},"metadata":{},"execution_count":49}],"execution_count":null},{"cell_type":"code","source":"import re\ntexto1 = re.sub('\\n', '', texto) #remover saltos de linea\nprint(type(texto))\nstr(texto1)","metadata":{"id":"YJEqi4TuSNHZ","colab":{"height":123,"base_uri":"https://localhost:8080/"},"cell_id":"a72b0b608d7a4393930f95dc1ac1ea55","outputId":"a2e6d4b3-f4e7-4dbc-b305-aeb48c854f25","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":389,"user_tz":240,"timestamp":1653231764290},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"\n"},{"output_type":"execute_result","data":{"text/plain":"'La FIFA responde así a una denuncia interpuesta por la Federación de Chile ante esa Comisión Disciplinaria, en la que presentaba alegaciones sobre la posible falsificación de los documentos que conceden la nacionalidad ecuatoriana Byron Castillo.La selección de Ecuador se clasificó de forma directa para el Mundial, junto con las de Brasil, Argentina y Uruguay, al contrario que las de Chile y Perú. El combinado peruano, que terminó quinto por detrás del ecuatoriano, disputará una repesca.El defensa fue alineado por el seleccionador ecuatoriano Gustavo Alfaro para los dos partidos contra Paraguay y Chile y en una ocasión ante Uruguay, Bolivia, Venezuela y Argentina, partidos clave para que el equipo lograse uno de los cupos directos para el Mundial.\"Innumerables pruebas de que nació en Colombia\"La Federación de Chile denunció el pasado día 5 que hay \"innumerables pruebas de que el jugador nació en Colombia\".\"Las investigaciones realizadas en Ecuador, entre ellas, un informe jurídico de la Dirección Nacional de Registro Civil, declararon la existencia de inconsistencias en el certificado de nacimiento presentado por el jugador\", afirmó este organismo, que acusó a la Federación Ecuatoriana de tener \"total conocimiento\" de las irregularidades.Una posible sanción de la FIFA podría implicar la resta de puntos a Ecuador por los partidos que Castillo jugó, lo que alteraría la nómina de clasificados.Un informe técnico jurídico de la dirección nacional del registro civil de Ecuador afirma que la inscripción de nacimiento de Byron Castillo en la ciudad ecuatoriana de Guayas no consta en el tomo, la página y el acta solicitado, según un documento oficial.'","application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":51}],"execution_count":null},{"cell_type":"code","source":"doc= nlp(texto1)\n# Remover stopwrods\nwords= [token.text for token in doc if not token.is_stop and not token.is_punct]\nfrom collections import Counter\nword_freq= Counter(words)\n# Sacar las 5 mas frecuentes y sus frecuencias\ncommon_words= word_freq.most_common(5)\nprint(common_words)\nunique_words = [word for (word, freq) in word_freq.items() if freq == 1]\nprint(unique_words)","metadata":{"id":"BaFYahOoRd6h","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"920c69b3acfe4770a74135abcfca471e","outputId":"7b82df60-ad71-4ef2-f8ad-a8d2856c8bdf","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":409,"user_tz":240,"timestamp":1653231850729},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[('y', 6), ('Chile', 4), ('Ecuador', 4), ('a', 3), ('Federación', 3)]\n['responde', 'denuncia', 'interpuesta', 'Comisión', 'Disciplinaria', 'presentaba', 'alegaciones', 'falsificación', 'documentos', 'conceden', 'nacionalidad', 'selección', 'clasificó', 'forma', 'directa', 'Brasil', 'contrario', 'Perú', 'combinado', 'peruano', 'terminó', 'quinto', 'disputará', 'repesca', 'defensa', 'alineado', 'seleccionador', 'Gustavo', 'Alfaro', 'Paraguay', 'ocasión', 'Bolivia', 'Venezuela', 'clave', 'equipo', 'lograse', 'cupos', 'directos', '\"Innumerables', 'Colombia\"La', 'denunció', '5', 'innumerables', 'Colombia\"', '\"Las', 'investigaciones', 'realizadas', 'Dirección', 'Nacional', 'Registro', 'Civil', 'declararon', 'existencia', 'inconsistencias', 'certificado', 'presentado', 'organismo', 'acusó', 'Ecuatoriana', 'conocimiento', 'irregularidades', 'sanción', 'implicar', 'resta', 'puntos', 'jugó', 'alteraría', 'nómina', 'clasificados', 'técnico', 'dirección', 'nacional', 'registro', 'civil', 'afirma', 'inscripción', 'ciudad', 'Guayas', 'consta', 'tomo', 'página', 'acta', 'solicitado', 'documento', 'oficial']\n"}],"execution_count":null},{"cell_type":"markdown","source":"# POS Tagging","metadata":{"id":"qzhKZ2n_TEv4","cell_id":"8e43790071cc404cae6d2d56eebad3b2","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"for token in doc:\n print(token,' -', token.tag_, ' -', token.pos_,' -' ,spacy.explain(token.tag_))","metadata":{"id":"KEF70sIAS-1d","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"5c90b241073840da952f4eb39bddd272","outputId":"a023495b-4956-47f3-e997-77899a3ecedd","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":383,"user_tz":240,"timestamp":1653231956391},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"La - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nFIFA - PROPN___ - PROPN - None\nresponde - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin - VERB - None\nasí - ADV___ - ADV - None\na - ADP__AdpType=Prep - ADP - None\nuna - DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art - DET - None\ndenuncia - NOUN__Gender=Fem|Number=Sing - NOUN - None\ninterpuesta - ADJ__Gender=Fem|Number=Sing|VerbForm=Part - ADJ - None\npor - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nFederación - PROPN___ - PROPN - None\nde - ADP__AdpType=Prep - ADP - None\nChile - PROPN___ - PROPN - None\nante - ADP__AdpType=Prep - ADP - None\nesa - DET__Gender=Fem|Number=Sing|PronType=Dem - DET - None\nComisión - PROPN___ - PROPN - None\nDisciplinaria - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nen - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nque - PRON__PronType=Rel - PRON - None\npresentaba - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin - VERB - None\nalegaciones - NOUN__Gender=Fem|Number=Plur - NOUN - None\nsobre - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nposible - ADJ__Number=Sing - ADJ - None\nfalsificación - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nlos - DET__Definite=Def|Gender=Masc|Number=Plur|PronType=Art - DET - None\ndocumentos - NOUN__Gender=Masc|Number=Plur - NOUN - None\nque - PRON__PronType=Rel - PRON - None\nconceden - VERB__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin - VERB - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nnacionalidad - NOUN__Gender=Fem|Number=Sing - NOUN - None\necuatoriana - ADJ__Gender=Fem|Number=Sing - ADJ - None\nByron - PROPN___ - PROPN - None\nCastillo - PROPN___ - PROPN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\nLa - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nselección - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nEcuador - PROPN___ - PROPN - None\nse - PRON__Person=3 - PRON - None\nclasificó - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nde - ADP__AdpType=Prep - ADP - None\nforma - NOUN__Gender=Fem|Number=Sing - NOUN - None\ndirecta - ADJ__Gender=Fem|Number=Sing - ADJ - None\npara - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\nMundial - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\njunto - ADJ__AdpType=Prep - ADJ - None\ncon - ADP__AdpType=Prep - ADP - None\nlas - DET__Definite=Def|Gender=Fem|Number=Plur|PronType=Art - DET - None\nde - ADP__AdpType=Prep - ADP - None\nBrasil - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nArgentina - PROPN___ - PROPN - None\ny - CCONJ___ - CONJ - None\nUruguay - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nal - ADP__AdpType=Preppron|Gender=Masc|Number=Sing - ADP - None\ncontrario - NOUN___ - NOUN - None\nque - SCONJ___ - SCONJ - None\nlas - DET__Definite=Def|Gender=Fem|Number=Plur|PronType=Art - DET - None\nde - ADP__AdpType=Prep - ADP - None\nChile - PROPN___ - PROPN - None\ny - CCONJ___ - CONJ - None\nPerú - PROPN___ - PROPN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\nEl - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\ncombinado - NOUN__Gender=Masc|Number=Sing - NOUN - None\nperuano - ADJ__Gender=Masc|Number=Sing - ADJ - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nque - PRON__PronType=Rel - PRON - None\nterminó - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nquinto - ADJ__Gender=Masc|Number=Sing|NumType=Ord - ADJ - None\npor - ADP__AdpType=Prep - ADP - None\ndetrás - ADV__AdpType=Preppron|Gender=Masc|Number=Sing - ADV - None\ndel - ADP__AdpType=Preppron|Gender=Masc|Number=Sing - ADP - None\necuatoriano - NOUN__Gender=Masc|Number=Sing - NOUN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\ndisputará - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin - VERB - None\nuna - DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art - DET - None\nrepesca - NOUN__Gender=Fem|Number=Sing - NOUN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\nEl - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\ndefensa - NOUN__Number=Sing - NOUN - None\nfue - AUX__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - AUX - None\nalineado - VERB__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part - VERB - None\npor - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\nseleccionador - NOUN__Gender=Masc|Number=Sing - NOUN - None\necuatoriano - ADJ__Gender=Masc|Number=Sing - ADJ - None\nGustavo - PROPN___ - PROPN - None\nAlfaro - PROPN___ - PROPN - None\npara - ADP__AdpType=Prep - ADP - None\nlos - DET__Definite=Def|Gender=Masc|Number=Plur|PronType=Art - DET - None\ndos - NUM__Number=Plur|NumType=Card - NUM - None\npartidos - NOUN__Gender=Masc|Number=Plur - NOUN - None\ncontra - ADP__AdpType=Prep - ADP - None\nParaguay - PROPN___ - PROPN - None\ny - CCONJ___ - CONJ - None\nChile - PROPN___ - PROPN - None\ny - CCONJ___ - CONJ - None\nen - ADP__AdpType=Prep - ADP - None\nuna - DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art - DET - None\nocasión - NOUN__Gender=Fem|Number=Sing - NOUN - None\nante - ADP__AdpType=Prep - ADP - None\nUruguay - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nBolivia - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nVenezuela - PROPN___ - PROPN - None\ny - CCONJ___ - CONJ - None\nArgentina - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\npartidos - NOUN__Gender=Masc|Number=Plur - NOUN - None\nclave - NOUN__Gender=Fem|Number=Sing - NOUN - None\npara - ADP__AdpType=Prep - ADP - None\nque - SCONJ___ - SCONJ - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\nequipo - NOUN__Gender=Masc|Number=Sing - NOUN - None\nlograse - VERB__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin - VERB - None\nuno - PRON__Gender=Masc|Number=Sing|PronType=Ind - PRON - None\nde - ADP__AdpType=Prep - ADP - None\nlos - DET__Definite=Def|Gender=Masc|Number=Plur|PronType=Art - DET - None\ncupos - NOUN__Gender=Masc|Number=Plur - NOUN - None\ndirectos - ADJ__Gender=Masc|Number=Plur - ADJ - None\npara - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\nMundial - PROPN___ - PROPN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\n\"Innumerables - ADJ__Number=Plur - ADJ - None\npruebas - NOUN__Gender=Fem|Number=Plur - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nque - SCONJ___ - SCONJ - None\nnació - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nen - ADP__AdpType=Prep - ADP - None\nColombia\"La - PROPN___ - PROPN - None\nFederación - PROPN___ - PROPN - None\nde - ADP__AdpType=Prep - ADP - None\nChile - PROPN___ - PROPN - None\ndenunció - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\npasado - ADJ__Gender=Masc|Number=Sing|VerbForm=Part - ADJ - None\ndía - NOUN___ - NOUN - None\n5 - NUM__NumForm=Digit|NumType=Card - NUM - None\nque - PRON__PronType=Rel - PRON - None\nhay - AUX__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin - AUX - None\n\" - PUNCT__PunctType=Quot - PUNCT - None\ninnumerables - ADJ__Number=Plur - ADJ - None\npruebas - NOUN__Gender=Fem|Number=Plur - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nque - SCONJ___ - SCONJ - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\njugador - NOUN__Gender=Masc|Number=Sing - NOUN - None\nnació - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nen - ADP__AdpType=Prep - ADP - None\nColombia\" - PROPN___ - PROPN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\n\"Las - NUM__NumForm=Digit - NUM - None\ninvestigaciones - NOUN__Gender=Fem|Number=Plur - NOUN - None\nrealizadas - ADJ__Gender=Fem|Number=Plur|VerbForm=Part - ADJ - None\nen - ADP__AdpType=Prep - ADP - None\nEcuador - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nentre - ADP__AdpType=Prep - ADP - None\nellas - PRON__Gender=Fem|Number=Plur|Person=3|PronType=Prs - PRON - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nun - DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art - DET - None\ninforme - NOUN__Gender=Masc|Number=Sing - NOUN - None\njurídico - ADJ__Gender=Masc|Number=Sing - ADJ - None\nde - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nDirección - PROPN___ - PROPN - None\nNacional - PROPN___ - PROPN - None\nde - ADP__AdpType=Prep - ADP - None\nRegistro - PROPN___ - PROPN - None\nCivil - PROPN___ - PROPN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\ndeclararon - VERB__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin - VERB - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nexistencia - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\ninconsistencias - NOUN__Gender=Fem|Number=Plur - NOUN - None\nen - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\ncertificado - NOUN__Gender=Masc|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nnacimiento - NOUN__Gender=Masc|Number=Sing - NOUN - None\npresentado - ADJ__Gender=Masc|Number=Sing|VerbForm=Part - ADJ - None\npor - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\njugador - NOUN__Gender=Masc|Number=Sing - NOUN - None\n\" - PUNCT__PunctType=Quot - PUNCT - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nafirmó - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\neste - DET__Gender=Masc|Number=Sing|PronType=Dem - DET - None\norganismo - NOUN__Gender=Masc|Number=Sing - NOUN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nque - PRON__PronType=Rel - PRON - None\nacusó - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\na - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nFederación - PROPN___ - PROPN - None\nEcuatoriana - PROPN___ - PROPN - None\nde - ADP__AdpType=Prep - ADP - None\ntener - VERB__VerbForm=Inf - VERB - None\n\" - PUNCT__PunctType=Quot - PUNCT - None\ntotal - ADJ__Number=Sing - ADJ - None\nconocimiento - NOUN__Gender=Masc|Number=Sing - NOUN - None\n\" - PUNCT__PunctType=Quot - PUNCT - None\nde - ADP__AdpType=Prep - ADP - None\nlas - DET__Definite=Def|Gender=Fem|Number=Plur|PronType=Art - DET - None\nirregularidades - NOUN__Gender=Fem|Number=Plur - NOUN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\nUna - DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art - DET - None\nposible - ADJ__Number=Sing - ADJ - None\nsanción - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nFIFA - PROPN___ - PROPN - None\npodría - AUX__Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin - AUX - None\nimplicar - VERB__VerbForm=Inf - VERB - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nresta - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\npuntos - NOUN__Gender=Masc|Number=Plur - NOUN - None\na - ADP__AdpType=Prep - ADP - None\nEcuador - PROPN___ - PROPN - None\npor - ADP__AdpType=Prep - ADP - None\nlos - DET__Definite=Def|Gender=Masc|Number=Plur|PronType=Art - DET - None\npartidos - NOUN__Gender=Masc|Number=Plur - NOUN - None\nque - PRON__PronType=Rel - PRON - None\nCastillo - PROPN___ - PROPN - None\njugó - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin - VERB - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nlo - DET__Definite=Def|Number=Sing|PronType=Art - DET - None\nque - PRON__PronType=Rel - PRON - None\nalteraría - VERB__Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin - VERB - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nnómina - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nclasificados - NOUN__Gender=Masc|Number=Plur - NOUN - None\n. - PUNCT__PunctType=Peri - PUNCT - None\nUn - DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art - DET - None\ninforme - NOUN__Gender=Masc|Number=Sing - NOUN - None\ntécnico - ADJ__Gender=Masc|Number=Sing - ADJ - None\njurídico - ADJ__Gender=Masc|Number=Sing - ADJ - None\nde - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\ndirección - PROPN___ - PROPN - None\nnacional - PROPN___ - PROPN - None\ndel - ADP__AdpType=Preppron|Gender=Masc|Number=Sing - ADP - None\nregistro - PROPN___ - PROPN - None\ncivil - ADJ__Number=Sing - ADJ - None\nde - ADP__AdpType=Prep - ADP - None\nEcuador - PROPN___ - PROPN - None\nafirma - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin - VERB - None\nque - SCONJ___ - SCONJ - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\ninscripción - NOUN__Gender=Fem|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nnacimiento - NOUN__Gender=Masc|Number=Sing - NOUN - None\nde - ADP__AdpType=Prep - ADP - None\nByron - PROPN___ - PROPN - None\nCastillo - PROPN___ - PROPN - None\nen - ADP__AdpType=Prep - ADP - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\nciudad - NOUN__Gender=Fem|Number=Sing - NOUN - None\necuatoriana - ADJ__Gender=Fem|Number=Sing - ADJ - None\nde - ADP__AdpType=Prep - ADP - None\nGuayas - PROPN___ - PROPN - None\nno - ADV__Polarity=Neg - ADV - None\nconsta - VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin - VERB - None\nen - ADP__AdpType=Prep - ADP - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\ntomo - NOUN__Gender=Masc|Number=Sing - NOUN - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nla - DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art - DET - None\npágina - NOUN__Gender=Fem|Number=Sing - NOUN - None\ny - CCONJ___ - CONJ - None\nel - DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art - DET - None\nacta - NOUN__Gender=Masc|Number=Sing - NOUN - None\nsolicitado - ADJ__Gender=Masc|Number=Sing|VerbForm=Part - ADJ - None\n, - PUNCT__PunctType=Comm - PUNCT - None\nsegún - ADP__AdpType=Prep - ADP - None\nun - DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art - DET - None\ndocumento - NOUN__Gender=Masc|Number=Sing - NOUN - None\noficial - ADJ__Number=Sing - ADJ - None\n. - PUNCT__PunctType=Peri - PUNCT - None\n"}],"execution_count":null},{"cell_type":"markdown","source":"1. tag_ listas las estructuras finas del POS\n2. pos_ lista la estructuras gruesas del POS\n3. spacy.explain da los detalles descriptivos sobre un POS particular","metadata":{"id":"A9U1VToeTcGl","cell_id":"23c45601576a44b984be3beb2edd468f","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"nouns=[]\nadjectives=[]\nfor token in doc:\n if token.pos_ =='NOUN':\n nouns.append(token)\n if token.pos_ =='ADJ':\n adjectives.append(token)\nprint(nouns)\nprint(adjectives)","metadata":{"id":"ePn4C2N_ToHG","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"d7ea3dc883974e9a9bc94b698254481b","outputId":"df093f04-ee60-4f07-f1c6-c6d61653ef4b","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":387,"user_tz":240,"timestamp":1653232061050},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[denuncia, alegaciones, falsificación, documentos, nacionalidad, selección, forma, contrario, combinado, ecuatoriano, repesca, defensa, seleccionador, partidos, ocasión, partidos, clave, equipo, cupos, pruebas, día, pruebas, jugador, investigaciones, informe, existencia, inconsistencias, certificado, nacimiento, jugador, organismo, conocimiento, irregularidades, sanción, resta, puntos, partidos, nómina, clasificados, informe, inscripción, nacimiento, ciudad, tomo, página, acta, documento]\n[interpuesta, posible, ecuatoriana, directa, junto, peruano, quinto, ecuatoriano, directos, \"Innumerables, pasado, innumerables, realizadas, jurídico, presentado, total, posible, técnico, jurídico, civil, ecuatoriana, solicitado, oficial]\n"}],"execution_count":null},{"cell_type":"markdown","source":"# Visualizacion usando spaCY","metadata":{"id":"asrthaXMT1j0","cell_id":"b25482a05f2042c2ae723fef9488cbd4","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"from spacy import displacy\ntexto = ('el se encuentra interesado en aprender Procesamiento de Lenguaje Natural')\nt = nlp(texto)\ndisplacy.render(t, style='dep',jupyter=True)","metadata":{"id":"67YSTloeT3Ou","colab":{"height":354,"base_uri":"https://localhost:8080/"},"cell_id":"770eaf8a2d7f4913b94b2cc674f677a1","outputId":"9d9d409c-4475-45d8-bbc1-654576b1678a","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":424,"user_tz":240,"timestamp":1653232166048},"deepnote_cell_type":"code"},"outputs":[{"output_type":"display_data","data":{"text/plain":"","text/html":"\n\n el\n DET\n\n\n\n se\n PRON\n\n\n\n encuentra\n AUX\n\n\n\n interesado\n ADJ\n\n\n\n en\n ADP\n\n\n\n aprender\n VERB\n\n\n\n Procesamiento\n PROPN\n\n\n\n de\n ADP\n\n\n\n Lenguaje\n NOUN\n\n\n\n Natural\n PROPN\n\n\n\n \n \n det\n \n \n\n\n\n \n \n obj\n \n \n\n\n\n \n \n cop\n \n \n\n\n\n \n \n mark\n \n \n\n\n\n \n \n xcomp\n \n \n\n\n\n \n \n obj\n \n \n\n\n\n \n \n case\n \n \n\n\n\n \n \n nmod\n \n \n\n\n\n \n \n flat\n \n \n\n"},"metadata":{}}],"execution_count":null},{"cell_type":"markdown","source":"# Caso aplicado: ¿Cómo podemos predecir el sentimiento asociado con una interacción con el cliente?","metadata":{"id":"Tkpc-IyM6Vv7","cell_id":"36349b8c574c4eac948fb4d016d281aa","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"import os\nfrom collections import Counter\n\nimport matplotlib.pyplot as plt\nimport nltk\nimport numpy as np\nimport pandas as pd\n\nfrom pylab import rcParams\nfrom wordcloud import WordCloud\nfrom nltk import word_tokenize\n\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.metrics import f1_score\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\nrcParams['figure.figsize'] = 30, 60\n%matplotlib inline","metadata":{"id":"eY8dg6Hc6e4m","cell_id":"1e59ea0ff566444dbd7ac0b911f0ce83","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"# Metas\n\nDespués de trabajar en este caso, debería poder aplicar el análisis de sentimientos a los problemas comerciales. Comprenderá cómo utilizar técnicas de clasificación de texto para crear un modelo de análisis de opiniones y podrá aplicar modelos de opiniones a datos del mundo real.\n\nHabrá adquirido experiencia en el uso de varias técnicas diferentes de vectorización y creación de modelos, utilizando bibliotecas populares de Python como scikit-learn y gensim. Habrá utilizado tanto vectorizaciones basadas en conteo como incrustaciones de palabras, y comprenderá los pros y los contras de cada una.","metadata":{"id":"3wPhXiFl6kbd","cell_id":"930b83720cfe475ba575c4e9b717d214","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Introducción\n**Contexto empresarial**. Eres un científico de datos para una gran empresa de comercio electrónico. Tienes decenas de miles de clientes que escriben reseñas sobre productos cada día. Cada revisión contiene comentarios textuales junto con un sistema de calificación de 1 a 5 estrellas (siendo 1 la menos satisfecha y 5 la más satisfecha). \n\nTambién tiene un equipo de atención al cliente que interactúa con los clientes a través de servicios de llamadas y mensajes. Su empresa también recopila comentarios sobre las experiencias de sus clientes con la interacción del sitio web después de cada compra. Ni este comentario ni el servicio de mensajería tienen un número de calificación. La empresa quiere cuantificar la satisfacción del cliente proveniente de estas interacciones no calificadas para ayudar con futuras decisiones comerciales (por ejemplo, determinar qué tan bien se están desempeñando sus diversos agentes de servicio al cliente).\n\n**Problema de negocio** Su tarea es construir modelos que puedan identificar el sentimiento (positivo o negativo) de cada una de estas interacciones no calificadas.\n\n**Contexto analitico** Los datos son un conjunto de reseñas en formato de archivo CSV. Combinaremos lo que aprendimos sobre el procesamiento de texto y los modelos de clasificación para desarrollar algoritmos capaces de clasificar las interacciones por sentimiento.\n\nEl caso está estructurado de la siguiente manera: 1) leerá y analizará los datos del texto de entrada y las variables de respuesta correspondientes (calificaciones); 2) realizara un preprocesamiento básico para preparar los datos para el modelado; 3) aprendera y aplicara varias formas de caracterizar el texto de reseñas; y finalmente 4) construira modelos de aprendizaje automático para clasificar el texto como mostrando un sentimiento positivo o negativo (1 o 0).","metadata":{"id":"KMDBHuy06rc1","cell_id":"fc54e374072049c393111162674d8f0d","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Lectura de datos y analisis basico","metadata":{"id":"hhEWNyxp7J9a","cell_id":"772a2b5a3af24249a12970af737c8e7b","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"from google.colab import drive\nimport os\ndrive.mount('/content/gdrive')\n# Establecer ruta de acceso en dr\nimport os\nprint(os.getcwd())\nos.chdir(\"/content/gdrive/My Drive\")","metadata":{"id":"CQFxWy6n7Wda","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"2680cc27e94e4058a2dcc1170f0d9379","outputId":"02a015f7-e18f-4eaf-d6c3-e3eaf49c6974","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":87002,"user_tz":240,"timestamp":1653425031155},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Mounted at /content/gdrive\n/content\n"}],"execution_count":1},{"cell_type":"code","source":"import pandas as pd\namazon_reviews = pd.read_csv('Reviews.csv')\n# Seleccionando solo los primeros 10,000 registros para calculo mas rapido\n#amazon_reviews = amazon_reviews[:10000]\namazon_reviews.head()","metadata":{"id":"zp73tMGp6haN","colab":{"height":337,"base_uri":"https://localhost:8080/"},"cell_id":"f7afa1b88d9143d2944af84b1d9735a3","outputId":"aa85f9ba-db9d-46e3-d875-e47102e3f291","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":8839,"user_tz":240,"timestamp":1653425118618},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" Id ProductId UserId ProfileName \\\n0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian \n1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa \n2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres \"Natalia Corres\" \n3 4 B000UA0QIQ A395BORC6FGVXV Karl \n4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham \"M. Wassir\" \n\n HelpfulnessNumerator HelpfulnessDenominator Score Time \\\n0 1 1 5 1303862400 \n1 0 0 1 1346976000 \n2 1 1 4 1219017600 \n3 3 3 2 1307923200 \n4 0 0 5 1350777600 \n\n Summary Text \n0 Good Quality Dog Food I have bought several of the Vitality canned d... \n1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... \n2 \"Delight\" says it all This is a confection that has been around a fe... \n3 Cough Medicine If you are looking for the secret ingredient i... \n4 Great taffy Great taffy at a great price. There was a wid... ","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryText
01B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d...
12B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...
23B000LQOCH0ABXLMWJIXXAINNatalia Corres \"Natalia Corres\"1141219017600\"Delight\" says it allThis is a confection that has been around a fe...
34B000UA0QIQA395BORC6FGVXVKarl3321307923200Cough MedicineIf you are looking for the secret ingredient i...
45B006K2ZZ7KA1UQRSCLF8GW1TMichael D. Bigham \"M. Wassir\"0051350777600Great taffyGreat taffy at a great price. There was a wid...
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":3}],"execution_count":3},{"cell_type":"code","source":"amazon_reviews.shape","metadata":{"id":"zJSLM18R0Ytm","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"ef87198733934c188d92b15c10c15ba0","outputId":"d96039e1-e30a-4f57-8d2d-eed5afec4f1c","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":217,"user_tz":240,"timestamp":1653425152489},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"(568454, 10)"},"metadata":{},"execution_count":4}],"execution_count":4},{"cell_type":"markdown","source":"Columnas del dataset:\n\n- **ID**: valor unico para cada fila\n- **ProductId**: una referencia del producto sobre la reseña\n- **UserId**: una referencia del usuario que dejo el review\n- **HelpfulnessNumerator**: numero de lectores que han indicado que la reseña ayuda o es interesante\n- **HelpfulnessDenominator**: numero total de personas que han dado una indicacion de si la reseña ha sido util o no \n- **Score**: rating (1-5)\n- **Time**: fecha formato timestamp que indica cuando se creo la revision \n- **Summary**: El resumen escrito por el usuario de lo que trata la reseña\n- **Text**: la revisión escrita por el usuario","metadata":{"id":"IsuNvQ957m-W","cell_id":"d0b61332062740749e6f33ff047d642a","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Miremos la distribucion del numero de palabras por review","metadata":{"id":"dq6TigUN8Zxy","cell_id":"fad108dbcbe04f648329310851fb09a7","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"words_per_review = amazon_reviews.Text.apply(lambda x: len(x.split(\" \")))\nwords_per_review.hist(bins = 100)\nplt.title('Numero de palabras por revision')\nplt.xlabel('Palabras')\nplt.ylabel('Frecuencia')","metadata":{"id":"FRoXoAfh8cUJ","colab":{"height":312,"base_uri":"https://localhost:8080/"},"cell_id":"c066c110f5f34d0788fe3667d97fae93","outputId":"383dc4ee-a6ba-42c8-9d08-288c9c92e3e6","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":4415,"user_tz":240,"timestamp":1653425197636},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"Text(0, 0.5, 'Frecuencia')"},"metadata":{},"execution_count":7},{"output_type":"display_data","data":{"text/plain":"
","image/png":"\n"},"metadata":{"needs_background":"light"}}],"execution_count":7},{"cell_type":"code","source":"words_per_review.mean()","metadata":{"id":"7mcYXdhd8f1Z","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"4052dac4840b4499a47632487e1b3171","outputId":"d9c2e6ec-4c61-4e9d-fc56-a78f25339f7c","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":226,"user_tz":240,"timestamp":1653425215006},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"82.00552199474363"},"metadata":{},"execution_count":8}],"execution_count":8},{"cell_type":"markdown","source":"Distrobucion de los ratings","metadata":{"id":"rRCyuNQ38gmB","cell_id":"2001e82922244900afe3c0aab5c14eb4","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"amazon_reviews.Score.value_counts()","metadata":{"id":"CIio-PLz8iHJ","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"f6a7fb776db34abaa1103c8d48c5ce15","outputId":"50333a0b-a863-435d-d704-08ac133c0b5c","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":218,"user_tz":240,"timestamp":1653425218506},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"5 363122\n4 80655\n1 52268\n3 42640\n2 29769\nName: Score, dtype: int64"},"metadata":{},"execution_count":9}],"execution_count":9},{"cell_type":"code","source":"percent_val = 100 * amazon_reviews.Score.value_counts()/amazon_reviews.shape[0]\npercent_val","metadata":{"id":"wWAVP9ad8jSL","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"124af6c91ea34c00ae3e8bb40d00c7a3","outputId":"6b87f690-8603-4e97-96ad-a9c42eb561ff","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":215,"user_tz":240,"timestamp":1653425224397},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"5 63.878871\n4 14.188483\n1 9.194763\n3 7.501047\n2 5.236835\nName: Score, dtype: float64"},"metadata":{},"execution_count":10}],"execution_count":10},{"cell_type":"code","source":"percent_val.plot.bar()\nplt.title('Revisiones por scores')\nplt.xlabel('Score')\nplt.ylabel('Porcentaje (%)')","metadata":{"id":"GUlRhwhj8lEa","colab":{"height":309,"base_uri":"https://localhost:8080/"},"cell_id":"2dddfb71d53d40fbb42d721ab890b24a","outputId":"02a602e1-029b-43c2-82ce-8b01f03e8f19","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":636,"user_tz":240,"timestamp":1653425236417},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"Text(0, 0.5, 'Porcentaje (%)')"},"metadata":{},"execution_count":11},{"output_type":"display_data","data":{"text/plain":"
","image/png":"\n"},"metadata":{"needs_background":"light"}}],"execution_count":11},{"cell_type":"markdown","source":"la distribucion es asimetrica, con un gran valor para los ratings de 5s y pocos par 3,2 y 1","metadata":{"id":"jJMFaoaW8oE7","cell_id":"1b27c5e9de4a47ef8072ed7c48ff9571","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Wordcloud","metadata":{"id":"z6JfoGI18uT2","cell_id":"279ac92dab214777b8387bef88a01ffa","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"word_cloud_text = ''.join(amazon_reviews.Text)\nprint(len(word_cloud_text))\n\nwordcloud = WordCloud(\n max_font_size=100,\n max_words=100,\n background_color=\"white\",\n scale=10,\n width=800,\n height=400\n).generate(word_cloud_text)\n\nplt.figure(figsize=(12,6))\nplt.imshow(wordcloud, interpolation=\"bilinear\")\nplt.axis(\"off\")\nplt.show()","metadata":{"id":"p4T2z4H-8vat","colab":{"height":374,"base_uri":"https://localhost:8080/"},"cell_id":"4cadd105dba544c99ff4924e4c8c358b","outputId":"ae7d7034-a376-4fcf-a86f-9cdc1c9a6a14","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":126653,"user_tz":240,"timestamp":1653425441082},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"247972188\n"},{"output_type":"display_data","data":{"text/plain":"
","image/png":"\n"},"metadata":{"needs_background":"light"}}],"execution_count":12},{"cell_type":"markdown","source":"El wordcloud muestra que hat muchos reviews que hablan sobre temas de comida (cafe, sabores, sabor, bebidas), tambien se observan otras palabras como bueno, amor y el mejor","metadata":{"id":"qQK8Ykur80AZ","cell_id":"4e0ca5dc8f464a3cb18629533dacc2f4","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Estandarizacion de ratings para analisis de sentimiento","metadata":{"id":"oCIjwyvX9DnC","cell_id":"1420a9eb1e034610800e705d389e2531","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Para propositos del analisis de sentimiento convertiremos todos los ratings en valores binarios con las siguientes reglas:\n- ratings de 4 o 5 seran convertidos a 1 (positivo)\n- ratings de 1 o 2 seran convertidos a 0 (negativo)\n- ratings de 3 seran removidos del analisis","metadata":{"id":"CFpvV3IP9HPt","cell_id":"a39d97237bf547018738580e29752291","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"amazon_reviews['Sentiment_rating'] = np.where(amazon_reviews.Score > 3, 1, 0)\namazon_reviews['Sentiment_rating'].value_counts()","metadata":{"id":"XE3s2uHi9W0t","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"0b36956a6a99483bbdc89ba85f073fc5","outputId":"374ce431-7b43-47ad-ce0e-aa252ab4a443","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":230,"user_tz":240,"timestamp":1653425658306},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"1 443777\n0 124677\nName: Sentiment_rating, dtype: int64"},"metadata":{},"execution_count":13}],"execution_count":13},{"cell_type":"code","source":"# removiendo neutrales\namazon_reviews = amazon_reviews[amazon_reviews.Score != 3]","metadata":{"id":"Ne9ciavE9Gp-","cell_id":"ee06e9db81cf4d1ca7e78e5f8b17ce4a","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":218,"user_tz":240,"timestamp":1653425668029},"deepnote_cell_type":"code"},"outputs":[],"execution_count":14},{"cell_type":"code","source":"#rcParams['figure.figsize'] = 8, 5\namazon_reviews.Sentiment_rating.value_counts().plot.bar()\nplt.title('Scores luego de estandarizacion')\nplt.xlabel('Score')\nplt.ylabel('Frecuencia')","metadata":{"id":"7EFLK3p69ajb","colab":{"height":309,"base_uri":"https://localhost:8080/"},"cell_id":"664dd4402100416fb7f2dc2ce4da1ba9","outputId":"4756d0bc-3398-4ab0-a469-fa5ed8a0a356","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":731,"user_tz":240,"timestamp":1653425680669},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"Text(0, 0.5, 'Frecuencia')"},"metadata":{},"execution_count":16},{"output_type":"display_data","data":{"text/plain":"
","image/png":"\n"},"metadata":{"needs_background":"light"}}],"execution_count":16},{"cell_type":"markdown","source":"## voy por aqui!!!","metadata":{"id":"1WI1bJFu3St2","cell_id":"73ab69ea8dd541df83f094c15fb84a8c","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Preprocesamiento","metadata":{"id":"yQHkuA0Y9e2q","cell_id":"b0c685ca52bf42c0b56bfe728ff3f5c2","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Recordemos que el preprocesamiento de texto y la normalizacion es crucial antes de desarrollar un modelo de NLP, algunos pasos importantes son:\n1. Convertir palabras a minusculas\n2. remover caracteres especiales\n3. remover stopwords y palabras de alta frecuencia\n4. Stemming y lemantizacion\n\nProcedamos con la primera fase","metadata":{"id":"DqxWLz1PAfer","cell_id":"3e7c82c1751248ce950bd8fbd8110210","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"amazon_reviews['reviews_text_new'] = amazon_reviews.Text.apply(lambda x: x.lower())","metadata":{"id":"nqPMAWzaAzWk","cell_id":"32d07a0d9e19489bb77900e8c1337113","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Las siguientes fases:","metadata":{"id":"417Vf1C4A1oI","cell_id":"33f1d1e3e8f94d25a144df14916d57a2","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"from nltk import word_tokenize\nimport nltk\n\nnltk.download('punkt')\n\ntoken_lists = [word_tokenize(each) for each in amazon_reviews.Text]\ntokens = [item for sublist in token_lists for item in sublist]\nprint(\"Numero de tokens unicos antes: \", len(set(tokens)))\n\ntoken_lists_lower = [word_tokenize(each) for each in amazon_reviews.reviews_text_new]\ntokens_lower = [item for sublist in token_lists_lower for item in sublist]\nprint(\"Numero de tokens unicos nuevos: \", len(set(tokens_lower)))","metadata":{"id":"zC_-ZELYA3Iz","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"434ffeb8665143a78b5497cdbdf5f1b2","outputId":"a462dd5e-1d0e-4e8c-e9f5-f2ef5843ec86","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":13629,"user_tz":240,"timestamp":1653305963072},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[nltk_data] Downloading package punkt to /root/nltk_data...\n[nltk_data] Unzipping tokenizers/punkt.zip.\nNumero de tokens unicos antes: 27899\nNumero de tokens unicos nuevos: 22865\n"}],"execution_count":null},{"cell_type":"code","source":"(22865-27899)/27899","metadata":{"id":"2Ixj5AbQBFaN","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"c98f4396515f4086918d099070466134","outputId":"d001f26a-bb9d-4ed9-ee98-5d0435746927","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":22,"user_tz":240,"timestamp":1653305963073},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"-0.18043657478762679"},"metadata":{},"execution_count":18}],"execution_count":null},{"cell_type":"markdown","source":"El numero de tokens han bajado en cerca del 18% con la normalizacion","metadata":{"id":"EQBzVKWlA_-0","cell_id":"48c1f1e5ebca4926ba881750ffe7411b","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"¿Es la eliminación de caracteres especiales incluso una buena idea? ¿Cuáles son algunos ejemplos de caracteres que probablemente sería seguro eliminar y cuáles no?","metadata":{"id":"uIH8nmc8BMav","cell_id":"09785bbc4bd141aa88971ed39dae4d16","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Eliminar caracteres especiales es una decisión subjetiva, especialmente en casos como este. Las personas a menudo usan caracteres especiales para expresar sus emociones y pueden dejar una reseña como \"¡¡¡Este producto es el peor!!!\", mientras que una reseña positiva podría ser \"Este producto es el mejor\". ¡Me encantó!' Aquí, la presencia de signos de exclamación indica claramente algo sobre el sentimiento subyacente, por lo que eliminarlos puede no ser una buena idea.\n\nPor otro lado, eliminar la puntuación sin carga emocional, como las comas, los puntos y el punto y coma, probablemente sea seguro.\n\nEn aras de la simplicidad, procederemos eliminando todos los caracteres especiales; sin embargo, vale la pena tener en cuenta que esto es algo para revisar dependiendo de los resultados que obtengamos más adelante. Lo siguiente da una lista de todos los caracteres especiales en nuestro conjunto de datos:","metadata":{"id":"8SnPc7g6BV9N","cell_id":"d0b1b40852664b6e925e04a9a89c6720","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Seleccionando los caracteres no alfa numericos que no son espacios\nspecial_chars = amazon_reviews.reviews_text_new.apply(lambda x: [each for each in list(x) if not each.isalnum() and each != ' '])\n\n# obtener una lista de listas\nflat_list = [item for sublist in special_chars for item in sublist]\n\n# caracteres especiales unicos \nprint(set(flat_list))","metadata":{"id":"33IALa4i9gB8","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"ffb7825d044e4536ade821f37e1fe39c","outputId":"eddb3dd7-38e8-4a07-be73-205eb699d350","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":662,"user_tz":240,"timestamp":1653305969133},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"{')', '`', '§', '^', '~', \"'\", '[', ';', '.', '\"', '<', '{', '(', '#', '}', '®', '&', '@', '-', ':', '>', '=', '_', '!', '%', ',', '?', ']', '+', '$', '*', '/'}\n"}],"execution_count":null},{"cell_type":"markdown","source":"Ahora removamos los caracteres especiales de los reviews","metadata":{"id":"kJeTZCmfBrX8","cell_id":"fab80a93a4d74b0588c539e8ef2c9033","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"import re\nreview_backup = amazon_reviews.reviews_text_new.copy()\namazon_reviews.reviews_text_new = amazon_reviews.reviews_text_new.apply(\n lambda x: re.sub('[^A-Za-z0-9 ]+', ' ', x)\n)","metadata":{"id":"tI1Zh8MiBuZw","cell_id":"1ae200e61a074468af65bbe3d236a55b","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Miremos como se ven algunos de los resultados luego de remover esto","metadata":{"id":"btC5RULGBxLX","cell_id":"b32f9f1db48c4f19a13aca864c18edce","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"print(\"Review anterior:\")\nreview_backup.values[6]","metadata":{"id":"wITFs61CB0cx","colab":{"height":105,"base_uri":"https://localhost:8080/"},"cell_id":"bbb884e9f5084ca7973e0496db99f7e7","outputId":"e35044b9-2529-40b2-d67c-dbd297c43c4a","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":12,"user_tz":240,"timestamp":1653305972711},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Review anterior:\n"},{"output_type":"execute_result","data":{"text/plain":"\"this saltwater taffy had great flavors and was very soft and chewy. each candy was individually wrapped well. none of the candies were stuck together, which did happen in the expensive version, fralinger's. would highly recommend this candy! i served it at a beach-themed party and everyone loved it!\"","application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":21}],"execution_count":null},{"cell_type":"code","source":"print(\"Review nuevo:\")\namazon_reviews.reviews_text_new[6]","metadata":{"id":"E8K8haC9B29Y","colab":{"height":105,"base_uri":"https://localhost:8080/"},"cell_id":"6adf9bbba0ca482ca959212323cfd2fb","outputId":"4c697f4b-7fe5-4038-a660-f7611313e536","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":494,"user_tz":240,"timestamp":1653305974276},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Review nuevo:\n"},{"output_type":"execute_result","data":{"text/plain":"'this saltwater taffy had great flavors and was very soft and chewy each candy was individually wrapped well none of the candies were stuck together which did happen in the expensive version fralinger s would highly recommend this candy i served it at a beach themed party and everyone loved it '","application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":22}],"execution_count":null},{"cell_type":"markdown","source":"El numero de tokens unicos que se han borrado son","metadata":{"id":"viiMJcdXB9fA","cell_id":"cb64564df97742d5bdf22f6abbbf65f5","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"token_lists = [word_tokenize(each) for each in amazon_reviews.Text]\ntokens = [item for sublist in token_lists for item in sublist]\nprint(\"Numero de token unicos antes: \", len(set(tokens)))\n\ntoken_lists = [word_tokenize(each) for each in amazon_reviews.reviews_text_new]\ntokens = [item for sublist in token_lists for item in sublist]\nprint(\"Numero de tokens unicos despues: \", len(set(tokens)))","metadata":{"id":"HkwuWkWZB8oz","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"e3bc65094c5b44d7846d3258d9189480","outputId":"a494a9f9-6de7-404f-c5b8-af28dd218897","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":9742,"user_tz":240,"timestamp":1653305985492},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Numero de token unicos antes: 27899\nNumero de tokens unicos despues: 18039\n"}],"execution_count":null},{"cell_type":"markdown","source":"## stopwords y palabras de alta/baja frecuencia\n\nVamos a remover estas palabras","metadata":{"id":"-EKO6f24CIbg","cell_id":"3caa616d07cf4b56b96bdcf595479137","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"import nltk\nnltk.download('stopwords')\n\nnoise_words = []\nstopwords_corpus = nltk.corpus.stopwords\neng_stop_words = stopwords_corpus.words('english')\nnoise_words.extend(eng_stop_words)\nprint(len(noise_words))\nnoise_words","metadata":{"id":"Z6uhTzzVCPB2","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"8beed5d9bb6f4755a801c9949878e5cc","outputId":"a4b74b78-0ea3-448e-8360-0df805d17a4c","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":12,"user_tz":240,"timestamp":1653305985493},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[nltk_data] Downloading package stopwords to /root/nltk_data...\n[nltk_data] Unzipping corpora/stopwords.zip.\n179\n"},{"output_type":"execute_result","data":{"text/plain":"['i',\n 'me',\n 'my',\n 'myself',\n 'we',\n 'our',\n 'ours',\n 'ourselves',\n 'you',\n \"you're\",\n \"you've\",\n \"you'll\",\n \"you'd\",\n 'your',\n 'yours',\n 'yourself',\n 'yourselves',\n 'he',\n 'him',\n 'his',\n 'himself',\n 'she',\n \"she's\",\n 'her',\n 'hers',\n 'herself',\n 'it',\n \"it's\",\n 'its',\n 'itself',\n 'they',\n 'them',\n 'their',\n 'theirs',\n 'themselves',\n 'what',\n 'which',\n 'who',\n 'whom',\n 'this',\n 'that',\n \"that'll\",\n 'these',\n 'those',\n 'am',\n 'is',\n 'are',\n 'was',\n 'were',\n 'be',\n 'been',\n 'being',\n 'have',\n 'has',\n 'had',\n 'having',\n 'do',\n 'does',\n 'did',\n 'doing',\n 'a',\n 'an',\n 'the',\n 'and',\n 'but',\n 'if',\n 'or',\n 'because',\n 'as',\n 'until',\n 'while',\n 'of',\n 'at',\n 'by',\n 'for',\n 'with',\n 'about',\n 'against',\n 'between',\n 'into',\n 'through',\n 'during',\n 'before',\n 'after',\n 'above',\n 'below',\n 'to',\n 'from',\n 'up',\n 'down',\n 'in',\n 'out',\n 'on',\n 'off',\n 'over',\n 'under',\n 'again',\n 'further',\n 'then',\n 'once',\n 'here',\n 'there',\n 'when',\n 'where',\n 'why',\n 'how',\n 'all',\n 'any',\n 'both',\n 'each',\n 'few',\n 'more',\n 'most',\n 'other',\n 'some',\n 'such',\n 'no',\n 'nor',\n 'not',\n 'only',\n 'own',\n 'same',\n 'so',\n 'than',\n 'too',\n 'very',\n 's',\n 't',\n 'can',\n 'will',\n 'just',\n 'don',\n \"don't\",\n 'should',\n \"should've\",\n 'now',\n 'd',\n 'll',\n 'm',\n 'o',\n 're',\n 've',\n 'y',\n 'ain',\n 'aren',\n \"aren't\",\n 'couldn',\n \"couldn't\",\n 'didn',\n \"didn't\",\n 'doesn',\n \"doesn't\",\n 'hadn',\n \"hadn't\",\n 'hasn',\n \"hasn't\",\n 'haven',\n \"haven't\",\n 'isn',\n \"isn't\",\n 'ma',\n 'mightn',\n \"mightn't\",\n 'mustn',\n \"mustn't\",\n 'needn',\n \"needn't\",\n 'shan',\n \"shan't\",\n 'shouldn',\n \"shouldn't\",\n 'wasn',\n \"wasn't\",\n 'weren',\n \"weren't\",\n 'won',\n \"won't\",\n 'wouldn',\n \"wouldn't\"]"},"metadata":{},"execution_count":24}],"execution_count":null},{"cell_type":"markdown","source":"Encontremos las palabras de alta y baja frecuencia, que definiremos como el 1 % de las palabras que aparecen con más frecuencia en las reseñas, así como el 1 % de las palabras que aparecen con menos frecuencia en las reseñas (después de ajustar por mayúsculas y minúsculas y caracteres especiales).","metadata":{"id":"oonZd40DCRnt","cell_id":"f982d1c73dba40f6a59623f0e50ed612","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"one_percentile = int(len(set(tokens)) * 0.01)\ntop_1_percentile = Counter(tokens).most_common(one_percentile)\ntop_1_percentile[:10]","metadata":{"id":"phySUXSpCb7N","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"cadffbb65ec44b849dbb24b87317337f","outputId":"2525fa21-3767-4d51-d9a2-407d55de420a","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":264,"user_tz":240,"timestamp":1653305995000},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"[('the', 28122),\n ('i', 25705),\n ('and', 19980),\n ('a', 18505),\n ('it', 16143),\n ('to', 15137),\n ('of', 12067),\n ('is', 11063),\n ('this', 10530),\n ('br', 9361)]"},"metadata":{},"execution_count":25}],"execution_count":null},{"cell_type":"code","source":"pd.DataFrame(top_1_percentile[:10], columns=['Palabras','Frecuencia']).set_index('Palabras').plot(kind='bar')\nplt.title('Percentil 1 de palabras mas frecuentes')\nplt.xlabel('Palabras')\nplt.ylabel('Frecuencia')","metadata":{"id":"cIogh--ct3CJ","colab":{"height":376,"base_uri":"https://localhost:8080/"},"cell_id":"f925313e71894e2bae2f0ebc5faf5cb2","outputId":"091a5833-1592-4e74-8ce8-869c308a64df","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":832,"user_tz":240,"timestamp":1653306148618},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"Text(0, 0.5, 'Frecuencia')"},"metadata":{},"execution_count":35},{"output_type":"display_data","data":{"text/plain":"
","image/png":"\n"},"metadata":{"needs_background":"light"}}],"execution_count":null},{"cell_type":"code","source":"bottom_1_percentile = Counter(tokens).most_common()[-one_percentile:]\nbottom_1_percentile[:10]","metadata":{"id":"lnFGqSCTCdU5","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"05f4902d0f9140589a08ef637600c87e","outputId":"f13d499a-643d-4cad-800b-7a5487c00a37","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":470,"user_tz":240,"timestamp":1653306224182},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"[('pruchase', 1),\n ('slick', 1),\n ('cloured', 1),\n ('innocuous', 1),\n ('espensive', 1),\n ('marketer', 1),\n ('strofoam', 1),\n ('destroyers', 1),\n ('ruth', 1),\n ('gleaning', 1)]"},"metadata":{},"execution_count":36}],"execution_count":null},{"cell_type":"code","source":"noise_words.extend([word for word,val in top_1_percentile])\nnoise_words.extend([word for word,val in bottom_1_percentile])","metadata":{"id":"g-A1HtiWCfH2","cell_id":"7f29d3cfcb7a491db01b44306d1f9a36","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Las stopwords y las palabras de alta/baja frecuencia ahora se han agregado a noise_words, que se eliminarán de las revisiones antes de entrenar los modelos de aprendizaje automático. \n\nEs poco probable que las stopwords sean tan útiles, ya que esperamos que aparezcan con la misma frecuencia en las críticas positivas y negativas. Las palabras poco comunes pueden ser más significativas y, en teoría, podrían indicar el sentimiento de la revisión","metadata":{"id":"n51bDGhkCiQF","cell_id":"5fb1727b662f403891305c98cda30a5a","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Stemming y Lemantizacion\n\nSi quieren profundizar en los conceptos de [stemming](http://www.nltk.org/api/nltk.stem.html?highlight=lemmatizer), y[lemmatization](http://www.nltk.org/api/nltk.stem.html?highlight=lemmatizer#module-nltk.stem.wordnet) y otros tipos de normalizaciones pueden encontrar una buena introduccion en: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.","metadata":{"id":"TFmioUxSCtCO","cell_id":"f1cf83798b40484781af1c485f333a89","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"from nltk.stem import PorterStemmer, WordNetLemmatizer, LancasterStemmer\n\nnltk.download('wordnet')\n\nfrom nltk.corpus import wordnet\n\nporter = PorterStemmer()\nlancaster = LancasterStemmer()\nlemmatizer = WordNetLemmatizer()","metadata":{"id":"uWkdbiEtC8Qc","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"066514759d8447e5b604238a97e4e8f1","outputId":"7d673d45-6f69-4e6c-b741-b6e1e33d734e","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":879,"user_tz":240,"timestamp":1653261196018},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"[nltk_data] Downloading package wordnet to /root/nltk_data...\n[nltk_data] Unzipping corpora/wordnet.zip.\n"}],"execution_count":null},{"cell_type":"markdown","source":"Los algoritmos de **Stemming** funcionan cortando el final o el principio de la palabra, teniendo en cuenta una lista de prefijos y sufijos comunes que se pueden encontrar.\n\nPor otro lado, la **lematización** toma en consideración el análisis morfológico de las palabras. Por lo tanto, la **lematización** tiene en cuenta la gramática de la palabra e intenta encontrar la palabra raíz en lugar de simplemente llegar a la palabra raíz mediante métodos de fuerza bruta.","metadata":{"id":"OWrUPKFfC9sa","cell_id":"1e8b9ad59445480eb33dc0ef51a82425","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"print(\"Lancaster Stemmer\")\nprint(lancaster.stem(\"trouble\"))\nprint(lancaster.stem(\"troubling\"))\nprint(lancaster.stem(\"troubled\"))\n\n# Proveer una palabra que sera lemantizada\nprint(\"WordNet Lemmatizer\")\nprint(lemmatizer.lemmatize(\"trouble\", wordnet.NOUN))\nprint(lemmatizer.lemmatize(\"troubling\", wordnet.VERB))\nprint(lemmatizer.lemmatize(\"troubled\", wordnet.VERB))","metadata":{"id":"CzTULvebDJgV","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"2426921def0946a4ab5ab7c5ae50623c","outputId":"e091a299-d51c-4af8-c5c4-5142c9ead3fd","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":3355,"user_tz":240,"timestamp":1653261265067},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"Lancaster Stemmer\ntroubl\ntroubl\ntroubl\nWordNet Lemmatizer\ntrouble\ntrouble\ntrouble\n"}],"execution_count":null},{"cell_type":"markdown","source":"Se puede ver que obtenemos una raíz de significado de Lemmatizer, mientras que Stemmer simplemente recorta y extrae la primera parte importante de la palabra.","metadata":{"id":"MfnEC2oUDQc5","cell_id":"f951894c6b294543aae1936652aaa5b4","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Creacion de modelo de ML","metadata":{"id":"EKYImfc-DSu4","cell_id":"65531152936e49aa83e271720692034b","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"amazon_reviews[['Text','Score','Sentiment_rating']].head(5)","metadata":{"id":"fdRlAQcWDWEf","colab":{"height":206,"base_uri":"https://localhost:8080/"},"cell_id":"b511e2cc367340dfbe99a96ca6225700","outputId":"a61604b0-1910-45c1-8e1b-1327aaaea8ce","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":363,"user_tz":240,"timestamp":1653261297738},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" Text Score Sentiment_rating\n0 I have bought several of the Vitality canned d... 5 1\n1 Product arrived labeled as Jumbo Salted Peanut... 1 0\n2 This is a confection that has been around a fe... 4 1\n3 If you are looking for the secret ingredient i... 2 0\n4 Great taffy at a great price. There was a wid... 5 1","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
TextScoreSentiment_rating
0I have bought several of the Vitality canned d...51
1Product arrived labeled as Jumbo Salted Peanut...10
2This is a confection that has been around a fe...41
3If you are looking for the secret ingredient i...20
4Great taffy at a great price. There was a wid...51
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":29}],"execution_count":null},{"cell_type":"markdown","source":"Las variables independientes o características del modelo se derivan del texto de revisión. Previamente, discutimos cómo podemos usar **n-grams** para crear características, y específicamente cómo la bolsa de palabras es la interpretación más simple de estos n-gramas, sin tener en cuenta el orden y el contexto por completo y solo enfocándonos en la frecuencia/recuento. Usemos eso como punto de partida.","metadata":{"id":"I-0HWHdbDYwZ","cell_id":"70e9c97f8488438aaa09615b856db41c","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Bag of words\n\nCountVectorizer es una clase de Python que da cuenta automáticamente de ciertos pasos de preprocesamiento, como la eliminación de palabras vacías, la derivación, la creación de n-gramas y la tokenización de palabras:","metadata":{"id":"qENuMxTvDen5","cell_id":"aeec9334fc344c66a474c1cc3c40ec31","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Creacion de metodo para stemming\nfrom nltk.stem import PorterStemmer\n\nstemmer = PorterStemmer()\nanalyzer = CountVectorizer().build_analyzer()\n\ndef stemmed_words(doc):\n return (stemmer.stem(w) for w in analyzer(doc))","metadata":{"id":"esyrY_TqDdst","cell_id":"93c0e92a9df642abab69f9702787acd6","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Usemos esto para crear una bolsa de palabras de las reseñas, excluyendo las palabras irrelevantes que identificamos anteriormente:","metadata":{"id":"8OHClEiJDr9_","cell_id":"fa68a4b1adc2441c9b225ac0313ce0e8","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Creacion de un objeto tipo CountVectorizer\nbow_counts = CountVectorizer(\n tokenizer=word_tokenize,\n stop_words=noise_words,\n ngram_range=(1, 1)\n)","metadata":{"id":"7wV6wAd8DupW","cell_id":"826feace271946ae96b96a19c97244c5","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Una vez que se prepara la bolsa de palabras, el conjunto de datos debe dividirse en conjuntos de entrenamiento y de prueba. También podríamos dividir los datos después de vectorizarlos, pero es útil dividir los datos lo antes posible en el proceso. Esto significa que una vez que hemos generado nuestras predicciones, podemos compararlas más fácilmente con los textos originales, antes de que hayan sido preprocesadas y vectorizadas.","metadata":{"id":"0I--fC0SDz3o","cell_id":"dbdfd8b242844732b0e7c84136b64bff","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"reviews_train, reviews_test = train_test_split(amazon_reviews, test_size=0.2, random_state=0)","metadata":{"id":"3-D0n379D5_G","cell_id":"e25ad7f80e5c4ccba5e8332e4d79e733","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"X_train_bow = bow_counts.fit_transform(reviews_train.reviews_text_new)\nX_test_bow = bow_counts.transform(reviews_test.reviews_text_new)","metadata":{"id":"3EkmtCUeD7Tu","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"3aac2465c764490aa6914a25a73407e4","outputId":"6cc420e5-bf80-4b42-d98a-e6af702a4ad5","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":6641,"user_tz":240,"timestamp":1653261456737},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stderr","text":"/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:401: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens [\"'d\", \"'ll\", \"'re\", \"'s\", \"'ve\", 'might', 'must', \"n't\", 'need', 'sha', 'wo'] not in stop_words.\n % sorted(inconsistent)\n"}],"execution_count":null},{"cell_type":"markdown","source":"Tengan en cuenta que llamamos a ```fit_transform``` para vectorizar nuestro conjunto de entrenamiento y ```transform``` para vectorizar nuestro conjunto de prueba. Esto genera las asignaciones de vectorización solo en los datos del conjunto de entrenamiento, que es una restricción a la que nos enfrentaríamos en un problema del mundo real (no tener acceso a los datos de prueba durante el tiempo de entrenamiento).\n\nPor lo tanto, puede haber algunas palabras en el conjunto de prueba que no sabemos cómo vectorizar y se omitirán.","metadata":{"id":"8q-tk94fD_Vu","cell_id":"fb6fe1363e264c71aafdf89e35a1bf18","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"y_train_bow = reviews_train['Sentiment_rating']\ny_test_bow = reviews_test['Sentiment_rating']","metadata":{"id":"mu-ClINID76n","cell_id":"18e17c065e1249278c87670f06827829","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"y_test_bow.value_counts() / y_test_bow.shape[0]","metadata":{"id":"KE-P_hwcEJzc","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"aafcf5ca68bb4fe7bcf8907b870ce65f","outputId":"88d11217-3aa5-488f-bfce-48456300cca0","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":383,"user_tz":240,"timestamp":1653261511710},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"1 0.847921\n0 0.152079\nName: Sentiment_rating, dtype: float64"},"metadata":{},"execution_count":35}],"execution_count":null},{"cell_type":"markdown","source":"Los datos de prueba contienen 84% de opiniones positivas. El modelo de predicción más simple que podríamos pensar sería uno que prediga \"positivo\" para cada entrada. Llamaríamos a esto un modelo \"ingenuo\", y constituye una línea de base útil. En este caso, dicho modelo obtendría un 84 % de precisión, por lo que podemos considerarlo como una puntuación de referencia que nuestro modelo de aprendizaje automático debe superar.","metadata":{"id":"i3RSbM62ENeq","cell_id":"1d4f7d35ee42490f932e69e8fbe6f215","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"## Modelo de regresion logistica","metadata":{"id":"b5GGWWW3EYyR","cell_id":"b83d927d224043fba9866384f7be8280","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Entrenar el modelo\nlr_model_all = LogisticRegression(C=1, solver=\"liblinear\")\nlr_model_all.fit(X_train_bow, y_train_bow)\n\n# Predecir el output\ntest_pred_lr_prob = lr_model_all.predict_proba(X_test_bow)\ntest_pred_lr_all = lr_model_all.predict(X_test_bow)\n\nprint(\"F1 score: \", f1_score(y_test_bow, test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_bow, test_pred_lr_all) * 100)","metadata":{"id":"_H_xrYMjEcBs","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"bbb2e00c459e4fb7b2cec24ec42fbe79","outputId":"148a6c87-bee1-460a-9c15-b4a781a1bdf4","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":901,"user_tz":240,"timestamp":1653261600835},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.942550505050505\nAccuracy: 90.04376367614879\n"}],"execution_count":null},{"cell_type":"code","source":"test_pred_lr_prob","metadata":{"id":"B9g9Z2AFEijY","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"a0067f829fd049f39b7dd9dd74e563fb","outputId":"00e19690-216e-4b4a-fb8d-7057e46345aa","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":382,"user_tz":240,"timestamp":1653261619777},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"array([[7.97531898e-01, 2.02468102e-01],\n [2.70697065e-09, 9.99999997e-01],\n [5.15753584e-02, 9.48424642e-01],\n ...,\n [1.23713200e-04, 9.99876287e-01],\n [9.96585322e-01, 3.41467779e-03],\n [9.18040817e-02, 9.08195918e-01]])"},"metadata":{},"execution_count":38}],"execution_count":null},{"cell_type":"code","source":"probabilities = [each[1] for each in test_pred_lr_prob]","metadata":{"id":"c3pDuetkEmfq","cell_id":"370ac5caa3bb43faaa2287baefdcdbda","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"predictions = pd.DataFrame()\npredictions['Text'] = reviews_test['Text']\npredictions['Actual_Score'] = reviews_test['Score']\npredictions['Sentiment_rating'] = reviews_test['Sentiment_rating']\npredictions['Predicted_sentiment'] = test_pred_lr_all\npredictions['Predicted_probability'] = probabilities","metadata":{"id":"yhK1q_skEoqv","cell_id":"baf0e4099be94b109eb9217cbd0cac23","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"predictions.head(5)","metadata":{"id":"tk36qT2QEruF","colab":{"height":424,"base_uri":"https://localhost:8080/"},"cell_id":"5de9695d91454190bb7eb09c68ca3bb8","outputId":"327719ad-3e1f-48f7-f1a5-f47f3e37a27e","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":9,"user_tz":240,"timestamp":1653261649104},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" Text Actual_Score \\\n7419 These truffles are not as great as they are ma... 2 \n4315 Ok--I just bought BIsquick GF and the first th... 4 \n1781 I love these chips so much... and I can't get ... 5 \n962 I purchased this because I read that it was a ... 2 \n7289 I do like the idea of these crackers for glute... 2 \n\n Sentiment_rating Predicted_sentiment Predicted_probability \n7419 0 0 0.202468 \n4315 1 1 1.000000 \n1781 1 1 0.948425 \n962 0 0 0.363864 \n7289 0 1 0.954062 ","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
TextActual_ScoreSentiment_ratingPredicted_sentimentPredicted_probability
7419These truffles are not as great as they are ma...2000.202468
4315Ok--I just bought BIsquick GF and the first th...4111.000000
1781I love these chips so much... and I can't get ...5110.948425
962I purchased this because I read that it was a ...2000.363864
7289I do like the idea of these crackers for glute...2010.954062
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":41}],"execution_count":null},{"cell_type":"code","source":"accuracy_score(predictions['Sentiment_rating'], predictions['Predicted_sentiment'])","metadata":{"id":"yOp3mDlXEuJA","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"50d2db3274a24431a9cfd41b74e44cd5","outputId":"628f8ea3-bc69-403e-9b1c-f3ca2bc99a74","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":440,"user_tz":240,"timestamp":1653261658529},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"0.9004376367614879"},"metadata":{},"execution_count":42}],"execution_count":null},{"cell_type":"markdown","source":"En la columna ```Predicted_probability```, puede ver la confianza que tenía el modelo en sus predicciones, siendo las probabilidades muy cercanas a 0 predicciones de sentimientos negativos muy confiables y las probabilidades muy cercanas a 1, predicciones de sentimientos positivos muy confiables.\n\nUtilice esta información para encontrar el caso en el que el modelo tenía más confianza al predecir que una reseña tenía un sentimiento negativo cuando la puntuación real era positiva.\n\nMire el texto y escriba algunas oraciones de análisis sobre por qué cree que el modelo se equivocó","metadata":{"id":"OBKZYoZtF4qw","cell_id":"1adae403f3a84e9e967b5d77cb7c5659","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"predictions[\n predictions['Predicted_sentiment'] != predictions['Sentiment_rating']\n].sort_values(by=[\"Predicted_probability\"]).head(3)","metadata":{"id":"VN0O6ei8GuOS","colab":{"height":292,"base_uri":"https://localhost:8080/"},"cell_id":"6bb93e4e466a4e3b8a32a8dd80d6a8ee","outputId":"634e4f98-9049-4b75-b963-ead16bd8e03b","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":520,"user_tz":240,"timestamp":1653262183193},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" Text Actual_Score \\\n7692 My little girl (6 months) can't stomach any fo... 5 \n4166 The first time I tried this product to make pa... 4 \n6612 When I opened the package, I only found 11 pac... 4 \n\n Sentiment_rating Predicted_sentiment Predicted_probability \n7692 1 0 0.000719 \n4166 1 0 0.002114 \n6612 1 0 0.006040 ","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
TextActual_ScoreSentiment_ratingPredicted_sentimentPredicted_probability
7692My little girl (6 months) can't stomach any fo...5100.000719
4166The first time I tried this product to make pa...4100.002114
6612When I opened the package, I only found 11 pac...4100.006040
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":43}],"execution_count":null},{"cell_type":"code","source":"predictions.loc[7692].values","metadata":{"id":"S-G3sENDGu_i","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"a8f24770efdc4fc498e128e0bbb63ba9","outputId":"e759240a-10d6-4421-8a25-7530360a980e","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":417,"user_tz":240,"timestamp":1653262189852},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"array([\"My little girl (6 months) can't stomach any formula. Even Nutramigen, at $35 per can made her colicky and sick (accompanied with the worst smelling spit up). The expensive formulas smelled bad, and tasted worse, I hated giving them to her. Goats milk was recommended to me. Her doctor's only objection was that it was low in Folic Acid and tended to dehydrate babies. Well, Meyenberg Powdered Milk is fortified with Folic Acid, so dilute it with a little extra water and you're in business! I now have the happiest infant you've ever seen. Enfamil can take that nasty chemical soup they call formula and shove it, I'll never go back.

You can get the formula Recipe at:
[...]

There have been concerns about arsenic in rice syrup lately, so I have been using Lyle's Golden Syrup, and she likes it fine (Also it's the best pancake syrup you'll ever have). I am going to try barley malt syrup next, it is chemically more similar to rice syrup.\",\n 5, 1, 0, 0.0007189017742100002], dtype=object)"},"metadata":{},"execution_count":44}],"execution_count":null},{"cell_type":"markdown","source":"Podemos ver que la reseña tiene un tono muy negativo y utiliza palabras que la modelo probablemente ha aprendido a asociar fuertemente con malas reseñas como \"caro\", \"malo\", \"peor\", \"odiado\", \"desagradable\", \" nunca\". Sin embargo, todos estos en realidad están dirigidos a un producto diferente. El autor dice que después de comprar este producto tiene el \"bebé más feliz\", pero eso no es lo suficientemente fuerte como para contrarrestar todos los aspectos negativos de la reseña.","metadata":{"id":"3zbOLxXOGxk_","cell_id":"e7f2af358123437d843a99b528d6cca6","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Modifique el conjunto de características en el modelo para incluir bigramas, trigramas y 4-grams. No elimine las palabras irrelevantes definidas anteriormente antes de presentarlas. (Sugerencia: establece ngram_range=(1,4).)\n\nAl mismo tiempo, experimente con el ajuste de hiperparámetros. Cambie el valor C del clasificador de regresión logística a 0,9.","metadata":{"id":"uq3urjaaHMV0","cell_id":"7bc3f6b91f0042d6b9a4e29d622eab78","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Cambios con respecto al código anterior\n# 1. Aumentar los n-gramas de solo tener 1 gramo a (1 gramo, 2 gramos, 3 gramos y 4 gramos)\n# 2. Incluir las palabras vacías en las características de la bolsa de palabras\nbow_counts = CountVectorizer(\n tokenizer=word_tokenize,\n ngram_range=(1,4)\n)\n\nX_train_bow = bow_counts.fit_transform(reviews_train.reviews_text_new)\nX_test_bow = bow_counts.transform(reviews_test.reviews_text_new)","metadata":{"id":"kEqVP5IuHSKk","cell_id":"79aafb0a428f4593a1760be7b744df3b","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"# Observe el aumento de funciones con la inclusión de palabras vacías\nX_train_bow","metadata":{"id":"8IPlWx1OHYKg","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"d378fedcba5c4792b3ffd5183af4e07a","outputId":"aed3e843-9871-4f9c-d07f-8712cbdbe6bd","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":504,"user_tz":240,"timestamp":1653262371710},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":"<7310x1054259 sparse matrix of type ''\n\twith 2036674 stored elements in Compressed Sparse Row format>"},"metadata":{},"execution_count":46}],"execution_count":null},{"cell_type":"code","source":"# Cambios en la regresión logística\n# Cambio de la sanción de regularización por defecto de l2 a l1\n# Cambiando el parámetro de costo C a 0.9\nlr_model_all_new = LogisticRegression(C=0.9, solver=\"liblinear\")","metadata":{"id":"GatU304PHdRw","cell_id":"809c87dc229d4406bfa0dacc93e9976c","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"# Entrenar el modelo\nlr_model_all_new.fit(X_train_bow, y_train_bow)\n\n# Predecir resultados\ntest_pred_lr_prob = lr_model_all_new.predict_proba(X_test_bow)\ntest_pred_lr_all = lr_model_all_new.predict(X_test_bow)\n\nprint(\"F1 score: \", f1_score(y_test_bow, test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_bow, test_pred_lr_all) * 100)","metadata":{"id":"t-9RjVU3HjeM","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"b65406b875044b60b1b58ef77e5fd035","outputId":"9a7ed7cb-94ec-42eb-b871-2fdc46affd00","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":21013,"user_tz":240,"timestamp":1653262433168},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.9554579673776662\nAccuracy: 92.23194748358861\n"}],"execution_count":null},{"cell_type":"markdown","source":"La precisión ha saltado del 90% al 92,2%. Este es un ejemplo de lo que el simple ajuste de hiperparámetros y la modificación de características de entrada pueden hacer en el rendimiento general. Incluso podemos obtener características interpretables de esto en términos de lo que más contribuyó al sentimiento positivo y negativo.","metadata":{"id":"AU11TtV5Hp3J","cell_id":"8dc1d33fe9ef416d99e771498565dae6","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"lr_weights = pd.DataFrame(list(\n zip(\n bow_counts.get_feature_names(),\n lr_model_all_new.coef_[0])\n ),\n columns=['words','weights']\n)\n\nlr_weights.sort_values(['weights'],ascending = False)[:15]","metadata":{"id":"1zboEFrIHvC5","colab":{"height":575,"base_uri":"https://localhost:8080/"},"cell_id":"191acded7ca64ef2b7bfaffec6a46cfa","outputId":"ee875f03-e61a-4c28-e22a-7efba540a77f","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":6151,"user_tz":240,"timestamp":1653262454049},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stderr","text":"/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n warnings.warn(msg, category=FutureWarning)\n"},{"output_type":"execute_result","data":{"text/plain":" words weights\n374800 great 1.382526\n256072 delicious 0.979694\n366512 good 0.864971\n677656 perfect 0.844286\n299760 excellent 0.838121\n855190 the best 0.826370\n534066 love 0.819486\n143456 best 0.813387\n593253 nice 0.780972\n536985 loves 0.660368\n1034499 wonderful 0.653160\n777859 smooth 0.642421\n309549 favorite 0.615238\n605055 not too 0.587947\n315915 find 0.585533","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordsweights
374800great1.382526
256072delicious0.979694
366512good0.864971
677656perfect0.844286
299760excellent0.838121
855190the best0.826370
534066love0.819486
143456best0.813387
593253nice0.780972
536985loves0.660368
1034499wonderful0.653160
777859smooth0.642421
309549favorite0.615238
605055not too0.587947
315915find0.585533
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":49}],"execution_count":null},{"cell_type":"code","source":"lr_weights.sort_values(['weights'],ascending = False)[-15:]","metadata":{"id":"jFRk5csSHxyj","colab":{"height":520,"base_uri":"https://localhost:8080/"},"cell_id":"889421d6317c427e8c56323ea00f52cf","outputId":"c4d2ae4c-01ea-4e1c-fd77-ffec093df08e","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":1023,"user_tz":240,"timestamp":1653262460288},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" words weights\n1037701 worst -0.526030\n1007881 were -0.529210\n982174 very disappointed -0.540921\n120594 away -0.546831\n730653 return -0.547383\n801742 stick -0.561657\n121444 awful -0.570031\n997363 waste -0.591713\n265649 disappointing -0.598279\n820835 t -0.654783\n124042 bad -0.663200\n1003928 weak -0.702415\n294693 even -0.782674\n599002 not -0.975730\n265330 disappointed -1.121489","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordsweights
1037701worst-0.526030
1007881were-0.529210
982174very disappointed-0.540921
120594away-0.546831
730653return-0.547383
801742stick-0.561657
121444awful-0.570031
997363waste-0.591713
265649disappointing-0.598279
820835t-0.654783
124042bad-0.663200
1003928weak-0.702415
294693even-0.782674
599002not-0.975730
265330disappointed-1.121489
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":50}],"execution_count":null},{"cell_type":"markdown","source":"## Random forest","metadata":{"id":"u4WxBMCnHz-H","cell_id":"2fe00730b19c43798799e8d24ef4bc0c","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"rf_model_all = RandomForestClassifier(n_estimators=100)\n\n# Entrenamiento\nrf_model_all.fit(X_train_bow, y_train_bow)\n\n# predicciones\ntest_pred_lr_prob = rf_model_all.predict_proba(X_test_bow)\ntest_pred_lr_all = rf_model_all.predict(X_test_bow)","metadata":{"id":"8qAvpMnoH2zm","cell_id":"32075822458b42c1ac117742cdcad288","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"print(\"F1 score: \", f1_score(y_test_bow,test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_bow,test_pred_lr_all)* 100)","metadata":{"id":"r2Pu1uwZH7Ew","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"2f2a3b27fcd24f5780b0131b7f39a0ee","outputId":"e267f797-86f6-4064-9df8-3c30e6906801","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":7,"user_tz":240,"timestamp":1653262557913},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.9281437125748503\nAccuracy: 86.87089715536105\n"}],"execution_count":null},{"cell_type":"markdown","source":"Esto no es tan bueno como la regresión logística. Podemos obtener los n-gramas que fueron más importantes para las predicciones de la siguiente manera:","metadata":{"id":"jpMWiD1gH-d8","cell_id":"8ecfa8d499db40bdb8459315e75263ec","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"feature_importances = pd.DataFrame(\n rf_model_all.feature_importances_,\n index=bow_counts.get_feature_names(),\n columns=['importance']\n)","metadata":{"id":"l-pQMcXKIA3O","cell_id":"8a715665945c4c1092c9a01f6a70fa98","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"feature_importances.sort_values(['importance'], ascending=False)[:10]","metadata":{"id":"_PP7hG-3IBXy","cell_id":"bd0e84490b8e44d5a7b1d6fca2e4c894","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"# Modelo TF-IDF\n\nPor supuesto, la bolsa de palabras no es la única forma de caracterizar el texto. Otro método, que mencionamos brevemente antes, es el método TF-IDF. Esto evalúa qué tan importante es una palabra para un documento dentro de una gran colección de documentos (es decir, corpus). La importancia aumenta proporcionalmente en función del número de veces que aparece una palabra en el documento, pero se compensa con la frecuencia de la palabra en el corpus.\n\nLa TF-IDF es el producto de dos términos. El primero calcula la frecuencia de término normalizada (TF); es decir, el número de veces que aparece una palabra en un documento dividido por el número total de palabras en ese documento. El segundo término es la Frecuencia Inversa de Documentos (IDF), calculada como el logaritmo del número de documentos en el corpus dividido por el número de documentos ","metadata":{"id":"2jl8v2WlIDHs","cell_id":"7b57dccafc024220aad5c5f969ea79b9","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Menos formalmente, ¿qué significa esto?\n\n- Si una palabra aparece con mucha frecuencia en un documento específico, es probable que sea significativa.\n- Si una palabra aparece con mucha frecuencia en casi todos los documentos del corpus, es poco probable que sea significativa.\n- Por tanto, una palabra que aparece a menudo en un documento pero rara vez en el resto del corpus merece especial atención.\n\nTF-IDF no solo cuenta cada palabra, sino que aplica una ponderación para que las palabras comunes reciban menos atención y las palabras raras reciban más.\n\nVolvamos a presentar nuestro conjunto original de revisiones basado en TF-IDF y dividamos las funciones resultantes en conjuntos de entrenamiento y prueba:","metadata":{"id":"iaDByDseIdYq","cell_id":"965957ed0b3d42f09157676bf19f2199","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Cree un vectorizador: aún alimentamos nuestras stopwords, aunque\n# estos son menos relevantes ahora ya que TF-IDF los ponderaría menos\n# de todas formas.\ntfidf_counts = TfidfVectorizer(\n tokenizer=word_tokenize,\n stop_words=noise_words,\n ngram_range=(1,1)\n)\n\nX_train_tfidf = tfidf_counts.fit_transform(reviews_train.reviews_text_new)\nX_test_tfidf = tfidf_counts.transform(reviews_test.reviews_text_new)","metadata":{"id":"Qy27CGjZInLg","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"07bd2108855946cd9e01b6318f669f61","outputId":"87f0fcfb-49cc-4316-c2ab-2e33c2026f30","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":4591,"user_tz":240,"timestamp":1653262703595},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stderr","text":"/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:401: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens [\"'d\", \"'ll\", \"'re\", \"'s\", \"'ve\", 'might', 'must', \"n't\", 'need', 'sha', 'wo'] not in stop_words.\n % sorted(inconsistent)\n"}],"execution_count":null},{"cell_type":"markdown","source":"## Regresion logistica","metadata":{"id":"2mcp31kGIuN4","cell_id":"9ec55bccb02b458eaabd29ebdca94c3f","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"# Crear el clasificador\nlr_model_tf_idf = LogisticRegression(solver=\"liblinear\")\n\n# Entrenar\nlr_model_tf_idf.fit(X_train_tfidf, y_train_bow)\n\n# Predecir\ntest_pred_lr_prob = lr_model_tf_idf.predict_proba(X_test_tfidf)\ntest_pred_lr_all = lr_model_tf_idf.predict(X_test_tfidf)\n\n## Evaluar el modelo\nprint(\"F1 score: \",f1_score(y_test_bow, test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_bow, test_pred_lr_all) * 100)","metadata":{"id":"h2fKRigVIxBs","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"2a4dca61b6ba49ccaedf9d9ba025cac8","outputId":"3372785a-1984-48f5-caff-266ec16bbdb3","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":535,"user_tz":240,"timestamp":1653262736670},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.9343419062027231\nAccuracy: 88.12910284463895\n"}],"execution_count":null},{"cell_type":"markdown","source":"Aquí hemos logrado una precisión del 88 % con TF-IDF en comparación con el 90 % con 1-gram. Es difícil saber exactamente por qué este algoritmo de vectorización más sofisticado conduce a peores resultados, pero podría ser que penalizar palabras que son comunes en todo el corpus genere una desventaja para este conjunto de datos en particular. TF-IDF suele ser útil cuando los datos de prueba son muy diferentes de los datos de entrenamiento, lo que permite que se desprioricen las palabras que solo son comunes en el conjunto de entrenamiento.","metadata":{"id":"A4uMiNZmI4Db","cell_id":"70a48335b7b04e74b6cd6e740526a13c","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Trate de aumentar la precisión del modelo por\n\n- configurando ```ngram_range=(1,4)``` en el Vectorizador\n- no eliminar las palabras irrelevantes de antemano en el Vectorizador\n- configurando C=10 en el clasificador LogisticRegression\n- estableciendo penalización=\"l1\" en el clasificador LogisticRegression","metadata":{"id":"SCsby16EJBGA","cell_id":"996f723d4b914c1492de8fca2db65bda","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"tfidf_counts = TfidfVectorizer(\n tokenizer=word_tokenize,\n ngram_range=(1,4)\n)\n\nX_train_tfidf = tfidf_counts.fit_transform(reviews_train.reviews_text_new)\nX_test_tfidf = tfidf_counts.transform(reviews_test.reviews_text_new)","metadata":{"id":"5UlnsuRcJKwQ","cell_id":"055ee761769d495ebf82d904d4278a09","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"# definiendo la clase del modelo\nlr_model_tf_idf_new = LogisticRegression(solver=\"liblinear\", penalty='l1', C=10)\n\n# Entrenar\nlr_model_tf_idf_new.fit(X_train_tfidf, y_train_bow)\n\n# Predecir\ntest_pred_lr_prob = lr_model_tf_idf_new.predict_proba(X_test_tfidf)\ntest_pred_lr_all = lr_model_tf_idf_new.predict(X_test_tfidf)\n\n# Evaluar el modelo\nprint(\"F1 score: \",f1_score(y_test_bow, test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_bow, test_pred_lr_all)*100)","metadata":{"id":"Ur44xFjOJNqx","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"fb1d712e52ec4809b7cc28a2fca7bcc1","outputId":"5dea35f6-3d59-4119-c32e-3355831d7291","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":5419,"user_tz":240,"timestamp":1653262866726},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.9470967741935484\nAccuracy: 91.02844638949672\n"}],"execution_count":null},{"cell_type":"markdown","source":"Esta es una mejora con respecto a nuestro resultado anterior, pero hicimos cuatro cambios al mismo tiempo, por lo que no sabemos cuáles ayudaron y cuánto.\n\nProbar diferentes hiperparámetros para mejorar su modelo se denomina ajuste de hiperparámetros y es un campo enorme por sí solo. Puede imaginar cómo ejecutar este modelo 16 veces, una vez con cada posible configuración de hiperparámetros que hemos probado, sería un poco complicado de seguir, ¡y esto es solo con 4 hiperparámetros y dos valores para cada uno! Con 100 o 1000 de hiperparámetros y 100 o 1000 de valores para cada uno, las combinaciones totales crecen muy rápidamente.\n\nPara ayudar con esto, scikit-learn proporciona la llamada funcionalidad de \"GridSearch\", donde puede configurar una canalización y especificar los rangos de hiperparámetros que desea \"buscar\". Scikit-learn probará cada combinación y entrenará y evaluará el modelo para cada caso, diciéndole cuál funcionó mejor.\n\nTambién podemos encontrar nuestras características más importantes nuevamente, como se muestra a continuación:","metadata":{"id":"VXgj1CPgJYg7","cell_id":"b482c250be2f4e9fa83733faf6071fe7","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"lr_weights = pd.DataFrame(\n list(\n zip(tfidf_counts.get_feature_names(), lr_model_tf_idf_new.coef_[0])\n ),\n columns=['words','weights']\n)\n\nlr_weights.sort_values(['weights'],ascending = False)[:10]","metadata":{"id":"lOnUb9NiJiEd","colab":{"height":418,"base_uri":"https://localhost:8080/"},"cell_id":"9db2346394f84314925733f083e8b7fd","outputId":"13471d7e-e549-4648-9870-603cddad4e60","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":3568,"user_tz":240,"timestamp":1653262930606},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stderr","text":"/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n warnings.warn(msg, category=FutureWarning)\n"},{"output_type":"execute_result","data":{"text/plain":" words weights\n374800 great 51.682173\n677656 perfect 46.173501\n855190 the best 44.691956\n256072 delicious 39.746053\n63261 amazing 38.903587\n131810 be disappointed 38.802913\n299760 excellent 37.935929\n777859 smooth 36.007994\n685128 pleased 33.385577\n593253 nice 32.073194","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordsweights
374800great51.682173
677656perfect46.173501
855190the best44.691956
256072delicious39.746053
63261amazing38.903587
131810be disappointed38.802913
299760excellent37.935929
777859smooth36.007994
685128pleased33.385577
593253nice32.073194
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":57}],"execution_count":null},{"cell_type":"code","source":"lr_weights.sort_values(['weights'],ascending = False)[-10:]","metadata":{"id":"GvmARJjeJkG2","colab":{"height":363,"base_uri":"https://localhost:8080/"},"cell_id":"5d5c525c3c4f44f7940555cb0bfe6cc3","outputId":"21600646-cc73-46d8-dc66-f2eeb7cbdc26","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":349,"user_tz":240,"timestamp":1653262933269},"deepnote_cell_type":"code"},"outputs":[{"output_type":"execute_result","data":{"text/plain":" words weights\n730653 return -34.726053\n599002 not -35.237744\n603641 not recommend -35.535403\n843696 than this -36.479892\n1037701 worst -37.469519\n601259 not for -38.382928\n1053388 yuck -41.851058\n605704 not worth -44.629182\n265330 disappointed -45.957117\n265649 disappointing -49.420275","text/html":"\n
\n
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordsweights
730653return-34.726053
599002not-35.237744
603641not recommend-35.535403
843696than this-36.479892
1037701worst-37.469519
601259not for-38.382928
1053388yuck-41.851058
605704not worth-44.629182
265330disappointed-45.957117
265649disappointing-49.420275
\n
\n \n \n \n\n \n
\n
\n "},"metadata":{},"execution_count":58}],"execution_count":null},{"cell_type":"markdown","source":"# Modelo Word Embeddings","metadata":{"id":"Pu7WHTrpJl_8","cell_id":"acf848498593443cb88500a118e6f93a","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"El tipo final de caracterización que cubriremos son las incrustaciones de palabras (Word Embeddings). Este es un tipo de representación de palabras que permite que palabras con un significado similar tengan una representación similar. Al ser entrenados previamente en datos externos, como Wikipedia, las incrustaciones de palabras saben cuándo los conceptos están relacionados semánticamente; por ejemplo, los vectores para \"rey\" y \"reina\" se ubicarían uno cerca del otro, aunque no hay sintaxis ni similitud en ortografía entre estas palabras.\n\nEs este enfoque para representar palabras y documentos el que puede considerarse uno de los avances clave del aprendizaje profundo en los desafiantes problemas de procesamiento del lenguaje natural.\n\nHay muchos conjuntos de datos de incrustaciones de palabras previamente entrenadas que están disponibles gratuitamente, o puede entrenar las suyas propias. Algunos de los avances más importantes en esta área incluyen Word2Vec, que se convirtió en el modelo a seguir de la NPL, y otros enfoques integrados como Glove, ELMo y BERT.","metadata":{"id":"cvsz0rEbJrmb","cell_id":"9ad4244b64b745d5a7b2b8fdd6b9d2fe","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Existen diferentes métodos para aprender incrustaciones de palabras: Word2Vec, GloVe, FastText. **Word2Vec** utiliza una red neuronal superficial y es de dos tipos; _CBOW_ y _Skip Gram_. **GloVe** es un algoritmo de aprendizaje no supervisado para obtener representaciones vectoriales de palabras. El entrenamiento se realiza en estadísticas globales agregadas de coocurrencia palabra-palabra de un corpus, y las representaciones resultantes muestran subestructuras lineales interesantes del espacio vectorial de palabras. **[fastText](https://fasttext.cc/)** es una biblioteca para aprender incrustaciones de palabras y clasificación de texto creada por el laboratorio de investigación de IA de Facebook.","metadata":{"id":"h8N02FZJJ-VF","cell_id":"bab87570bf5d4bb38a717f2161183421","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"**Porque usar Word Embeddings en lugar de Bag of words o TF-IDF**\nCada palabra está representada por un vector de valor real, que generalmente tiene decenas o cientos de dimensiones. Esto contrasta con los miles o millones de dimensiones requeridas para las representaciones de palabras dispersas. Por lo tanto, las incrustaciones de palabras pueden reducir drásticamente la cantidad de dimensiones requeridas para representar un documento de texto:","metadata":{"id":"2gzgZC5UKMyS","cell_id":"41d5ccae70744d7da57f46ea4339946b","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"import gensim\n# Cargar una incrustación de palabras de guante preentrenada que se entrena en un conjunto de datos de Twitter\n# Esta palabra incrustada tiene 200 dimensiones, lo que significa que cada palabra está representada\n# por un vector de 200 dimensiones.\nmodel = gensim.models.KeyedVectors.load_word2vec_format(\n os.path.join(os.getcwd(), 'glove.twitter.27B.200d_out.txt'),\n binary=False,\n unicode_errors='ignore'\n)","metadata":{"id":"fWaGepmSKVwl","cell_id":"76d8db05cd344d9f82b8c8e4cc6b4286","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Teníamos aproximadamente 18000 tokens distintos para características de 1-grams en la representación de la bolsa de palabras, pero solo tendrán 200 dimensiones en esta inserción de palabras. ¡Esta es una gran diferencia!\n\nAdemás, las incrustaciones de palabras capturan el contexto y la semántica de las oraciones, ya que cada representación de vector de palabra se basa en su significado contextual.\n\nA continuación se muestra la representación vectorial de \"comida\" y \"genial\":","metadata":{"id":"OlmzuwOFKjRM","cell_id":"3e024d61feaf400d903c80c468c95ec9","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"print(\"El embedding para food es\", len(model['food']), \"dimensional\")\n\nmodel['food']","metadata":{"id":"OCU3QFnmKfXr","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"f85c346491bd4be2a5df763ace217e94","outputId":"602a23a0-bcde-44a3-e0b5-d7646b6da310","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":9,"user_tz":240,"timestamp":1653263230563},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"El embedding para food es 200 dimensional\n"},{"output_type":"execute_result","data":{"text/plain":"array([-6.9175e-01, -1.4259e-01, 3.8653e-01, -2.3141e-01, -2.0408e-01,\n -2.1565e-01, 7.7839e-01, 2.2689e-03, -7.2446e-02, -6.0134e-01,\n -4.2400e-01, -5.7140e-01, -8.4249e-01, 1.5947e-01, -1.2899e-01,\n 5.9032e-01, -1.3632e-01, -6.6478e-01, -1.9557e-01, -8.2453e-01,\n -1.3177e-01, 1.3514e-01, -7.3214e-01, 4.8200e-01, 4.3505e-01,\n 1.6676e+00, -1.8275e-01, -1.0007e-01, 3.7003e-01, 1.0411e-01,\n -8.8115e-01, -9.7733e-04, -2.9459e-01, -7.3869e-02, -4.0103e-01,\n -4.6626e-01, 2.3253e-01, 2.7776e-01, 4.0754e-01, -4.5051e-02,\n -1.9468e-01, -2.9230e-01, -3.4642e-01, -4.9286e-01, 1.0467e-01,\n 7.2143e-01, 5.9596e-01, 5.3495e-01, 3.8788e-02, -1.4406e-01,\n -5.2248e-02, -6.8292e-01, -1.0080e-01, -1.2961e-01, -2.6006e-02,\n 1.4836e-01, 3.2417e-02, 1.3997e-01, 8.3943e-03, -2.3139e-01,\n -1.8000e-01, -3.1689e-01, 2.3606e-01, 1.8237e-01, 4.3933e-01,\n -3.2313e-01, -2.1512e-03, -4.4172e-01, 4.1011e-01, 1.7174e-01,\n -8.6405e-01, -3.9674e-01, 4.4175e-01, 5.9300e-01, 1.8982e-01,\n -2.9646e-02, -3.4041e-01, -3.3708e-02, 7.3449e-01, 4.5300e-01,\n -2.7855e-02, -1.8993e-02, 3.8107e-01, -5.6606e-02, 1.4864e-02,\n 3.1518e-01, -3.2304e-01, -2.7439e-01, 6.1900e-02, 3.2886e-01,\n 1.5138e-01, 5.3268e-01, -1.6616e-01, -2.3076e-01, -9.6515e-02,\n 4.5991e-01, -5.1475e-01, 1.0297e-01, -4.0225e-02, 5.6679e-01,\n 3.1027e-01, 1.5679e-01, -2.5897e-01, 4.6312e-01, 2.2561e-01,\n -3.9300e-01, -3.9593e-01, 4.4001e-01, 3.7176e-01, 1.4747e-02,\n -1.9193e-01, -2.2478e-01, -1.2665e-01, -3.4982e-01, 5.0847e-01,\n 3.1720e-01, 1.2942e-01, -6.2695e-01, 5.8675e-01, 4.1040e-02,\n 1.8835e-01, -2.2626e-01, -1.1744e-01, 5.1429e-03, 7.2058e-02,\n -4.9525e-01, 4.4159e-01, 8.6225e-01, 7.6765e-02, -9.7908e-02,\n 6.8383e-02, 3.0596e-01, 3.7980e-01, 1.1563e-01, -6.1020e-01,\n -6.8107e-01, 3.2723e-02, 2.5346e-01, 3.5334e-01, 2.5407e-01,\n -4.6516e-01, 4.8858e-01, 3.9032e-01, -8.1296e-01, -6.9780e-01,\n -1.2542e-01, 7.9234e-02, 1.2918e-01, -1.1048e-01, 8.9312e-03,\n 3.6999e-01, 3.0116e-01, -4.6578e+00, -4.4493e-03, 2.0313e-02,\n -5.0215e-02, -2.0646e-01, -3.7321e-02, -5.1779e-02, 6.6986e-02,\n -5.8853e-01, 7.1753e-01, 4.2784e-02, 1.6667e-03, -2.6193e-01,\n 5.8214e-01, -1.0513e+00, -3.0341e-02, 7.3892e-01, -1.8003e-01,\n -1.1104e-01, 3.0846e-01, 4.4027e-01, -8.4080e-02, -2.6251e-01,\n -3.8733e-01, -2.6630e-01, 1.9655e-01, 5.3812e-02, -2.4456e-01,\n -7.8868e-01, -7.1843e-01, 7.0593e-02, -1.9051e-01, 2.5553e-01,\n -1.3786e-01, 1.2942e-01, 4.5864e-01, 5.5462e-01, 8.2104e-01,\n -2.5049e-01, -3.3623e-01, 1.8491e-01, -4.8235e-01, 3.1425e-01,\n 2.4499e-01, -2.4404e-01, 8.0309e-02, 3.4060e-01, 7.0451e-01],\n dtype=float32)"},"metadata":{},"execution_count":61}],"execution_count":null},{"cell_type":"code","source":"print(\"El embedding para great es\", len(model['great']), \"dimensional\")\n\nmodel['great']","metadata":{"id":"1g4FvfwRKtp5","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"5dd2a45527e54a7e96e917adad731eeb","outputId":"5a9927cf-325c-46d1-b6cc-824796f5e00a","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":410,"user_tz":240,"timestamp":1653263239565},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"El embedding para great es 200 dimensional\n"},{"output_type":"execute_result","data":{"text/plain":"array([ 1.0751e-01, 1.5958e-01, 1.3332e-01, 1.6642e-01, -3.2737e-02,\n 1.7592e-01, 7.2395e-01, 1.1713e-01, -3.5036e-01, -4.2937e-01,\n -4.0925e-01, -2.5761e-01, -1.0264e+00, -1.0014e-01, 5.5390e-02,\n 2.0413e-01, 1.2807e-01, -2.6337e-02, -6.9719e-02, -3.6193e-02,\n -1.9917e-01, 3.9437e-02, -9.2358e-02, 2.6981e-01, -2.0951e-01,\n 1.5455e+00, -2.8123e-01, 3.2046e-01, 4.5545e-01, -3.8841e-02,\n -1.7369e-01, -2.3251e-01, -5.9551e-02, 2.3250e-01, 4.4214e-01,\n 3.3666e-01, 3.9352e-02, -1.2462e-01, -2.9317e-01, -4.8857e-02,\n 6.9021e-01, 7.1279e-02, 1.0252e-01, 1.6122e-01, -2.3536e-01,\n 6.2724e-02, 2.0222e-01, 5.0234e-02, -1.1611e-01, 2.8909e-02,\n -1.1109e-01, -5.0241e-02, -5.9063e-01, -8.8747e-02, 5.1444e-01,\n -1.3715e-01, 1.7194e-01, -8.3657e-02, 9.6333e-02, -9.7063e-02,\n 3.4003e-03, -7.0180e-02, -5.9588e-01, -2.8264e-01, 1.2529e-01,\n 2.4359e-01, -4.9082e-01, -4.2533e-02, 2.2158e-01, -2.1491e-01,\n -4.2101e-02, 2.3359e-01, 3.1978e-01, 3.5063e-01, 6.1748e-01,\n -1.0197e-01, 5.3357e-01, -3.6005e-01, -1.7212e-02, 1.6645e-01,\n 8.9432e-01, 2.7322e-02, 3.0683e-01, 1.9715e-02, 6.0516e-01,\n 4.1085e-01, 5.5945e-01, -8.4501e-02, 3.5933e-01, 1.0216e-01,\n 2.6675e-01, -6.0445e-01, -1.0513e-01, -1.9248e-01, 2.9150e-01,\n -1.0537e-01, 5.2671e-01, 2.3763e-01, -1.3640e-01, -6.1029e-02,\n 1.0081e-01, 7.4541e-02, -1.4899e-01, -2.2301e-01, -1.3653e-02,\n 4.0192e-02, 5.5821e-03, -2.9936e-02, 2.7338e-02, 5.9412e-01,\n -1.0302e-01, 9.0319e-02, 3.1055e-01, 6.3336e-01, 2.9762e-01,\n -8.4671e-02, -1.2552e-01, -6.3930e-01, 3.8613e-01, 6.6371e-01,\n 5.1345e-01, 2.0719e-01, 2.1100e-01, 1.4579e-01, -7.3321e-02,\n -7.0593e-01, -6.2578e-02, -2.5470e-01, 1.1986e-01, 1.6102e-01,\n 3.2958e-02, -2.4159e-01, -2.5708e-01, 3.2051e-01, -1.1569e-01,\n 6.7540e-03, -1.1688e-01, -3.6158e-02, -6.5320e-01, 4.9560e-01,\n -3.9429e-02, -1.8395e-01, 2.3295e-01, 5.4128e-01, 2.4568e-02,\n -1.9862e-01, 2.1041e-01, 9.3798e-02, 8.3096e-03, -6.1551e-02,\n 2.3262e-01, -4.2756e-02, -5.3511e+00, 3.0604e-01, 3.3578e-01,\n -3.6771e-01, 5.6225e-01, -8.2341e-02, 2.9809e-01, 2.5189e-01,\n -4.6203e-01, 1.0452e-01, -3.9540e-01, 3.6961e-01, 1.3093e-01,\n 1.6653e-01, -3.1915e-01, 1.6974e-01, 4.2575e-01, 3.6420e-01,\n 3.7175e-01, -1.9450e-01, 6.2702e-02, 4.9775e-01, 3.1842e-02,\n -6.4072e-02, 7.6183e-02, -5.9534e-01, 3.1731e-01, -2.8254e-01,\n 1.5987e-01, -9.2750e-02, -4.1426e-02, 7.5799e-02, 9.5740e-03,\n -2.1532e-01, -3.1419e-01, -1.5144e-01, -4.6584e-01, -1.1069e-01,\n -4.0130e-01, 3.9266e-02, 8.1880e-01, -4.2955e-02, 2.1698e-01,\n -6.0347e-02, 3.3431e-01, -9.9549e-02, -1.8156e-01, -8.5143e-02],\n dtype=float32)"},"metadata":{},"execution_count":62}],"execution_count":null},{"cell_type":"markdown","source":"Como discutimos, el poder de las incrustaciones de palabras es que las palabras que tienen un significado similar están más juntas en el espacio vectorial. Podemos demostrar esto mirando la distancia del coseno entre algunos pares de palabras de la siguiente manera:","metadata":{"id":"z6zkIX7UKzlV","cell_id":"3e7784e97dc840c99e49e217129c90ce","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"def print_similarity(word1, word2, model):\n v1 = model[word1]\n v2 = model[word2]\n similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))\n print(f\"{word1} y {word2} son {round(similarity * 100)}% similar\")\n\nprint_similarity(\"cat\", \"dog\", model)\nprint_similarity(\"good\", \"bad\", model)\nprint_similarity(\"great\", \"good\", model)\nprint_similarity(\"grass\", \"model\", model)","metadata":{"id":"1LG9xirxK2ET","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"85f03dbcf2d94897bbf486971eca1957","outputId":"51d1839e-be29-4fd0-a8a8-4645dd785a53","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":415,"user_tz":240,"timestamp":1653263273770},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"cat y dog son 83% similar\ngood y bad son 80% similar\ngreat y good son 87% similar\ngrass y model son 11% similar\n"}],"execution_count":null},{"cell_type":"markdown","source":"Aquí, \"significado similar\" se define vagamente como \"usado en contextos similares\". Debido a que hay muchos ejemplos como \"Acaricié a mi gato\" y \"Acaricié a mi perro\" donde estas palabras se usan en contextos similares, se consideran muy similares. También hay muchas oraciones en las que \"bueno\" y \"malo\" pueden intercambiarse y la oración puede seguir siendo válida, por lo que, aunque las consideremos \"opuestas\", nuestro modelo las considerará similares. \"hierba\" y \"modelo\" no tienen casi nada que ver entre sí, por lo que están muy separados.\n\nPara encontrar el vector de una reseña completa, obtenemos el vector de cada palabra de la reseña por separado y tomamos un promedio simple","metadata":{"id":"NyPpFZbQK8Ra","cell_id":"6614aa2d6acd4875bbc5228ac6da2d57","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"Calcule el vector para cada revisión individual en el conjunto de datos.","metadata":{"id":"Axse7IKFLC0v","cell_id":"818f13c4644b44698a7e7107526ef537","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"review_embeddings = []\n\nfor each_review in amazon_reviews.reviews_text_new:\n review_average = np.zeros(model.vector_size)\n count_val = 0\n \n for each_word in word_tokenize(each_review):\n\n # Cambiar a \"if True\" para remover stopwords del promedio de embeddings\n if False:\n if(each_word.lower() in noise_words):\n print(each_word.lower())\n continue\n \n if(each_word.lower() in model):\n review_average += model[each_word.lower()]\n count_val += 1\n \n review_embeddings.append(list(review_average/count_val))","metadata":{"id":"RQbosshgLHXN","cell_id":"37617c382ef24e1da7822691a75a3cf8","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Cargar sus propios vectores de palabras es excelente para comprender cómo funcionan, pero también existen bibliotecas de nivel superior que abstraen código como el que se muestra arriba. En la industria, una biblioteca de NLP ampliamente utilizada es [SpaCy] (https://spacy.io/). Esta biblioteca le permite extraer de manera eficiente las incrustaciones de palabras de los textos y realizar operaciones de alto nivel sobre ellos.\n\nConvirtamos la lista de representaciones vectoriales para cada revisión en un DataFrame y dividámoslo en conjuntos de entrenamiento y prueba:","metadata":{"id":"xeD-CD9zLRqG","cell_id":"c165449f4f40430cb2ea3cb33f2b8889","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"embedding_data = pd.DataFrame(review_embeddings)\nembedding_data = embedding_data.fillna(0)","metadata":{"id":"YFztCDMZLWbz","cell_id":"9c699427347b498fac92aed7636e9361","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"code","source":"X_train_embed, X_test_embed, y_train_embed, y_test_embed = train_test_split(\n embedding_data,\n amazon_reviews.Sentiment_rating,\n test_size=0.2,\n random_state=0\n)","metadata":{"id":"PKDP6ZYzLX_3","cell_id":"bb0836a6283e4373bac8b5f3cc29af46","deepnote_cell_type":"code"},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Ahora apliquemos el modelo de regresion logistica a nuestras word embeddings ","metadata":{"id":"SPWhQqYKLZX2","cell_id":"f1eb8bbb243340159a9ea81fae0f810f","deepnote_cell_type":"markdown"}},{"cell_type":"code","source":"lr_model = LogisticRegression(penalty=\"l1\", C=10, solver=\"liblinear\")\nlr_model.fit(X_train_embed, y_train_embed)\ntest_pred_lr_prob = lr_model.predict_proba(X_test_embed)\ntest_pred_lr_all = lr_model.predict(X_test_embed)\n\nprint(\"F1 score: \", f1_score(y_test_embed, test_pred_lr_all))\nprint(\"Accuracy: \", accuracy_score(y_test_embed, test_pred_lr_all)*100)","metadata":{"id":"ecE_YyKuLc2-","colab":{"base_uri":"https://localhost:8080/"},"cell_id":"0b862de8975e4c99a7510c3c98ba9a96","outputId":"edd8451e-0640-4247-d307-692fe0410513","executionInfo":{"user":{"userId":"09471607480253994520","displayName":"David Francisco Bustos Usta"},"status":"ok","elapsed":14370,"user_tz":240,"timestamp":1653263441295},"deepnote_cell_type":"code"},"outputs":[{"output_type":"stream","name":"stdout","text":"F1 score: 0.9381703470031545\nAccuracy: 89.27789934354486\n"}],"execution_count":null},{"cell_type":"markdown","source":"Desafortunadamente, esto no es tan bueno como las representaciones de bolsa de palabras o TF-IDF. Además, aunque las incrustaciones de palabras fueron realmente efectivas para reducir el número total de dimensiones, adolece del problema de la interpretabilidad. Esto significa que es muy difícil para nosotros incluso diagnosticar qué está causando su bajo rendimiento.\n\nSin embargo, ¿recuerda cómo \"bueno\" y \"malo\" estaban juntos en el espacio vectorial? Esta es una de las razones por las que las incrustaciones de palabras pueden no funcionar tan bien para el análisis de sentimientos en conjuntos de datos más pequeños: las incrustaciones de palabras son buenas para usar \"conocimiento\" del mundo externo (latente en las incrustaciones preentrenadas) para inferir información adicional sobre un conjunto de datos más pequeño, pero en el caso del análisis de sentimientos, esto podría hacer más daño que bien al combinar palabras \"similares\" que en realidad están muy separadas para una tarea de análisis de sentimientos.\n\n**En nuestro caso, la creación de características usando TF-IDF nos dio una precisión del 92 % con características muy interpretables. Esta es una buena combinación, por lo que consideramos que este es el mejor modelo para nosotros aquí**.\n\nTenga en cuenta que para un experimento real, habríamos dividido nuestro conjunto de datos en tres partes, no solo en dos. Cuando se ejecuta un experimento varias veces con diferentes parámetros, es casi seguro que algunos resultados serán mejores simplemente por casualidad, y es mala ciencia seleccionar el modelo con mejor desempeño después de docenas o cientos de ejecuciones.\n\nPara evitar este problema, los datos deben dividirse en conjuntos de \"entrenamiento\", \"prueba\" y \"validación\". El conjunto de \"prueba\" debe reservarse al comienzo del experimento y nunca mirarse. El modelo debe ajustarse utilizando el conjunto de \"validación\".\n\nSolo una vez que el experimentador esté satisfecho con el modelo al mejorar el rendimiento en el conjunto de validación, se debe ejecutar el modelo en el conjunto de prueba y esos resultados finales se toman como los resultados finales del experimento.","metadata":{"id":"aLgmytl8LjYx","cell_id":"54762887d23c436d901c6c1a517709bb","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Conclusiones","metadata":{"id":"mACgQsCSLzu9","cell_id":"23d9dc3b75a04e39a8047c400b20845b","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"En este caso, limpiamos y destacamos un conjunto de datos de reseñas de Amazon y construimos algunos modelos de clasificación en estas características para predecir el sentimiento. Vimos que la bolsa de palabras y TF-IDF brindaban características interpretables, mientras que las incrustaciones de palabras realmente no. Al aumentar el conjunto de n-gramas que usamos de 1-gram a 4-grams, pudimos obtener la precisión de nuestro modelo de regresión logística hasta en un 92 %.","metadata":{"id":"hPvVo2MuL04X","cell_id":"e14dfd80116844828ad355d448250ce0","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"# Para llevar","metadata":{"id":"8IUVh266MA_L","cell_id":"524495d87068404da0df83776a14d1cd","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"La construcción de modelos de aprendizaje automático en texto es una disciplina muy complicada. Algunas cosas importantes a tener en cuenta son las siguientes:\n\n1. Aunque existen diferentes tipos de preprocesamiento involucrados en los datos textuales, no todo tiene que aplicarse en cada caso. Por ejemplo, cuando se trata de mensajes de texto, los caracteres especiales pueden representar información importante y no es necesario eliminarlos. Además, las mayúsculas pueden significar que alguien está enojado y representan gritos, por lo que las mayúsculas y las minúsculas pueden representar información valiosa. En otras situaciones, es más valioso normalizarlas.\n\n2. El ajuste de hiperparámetros en los modelos de aprendizaje automático es un paso muy importante y, si bien los hiperparámetros predeterminados funcionan bien en muchos casos, a menudo se puede obtener un rendimiento adicional al ajustarlos. Se deben probar diferentes conjuntos de parámetros para ver qué contribuye al mejor modelo.\n\n3. Cada tarea de clasificación en NLP es diferente, pero el proceso a seguir es similar al que hicimos en este caso: **discutir los datos -> crear características a partir del texto -> entrenar modelos -> evaluar modelos**.","metadata":{"id":"iN7hN-ZNMCUX","cell_id":"bbe64c90cf2f4da397bc96668991a84d","deepnote_cell_type":"markdown"}},{"cell_type":"markdown","source":"\nCreated in deepnote.com \nCreated in Deepnote","metadata":{"created_in_deepnote_cell":true,"deepnote_cell_type":"markdown"}}],"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Clase 15 - Introduccion a NLP Clase 2.ipynb","provenance":[],"authorship_tag":"ABX9TyMJOIDex6+Fa4Bm/eZnYXYV","collapsed_sections":["eB-FMmvvOh90","RrNKtx8EPrr1","Lq1Cev75Qjy5","27Owz6clRAe2","WPyUzf66RL-V","qzhKZ2n_TEv4","asrthaXMT1j0"]},"deepnote":{},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"deepnote_notebook_id":"c75de597e02c486db219573ea95d9faa","deepnote_execution_queue":[]}}