{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Procesamiento de Lenguage Natural\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Datos: Libre elección\n", "\n", "Expectativas:\n", "- Pre-procesamiento del texto\n", "- Uso de Word2Vec (Consejo: jugar con los parámetros)\n", "- Mostrar las palabras más parecidas (`most_similar`) de tres palabras que le llamen la atención\n", "- Responder:\n", " - ¿Su modelo da buenos resultados? ¿Por qué sí o por qué no?\n", " - ¿Qué problemas encontró al realizar este taller?\n", "\n", "\n", "Bonus: \n", "- Usar una función que no hayamos visto en clase ([Aquí](https://radimrehurek.com/gensim/models/word2vec.html#module-gensim.models.word2vec))\n", "- Visualizar el modelo usando PCA" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | text | \n", "pp | \n", "
---|---|---|
0 | \n", "El suceso ha tenido lugar en Brasil. Un adole... | \n", "[suceso, lugar, brasil, adolescente, años, mur... | \n", "
1 | \n", "Estamos en la semana decisiva. Los expertos a... | \n", "[semana, decisiva, expertos, aseguran, campaña... | \n", "
2 | \n", "Estudios científicos hay muchos. Unos nos int... | \n", "[estudios, científicos, interesan, concreto, h... | \n", "
3 | \n", "Ha sucedido en la ciudad de San José de Río P... | \n", "[sucedido, ciudad, san, josé, río, preto, bras... | \n", "
4 | \n", "La fiesta en Sevilla por el vuelco electoral ... | \n", "[fiesta, sevilla, vuelco, electoral, alargó, c... | \n", "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "190 | \n", "191 | \n", "192 | \n", "193 | \n", "194 | \n", "195 | \n", "196 | \n", "197 | \n", "198 | \n", "199 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
suceso | \n", "0.006212 | \n", "0.031511 | \n", "0.045481 | \n", "0.101416 | \n", "-0.031679 | \n", "0.150899 | \n", "-0.059783 | \n", "-0.014200 | \n", "0.107744 | \n", "0.176405 | \n", "... | \n", "0.050822 | \n", "-0.017535 | \n", "0.042748 | \n", "-0.073751 | \n", "-0.088236 | \n", "-0.076950 | \n", "-0.062377 | \n", "-0.055633 | \n", "-0.060737 | \n", "-0.058302 | \n", "
lugar | \n", "0.005950 | \n", "0.040947 | \n", "0.059403 | \n", "0.136897 | \n", "-0.040842 | \n", "0.201386 | \n", "-0.075965 | \n", "-0.015665 | \n", "0.138248 | \n", "0.228637 | \n", "... | \n", "0.065123 | \n", "-0.018993 | \n", "0.060566 | \n", "-0.094583 | \n", "-0.112451 | \n", "-0.098903 | \n", "-0.081829 | \n", "-0.071580 | \n", "-0.077870 | \n", "-0.076109 | \n", "
brasil | \n", "0.005165 | \n", "0.026123 | \n", "0.037262 | \n", "0.089919 | \n", "-0.026611 | \n", "0.130936 | \n", "-0.052012 | \n", "-0.013126 | \n", "0.092490 | \n", "0.145507 | \n", "... | \n", "0.043393 | \n", "-0.012164 | \n", "0.036452 | \n", "-0.062901 | \n", "-0.074967 | \n", "-0.062631 | \n", "-0.050171 | \n", "-0.049755 | \n", "-0.051648 | \n", "-0.048988 | \n", "
años | \n", "0.008973 | \n", "0.040707 | \n", "0.061044 | \n", "0.140177 | \n", "-0.041050 | \n", "0.201412 | \n", "-0.080918 | \n", "-0.014251 | \n", "0.143369 | \n", "0.234809 | \n", "... | \n", "0.066531 | \n", "-0.022884 | \n", "0.063267 | \n", "-0.096038 | \n", "-0.120391 | \n", "-0.097196 | \n", "-0.080598 | \n", "-0.076261 | \n", "-0.079407 | \n", "-0.077067 | \n", "
después | \n", "0.004588 | \n", "0.037437 | \n", "0.059342 | \n", "0.144849 | \n", "-0.044886 | \n", "0.210980 | \n", "-0.083985 | \n", "-0.015445 | \n", "0.150787 | \n", "0.239436 | \n", "... | \n", "0.070495 | \n", "-0.024690 | \n", "0.062590 | \n", "-0.098726 | \n", "-0.118127 | \n", "-0.101004 | \n", "-0.084490 | \n", "-0.078070 | \n", "-0.085354 | \n", "-0.076749 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
mercadona | \n", "0.001592 | \n", "0.029226 | \n", "0.041898 | \n", "0.102322 | \n", "-0.031841 | \n", "0.150310 | \n", "-0.054796 | \n", "-0.013188 | \n", "0.104230 | \n", "0.167622 | \n", "... | \n", "0.050685 | \n", "-0.017119 | \n", "0.044884 | \n", "-0.071129 | \n", "-0.086456 | \n", "-0.073510 | \n", "-0.060821 | \n", "-0.052663 | \n", "-0.058199 | \n", "-0.055017 | \n", "
marruecos | \n", "0.005181 | \n", "0.024268 | \n", "0.038444 | \n", "0.091734 | \n", "-0.027209 | \n", "0.133286 | \n", "-0.050361 | \n", "-0.009693 | \n", "0.092897 | \n", "0.150116 | \n", "... | \n", "0.046088 | \n", "-0.014870 | \n", "0.037095 | \n", "-0.065604 | \n", "-0.077990 | \n", "-0.062612 | \n", "-0.051981 | \n", "-0.048857 | \n", "-0.052151 | \n", "-0.049877 | \n", "
italia | \n", "0.003858 | \n", "0.030171 | \n", "0.042547 | \n", "0.097440 | \n", "-0.031347 | \n", "0.146198 | \n", "-0.056130 | \n", "-0.010863 | \n", "0.105478 | \n", "0.166581 | \n", "... | \n", "0.048054 | \n", "-0.015512 | \n", "0.045370 | \n", "-0.071881 | \n", "-0.082600 | \n", "-0.072842 | \n", "-0.061193 | \n", "-0.052945 | \n", "-0.057551 | \n", "-0.053230 | \n", "
usted | \n", "0.003297 | \n", "0.032344 | \n", "0.048360 | \n", "0.108048 | \n", "-0.034448 | \n", "0.159390 | \n", "-0.064360 | \n", "-0.015801 | \n", "0.112862 | \n", "0.186113 | \n", "... | \n", "0.054664 | \n", "-0.019894 | \n", "0.049417 | \n", "-0.078136 | \n", "-0.090361 | \n", "-0.079543 | \n", "-0.064764 | \n", "-0.060393 | \n", "-0.062716 | \n", "-0.057396 | \n", "
jordi | \n", "0.006217 | \n", "0.030590 | \n", "0.045220 | \n", "0.112371 | \n", "-0.036519 | \n", "0.166078 | \n", "-0.065310 | \n", "-0.016198 | \n", "0.118068 | \n", "0.187892 | \n", "... | \n", "0.055352 | \n", "-0.017221 | \n", "0.049916 | \n", "-0.078550 | \n", "-0.095642 | \n", "-0.080198 | \n", "-0.067379 | \n", "-0.059102 | \n", "-0.067260 | \n", "-0.058912 | \n", "
936 rows × 200 columns
\n", "\n", " | X | \n", "Y | \n", "Palabra | \n", "
---|---|---|---|
0 | \n", "0.137783 | \n", "0.002006 | \n", "suceso | \n", "
1 | \n", "-0.185469 | \n", "0.002270 | \n", "lugar | \n", "
2 | \n", "0.297975 | \n", "-0.001699 | \n", "brasil | \n", "
3 | \n", "-0.208962 | \n", "-0.002381 | \n", "años | \n", "
4 | \n", "-0.250687 | \n", "-0.002270 | \n", "después | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
931 | \n", "0.167823 | \n", "-0.001028 | \n", "mercadona | \n", "
932 | \n", "0.273000 | \n", "0.000120 | \n", "marruecos | \n", "
933 | \n", "0.175010 | \n", "0.001343 | \n", "italia | \n", "
934 | \n", "0.075135 | \n", "0.001296 | \n", "usted | \n", "
935 | \n", "0.055082 | \n", "0.000091 | \n", "jordi | \n", "
936 rows × 3 columns
\n", "