{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"# Procesamiento de Lenguage Natural\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"Datos: Libre elección\n",
"\n",
"Expectativas:\n",
"- Pre-procesamiento del texto\n",
"- Uso de Word2Vec (Consejo: jugar con los parámetros)\n",
"- Mostrar las palabras más parecidas (`most_similar`) de tres palabras que le llamen la atención\n",
"- Responder:\n",
" - ¿Su modelo da buenos resultados? ¿Por qué sí o por qué no?\n",
" - ¿Qué problemas encontró al realizar este taller?\n",
"\n",
"\n",
"Bonus: \n",
"- Usar una función que no hayamos visto en clase ([Aquí](https://radimrehurek.com/gensim/models/word2vec.html#module-gensim.models.word2vec))\n",
"- Visualizar el modelo usando PCA"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | text | \n", "pp | \n", "
|---|---|---|
| 0 | \n", "El suceso ha tenido lugar en Brasil. Un adole... | \n", "[suceso, lugar, brasil, adolescente, años, mur... | \n", "
| 1 | \n", "Estamos en la semana decisiva. Los expertos a... | \n", "[semana, decisiva, expertos, aseguran, campaña... | \n", "
| 2 | \n", "Estudios científicos hay muchos. Unos nos int... | \n", "[estudios, científicos, interesan, concreto, h... | \n", "
| 3 | \n", "Ha sucedido en la ciudad de San José de Río P... | \n", "[sucedido, ciudad, san, josé, río, preto, bras... | \n", "
| 4 | \n", "La fiesta en Sevilla por el vuelco electoral ... | \n", "[fiesta, sevilla, vuelco, electoral, alargó, c... | \n", "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "190 | \n", "191 | \n", "192 | \n", "193 | \n", "194 | \n", "195 | \n", "196 | \n", "197 | \n", "198 | \n", "199 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| suceso | \n", "0.006212 | \n", "0.031511 | \n", "0.045481 | \n", "0.101416 | \n", "-0.031679 | \n", "0.150899 | \n", "-0.059783 | \n", "-0.014200 | \n", "0.107744 | \n", "0.176405 | \n", "... | \n", "0.050822 | \n", "-0.017535 | \n", "0.042748 | \n", "-0.073751 | \n", "-0.088236 | \n", "-0.076950 | \n", "-0.062377 | \n", "-0.055633 | \n", "-0.060737 | \n", "-0.058302 | \n", "
| lugar | \n", "0.005950 | \n", "0.040947 | \n", "0.059403 | \n", "0.136897 | \n", "-0.040842 | \n", "0.201386 | \n", "-0.075965 | \n", "-0.015665 | \n", "0.138248 | \n", "0.228637 | \n", "... | \n", "0.065123 | \n", "-0.018993 | \n", "0.060566 | \n", "-0.094583 | \n", "-0.112451 | \n", "-0.098903 | \n", "-0.081829 | \n", "-0.071580 | \n", "-0.077870 | \n", "-0.076109 | \n", "
| brasil | \n", "0.005165 | \n", "0.026123 | \n", "0.037262 | \n", "0.089919 | \n", "-0.026611 | \n", "0.130936 | \n", "-0.052012 | \n", "-0.013126 | \n", "0.092490 | \n", "0.145507 | \n", "... | \n", "0.043393 | \n", "-0.012164 | \n", "0.036452 | \n", "-0.062901 | \n", "-0.074967 | \n", "-0.062631 | \n", "-0.050171 | \n", "-0.049755 | \n", "-0.051648 | \n", "-0.048988 | \n", "
| años | \n", "0.008973 | \n", "0.040707 | \n", "0.061044 | \n", "0.140177 | \n", "-0.041050 | \n", "0.201412 | \n", "-0.080918 | \n", "-0.014251 | \n", "0.143369 | \n", "0.234809 | \n", "... | \n", "0.066531 | \n", "-0.022884 | \n", "0.063267 | \n", "-0.096038 | \n", "-0.120391 | \n", "-0.097196 | \n", "-0.080598 | \n", "-0.076261 | \n", "-0.079407 | \n", "-0.077067 | \n", "
| después | \n", "0.004588 | \n", "0.037437 | \n", "0.059342 | \n", "0.144849 | \n", "-0.044886 | \n", "0.210980 | \n", "-0.083985 | \n", "-0.015445 | \n", "0.150787 | \n", "0.239436 | \n", "... | \n", "0.070495 | \n", "-0.024690 | \n", "0.062590 | \n", "-0.098726 | \n", "-0.118127 | \n", "-0.101004 | \n", "-0.084490 | \n", "-0.078070 | \n", "-0.085354 | \n", "-0.076749 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| mercadona | \n", "0.001592 | \n", "0.029226 | \n", "0.041898 | \n", "0.102322 | \n", "-0.031841 | \n", "0.150310 | \n", "-0.054796 | \n", "-0.013188 | \n", "0.104230 | \n", "0.167622 | \n", "... | \n", "0.050685 | \n", "-0.017119 | \n", "0.044884 | \n", "-0.071129 | \n", "-0.086456 | \n", "-0.073510 | \n", "-0.060821 | \n", "-0.052663 | \n", "-0.058199 | \n", "-0.055017 | \n", "
| marruecos | \n", "0.005181 | \n", "0.024268 | \n", "0.038444 | \n", "0.091734 | \n", "-0.027209 | \n", "0.133286 | \n", "-0.050361 | \n", "-0.009693 | \n", "0.092897 | \n", "0.150116 | \n", "... | \n", "0.046088 | \n", "-0.014870 | \n", "0.037095 | \n", "-0.065604 | \n", "-0.077990 | \n", "-0.062612 | \n", "-0.051981 | \n", "-0.048857 | \n", "-0.052151 | \n", "-0.049877 | \n", "
| italia | \n", "0.003858 | \n", "0.030171 | \n", "0.042547 | \n", "0.097440 | \n", "-0.031347 | \n", "0.146198 | \n", "-0.056130 | \n", "-0.010863 | \n", "0.105478 | \n", "0.166581 | \n", "... | \n", "0.048054 | \n", "-0.015512 | \n", "0.045370 | \n", "-0.071881 | \n", "-0.082600 | \n", "-0.072842 | \n", "-0.061193 | \n", "-0.052945 | \n", "-0.057551 | \n", "-0.053230 | \n", "
| usted | \n", "0.003297 | \n", "0.032344 | \n", "0.048360 | \n", "0.108048 | \n", "-0.034448 | \n", "0.159390 | \n", "-0.064360 | \n", "-0.015801 | \n", "0.112862 | \n", "0.186113 | \n", "... | \n", "0.054664 | \n", "-0.019894 | \n", "0.049417 | \n", "-0.078136 | \n", "-0.090361 | \n", "-0.079543 | \n", "-0.064764 | \n", "-0.060393 | \n", "-0.062716 | \n", "-0.057396 | \n", "
| jordi | \n", "0.006217 | \n", "0.030590 | \n", "0.045220 | \n", "0.112371 | \n", "-0.036519 | \n", "0.166078 | \n", "-0.065310 | \n", "-0.016198 | \n", "0.118068 | \n", "0.187892 | \n", "... | \n", "0.055352 | \n", "-0.017221 | \n", "0.049916 | \n", "-0.078550 | \n", "-0.095642 | \n", "-0.080198 | \n", "-0.067379 | \n", "-0.059102 | \n", "-0.067260 | \n", "-0.058912 | \n", "
936 rows × 200 columns
\n", "| \n", " | X | \n", "Y | \n", "Palabra | \n", "
|---|---|---|---|
| 0 | \n", "0.137783 | \n", "0.002006 | \n", "suceso | \n", "
| 1 | \n", "-0.185469 | \n", "0.002270 | \n", "lugar | \n", "
| 2 | \n", "0.297975 | \n", "-0.001699 | \n", "brasil | \n", "
| 3 | \n", "-0.208962 | \n", "-0.002381 | \n", "años | \n", "
| 4 | \n", "-0.250687 | \n", "-0.002270 | \n", "después | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 931 | \n", "0.167823 | \n", "-0.001028 | \n", "mercadona | \n", "
| 932 | \n", "0.273000 | \n", "0.000120 | \n", "marruecos | \n", "
| 933 | \n", "0.175010 | \n", "0.001343 | \n", "italia | \n", "
| 934 | \n", "0.075135 | \n", "0.001296 | \n", "usted | \n", "
| 935 | \n", "0.055082 | \n", "0.000091 | \n", "jordi | \n", "
936 rows × 3 columns
\n", "