{ "cells": [ { "cell_type": "markdown", "id": "5241d11d-38e4-471b-8023-aa6ccbe53104", "metadata": {}, "source": [ "
\n", " \n", "
\n", "\n", "# Aprendizaje Profundo" ] }, { "cell_type": "markdown", "id": "7f790db7-dbbc-4024-a0ec-7c1a8cff35a5", "metadata": { "tags": [] }, "source": [ "#
Diplomado en Inteligencia Artificial y Aprendizaje Profundo
" ] }, { "cell_type": "markdown", "id": "b2ad70c4-c7a1-455b-8ff3-9d98fd84661b", "metadata": { "tags": [] }, "source": [ "#
Introducción a spaCy para NLP
" ] }, { "cell_type": "markdown", "id": "9203b92c-78e7-48ba-b64c-1a4acd78dc47", "metadata": {}, "source": [ "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
Fuente: Arquitectura de la librería spaCy
" ] }, { "cell_type": "markdown", "id": "16ce3fb1-bfe5-4a1c-9e0e-1e86bdacfca4", "metadata": {}, "source": [ "## Profesores" ] }, { "cell_type": "markdown", "id": "f8907b73-4b4c-412c-9f15-e72b72b240ab", "metadata": {}, "source": [ "1. Alvaro Montenegro, PhD, ammontenegrod@unal.edu.co\n", "1. Camilo José Torres Jiménez, Msc, cjtorresj@unal.edu.co\n", "1. Daniel Montenegro, Msc, dextronomo@gmail.com " ] }, { "cell_type": "markdown", "id": "f6bed094-dbe3-4c54-8983-7324a2ad1c5c", "metadata": {}, "source": [ "## Asesora Medios y Marketing digital" ] }, { "cell_type": "markdown", "id": "ce257431-c9e3-4679-8258-c9195aaf3cdb", "metadata": {}, "source": [ "4. Maria del Pilar Montenegro, pmontenegro88@gmail.com\n", "5. Jessica López Mejía, jelopezme@unal.edu.co\n", "6. Venus Puertas, vpuertasg@unal.edu.co" ] }, { "cell_type": "markdown", "id": "9cd9e0df-9088-4e70-b0cc-960b4b44c3d8", "metadata": {}, "source": [ "## Jefe Jurídica" ] }, { "cell_type": "markdown", "id": "c944fb3f-707c-44bd-8f6e-dbbc0cfb3c92", "metadata": {}, "source": [ "7. Paula Andrea Guzmán, guzmancruz.paula@gmail.com" ] }, { "cell_type": "markdown", "id": "a50b138d-1ca6-40d4-8ae4-317318876de3", "metadata": {}, "source": [ "## Coordinador Jurídico" ] }, { "cell_type": "markdown", "id": "287175c1-fcfb-491b-b963-2744010720a3", "metadata": {}, "source": [ "8. David Fuentes, fuentesd065@gmail.com" ] }, { "cell_type": "markdown", "id": "ccb24a86-284b-4292-ad38-f76a59c89c62", "metadata": {}, "source": [ "## Desarrolladores Principales" ] }, { "cell_type": "markdown", "id": "38eed8e6-cd61-4a0e-b5c0-ab9275b52183", "metadata": {}, "source": [ "9. Dairo Moreno, damoralesj@unal.edu.co\n", "10. Joan Castro, jocastroc@unal.edu.co\n", "11. Bryan Riveros, briveros@unal.edu.co\n", "12. Rosmer Vargas, rovargasc@unal.edu.co" ] }, { "cell_type": "markdown", "id": "f448aac2-368f-425c-81c3-c07139340bbf", "metadata": {}, "source": [ "## Expertos en Bases de Datos" ] }, { "cell_type": "markdown", "id": "0c4a7ffc-6bc9-49c1-86db-02d4fa3cf80b", "metadata": {}, "source": [ "13. Giovvani Barrera, udgiovanni@gmail.com\n", "14. Camilo Chitivo, cchitivo@unal.edu.co" ] }, { "cell_type": "markdown", "id": "7905f969-c5f5-420d-bd0c-c4a12ecbda70", "metadata": {}, "source": [ "## Fuentes y referencias\n", "\n", "Este cuaderno es una adaptación y traducción libre de las guías disponibles en la página oficial de [spaCy](https://spacy.io/)" ] }, { "cell_type": "markdown", "id": "29aac8a5-4e5c-4878-9b9b-1c5446cfab79", "metadata": {}, "source": [ "## Introducción" ] }, { "cell_type": "markdown", "id": "f567537c-b192-4db9-bf83-28f00d47cbf3", "metadata": { "tags": [] }, "source": [ "### Instalar el paquete spacy" ] }, { "cell_type": "code", "execution_count": null, "id": "db808048-4039-41ac-987b-efb94d5d1545", "metadata": {}, "outputs": [], "source": [ "#!pip install --quiet spacy\n", "#!conda install -c conda-forge spacy" ] }, { "cell_type": "markdown", "id": "298da6e9-600d-4de7-aa35-8352065a8f32", "metadata": {}, "source": [ "### Descargar un pipeline entrenado" ] }, { "cell_type": "code", "execution_count": null, "id": "e1b7523d-bb95-4c48-abbc-06eac1c50b0d", "metadata": {}, "outputs": [], "source": [ "!spacy download en_core_web_md" ] }, { "cell_type": "markdown", "id": "a66fe5d0-1e03-49a7-bf33-c681ad613de7", "metadata": {}, "source": [ "### Importar el paquete y cargar el pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "011544f9-3864-4795-a814-3455cf8b3f9b", "metadata": {}, "outputs": [], "source": [ "import spacy\n", "from spacy import displacy\n", "\n", "nlp = spacy.load(\"en_core_web_md\")" ] }, { "cell_type": "markdown", "id": "d3ab7dbb-a174-4146-b585-46f4be15b9bd", "metadata": {}, "source": [ "### Documentos" ] }, { "cell_type": "code", "execution_count": null, "id": "74b3c72d-4414-48d6-b2de-8bbe878a0d4b", "metadata": {}, "outputs": [], "source": [ "doc = nlp(\"Apple is looking at buying U.K. startup for $1 billion\")\n", "\n", "text = \"\"\"In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin).\"\"\"\n", "longer_doc = nlp(text)" ] }, { "cell_type": "markdown", "id": "fea3ea49-9b49-40a0-bee3-5e67ae3fefad", "metadata": { "tags": [] }, "source": [ "## Anotaciones lingüísticas" ] }, { "cell_type": "markdown", "id": "477d0cd6-d728-4dca-ad66-a61567cb97c4", "metadata": {}, "source": [ "### Sentencias" ] }, { "cell_type": "code", "execution_count": null, "id": "f1554929-9db2-41cd-9ce1-e26ffb2e8d58", "metadata": {}, "outputs": [], "source": [ "sentence_spans = list(longer_doc.sents)\n", "print(sentence_spans)" ] }, { "cell_type": "markdown", "id": "ff532122-ef30-4bf3-b280-4eeeabcf33cc", "metadata": {}, "source": [ "### Tokenizacion" ] }, { "cell_type": "code", "execution_count": null, "id": "304c0d1c-61b4-47fc-8423-d21f0fdc4bee", "metadata": {}, "outputs": [], "source": [ "for token in doc:\n", " print(token.text)" ] }, { "cell_type": "markdown", "id": "e029aee3-3c05-41f0-99e1-53e724862fa7", "metadata": {}, "source": [ "### Etiquetas de partes del discurso y dependencias" ] }, { "cell_type": "code", "execution_count": null, "id": "cf24def3-92f0-4b9e-9d44-9ca625f8e12e", "metadata": {}, "outputs": [], "source": [ "print(f'{\"text\":15s}{\"lemma\":15s}{\"pos\":15s}{\"tag_\":15s}{\"dep_\":15s}{\"shape_\":15s}is_alpha, is_stop')\n", "print(\"-\"*110)\n", "for token in doc:\n", " print(f'{token.text:15s}{token.lemma_:15s}{token.pos_:15s}{token.tag_:15s}{token.dep_:15s}{token.shape_:15s}{token.is_alpha}, {token.is_stop}')" ] }, { "cell_type": "markdown", "id": "1614bdb8-4f3e-4d94-ab08-61cd7234d680", "metadata": {}, "source": [ "### Características morfológicas" ] }, { "cell_type": "code", "execution_count": null, "id": "849eff92-40e4-48e8-9ce8-8076f645e1c9", "metadata": {}, "outputs": [], "source": [ "for token in doc:\n", " print(token.morph)" ] }, { "cell_type": "markdown", "id": "fc13486f-76a7-4b7f-86d7-f6efafad8533", "metadata": {}, "source": [ "### Visualización" ] }, { "cell_type": "code", "execution_count": null, "id": "1d3ed09e-b3d1-4895-887a-23a837421bf0", "metadata": {}, "outputs": [], "source": [ "displacy.render(doc, style=\"dep\")" ] }, { "cell_type": "code", "execution_count": null, "id": "29da69d5-7cea-4739-a935-e38c78d5ef49", "metadata": {}, "outputs": [], "source": [ "displacy.render(sentence_spans, style=\"dep\")" ] }, { "cell_type": "markdown", "id": "41852e3f-516b-4c0a-b09d-a707b10cc36f", "metadata": {}, "source": [ "## Entidades con nombre (propias)" ] }, { "cell_type": "code", "execution_count": null, "id": "5ec5beb3-cecf-425c-9dc5-82b9889293d7", "metadata": {}, "outputs": [], "source": [ "for ent in doc.ents:\n", " print(f'{ent.text:15s}{ent.label_:10}{ent.start_char:5d}{ent.end_char:5d}')" ] }, { "cell_type": "markdown", "id": "3a9610a4-484d-45a7-a4b1-9fb14fc7a3a1", "metadata": {}, "source": [ "### Visualización" ] }, { "cell_type": "code", "execution_count": null, "id": "69aadcc8-18cd-48fb-832e-647c55670ed4", "metadata": {}, "outputs": [], "source": [ "displacy.render(doc, style=\"ent\")" ] }, { "cell_type": "code", "execution_count": null, "id": "edf8986c-6e0f-4014-ac89-eda9c2906a80", "metadata": {}, "outputs": [], "source": [ "displacy.render(longer_doc, style=\"ent\")" ] }, { "cell_type": "markdown", "id": "3bc73d6d-361a-4abb-8df0-fe074cc458a0", "metadata": {}, "source": [ "## Vectores de palabras y similitud" ] }, { "cell_type": "code", "execution_count": null, "id": "3f133b03-cb7f-4e70-99f9-541c3c4a6fd9", "metadata": {}, "outputs": [], "source": [ "for token in doc:\n", " print(f'{token.text:15s}{token.vector_norm:10f} {token.has_vector} {token.is_oov}')" ] }, { "cell_type": "code", "execution_count": null, "id": "b83ea5c1-fea9-4613-b9d9-9b7db05d02f7", "metadata": {}, "outputs": [], "source": [ "doc1 = nlp(\"I like salty fries and hamburgers.\")\n", "doc2 = nlp(\"Fast food tastes very good.\")\n", "# Similitud de dos documentos\n", "print(doc1, \"<->\", doc2, \":\", doc1.similarity(doc2))" ] }, { "cell_type": "code", "execution_count": null, "id": "b0cad04e-98a8-4828-9a43-98595b0d520f", "metadata": {}, "outputs": [], "source": [ "# Similitud de tokens y spans\n", "french_fries = doc1[2:4]\n", "burgers = doc1[5]\n", "print(french_fries, \"<->\", burgers, \":\", french_fries.similarity(burgers))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }