{ "cells": [ { "cell_type": "markdown", "id": "1daa77c9", "metadata": {}, "source": [ "## Análisis de la colección TCE60\n", "\n", "Este ejemplo muestra cómo analizar el dataset generado a partir de los documentos TEI originales en XML. Se ha creado un fichero en formato CSV que es posible analizar con la librería Pandas de Python. \n", "\n", "El corpus original se puede consultar en la Biblioteca Virtual Miguel de Cervantes:\n", "https://www.cervantesvirtual.com/portales/teatro_clasico_espanol/obra/canon-60-la-coleccion-esencial-del-tc12-teatro-clasico-espanol/\n", "\n", "La colección se basa en los parlamentos de las obras incluidas en el corpus TCE60 organizados a modo de registro (idRegistro, idAutoridad, Personaje, Texto) " ] }, { "cell_type": "markdown", "id": "128e6210", "metadata": {}, "source": [ "### Importamos las librerías de Python" ] }, { "cell_type": "code", "execution_count": 33, "id": "558aad51", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "markdown", "id": "644f8c1e", "metadata": {}, "source": [ "#### Clase Corpus_TEC60\n", "Se ha creado una clase en Python que permite cargar y analizar la colección en formato CSV. A continuación, se detallan los diferentes métodos disponibles para analizar su contenido. " ] }, { "cell_type": "code", "execution_count": 40, "id": "45870a37", "metadata": {}, "outputs": [], "source": [ "class Corpus_TCE60:\n", " def __init__(self, path_csv):\n", " self.df = pd.read_csv (path_csv, sep=';')\n", " \n", " def estructura(self) :\n", " print('#### estructura del corpus:') \n", " print(self.df.columns.tolist())\n", " print(self.df.dtypes)\n", "\n", " def num_parlamentos(self):\n", " print('#### numero de parlamentos:') \n", " print(self.df.count())\n", " \n", " def num_personajes_obra(self):\n", " print('#### Num de personajes por obra:') \n", " personajes_by_registro = self.df.groupby(\"Registro\")[\"Personaje\"].apply(lambda x: x.unique().shape[0])\n", " print(personajes_by_registro.head(10))\n", "\n", " def num_registros_por_obra(self):\n", " print('#### Num de registros por obra:') \n", " print(self.df.groupby(\"Registro\").size())\n", " \n", " def num_obras(self):\n", " print('#### num_obras en el corpus:') \n", " print(self.df['Registro'].describe())\n", " \n", " def num_autores(self):\n", " print('#### num_autores en el corpus:') \n", " print(self.df['ID Autoridad'].describe())\n", " \n", " def grafica_parlamentos_autor(self):\n", " self.df.groupby(['ID Autoridad']).count()['Registro'].plot(kind=\"barh\")\n", " plt.title(\"Parlamentos por autor\")\n", " plt.ylabel(\"Id Autoridad\")\n", " plt.xlabel(\"Núm parlamentos\")\n", " plt.show()\n", " \n", " def obras_autor(self):\n", " self.df.groupby('ID Autoridad').Registro.nunique().plot(kind=\"barh\")\n", " plt.title(\"Obras por autor\")\n", " plt.ylabel(\"Id Autoridad\")\n", " plt.xlabel(\"Núm obras\")\n", " plt.show()\n", " " ] }, { "cell_type": "markdown", "id": "a071ac40", "metadata": {}, "source": [ "### Inicializamos la clase y ejecutamos los diferentes métodos" ] }, { "cell_type": "code", "execution_count": 41, "id": "ce0f8300", "metadata": {}, "outputs": [], "source": [ "corpus = Corpus_TCE60('../data/procesado.csv')" ] }, { "cell_type": "code", "execution_count": 42, "id": "7ab60fde", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### estructura del corpus:\n", "['Registro', 'ID Autoridad', 'Índice', 'Personaje', 'Contenido XML', 'Contenido texto']\n", "Registro object\n", "ID Autoridad object\n", "Índice int64\n", "Personaje object\n", "Contenido XML object\n", "Contenido texto object\n", "dtype: object\n" ] } ], "source": [ "# columnas del fichero CSV con el contenido de la colección TCE60\n", "corpus.estructura()" ] }, { "cell_type": "code", "execution_count": 38, "id": "b2f64bc0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### num_obras en el corpus:\n", "count 47895\n", "unique 60\n", "top 682342.xml\n", "freq 1405\n", "Name: Registro, dtype: object\n" ] } ], "source": [ "# 60 obras\n", "corpus.num_obras()" ] }, { "cell_type": "code", "execution_count": 29, "id": "23821860", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### num_autores en el corpus:\n", "count 47895\n", "unique 22\n", "top 72\n", "freq 13304\n", "Name: ID Autoridad, dtype: object\n" ] } ], "source": [ "# 22 autores\n", "corpus.num_autores()" ] }, { "cell_type": "code", "execution_count": 30, "id": "b9e8e0f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### Num de personajes por obra:\n", "Registro\n", "681756.xml 16\n", "681765.xml 30\n", "681843.xml 9\n", "681846.xml 15\n", "681849.xml 20\n", "681855.xml 22\n", "681858.xml 19\n", "681861.xml 19\n", "681864.xml 20\n", "681868.xml 12\n", "Name: Personaje, dtype: int64\n" ] } ], "source": [ "# personajes por obra\n", "corpus.num_personajes_obra()" ] }, { "cell_type": "code", "execution_count": 31, "id": "6a243ec9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### Num de registros por obra:\n", "Registro\n", "681756.xml 665\n", "681765.xml 774\n", "681843.xml 763\n", "681846.xml 313\n", "681849.xml 785\n", "681855.xml 763\n", "681858.xml 740\n", "681861.xml 640\n", "681864.xml 496\n", "681868.xml 490\n", "681873.xml 761\n", "682308.xml 351\n", "682311.xml 1150\n", "682314.xml 631\n", "682317.xml 580\n", "682320.xml 750\n", "682323.xml 1034\n", "682330.xml 1099\n", "682333.xml 347\n", "682342.xml 1405\n", "682348.xml 1131\n", "682351.xml 856\n", "682360.xml 862\n", "682363.xml 985\n", "682366.xml 875\n", "682369.xml 1244\n", "682375.xml 1233\n", "682378.xml 556\n", "682381.xml 864\n", "682384.xml 268\n", "682387.xml 598\n", "703732.xml 795\n", "703735.xml 868\n", "703738.xml 925\n", "703741.xml 240\n", "703744.xml 837\n", "703747.xml 1090\n", "703753.xml 535\n", "703759.xml 1064\n", "703765.xml 789\n", "703774.xml 859\n", "703795.xml 888\n", "703942.xml 1057\n", "703951.xml 643\n", "707437.xml 753\n", "707442.xml 755\n", "707446.xml 786\n", "707454.xml 813\n", "707658.xml 927\n", "707661.xml 1208\n", "707664.xml 762\n", "707670.xml 1150\n", "707673.xml 775\n", "707678.xml 878\n", "707685.xml 976\n", "708744.xml 676\n", "738313.xml 458\n", "738319.xml 936\n", "738323.xml 558\n", "738328.xml 885\n", "dtype: int64\n" ] } ], "source": [ "# numero de parlamentos por obra\n", "corpus.num_registros_por_obra()" ] }, { "cell_type": "code", "execution_count": 32, "id": "d256d4d6", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# identificador autoridad en la BVMC y número de parlamentos en el corpus completo\n", "corpus.grafica_parlamentos_autor()" ] }, { "cell_type": "code", "execution_count": 23, "id": "544d5c59", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# obras por autor, identificador de la BVMC (por ejemplo 72 corresponde a Lope de Vega) \n", "corpus.obras_autor()" ] }, { "cell_type": "code", "execution_count": 43, "id": "6f09af4f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#### numero de parlamentos:\n", "Registro 47895\n", "ID Autoridad 47895\n", "Índice 47895\n", "Personaje 47398\n", "Contenido XML 47895\n", "Contenido texto 47895\n", "dtype: int64\n" ] } ], "source": [ "corpus.num_parlamentos()" ] }, { "cell_type": "code", "execution_count": null, "id": "4cabd323", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }