{ "cells": [ { "cell_type": "markdown", "id": "e69ba12f-305c-4e43-aa02-54a09093c321", "metadata": {}, "source": [ "

Text Extensions for Pandas

\n", "

Interactive Dataframe Widget

\n", "The interactive dataframe widget is an application within the IBM CODAIT team's open source Python library: Text Extension for Pandas. The widget aims to provide data scientists with a meaningful, visual way to interpret NLP (Natural Language Processing) data." ] }, { "cell_type": "markdown", "id": "d23a275a-09ac-4de0-9e07-1a71adb78365", "metadata": {}, "source": [ "This demo will walk you though an example session of using the widget and related visualizers provided in the ```jupyter``` sub-module of Text Extensions for Pandas." ] }, { "cell_type": "code", "execution_count": 1, "id": "8a02abec-ae6b-4ad8-903b-f182418726e9", "metadata": {}, "outputs": [], "source": [ "import os\n", "import regex\n", "import sys\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# And of course we need the text_extensions_for_pandas library itself.\n", "try:\n", " import text_extensions_for_pandas as tp\n", "except ModuleNotFoundError as e:\n", " # If we're running from within the project source tree and the parent Python\n", " # environment doesn't have the text_extensions_for_pandas package, use the\n", " # version in the local source tree.\n", " if not os.getcwd().endswith(\"notebooks\"):\n", " raise e\n", " if \"..\" not in sys.path:\n", " sys.path.insert(0, \"..\")\n", " import text_extensions_for_pandas as tp" ] }, { "cell_type": "markdown", "id": "2f101deb-5c29-4ee2-be0c-da7061d0b5c9", "metadata": {}, "source": [ "This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition (Named Entity Extraction). We will be looking at a token classification problem - analyzing the building blocks of natural language present in this dataset that we can process and feed into a machine learning algorithm. The dataset contains categorical entity classifications of ```locations (LOC)```, ```persons (PER)```, ```organizations (ORG)``` and ```miscellaneous (MISC)```.\n", "\n", "Our goal is to load up some data from this dataset and do some basic processing and analysis, and make corrections if necessary.\n", "\n", "We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes to work with." ] }, { "cell_type": "code", "execution_count": 2, "id": "ac6e58cc-57fb-4d71-ba64-364fd2255d95", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'train': 'outputs/eng.train',\n", " 'dev': 'outputs/eng.testa',\n", " 'test': 'outputs/eng.testb'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Download and cache the data set.\n", "# NOTE: This data set is licensed for research use only. Be sure to adhere\n", "# to the terms of the license when using this data set!\n", "data_set_info = tp.io.conll.maybe_download_conll_data(\"outputs\")\n", "data_set_info" ] }, { "cell_type": "code", "execution_count": 3, "id": "2847afc9-6e3e-48a9-acc0-326cdf45877d", "metadata": {}, "outputs": [], "source": [ "gold_standard = tp.io.conll.conll_2003_to_dataframes(\n", " data_set_info[\"test\"], [\"pos\", \"phrase\", \"ent\"], [False, True, True])\n", "gold_standard = [\n", " df.drop(columns=[\"pos\", \"phrase_iob\", \"phrase_type\"])\n", " for df in gold_standard\n", "]\n" ] }, { "cell_type": "markdown", "id": "f3dd3e7b-6535-4476-8e8a-997b2ba5e0d0", "metadata": {}, "source": [ "Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization." ] }, { "cell_type": "code", "execution_count": 4, "id": "3651acbd-18e8-45e8-bd5c-9427659e2fd3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
spanent_iobent_typesentenceline_num
0[0, 10): '-DOCSTART-'ONone[0, 10): '-DOCSTART-'0
1[11, 17): 'SOCCER'ONone[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...2
2[17, 18): '-'ONone[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...3
3[19, 24): 'JAPAN'BLOC[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...4
4[25, 28): 'GET'ONone[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...5
..................
415[2178, 2182): 'each'ONone[2138, 2197): 'All four teams are level with o...437
416[2183, 2187): 'from'ONone[2138, 2197): 'All four teams are level with o...438
417[2188, 2191): 'one'ONone[2138, 2197): 'All four teams are level with o...439
418[2192, 2196): 'game'ONone[2138, 2197): 'All four teams are level with o...440
419[2196, 2197): '.'ONone[2138, 2197): 'All four teams are level with o...441
\n", "

420 rows × 5 columns

\n", "
" ], "text/plain": [ " span ent_iob ent_type \\\n", "0 [0, 10): '-DOCSTART-' O None \n", "1 [11, 17): 'SOCCER' O None \n", "2 [17, 18): '-' O None \n", "3 [19, 24): 'JAPAN' B LOC \n", "4 [25, 28): 'GET' O None \n", ".. ... ... ... \n", "415 [2178, 2182): 'each' O None \n", "416 [2183, 2187): 'from' O None \n", "417 [2188, 2191): 'one' O None \n", "418 [2192, 2196): 'game' O None \n", "419 [2196, 2197): '.' O None \n", "\n", " sentence line_num \n", "0 [0, 10): '-DOCSTART-' 0 \n", "1 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 2 \n", "2 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 3 \n", "3 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 4 \n", "4 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 5 \n", ".. ... ... \n", "415 [2138, 2197): 'All four teams are level with o... 437 \n", "416 [2138, 2197): 'All four teams are level with o... 438 \n", "417 [2138, 2197): 'All four teams are level with o... 439 \n", "418 [2138, 2197): 'All four teams are level with o... 440 \n", "419 [2138, 2197): 'All four teams are level with o... 441 \n", "\n", "[420 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = gold_standard[0]\n", "tokens" ] }, { "cell_type": "code", "execution_count": 5, "id": "d1c602a3-186a-4059-847b-753e73df685e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
spanent_type
0[19, 24): 'JAPAN'LOC
1[40, 45): 'CHINA'PER
2[66, 77): 'Nadim Ladki'PER
3[78, 84): 'AL-AIN'LOC
4[86, 106): 'United Arab Emirates'LOC
\n", "
" ], "text/plain": [ " span ent_type\n", "0 [19, 24): 'JAPAN' LOC\n", "1 [40, 45): 'CHINA' PER\n", "2 [66, 77): 'Nadim Ladki' PER\n", "3 [78, 84): 'AL-AIN' LOC\n", "4 [86, 106): 'United Arab Emirates' LOC" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entity_mentions = tp.io.conll.iob_to_spans(tokens)\n", "entity_mentions.head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "2e539ae1-b4f6-4cc1-bb74-adb206e32544", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
spanent_typesentencesentence_id
0[19, 24): 'JAPAN'LOC[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...11
1[40, 45): 'CHINA'PER[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...11
2[66, 77): 'Nadim Ladki'PER[66, 77): 'Nadim Ladki'66
3[78, 84): 'AL-AIN'LOC[78, 117): 'AL-AIN, United Arab Emirates 1996-...78
4[86, 106): 'United Arab Emirates'LOC[78, 117): 'AL-AIN, United Arab Emirates 1996-...78
\n", "
" ], "text/plain": [ " span ent_type \\\n", "0 [19, 24): 'JAPAN' LOC \n", "1 [40, 45): 'CHINA' PER \n", "2 [66, 77): 'Nadim Ladki' PER \n", "3 [78, 84): 'AL-AIN' LOC \n", "4 [86, 106): 'United Arab Emirates' LOC \n", "\n", " sentence sentence_id \n", "0 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 11 \n", "1 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 11 \n", "2 [66, 77): 'Nadim Ladki' 66 \n", "3 [78, 117): 'AL-AIN, United Arab Emirates 1996-... 78 \n", "4 [78, 117): 'AL-AIN, United Arab Emirates 1996-... 78 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences = tokens[\"sentence\"].unique()\n", "entity_sentence_pairs = tp.spanner.contain_join(pd.Series(sentences), entity_mentions[\"span\"], \"sentence\", \"span\")\n", "entity_mentions = entity_mentions.merge(entity_sentence_pairs)\n", "entity_mentions[\"sentence_id\"] = entity_mentions[\"sentence\"].array.begin\n", "entity_mentions.head()" ] }, { "cell_type": "markdown", "id": "2cd6d543-df21-4774-8140-5d45d6a1b7ca", "metadata": {}, "source": [ "We can take a closer look at what the ```span``` column might look like in context by viewing the column alone as the SpanArray datatype." ] }, { "cell_type": "code", "execution_count": 7, "id": "e777948a-ffca-4a4c-9a04-31ac90a88698", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", "
beginendbegin tokenend tokencontext
01165113SOCCER- JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT.
166771315Nadim Ladki
2781171521AL-AIN, United Arab Emirates 1996-12-06
31182442146Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.
42453744671But China saw their luck desert them in the second match of the group, crashing to a surprise 2-0 defeat to newcomers Uzbekistan.
537561771113China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.
6618735113136Oleg Shatskiku made sure of the win in injury time, hitting an unstoppable left foot shot from just outside the area.
7736821136153The former Soviet republic was playing in an Asian Cup finals tie for the first time.
8822917153171Despite winning the Asian Games title two years ago, Uzbekistan are in the finals as outsiders.
99181078171199Two goals from defensive errors in the last six minutes allowed Japan to come from behind and collect all three points from their opening meeting against Syria.
1010791291199237Takuya Takagi scored the winner in the 88th minute, rising to head a Hiroshige Yanagimoto cross towards the Syrian goal which goalkeeper Salem Bitar appeared to have covered but then allowed to slip into the net.
1112921350237249It was the second costly blunder by Syria in four minutes.
1213511502249280Defender Hassan Abbas rose to intercept a long ball into the area in the 84th minute but only managed to divert it into the top corner of Bitar's goal.
1315031591280296Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute.
1415921701296317Japan then laid siege to the Syrian penalty area for most of the game but rarely breached the Syrian defence.
1517021748317326Bitar pulled off fine saves whenever they did.
1617491820326343Japan coach Shu Kamo said: ' ' The Syrian own goal proved lucky for us.
1718211925343363The Syrians scored early and then played defensively and adopted long balls which made it hard for us. '
1819282049364390Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by FIFA, are favourites to regain their title here.
1920502137390407Hosts UAE play Kuwait and South Korea take on Indonesia on Saturday in Group A matches.
\n", "

\n", "\n", " -DOCSTART-
\n", "\n", " SOCCER- JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT.\n", "\n", "
\n", "\n", " Nadim Ladki\n", "\n", "
\n", "\n", " AL-AIN, United Arab Emirates 1996-12-06\n", "\n", "
\n", "\n", " Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.\n", "\n", "
\n", "\n", " But China saw their luck desert them in the second match of the group, crashing to a surprise 2-0 defeat to newcomers Uzbekistan.\n", "\n", "
\n", "\n", " China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.\n", "\n", "
\n", "\n", " Oleg Shatskiku made sure of the win in injury time, hitting an unstoppable left foot shot from just outside the area.\n", "\n", "
\n", "\n", " The former Soviet republic was playing in an Asian Cup finals tie for the first time.\n", "\n", "
\n", "\n", " Despite winning the Asian Games title two years ago, Uzbekistan are in the finals as outsiders.\n", "\n", "
\n", "\n", " Two goals from defensive errors in the last six minutes allowed Japan to come from behind and collect all three points from their opening meeting against Syria.\n", "\n", "
\n", "\n", " Takuya Takagi scored the winner in the 88th minute, rising to head a Hiroshige Yanagimoto cross towards the Syrian goal which goalkeeper Salem Bitar appeared to have covered but then allowed to slip into the net.\n", "\n", "
\n", "\n", " It was the second costly blunder by Syria in four minutes.\n", "\n", "
\n", "\n", " Defender Hassan Abbas rose to intercept a long ball into the area in the 84th minute but only managed to divert it into the top corner of Bitar's goal.\n", "\n", "
\n", "\n", " Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute.\n", "\n", "
\n", "\n", " Japan then laid siege to the Syrian penalty area for most of the game but rarely breached the Syrian defence.\n", "\n", "
\n", "\n", " Bitar pulled off fine saves whenever they did.\n", "\n", "
\n", "\n", " Japan coach Shu Kamo said: ' ' The Syrian own goal proved lucky for us.\n", "\n", "
\n", "\n", " The Syrians scored early and then played defensively and adopted long balls which made it hard for us. '\n", "\n", "
'
\n", "\n", " Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by FIFA, are favourites to regain their title here.\n", "\n", "
\n", "\n", " Hosts UAE play Kuwait and South Korea take on Indonesia on Saturday in Group A matches.\n", "
All four teams are level with one point each from one game.\n", "

\n", "
\n", "\n", " Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.\n", "
\n", "\n", "\n" ], "text/plain": [ "\n", "[ [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT.',\n", " [66, 77): 'Nadim Ladki',\n", " [78, 117): 'AL-AIN, United Arab Emirates 1996-12-06',\n", " [118, 244): 'Japan began the defence of their Asian Cup title with a lucky 2-1 win [...]',\n", " [245, 374): 'But China saw their luck desert them in the second match of the group, [...]',\n", " [375, 617): 'China controlled most of the match and saw several chances missed until [...]',\n", " [618, 735): 'Oleg Shatskiku made sure of the win in injury time, hitting an unstoppable [...]',\n", " [736, 821): 'The former Soviet republic was playing in an Asian Cup finals tie for the [...]',\n", " [822, 917): 'Despite winning the Asian Games title two years ago, Uzbekistan are in the [...]',\n", " [918, 1078): 'Two goals from defensive errors in the last six minutes allowed Japan to [...]',\n", " [1079, 1291): 'Takuya Takagi scored the winner in the 88th minute, rising to head a [...]',\n", " [1292, 1350): 'It was the second costly blunder by Syria in four minutes.',\n", " [1351, 1502): 'Defender Hassan Abbas rose to intercept a long ball into the area in the [...]',\n", " [1503, 1591): 'Nader Jokhadar had given Syria the lead with a well-struck header in the [...]',\n", " [1592, 1701): 'Japan then laid siege to the Syrian penalty area for most of the game but [...]',\n", " [1702, 1748): 'Bitar pulled off fine saves whenever they did.',\n", " [1749, 1820): 'Japan coach Shu Kamo said: ' ' The Syrian own goal proved lucky for us.',\n", " [1821, 1925): 'The Syrians scored early and then played defensively and adopted long [...]',\n", " [1928, 2049): 'Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by [...]',\n", " [2050, 2137): 'Hosts UAE play Kuwait and South Korea take on Indonesia on Saturday in [...]']\n", "Length: 20, dtype: TokenSpanDtype" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entity_mentions[\"sentence\"].unique()" ] }, { "cell_type": "markdown", "id": "c7a34d33-2956-4e15-ba6b-93b1fee06fd7", "metadata": {}, "source": [ "We don't really want to visualize every column in our dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about." ] }, { "cell_type": "markdown", "id": "84886242-284e-4a66-ad87-0f65191880bf", "metadata": {}, "source": [ "Now that our data is prepared for analysis, we can load it up in our widget." ] }, { "cell_type": "code", "execution_count": 8, "id": "40f6c7f2-dd12-4038-9f14-e19d40e7301e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "480b5072a64742b2be61aaef7b151a53", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(_dom_classes=('tep--dfwidget--output',))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "widget = tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\"]))\n", "widget.display()" ] }, { "cell_type": "markdown", "id": "dbe7bbe4-44aa-4a73-ac13-faf8bd741f2c", "metadata": {}, "source": [ "If we want to view this widget interactively, we can pass in the additional parameter ```interactive_columns``` with an array of column names we want to become interactive widgets.\n", "\n", "One thing you may notice in the above widgets is that the column ```ent_type``` is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data." ] }, { "cell_type": "code", "execution_count": 9, "id": "46b65ec8-865e-43eb-9990-2c91b4f298c8", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d1d66f97fd2343948c14ce57bfe3aea1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(_dom_classes=('tep--dfwidget--output',))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical = pd.Categorical(entity_mentions[\"ent_type\"], categories=[\"PER\", \"LOC\", \"ORG\", \"MISC\"])\n", "entity_mentions[\"ent_type\"] = categorical\n", "tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\", \"sentence_id\"]), interactive_columns=[\"ent_type\"]).display()" ] }, { "cell_type": "code", "execution_count": 10, "id": "f8280a38-7c4d-45c6-b90d-60ceccc6b603", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
spanent_typesentencesentence_idnew_type
0[19, 24): 'JAPAN'LOC[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...11LOC
1[40, 45): 'CHINA'PER[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...11PER
2[66, 77): 'Nadim Ladki'PER[66, 77): 'Nadim Ladki'66PER
3[78, 84): 'AL-AIN'LOC[78, 117): 'AL-AIN, United Arab Emirates 1996-...78LOC
4[86, 106): 'United Arab Emirates'LOC[78, 117): 'AL-AIN, United Arab Emirates 1996-...78LOC
5[118, 123): 'Japan'LOC[118, 244): 'Japan began the defence of their ...118LOC
6[151, 160): 'Asian Cup'MISC[118, 244): 'Japan began the defence of their ...118MISC
7[196, 201): 'Syria'LOC[118, 244): 'Japan began the defence of their ...118LOC
8[249, 254): 'China'LOC[245, 374): 'But China saw their luck desert t...245LOC
9[363, 373): 'Uzbekistan'LOC[245, 374): 'But China saw their luck desert t...245LOC
10[375, 380): 'China'LOC[375, 617): 'China controlled most of the matc...375LOC
11[468, 473): 'Uzbek'MISC[375, 617): 'China controlled most of the matc...375MISC
12[482, 495): 'Igor Shkvyrin'PER[375, 617): 'China controlled most of the matc...375PER
13[580, 587): 'Chinese'MISC[375, 617): 'China controlled most of the matc...375MISC
14[618, 632): 'Oleg Shatskiku'PER[618, 735): 'Oleg Shatskiku made sure of the w...618PER
15[747, 753): 'Soviet'MISC[736, 821): 'The former Soviet republic was pl...736MISC
16[781, 790): 'Asian Cup'MISC[736, 821): 'The former Soviet republic was pl...736MISC
17[842, 853): 'Asian Games'MISC[822, 917): 'Despite winning the Asian Games t...822MISC
18[875, 885): 'Uzbekistan'LOC[822, 917): 'Despite winning the Asian Games t...822LOC
19[982, 987): 'Japan'LOC[918, 1078): 'Two goals from defensive errors ...918LOC
20[1072, 1077): 'Syria'LOC[918, 1078): 'Two goals from defensive errors ...918LOC
21[1079, 1092): 'Takuya Takagi'PER[1079, 1291): 'Takuya Takagi scored the winner...1079PER
22[1148, 1168): 'Hiroshige Yanagimoto'PER[1079, 1291): 'Takuya Takagi scored the winner...1079PER
23[1187, 1193): 'Syrian'MISC[1079, 1291): 'Takuya Takagi scored the winner...1079MISC
24[1216, 1227): 'Salem Bitar'PER[1079, 1291): 'Takuya Takagi scored the winner...1079PER
25[1328, 1333): 'Syria'LOC[1292, 1350): 'It was the second costly blunde...1292LOC
26[1360, 1372): 'Hassan Abbas'PER[1351, 1502): 'Defender Hassan Abbas rose to i...1351PER
27[1489, 1494): 'Bitar'PER[1351, 1502): 'Defender Hassan Abbas rose to i...1351PER
28[1503, 1517): 'Nader Jokhadar'PER[1503, 1591): 'Nader Jokhadar had given Syria ...1503PER
29[1528, 1533): 'Syria'LOC[1503, 1591): 'Nader Jokhadar had given Syria ...1503LOC
30[1592, 1597): 'Japan'LOC[1592, 1701): 'Japan then laid siege to the Sy...1592LOC
31[1621, 1627): 'Syrian'MISC[1592, 1701): 'Japan then laid siege to the Sy...1592MISC
32[1686, 1692): 'Syrian'MISC[1592, 1701): 'Japan then laid siege to the Sy...1592MISC
33[1702, 1707): 'Bitar'PER[1702, 1748): 'Bitar pulled off fine saves whe...1702PER
34[1749, 1754): 'Japan'LOC[1749, 1820): 'Japan coach Shu Kamo said: ' ' ...1749LOC
35[1761, 1769): 'Shu Kamo'PER[1749, 1820): 'Japan coach Shu Kamo said: ' ' ...1749PER
36[1784, 1790): 'Syrian'MISC[1749, 1820): 'Japan coach Shu Kamo said: ' ' ...1749MISC
37[1825, 1832): 'Syrians'MISC[1821, 1925): 'The Syrians scored early and th...1821MISC
38[1928, 1933): 'Japan'LOC[1928, 2049): 'Japan, co-hosts of the World Cu...1928LOC
39[1951, 1960): 'World Cup'MISC[1928, 2049): 'Japan, co-hosts of the World Cu...1928MISC
40[2001, 2005): 'FIFA'ORG[1928, 2049): 'Japan, co-hosts of the World Cu...1928ORG
41[2056, 2059): 'UAE'LOC[2050, 2137): 'Hosts UAE play Kuwait and South...2050LOC
42[2065, 2071): 'Kuwait'LOC[2050, 2137): 'Hosts UAE play Kuwait and South...2050LOC
43[2076, 2087): 'South Korea'LOC[2050, 2137): 'Hosts UAE play Kuwait and South...2050LOC
44[2096, 2105): 'Indonesia'LOC[2050, 2137): 'Hosts UAE play Kuwait and South...2050LOC
\n", "
" ], "text/plain": [ " span ent_type \\\n", "0 [19, 24): 'JAPAN' LOC \n", "1 [40, 45): 'CHINA' PER \n", "2 [66, 77): 'Nadim Ladki' PER \n", "3 [78, 84): 'AL-AIN' LOC \n", "4 [86, 106): 'United Arab Emirates' LOC \n", "5 [118, 123): 'Japan' LOC \n", "6 [151, 160): 'Asian Cup' MISC \n", "7 [196, 201): 'Syria' LOC \n", "8 [249, 254): 'China' LOC \n", "9 [363, 373): 'Uzbekistan' LOC \n", "10 [375, 380): 'China' LOC \n", "11 [468, 473): 'Uzbek' MISC \n", "12 [482, 495): 'Igor Shkvyrin' PER \n", "13 [580, 587): 'Chinese' MISC \n", "14 [618, 632): 'Oleg Shatskiku' PER \n", "15 [747, 753): 'Soviet' MISC \n", "16 [781, 790): 'Asian Cup' MISC \n", "17 [842, 853): 'Asian Games' MISC \n", "18 [875, 885): 'Uzbekistan' LOC \n", "19 [982, 987): 'Japan' LOC \n", "20 [1072, 1077): 'Syria' LOC \n", "21 [1079, 1092): 'Takuya Takagi' PER \n", "22 [1148, 1168): 'Hiroshige Yanagimoto' PER \n", "23 [1187, 1193): 'Syrian' MISC \n", "24 [1216, 1227): 'Salem Bitar' PER \n", "25 [1328, 1333): 'Syria' LOC \n", "26 [1360, 1372): 'Hassan Abbas' PER \n", "27 [1489, 1494): 'Bitar' PER \n", "28 [1503, 1517): 'Nader Jokhadar' PER \n", "29 [1528, 1533): 'Syria' LOC \n", "30 [1592, 1597): 'Japan' LOC \n", "31 [1621, 1627): 'Syrian' MISC \n", "32 [1686, 1692): 'Syrian' MISC \n", "33 [1702, 1707): 'Bitar' PER \n", "34 [1749, 1754): 'Japan' LOC \n", "35 [1761, 1769): 'Shu Kamo' PER \n", "36 [1784, 1790): 'Syrian' MISC \n", "37 [1825, 1832): 'Syrians' MISC \n", "38 [1928, 1933): 'Japan' LOC \n", "39 [1951, 1960): 'World Cup' MISC \n", "40 [2001, 2005): 'FIFA' ORG \n", "41 [2056, 2059): 'UAE' LOC \n", "42 [2065, 2071): 'Kuwait' LOC \n", "43 [2076, 2087): 'South Korea' LOC \n", "44 [2096, 2105): 'Indonesia' LOC \n", "\n", " sentence sentence_id new_type \n", "0 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 11 LOC \n", "1 [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... 11 PER \n", "2 [66, 77): 'Nadim Ladki' 66 PER \n", "3 [78, 117): 'AL-AIN, United Arab Emirates 1996-... 78 LOC \n", "4 [78, 117): 'AL-AIN, United Arab Emirates 1996-... 78 LOC \n", "5 [118, 244): 'Japan began the defence of their ... 118 LOC \n", "6 [118, 244): 'Japan began the defence of their ... 118 MISC \n", "7 [118, 244): 'Japan began the defence of their ... 118 LOC \n", "8 [245, 374): 'But China saw their luck desert t... 245 LOC \n", "9 [245, 374): 'But China saw their luck desert t... 245 LOC \n", "10 [375, 617): 'China controlled most of the matc... 375 LOC \n", "11 [375, 617): 'China controlled most of the matc... 375 MISC \n", "12 [375, 617): 'China controlled most of the matc... 375 PER \n", "13 [375, 617): 'China controlled most of the matc... 375 MISC \n", "14 [618, 735): 'Oleg Shatskiku made sure of the w... 618 PER \n", "15 [736, 821): 'The former Soviet republic was pl... 736 MISC \n", "16 [736, 821): 'The former Soviet republic was pl... 736 MISC \n", "17 [822, 917): 'Despite winning the Asian Games t... 822 MISC \n", "18 [822, 917): 'Despite winning the Asian Games t... 822 LOC \n", "19 [918, 1078): 'Two goals from defensive errors ... 918 LOC \n", "20 [918, 1078): 'Two goals from defensive errors ... 918 LOC \n", "21 [1079, 1291): 'Takuya Takagi scored the winner... 1079 PER \n", "22 [1079, 1291): 'Takuya Takagi scored the winner... 1079 PER \n", "23 [1079, 1291): 'Takuya Takagi scored the winner... 1079 MISC \n", "24 [1079, 1291): 'Takuya Takagi scored the winner... 1079 PER \n", "25 [1292, 1350): 'It was the second costly blunde... 1292 LOC \n", "26 [1351, 1502): 'Defender Hassan Abbas rose to i... 1351 PER \n", "27 [1351, 1502): 'Defender Hassan Abbas rose to i... 1351 PER \n", "28 [1503, 1591): 'Nader Jokhadar had given Syria ... 1503 PER \n", "29 [1503, 1591): 'Nader Jokhadar had given Syria ... 1503 LOC \n", "30 [1592, 1701): 'Japan then laid siege to the Sy... 1592 LOC \n", "31 [1592, 1701): 'Japan then laid siege to the Sy... 1592 MISC \n", "32 [1592, 1701): 'Japan then laid siege to the Sy... 1592 MISC \n", "33 [1702, 1748): 'Bitar pulled off fine saves whe... 1702 PER \n", "34 [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... 1749 LOC \n", "35 [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... 1749 PER \n", "36 [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... 1749 MISC \n", "37 [1821, 1925): 'The Syrians scored early and th... 1821 MISC \n", "38 [1928, 2049): 'Japan, co-hosts of the World Cu... 1928 LOC \n", "39 [1928, 2049): 'Japan, co-hosts of the World Cu... 1928 MISC \n", "40 [1928, 2049): 'Japan, co-hosts of the World Cu... 1928 ORG \n", "41 [2050, 2137): 'Hosts UAE play Kuwait and South... 2050 LOC \n", "42 [2050, 2137): 'Hosts UAE play Kuwait and South... 2050 LOC \n", "43 [2050, 2137): 'Hosts UAE play Kuwait and South... 2050 LOC \n", "44 [2050, 2137): 'Hosts UAE play Kuwait and South... 2050 LOC " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corrected_entities = entity_mentions.copy(True)\n", "new_types = corrected_entities[\"ent_type\"].copy()\n", "new_types[widget.selected] = \"ORG\"\n", "corrected_entities[\"new_type\"] = new_types\n", "corrected_entities" ] }, { "cell_type": "code", "execution_count": null, "id": "6d826972-07ef-40dd-9d60-534fe1198c1b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }