{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "BsRfx3DyAHC5" }, "source": [ "# Exploring Unicode categories\n", "\n", "Anyone working with data will sooner or later come across Unicode. In this notebook we're going to download Unicode data directly from the official source and explore the various categories of Unicode characters using interactive Bokeh tables" ] }, { "cell_type": "markdown", "metadata": { "id": "mRDzwN2HAHC6" }, "source": [ "The file `UnicodeData.txt`, available from [unicode.org](https://www.unicode.org), the home of Unicode, is the core of the Unicode Character Database (UCD). It is an ASCII file that serves as the foundational data source for Unicode-related functionality, including Python's [unicodedata](https://docs.python.org/3/library/unicodedata.html) module.\n", "\n", "The file contains detailed information about each Unicode character, including its properties and categorizations. The structure of `UnicodeData.txt` (i.e., its columns) and descriptions of the Unicode general categories are documented in the [Unicode® Standard Annex #44](https://www.unicode.org/reports/tr44). For direct reference, you can find information about the [columns of UnicodeData.txt](https://www.unicode.org/reports/tr44/#UnicodeData.txt) and the [list of categories](https://www.unicode.org/reports/tr44/#GC_Values_Table) in the annex." ] }, { "cell_type": "markdown", "metadata": { "id": "cX1_7jw6AHC6" }, "source": [ "## Download the data\n", "\n", "Download the file `UnicodeData.txt` and save it to the dataframe `df`. Download categories and save them to the dictionary `categories`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-13T09:45:19.959122Z", "iopub.status.busy": "2024-10-13T09:45:19.958567Z", "iopub.status.idle": "2024-10-13T09:45:20.809813Z", "shell.execute_reply": "2024-10-13T09:45:20.809119Z" }, "id": "UVJYpfi7AHC7", "outputId": "30716ba4-572a-4342-c08a-0086acc70fa5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Length of Unicode data dataframe: 40116\n" ] } ], "source": [ "import urllib.request\n", "import pandas as pd\n", "from collections import defaultdict\n", "import re\n", "\n", "url = 'https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt' # unicode data\n", "url_cat = 'https://unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt' # ctegories\n", "\n", "df = pd.read_csv(url, sep=\";\", header=None)\n", "df.columns = [str(c) for c in df.columns]\n", "\n", "categories = defaultdict(str)\n", "for line in urllib.request.urlopen(url_cat).readlines():\n", " line = line.decode('utf-8')\n", " if line.startswith('gc'):\n", " name, desc = re.split(r'\\s*;\\s*', line.strip())[1:3]\n", " desc = desc.split()[0]\n", " if len(name)>1:\n", " categories[name] = desc\n", "\n", "\n", "print(\"Length of Unicode data dataframe: {}\".format(len(df.index)))" ] }, { "cell_type": "markdown", "metadata": { "id": "9I72cVMQAHC7" }, "source": [ "## List categories\n", "\n", "The Unicode standard has 31 different categories." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 445 }, "execution": { "iopub.execute_input": "2024-10-13T09:45:20.841426Z", "iopub.status.busy": "2024-10-13T09:45:20.840951Z", "iopub.status.idle": "2024-10-13T09:45:21.224090Z", "shell.execute_reply": "2024-10-13T09:45:21.223507Z" }, "id": "kR4pw-7tAHC8", "outputId": "5266f1c2-78d7-400f-bfb1-ca9375505d84" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " const force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", "const JS_MIME_TYPE = 'application/javascript';\n", " const HTML_MIME_TYPE = 'text/html';\n", " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " const CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " const script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " const cell = handle.cell;\n", "\n", " const id = cell.output_area._bokeh_element_id;\n", " const server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd_clean, {\n", " iopub: {\n", " output: function(msg) {\n", " const id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd_destroy);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " const output_area = handle.output_area;\n", " const output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " const bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " const script_attrs = bk_div.children[0].attributes;\n", " for (let i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " const toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " const events = require('base/js/events');\n", " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " const NB_LOAD_WARNING = {'data': {'text/html':\n", " \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"