{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Preliminaries" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "TMP_NOTEBOOK_ROOT = Path(\"/tmp/bridge-ds/tutorials\")" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "# DisplayEngine\n", "In the previous tutorial, we've written a DatasetProvider for the text classification dataset [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). However, we didn't use a custom DisplayEngine, so our visualization was lacking:" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "from bridge.display.basic import SimplePrints\n", "from bridge.providers.text import LargeMovieReviewDataset\n", "\n", "provider = LargeMovieReviewDataset(root=TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=True)\n", "ds = provider.build_dataset(display_engine=SimplePrints())\n", "ds.iget(0).show()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Class Structure\n", "We can improve this \"viz\" by writing our own DisplayEngine. For starters, let's see which methods we need to implement:\n", "\n", "```python\n", "class MyDisplayEngine(DisplayEngine):\n", " def show_element(\n", " self,\n", " element,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass\n", "\n", " def show_sample(\n", " self,\n", " sample,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass\n", "\n", " def show_dataset(\n", " self,\n", " dataset,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " dataset_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass\n", "```\n", "\n", "Seems straightforward enough. the DisplayEngine object implements methods to show individual annotations, samples, and datasets. \n", "\n", "Let's build our own DisplayEngine from the bottom up, starting with text and class elements. We will use [Panel](https://panel.holoviz.org/), but sure enough you can implement your own however you'd like:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from typing import Any, Dict\n", "\n", "import panel as pn\n", "\n", "from bridge.display.basic import DisplayEngine\n", "\n", "pn.extension()\n", "\n", "\n", "class TextClassification(DisplayEngine):\n", " def show_element(self, element, element_plot_kwargs: Dict[str, Any] | None = None):\n", " if element.etype == \"class_label\":\n", " return pn.pane.Markdown(element.to_pd_series().to_frame().T.to_markdown())\n", " elif element.etype == \"text\":\n", " return pn.pane.Markdown(element.data)\n", " else:\n", " raise NotImplementedError()\n", "\n", " def show_sample(\n", " self,\n", " sample,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass\n", "\n", " def show_dataset(\n", " self,\n", " dataset,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " dataset_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "To test it:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "engine = TextClassification()\n", "sample = ds.iget(0)\n", "text_element = sample.element # SingularSample exposes the text element specifically\n", "label_element = sample.annotations[\"class_label\"][0] # the class labels in this case are annotations\n", "\n", "pn.Column(engine.show_element(label_element), engine.show_element(text_element))" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Looks good. Now, if we want to display an entire sample rather than individual elements:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "from typing import Any, Dict\n", "\n", "import pandas as pd\n", "import panel as pn\n", "\n", "pn.extension()\n", "\n", "\n", "class TextClassification(DisplayEngine):\n", " def show_element(self, element, element_plot_kwargs: Dict[str, Any] | None = None):\n", " if element.etype == \"class_label\":\n", " return pn.pane.Markdown(element.to_pd_series().to_frame().T.to_markdown())\n", " elif element.etype == \"text\":\n", " return pn.pane.Markdown(element.data)\n", " else:\n", " raise NotImplementedError()\n", "\n", " def show_sample(\n", " self,\n", " sample,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " annotations_md = pd.DataFrame([ann.to_pd_series() for ann in sample.annotations[\"class_label\"]]).to_markdown()\n", " text_display = pn.pane.Markdown(sample.data)\n", " return pn.Column(\"# Sample Text:\", text_display, \"# Annotations Data:\", annotations_md)\n", "\n", " def show_dataset(\n", " self,\n", " dataset,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " dataset_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "engine = TextClassification()\n", "engine.show_sample(ds.iget(0))" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "Good. Finally, let's use the Panel DiscreteSlider widget to create an interface to browse all samples in our Dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "from typing import Any, Dict\n", "\n", "\n", "class TextClassification(DisplayEngine):\n", " def show_element(self, element, element_plot_kwargs: Dict[str, Any] | None = None):\n", " if element.etype == \"class_label\":\n", " return pn.pane.Markdown(element.to_pd_series().to_frame().T.to_markdown())\n", " elif element.etype == \"text\":\n", " return pn.pane.Markdown(element.data)\n", " else:\n", " raise NotImplementedError()\n", "\n", " def show_sample(\n", " self,\n", " sample,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " annotations_md = pd.DataFrame([ann.to_pd_series() for ann in sample.annotations[\"class_label\"]]).to_markdown()\n", " text_display = pn.pane.Markdown(sample.data)\n", " return pn.Column(\"# Sample Text:\", text_display, \"# Annotations Data:\", annotations_md)\n", "\n", " def show_dataset(\n", " self,\n", " dataset,\n", " element_plot_kwargs: Dict[str, Any] | None = None,\n", " sample_plot_kwargs: Dict[str, Any] | None = None,\n", " dataset_plot_kwargs: Dict[str, Any] | None = None,\n", " ):\n", " sample_ids = dataset.sample_ids\n", " sample_ids_wig = pn.widgets.DiscreteSlider(name=\"Sample ID\", options=sample_ids, value=sample_ids[0])\n", "\n", " @pn.depends(sample_ids_wig.param.value)\n", " def plot_sample_by_widget(sample_id):\n", " return self.show_sample(dataset.get(sample_id), element_plot_kwargs, sample_plot_kwargs)\n", "\n", " return pn.Column(sample_ids_wig, plot_sample_by_widget)" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "engine = TextClassification()\n", "engine.show_dataset(ds)" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "Done! Now we have an operable interface to browse our dataset. \n", "\n", "We can also include our `TextClassification` engine right when the dataset is built. The following code is enough to reproduce everything we've written so far:" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "ds = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=False).build_dataset(\n", " display_engine=TextClassification()\n", ")\n", "\n", "ds = ds.select_samples(lambda samples, anns: anns[anns.data != \"unsup\"].index.get_level_values(\"sample_id\"))\n", "ds.show()" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "## In Summary\n", "1. DisplayEngines are tools used to visualize data through the methods `element.show`, `sample.show()`, `ds.show()`\n", "2. We built a DisplayEngine using Holoviz Panel, but this is not a requirement and users can implement their own DisplayEngines using whichever libraries they'd like.\n", "\n", "## Up Next\n", "\n", "So far, we've learned how to create Bridge Datasets and how to use them. In the following tutorials we will learn how to transform these Datasets into ones which are usable to train models (e.g. into PyTorch Datasets)." ] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }