{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining Hugging Face datasets with dask\n", "\n", "> Using 🤗 datasets in combination with dask \n", "\n", "- toc: true \n", "- badges: false\n", "- comments: true\n", "- categories: [huggingface, huggingface-datasets, dask]\n", "- search_exclude: false\n", "- badges: true\n", "- image: images/dask_plot_example.png" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hugging Face datasets is a super useful library for loading, processing and sharing datasets with other people. \n", "\n", "For many pre-processing steps it works beautifully. The one area where it can be a bit trickier to use is for EDA style analysis. This column-wise EDA is often important as an early step in working with some data or for preparing a data card. \n", "\n", "Fortunately combining datasets and another data library, [dask](https://www.dask.org/) works pretty smoothly. This isn't intended to be a full intro to either datasets or dask but hopefully gives you a sense of how both libaries work and how they can complement each other. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, make sure we have the required libraries. [Rich](https://rich.readthedocs.io/en/stable/) is there for a little added visual flair ✨ " ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "id": "lIYdn1woOS1n" }, "outputs": [], "source": [ "%%capture\n", "!pip install datasets toolz rich[jupyter] dask" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "gKumkrtPnvdg" }, "outputs": [], "source": [ "%load_ext rich" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load some data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example we will use a the [blbooksgenre dataset](https://huggingface.co/datasets/blbooksgenre) that contains metadata about some digitised books from the British Library. This collection also includes some annotations for the genre of the book which we could use to train a machine learning model. \n", "\n", "We can load a dataset hosted on the Hugging Face hub by using the `load_dataset` function." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "21uMQyUFhryg" }, "outputs": [], "source": [ "from datasets import load_dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 87, "referenced_widgets": [ "408dd31921ee40bb9c9fc8995d4b8577", "5e1150b6dfb8494195191efaeb5b7feb", "6edda0b1149541499ab485df06711944", "45fcaeca1a184768a624607172ab7d72", "ea28bfd1711148c29b494a0194d2ffbf", "b3eaa93967914d7f8977d736611b845f", "5d3ee4cdf67e44878bac361e143b828a", "bff82a2eab2a4f7d87f8037e30839050", "0c0752e8e9024977979562531ad5a4b7", "4ccb508cf3ab4001a30cea47d2448e23", "88809da60e3846dc886cd2545edbc5c3" ] }, "id": "P8C8Ljd1i1zj", "outputId": "1b619316-4f2d-4f07-862b-08747f6a715e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reusing dataset bl_books_genre (/Users/dvanstrien/.cache/huggingface/datasets/bl_books_genre/annotated_raw/1.1.0/1e01f82403b3d9344121c3b81e5ad7c130338b250bf95dad4c6ab342c642dbe8)\n" ] } ], "source": [ "ds = load_dataset(\"blbooksgenre\", \"annotated_raw\", split=\"train\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we requested only the train split we get back a `Dataset`" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 131 }, "id": "Wx9ZlPhhjAiC", "outputId": "d7e74115-796c-4acd-9f15-aae838cffdee" }, "outputs": [ { "data": { "text/html": [ "
\n",
       "Dataset({\n",
       "    features: ['BL record ID', 'Name', 'Dates associated with name', 'Type of name', 'Role', 'All names', 'Title', 'Variant titles', 'Series title', 'Number within series', 'Country of publication', 'Place of publication', 'Publisher', 'Date of publication', 'Edition', 'Physical description', 'Dewey classification', 'BL shelfmark', 'Topics', 'Genre', 'Languages', 'Notes', 'BL record ID for physical resource', 'classification_id', 'user_id', 'subject_ids', 'annotator_date_pub', 'annotator_normalised_date_pub', 'annotator_edition_statement', 'annotator_FAST_genre_terms', 'annotator_FAST_subject_terms', 'annotator_comments', 'annotator_main_language', 'annotator_other_languages_summaries', 'annotator_summaries_language', 'annotator_translation', 'annotator_original_language', 'annotator_publisher', 'annotator_place_pub', 'annotator_country', 'annotator_title', 'Link to digitised book', 'annotated', 'Type of resource', 'created_at', 'annotator_genre'],\n",
       "    num_rows: 4398\n",
       "})\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;35mDataset\u001b[0m\u001b[1m(\u001b[0m\u001b[1m{\u001b[0m\n", " features: \u001b[1m[\u001b[0m\u001b[32m'BL record ID'\u001b[0m, \u001b[32m'Name'\u001b[0m, \u001b[32m'Dates associated with name'\u001b[0m, \u001b[32m'Type of name'\u001b[0m, \u001b[32m'Role'\u001b[0m, \u001b[32m'All names'\u001b[0m, \u001b[32m'Title'\u001b[0m, \u001b[32m'Variant titles'\u001b[0m, \u001b[32m'Series title'\u001b[0m, \u001b[32m'Number within series'\u001b[0m, \u001b[32m'Country of publication'\u001b[0m, \u001b[32m'Place of publication'\u001b[0m, \u001b[32m'Publisher'\u001b[0m, \u001b[32m'Date of publication'\u001b[0m, \u001b[32m'Edition'\u001b[0m, \u001b[32m'Physical description'\u001b[0m, \u001b[32m'Dewey classification'\u001b[0m, \u001b[32m'BL shelfmark'\u001b[0m, \u001b[32m'Topics'\u001b[0m, \u001b[32m'Genre'\u001b[0m, \u001b[32m'Languages'\u001b[0m, \u001b[32m'Notes'\u001b[0m, \u001b[32m'BL record ID for physical resource'\u001b[0m, \u001b[32m'classification_id'\u001b[0m, \u001b[32m'user_id'\u001b[0m, \u001b[32m'subject_ids'\u001b[0m, \u001b[32m'annotator_date_pub'\u001b[0m, \u001b[32m'annotator_normalised_date_pub'\u001b[0m, \u001b[32m'annotator_edition_statement'\u001b[0m, \u001b[32m'annotator_FAST_genre_terms'\u001b[0m, \u001b[32m'annotator_FAST_subject_terms'\u001b[0m, \u001b[32m'annotator_comments'\u001b[0m, \u001b[32m'annotator_main_language'\u001b[0m, \u001b[32m'annotator_other_languages_summaries'\u001b[0m, \u001b[32m'annotator_summaries_language'\u001b[0m, \u001b[32m'annotator_translation'\u001b[0m, \u001b[32m'annotator_original_language'\u001b[0m, \u001b[32m'annotator_publisher'\u001b[0m, \u001b[32m'annotator_place_pub'\u001b[0m, \u001b[32m'annotator_country'\u001b[0m, \u001b[32m'annotator_title'\u001b[0m, \u001b[32m'Link to digitised book'\u001b[0m, \u001b[32m'annotated'\u001b[0m, \u001b[32m'Type of resource'\u001b[0m, \u001b[32m'created_at'\u001b[0m, \u001b[32m'annotator_genre'\u001b[0m\u001b[1m]\u001b[0m,\n", " num_rows: \u001b[1;36m4398\u001b[0m\n", "\u001b[1m}\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see this has a bunch of columns. One that is of interest is the `Data of publication` column. Since we could use this dataset to train some type of classifier we may want to check whether we have enough examples across different time periods in the dataset. " ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "Iz1qpri2jA7F", "outputId": "6927a531-4649-4ec8-b40e-0d467f735a8d" }, "outputs": [ { "data": { "text/html": [ "
'1879'\n",
       "
\n" ], "text/plain": [ "\u001b[32m'1879'\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ds[0][\"Date of publication\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using toolz to calculate frequencies for a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One quick way we can get the frequency count for a column is using the wonderful [toolz](https://toolz.readthedocs.io/en/latest/index.html) library \n", "\n", "If our data fits in memory, we can simply pass in a column containing a categorical value to a frequency function to get a frequency count. " ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "id": "wipOCP-wjB3Y" }, "outputs": [], "source": [ "from toolz import frequencies, topk" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "iy8JiqQNjFUw" }, "outputs": [], "source": [ "dates = ds[\"Date of publication\"]" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "s1WXFnFljx5B", "outputId": "9d4e5792-30e1-4279-9534-4574a39d321b" }, "outputs": [ { "data": { "text/html": [ "
\n",
       "{\n",
       "    '1879': 99,\n",
       "    '1774': 5,\n",
       "    '1765': 5,\n",
       "    '1877': 69,\n",
       "    '1893': 222,\n",
       "    '1891': 148,\n",
       "    '1827': 29,\n",
       "    '1868': 42,\n",
       "    '1878': 72,\n",
       "    '1895': 189,\n",
       "    '1897': 120,\n",
       "    '1899': 104,\n",
       "    '1896': 174,\n",
       "    '1876': 48,\n",
       "    '1812': 13,\n",
       "    '1799': 8,\n",
       "    '1830': 32,\n",
       "    '1870': 42,\n",
       "    '1894': 155,\n",
       "    '1864': 28,\n",
       "    '1855': 42,\n",
       "    '1871': 42,\n",
       "    '1836': 37,\n",
       "    '1883': 51,\n",
       "    '1880': 111,\n",
       "    '1884': 69,\n",
       "    '1822': 16,\n",
       "    '1856': 38,\n",
       "    '1872': 42,\n",
       "    '1875': 57,\n",
       "    '1844': 35,\n",
       "    '1890': 134,\n",
       "    '1886': 43,\n",
       "    '1840': 15,\n",
       "    '1888': 109,\n",
       "    '1858': 43,\n",
       "    '1867': 53,\n",
       "    '1826': 24,\n",
       "    '1800': 3,\n",
       "    '1851': 43,\n",
       "    '1838': 14,\n",
       "    '1824': 20,\n",
       "    '1887': 58,\n",
       "    '1874': 42,\n",
       "    '1857': 44,\n",
       "    '1873': 34,\n",
       "    '1837': 16,\n",
       "    '1846': 32,\n",
       "    '1881': 55,\n",
       "    '1898': 104,\n",
       "    '1906': 4,\n",
       "    '1892': 134,\n",
       "    '1869': 25,\n",
       "    '1885': 69,\n",
       "    '1882': 71,\n",
       "    '1863': 55,\n",
       "    '1865': 53,\n",
       "    '1635': 3,\n",
       "    '1859': 39,\n",
       "    '1818': 17,\n",
       "    '1845': 28,\n",
       "    '1852': 43,\n",
       "    '1841': 23,\n",
       "    '1842': 29,\n",
       "    '1848': 28,\n",
       "    '1828': 23,\n",
       "    '1850': 38,\n",
       "    '1860': 45,\n",
       "    '1889': 140,\n",
       "    '1815': 5,\n",
       "    '1861': 28,\n",
       "    '1814': 13,\n",
       "    '1843': 28,\n",
       "    '1817': 12,\n",
       "    '1819': 16,\n",
       "    '1853': 34,\n",
       "    '1833': 5,\n",
       "    '1854': 36,\n",
       "    '1839': 33,\n",
       "    '1803': 7,\n",
       "    '1835': 14,\n",
       "    '1813': 8,\n",
       "    '1695': 4,\n",
       "    '1809-1811': 5,\n",
       "    '1832': 9,\n",
       "    '1823': 17,\n",
       "    '1847': 28,\n",
       "    '1816': 8,\n",
       "    '1806': 5,\n",
       "    '1866': 26,\n",
       "    '1829': 13,\n",
       "    '1791': 5,\n",
       "    '1637': 5,\n",
       "    '1821': 4,\n",
       "    '1807': 14,\n",
       "    '1862': 22,\n",
       "    '1795': 5,\n",
       "    '1834': 12,\n",
       "    '1831': 10,\n",
       "    '1849': 13,\n",
       "    '1811': 1,\n",
       "    '1825': 1,\n",
       "    '1809': 3,\n",
       "    '1905': 1,\n",
       "    '1808': 1,\n",
       "    '1900': 5,\n",
       "    '1892-1912': 1,\n",
       "    '1804': 4,\n",
       "    '1769': 5,\n",
       "    '1910': 1,\n",
       "    '1805': 5,\n",
       "    '1802': 3,\n",
       "    '1871-': 1,\n",
       "    '1901': 5,\n",
       "    '1884-1909': 1,\n",
       "    '1873-1887': 1,\n",
       "    '1979': 1,\n",
       "    '1852-1941': 1,\n",
       "    '1903': 1,\n",
       "    '1871-1873': 1,\n",
       "    '1810': 3,\n",
       "    '1907': 1,\n",
       "    '1820': 5,\n",
       "    '1789': 5\n",
       "}\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m{\u001b[0m\n", " \u001b[32m'1879'\u001b[0m: \u001b[1;36m99\u001b[0m,\n", " \u001b[32m'1774'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1765'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1877'\u001b[0m: \u001b[1;36m69\u001b[0m,\n", " \u001b[32m'1893'\u001b[0m: \u001b[1;36m222\u001b[0m,\n", " \u001b[32m'1891'\u001b[0m: \u001b[1;36m148\u001b[0m,\n", " \u001b[32m'1827'\u001b[0m: \u001b[1;36m29\u001b[0m,\n", " \u001b[32m'1868'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1878'\u001b[0m: \u001b[1;36m72\u001b[0m,\n", " \u001b[32m'1895'\u001b[0m: \u001b[1;36m189\u001b[0m,\n", " \u001b[32m'1897'\u001b[0m: \u001b[1;36m120\u001b[0m,\n", " \u001b[32m'1899'\u001b[0m: \u001b[1;36m104\u001b[0m,\n", " \u001b[32m'1896'\u001b[0m: \u001b[1;36m174\u001b[0m,\n", " \u001b[32m'1876'\u001b[0m: \u001b[1;36m48\u001b[0m,\n", " \u001b[32m'1812'\u001b[0m: \u001b[1;36m13\u001b[0m,\n", " \u001b[32m'1799'\u001b[0m: \u001b[1;36m8\u001b[0m,\n", " \u001b[32m'1830'\u001b[0m: \u001b[1;36m32\u001b[0m,\n", " \u001b[32m'1870'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1894'\u001b[0m: \u001b[1;36m155\u001b[0m,\n", " \u001b[32m'1864'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1855'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1871'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1836'\u001b[0m: \u001b[1;36m37\u001b[0m,\n", " \u001b[32m'1883'\u001b[0m: \u001b[1;36m51\u001b[0m,\n", " \u001b[32m'1880'\u001b[0m: \u001b[1;36m111\u001b[0m,\n", " \u001b[32m'1884'\u001b[0m: \u001b[1;36m69\u001b[0m,\n", " \u001b[32m'1822'\u001b[0m: \u001b[1;36m16\u001b[0m,\n", " \u001b[32m'1856'\u001b[0m: \u001b[1;36m38\u001b[0m,\n", " \u001b[32m'1872'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1875'\u001b[0m: \u001b[1;36m57\u001b[0m,\n", " \u001b[32m'1844'\u001b[0m: \u001b[1;36m35\u001b[0m,\n", " \u001b[32m'1890'\u001b[0m: \u001b[1;36m134\u001b[0m,\n", " \u001b[32m'1886'\u001b[0m: \u001b[1;36m43\u001b[0m,\n", " \u001b[32m'1840'\u001b[0m: \u001b[1;36m15\u001b[0m,\n", " \u001b[32m'1888'\u001b[0m: \u001b[1;36m109\u001b[0m,\n", " \u001b[32m'1858'\u001b[0m: \u001b[1;36m43\u001b[0m,\n", " \u001b[32m'1867'\u001b[0m: \u001b[1;36m53\u001b[0m,\n", " \u001b[32m'1826'\u001b[0m: \u001b[1;36m24\u001b[0m,\n", " \u001b[32m'1800'\u001b[0m: \u001b[1;36m3\u001b[0m,\n", " \u001b[32m'1851'\u001b[0m: \u001b[1;36m43\u001b[0m,\n", " \u001b[32m'1838'\u001b[0m: \u001b[1;36m14\u001b[0m,\n", " \u001b[32m'1824'\u001b[0m: \u001b[1;36m20\u001b[0m,\n", " \u001b[32m'1887'\u001b[0m: \u001b[1;36m58\u001b[0m,\n", " \u001b[32m'1874'\u001b[0m: \u001b[1;36m42\u001b[0m,\n", " \u001b[32m'1857'\u001b[0m: \u001b[1;36m44\u001b[0m,\n", " \u001b[32m'1873'\u001b[0m: \u001b[1;36m34\u001b[0m,\n", " \u001b[32m'1837'\u001b[0m: \u001b[1;36m16\u001b[0m,\n", " \u001b[32m'1846'\u001b[0m: \u001b[1;36m32\u001b[0m,\n", " \u001b[32m'1881'\u001b[0m: \u001b[1;36m55\u001b[0m,\n", " \u001b[32m'1898'\u001b[0m: \u001b[1;36m104\u001b[0m,\n", " \u001b[32m'1906'\u001b[0m: \u001b[1;36m4\u001b[0m,\n", " \u001b[32m'1892'\u001b[0m: \u001b[1;36m134\u001b[0m,\n", " \u001b[32m'1869'\u001b[0m: \u001b[1;36m25\u001b[0m,\n", " \u001b[32m'1885'\u001b[0m: \u001b[1;36m69\u001b[0m,\n", " \u001b[32m'1882'\u001b[0m: \u001b[1;36m71\u001b[0m,\n", " \u001b[32m'1863'\u001b[0m: \u001b[1;36m55\u001b[0m,\n", " \u001b[32m'1865'\u001b[0m: \u001b[1;36m53\u001b[0m,\n", " \u001b[32m'1635'\u001b[0m: \u001b[1;36m3\u001b[0m,\n", " \u001b[32m'1859'\u001b[0m: \u001b[1;36m39\u001b[0m,\n", " \u001b[32m'1818'\u001b[0m: \u001b[1;36m17\u001b[0m,\n", " \u001b[32m'1845'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1852'\u001b[0m: \u001b[1;36m43\u001b[0m,\n", " \u001b[32m'1841'\u001b[0m: \u001b[1;36m23\u001b[0m,\n", " \u001b[32m'1842'\u001b[0m: \u001b[1;36m29\u001b[0m,\n", " \u001b[32m'1848'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1828'\u001b[0m: \u001b[1;36m23\u001b[0m,\n", " \u001b[32m'1850'\u001b[0m: \u001b[1;36m38\u001b[0m,\n", " \u001b[32m'1860'\u001b[0m: \u001b[1;36m45\u001b[0m,\n", " \u001b[32m'1889'\u001b[0m: \u001b[1;36m140\u001b[0m,\n", " \u001b[32m'1815'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1861'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1814'\u001b[0m: \u001b[1;36m13\u001b[0m,\n", " \u001b[32m'1843'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1817'\u001b[0m: \u001b[1;36m12\u001b[0m,\n", " \u001b[32m'1819'\u001b[0m: \u001b[1;36m16\u001b[0m,\n", " \u001b[32m'1853'\u001b[0m: \u001b[1;36m34\u001b[0m,\n", " \u001b[32m'1833'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1854'\u001b[0m: \u001b[1;36m36\u001b[0m,\n", " \u001b[32m'1839'\u001b[0m: \u001b[1;36m33\u001b[0m,\n", " \u001b[32m'1803'\u001b[0m: \u001b[1;36m7\u001b[0m,\n", " \u001b[32m'1835'\u001b[0m: \u001b[1;36m14\u001b[0m,\n", " \u001b[32m'1813'\u001b[0m: \u001b[1;36m8\u001b[0m,\n", " \u001b[32m'1695'\u001b[0m: \u001b[1;36m4\u001b[0m,\n", " \u001b[32m'1809-1811'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1832'\u001b[0m: \u001b[1;36m9\u001b[0m,\n", " \u001b[32m'1823'\u001b[0m: \u001b[1;36m17\u001b[0m,\n", " \u001b[32m'1847'\u001b[0m: \u001b[1;36m28\u001b[0m,\n", " \u001b[32m'1816'\u001b[0m: \u001b[1;36m8\u001b[0m,\n", " \u001b[32m'1806'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1866'\u001b[0m: \u001b[1;36m26\u001b[0m,\n", " \u001b[32m'1829'\u001b[0m: \u001b[1;36m13\u001b[0m,\n", " \u001b[32m'1791'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1637'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1821'\u001b[0m: \u001b[1;36m4\u001b[0m,\n", " \u001b[32m'1807'\u001b[0m: \u001b[1;36m14\u001b[0m,\n", " \u001b[32m'1862'\u001b[0m: \u001b[1;36m22\u001b[0m,\n", " \u001b[32m'1795'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1834'\u001b[0m: \u001b[1;36m12\u001b[0m,\n", " \u001b[32m'1831'\u001b[0m: \u001b[1;36m10\u001b[0m,\n", " \u001b[32m'1849'\u001b[0m: \u001b[1;36m13\u001b[0m,\n", " \u001b[32m'1811'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1825'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1809'\u001b[0m: \u001b[1;36m3\u001b[0m,\n", " \u001b[32m'1905'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1808'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1900'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1892-1912'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1804'\u001b[0m: \u001b[1;36m4\u001b[0m,\n", " \u001b[32m'1769'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1910'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1805'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1802'\u001b[0m: \u001b[1;36m3\u001b[0m,\n", " \u001b[32m'1871-'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1901'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1884-1909'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1873-1887'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1979'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1852-1941'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1903'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1871-1873'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1810'\u001b[0m: \u001b[1;36m3\u001b[0m,\n", " \u001b[32m'1907'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", " \u001b[32m'1820'\u001b[0m: \u001b[1;36m5\u001b[0m,\n", " \u001b[32m'1789'\u001b[0m: \u001b[1;36m5\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# collapse_hide\n", "\n", "frequencies(dates)" ] }, { "cell_type": "markdown", "metadata": { "id": "ko_frygKn1go" }, "source": [ "## Make it parallel!\n", "\n", "If our data doesn't fit in memory or we want to do things in parallel we might want to use a slightly different approach. This is where dask can play a role. \n", "\n", "Dask offers a number of different collection abstractions that make it easier to do things in parallel. This includes dask bag.\n", "\n", "First we'll create a dask client here, I won't dig into the details of this here but you can get a good overview in the [getting started](https://www.dask.org/get-started) pages. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from distributed import Client" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "client = Client()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we don't want to load all of our data into memory we can great a generator that will yield one row at a time. In this case we'll start by exploring the `Title` column " ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "id": "modIC-hIj0Vs" }, "outputs": [], "source": [ "def yield_titles():\n", " for row in ds:\n", " yield row[\"Title\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this returns a generator " ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qDFTGtUuj6xk", "outputId": "0fe3d283-19ec-4c47-e851-dca5292759c5" }, "outputs": [ { "data": { "text/html": [ "
<generator object yield_titles at 0x7ffc28fdc040>\n",
       "
\n" ], "text/plain": [ "\u001b[1m<\u001b[0m\u001b[1;95mgenerator\u001b[0m\u001b[39m object yield_titles at \u001b[0m\u001b[1;36m0x7ffc28fdc040\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "yield_titles()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "y3R8YVXykUQX", "outputId": "91bdfcae-1b6e-4c6b-a586-7375b00ee9bc" }, "outputs": [ { "data": { "text/html": [ "
'The Canadian farmer. A missionary incident [Signed: W. J. H. Y, i.e. William J. H. Yates.]'\n",
       "
\n" ], "text/plain": [ "\u001b[32m'The Canadian farmer. A missionary incident \u001b[0m\u001b[32m[\u001b[0m\u001b[32mSigned: W. J. H. Y, i.e. William J. H. Yates.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "next(iter(yield_titles()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can store this in a titles variable. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "HVbIYOEwkVt-" }, "outputs": [], "source": [ "titles = yield_titles()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now import dask bag. " ] }, { "cell_type": "markdown", "metadata": { "id": "d44WSkiDkhEo" }, "source": [ "import dask.bag as db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a dask bag object using the `from_sequence` method. " ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "id": "UPKp665fkjmn" }, "outputs": [], "source": [ "bag = db.from_sequence(titles)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
dask.bag<from_sequence, npartitions=1>\n",
       "
\n" ], "text/plain": [ "dask.bag\u001b[1m<\u001b[0m\u001b[1;95mfrom_sequence\u001b[0m\u001b[39m, \u001b[0m\u001b[33mnpartitions\u001b[0m\u001b[39m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at an example using the `take` method" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uSL2Den-kl8O", "outputId": "dacee933-d4c0-4fc1-b7ce-2d4b8ce800b7" }, "outputs": [ { "data": { "text/html": [ "
\n",
       "(\n",
       "    [\n",
       "        'The',\n",
       "        'Canadian',\n",
       "        'farmer.',\n",
       "        'A',\n",
       "        'missionary',\n",
       "        'incident',\n",
       "        '[Signed:',\n",
       "        'W.',\n",
       "        'J.',\n",
       "        'H.',\n",
       "        'Y,',\n",
       "        'i.e.',\n",
       "        'William',\n",
       "        'J.',\n",
       "        'H.',\n",
       "        'Yates.]'\n",
       "    ],\n",
       ")\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m(\u001b[0m\n", " \u001b[1m[\u001b[0m\n", " \u001b[32m'The'\u001b[0m,\n", " \u001b[32m'Canadian'\u001b[0m,\n", " \u001b[32m'farmer.'\u001b[0m,\n", " \u001b[32m'A'\u001b[0m,\n", " \u001b[32m'missionary'\u001b[0m,\n", " \u001b[32m'incident'\u001b[0m,\n", " \u001b[32m'\u001b[0m\u001b[32m[\u001b[0m\u001b[32mSigned:'\u001b[0m,\n", " \u001b[32m'W.'\u001b[0m,\n", " \u001b[32m'J.'\u001b[0m,\n", " \u001b[32m'H.'\u001b[0m,\n", " \u001b[32m'Y,'\u001b[0m,\n", " \u001b[32m'i.e.'\u001b[0m,\n", " \u001b[32m'William'\u001b[0m,\n", " \u001b[32m'J.'\u001b[0m,\n", " \u001b[32m'H.'\u001b[0m,\n", " \u001b[32m'Yates.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\n", " \u001b[1m]\u001b[0m,\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bag.take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dask bag has a bunch of handy methods for processing data (some of these we could also do in 🤗 datasets but others are not available as specific methods in datasets). \n", "\n", "For example we can make sure we only have unique titles using the `distinct` method. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "id": "afOe5tr7kowe" }, "outputs": [], "source": [ "unique_titles = bag.distinct()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hWVI4HZykxuy", "outputId": "4392f098-1f1d-4950-daeb-679de217d111" }, "outputs": [ { "data": { "text/html": [ "
\n",
       "(\n",
       "    'The Canadian farmer. A missionary incident [Signed: W. J. H. Y, i.e. William J. H. Yates.]',\n",
       "    'A new musical Interlude, called the Election [By M. P. Andrews.]',\n",
       "    'An Elegy written among the ruins of an Abbey. By the author of the Nun [E. Jerningham]',\n",
       "    \"The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P\"\n",
       ")\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m(\u001b[0m\n", " \u001b[32m'The Canadian farmer. A missionary incident \u001b[0m\u001b[32m[\u001b[0m\u001b[32mSigned: W. J. H. Y, i.e. William J. H. Yates.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m,\n", " \u001b[32m'A new musical Interlude, called the Election \u001b[0m\u001b[32m[\u001b[0m\u001b[32mBy M. P. Andrews.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m,\n", " \u001b[32m'An Elegy written among the ruins of an Abbey. By the author of the Nun \u001b[0m\u001b[32m[\u001b[0m\u001b[32mE. Jerningham\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m,\n", " \u001b[32m\"The Baron's Daughter. A ballad by the author of Poetical Recreations \u001b[0m\u001b[32m[\u001b[0m\u001b[32mi.e. William C. Hazlitt\u001b[0m\u001b[32m]\u001b[0m\u001b[32m . F.P\"\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "unique_titles.take(4)" ] }, { "cell_type": "markdown", "metadata": { "id": "YCZP1Zh2kzGB" }, "source": [ "Similar to 🤗 datasets we have a map method that we can use to apply a function to all of our examples. In this case we split the title text into individual words. \n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "1vrcwsMXlAi6" }, "outputs": [], "source": [ "title_words_split = unique_titles.map(lambda x: x.split(\" \"))" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CIrlsZonlEmu", "outputId": "51037d54-2f8c-4f67-903c-9408c7fc5b4c" }, "outputs": [ { "data": { "text/html": [ "
\n",
       "(\n",
       "    [\n",
       "        'The',\n",
       "        'Canadian',\n",
       "        'farmer.',\n",
       "        'A',\n",
       "        'missionary',\n",
       "        'incident',\n",
       "        '[Signed:',\n",
       "        'W.',\n",
       "        'J.',\n",
       "        'H.',\n",
       "        'Y,',\n",
       "        'i.e.',\n",
       "        'William',\n",
       "        'J.',\n",
       "        'H.',\n",
       "        'Yates.]'\n",
       "    ],\n",
       "    [\n",
       "        'A',\n",
       "        'new',\n",
       "        'musical',\n",
       "        'Interlude,',\n",
       "        'called',\n",
       "        'the',\n",
       "        'Election',\n",
       "        '[By',\n",
       "        'M.',\n",
       "        'P.',\n",
       "        'Andrews.]'\n",
       "    ]\n",
       ")\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m(\u001b[0m\n", " \u001b[1m[\u001b[0m\n", " \u001b[32m'The'\u001b[0m,\n", " \u001b[32m'Canadian'\u001b[0m,\n", " \u001b[32m'farmer.'\u001b[0m,\n", " \u001b[32m'A'\u001b[0m,\n", " \u001b[32m'missionary'\u001b[0m,\n", " \u001b[32m'incident'\u001b[0m,\n", " \u001b[32m'\u001b[0m\u001b[32m[\u001b[0m\u001b[32mSigned:'\u001b[0m,\n", " \u001b[32m'W.'\u001b[0m,\n", " \u001b[32m'J.'\u001b[0m,\n", " \u001b[32m'H.'\u001b[0m,\n", " \u001b[32m'Y,'\u001b[0m,\n", " \u001b[32m'i.e.'\u001b[0m,\n", " \u001b[32m'William'\u001b[0m,\n", " \u001b[32m'J.'\u001b[0m,\n", " \u001b[32m'H.'\u001b[0m,\n", " \u001b[32m'Yates.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\n", " \u001b[1m]\u001b[0m,\n", " \u001b[1m[\u001b[0m\n", " \u001b[32m'A'\u001b[0m,\n", " \u001b[32m'new'\u001b[0m,\n", " \u001b[32m'musical'\u001b[0m,\n", " \u001b[32m'Interlude,'\u001b[0m,\n", " \u001b[32m'called'\u001b[0m,\n", " \u001b[32m'the'\u001b[0m,\n", " \u001b[32m'Election'\u001b[0m,\n", " \u001b[32m'\u001b[0m\u001b[32m[\u001b[0m\u001b[32mBy'\u001b[0m,\n", " \u001b[32m'M.'\u001b[0m,\n", " \u001b[32m'P.'\u001b[0m,\n", " \u001b[32m'Andrews.\u001b[0m\u001b[32m]\u001b[0m\u001b[32m'\u001b[0m\n", " \u001b[1m]\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "title_words_split.take(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "D0LQxfYEk9id" }, "source": [ "We can see we now have all our words in a list. Helpfully dask bag has a `flatten` method. This will consume our lists and put all the words in a single sequence. " ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "id": "JDBCPDP2lFd2" }, "outputs": [], "source": [ "flattend_title_words = title_words_split.flatten()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UMiAV4sjlKz4", "outputId": "9c473358-4520-42f0-cb0e-3e6696e86283" }, "outputs": [ { "data": { "text/html": [ "
('The', 'Canadian')\n",
       "
\n" ], "text/plain": [ "\u001b[1m(\u001b[0m\u001b[32m'The'\u001b[0m, \u001b[32m'Canadian'\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "flattend_title_words.take(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now use the `frequencies` method to get the top words. " ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4NIeaL0klZhI", "outputId": "7d7a32e3-6a82-4437-996d-ba325dc18758" }, "outputs": [], "source": [ "freqs = flattend_title_words.frequencies(sort=True)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
dask.bag<sorted, npartitions=1>\n",
       "
\n" ], "text/plain": [ "dask.bag\u001b[1m<\u001b[0m\u001b[1;95msorted\u001b[0m\u001b[39m, \u001b[0m\u001b[33mnpartitions\u001b[0m\u001b[39m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "freqs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since dask bag methods are lazy by default nothing has actually been calculated yet. We could just grab the top 10 words. " ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "top_10_words = freqs.topk(10, key=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want the results of something we call `compute` which will call all of the chained methods on our bag. \n" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
       "[\n",
       "    ('of', 808),\n",
       "    ('the', 674),\n",
       "    ('and', 550),\n",
       "    ('...', 518),\n",
       "    ('in', 402),\n",
       "    ('van', 306),\n",
       "    ('etc', 301),\n",
       "    ('de', 258),\n",
       "    ('en', 258),\n",
       "    ('a', 231)\n",
       "]\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m[\u001b[0m\n", " \u001b[1m(\u001b[0m\u001b[32m'of'\u001b[0m, \u001b[1;36m808\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'the'\u001b[0m, \u001b[1;36m674\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'and'\u001b[0m, \u001b[1;36m550\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'...'\u001b[0m, \u001b[1;36m518\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'in'\u001b[0m, \u001b[1;36m402\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'van'\u001b[0m, \u001b[1;36m306\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'etc'\u001b[0m, \u001b[1;36m301\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'de'\u001b[0m, \u001b[1;36m258\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'en'\u001b[0m, \u001b[1;36m258\u001b[0m\u001b[1m)\u001b[0m,\n", " \u001b[1m(\u001b[0m\u001b[32m'a'\u001b[0m, \u001b[1;36m231\u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[1m]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top_10_words.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also do the same with lowered version " ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "lowered_title_words = flattend_title_words.map(lambda x: x.lower())" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "freqs = lowered_title_words.frequencies(sort=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The visualize method gives you some insights into how the computation is managed by dask. " ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d985cef20e334ec49f640e264f51b494", "version_major": 2, "version_minor": 0 }, "text/plain": [ "CytoscapeWidget(cytoscape_layout={'name': 'dagre', 'rankDir': 'BT', 'nodeSep': 10, 'edgeSep': 10, 'spacingFact…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "freqs.visualize(engine=\"cytoscape\", optimize_graph=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Moving from datasets to a dask dataframe \n", "\n", "For some operations, dask bag is super easy to use. Sometimes though you will hurt your brain trying to crow bar your problem into the dask bag API 😵‍💫 This is where dask dataframes come in! Using parquet, we can easily save our 🤗 dataset as a parquet file. " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3583138" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.to_parquet(\"genre.parquet\")" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and load from this file" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "ddf = dd.read_parquet(\"genre.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As dask dataframe works quite similar to a pandas dataframe. It is lazy by default so if we just print it out" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BL record IDNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within seriesCountry of publicationPlace of publicationPublisherDate of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resourceclassification_iduser_idsubject_idsannotator_date_pubannotator_normalised_date_pubannotator_edition_statementannotator_FAST_genre_termsannotator_FAST_subject_termsannotator_commentsannotator_main_languageannotator_other_languages_summariesannotator_summaries_languageannotator_translationannotator_original_languageannotator_publisherannotator_place_pubannotator_countryannotator_titleLink to digitised bookannotatedType of resourcecreated_atannotator_genre
npartitions=1
objectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectboolint64datetime64[ns]int64
..........................................................................................................................................
\n", "
\n", "
Dask Name: read-parquet, 1 tasks
" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll see we don't actually get back any data. If we use head we get the number of examples we ask for. " ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BL record IDNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...annotator_original_languageannotator_publisherannotator_place_pubannotator_countryannotator_titleLink to digitised bookannotatedType of resourcecreated_atannotator_genre
0014603046Yates, William Joseph H.person[Yates, William Joseph H. [person] , Y, W. J....The Canadian farmer. A missionary incident [Si......NONELondonenkThe Canadian farmer. A missionary incident [Si...http://access.bl.uk/item/viewer/ark:/81055/vdc...True02020-08-11 14:30:330
1014603046Yates, William Joseph H.person[Yates, William Joseph H. [person] , Y, W. J....The Canadian farmer. A missionary incident [Si......NONELondonenkThe Canadian farmer. A missionary incident [Si...http://access.bl.uk/item/viewer/ark:/81055/vdc...True02021-04-15 09:53:230
2014603046Yates, William Joseph H.person[Yates, William Joseph H. [person] , Y, W. J....The Canadian farmer. A missionary incident [Si......NONELondonenkThe Canadian farmer. A missionary incident [Si...http://access.bl.uk/item/viewer/ark:/81055/vdc...True02020-09-24 14:27:540
\n", "

3 rows × 46 columns

\n", "
" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have some familiar methods from pandas available to us" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "ddf = ddf.drop_duplicates(subset=\"Title\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example of something that would be a bit tricky in datasets, we can see how to groupby the mean title length by year of publication. First we create a new column for title length" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "ddf[\"title_len\"] = ddf[\"Title\"].map(lambda x: len(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then groupby the date of publication " ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "grouped = ddf.groupby(\"Date of publication\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and then calculate the mean `title_len` " ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "mean_title_len = grouped[\"title_len\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To actually compute this value we call the `compute` method " ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
       "Date of publication\n",
       "1635    248.0\n",
       "1637     67.0\n",
       "1695     63.0\n",
       "1765     86.0\n",
       "1769     20.0\n",
       "        ...  \n",
       "1905    141.0\n",
       "1906    225.0\n",
       "1907    142.0\n",
       "1910     65.0\n",
       "1979     43.0\n",
       "Name: title_len, Length: 124, dtype: float64\n",
       "
\n" ], "text/plain": [ "\n", "Date of publication\n", "\u001b[1;36m1635\u001b[0m \u001b[1;36m248.0\u001b[0m\n", "\u001b[1;36m1637\u001b[0m \u001b[1;36m67.0\u001b[0m\n", "\u001b[1;36m1695\u001b[0m \u001b[1;36m63.0\u001b[0m\n", "\u001b[1;36m1765\u001b[0m \u001b[1;36m86.0\u001b[0m\n", "\u001b[1;36m1769\u001b[0m \u001b[1;36m20.0\u001b[0m\n", " \u001b[33m...\u001b[0m \n", "\u001b[1;36m1905\u001b[0m \u001b[1;36m141.0\u001b[0m\n", "\u001b[1;36m1906\u001b[0m \u001b[1;36m225.0\u001b[0m\n", "\u001b[1;36m1907\u001b[0m \u001b[1;36m142.0\u001b[0m\n", "\u001b[1;36m1910\u001b[0m \u001b[1;36m65.0\u001b[0m\n", "\u001b[1;36m1979\u001b[0m \u001b[1;36m43.0\u001b[0m\n", "Name: title_len, Length: \u001b[1;36m124\u001b[0m, dtype: float64\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mean_title_len.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create a plot in the usual way " ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<AxesSubplot:xlabel='Date of publication'>\n",
       "
\n" ], "text/plain": [ "\u001b[1m<\u001b[0m\u001b[1;95mAxesSubplot:\u001b[0m\u001b[1;33mxlabel\u001b[0m\u001b[39m=\u001b[0m\u001b[32m'Date of publication'\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
<Figure size 432x288 with 1 Axes>\n",
       "
\n" ], "text/plain": [ "\u001b[1m<\u001b[0m\u001b[1;95mFigure\u001b[0m\u001b[39m size 432x288 with \u001b[0m\u001b[1;36m1\u001b[0m\u001b[39m Axes\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "mean_title_len.compute().plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This was a very quick overview. The [dask docs](https://www.dask.org/get-started) go into much more detail as do the Hugging Face [datasets docs](https://huggingface.co/docs/datasets/). \n" ] } ], "metadata": { "colab": { "name": "scratchpad", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "0c0752e8e9024977979562531ad5a4b7": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "408dd31921ee40bb9c9fc8995d4b8577": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_5e1150b6dfb8494195191efaeb5b7feb", "IPY_MODEL_6edda0b1149541499ab485df06711944", "IPY_MODEL_45fcaeca1a184768a624607172ab7d72" ], "layout": "IPY_MODEL_ea28bfd1711148c29b494a0194d2ffbf" } }, "45fcaeca1a184768a624607172ab7d72": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_4ccb508cf3ab4001a30cea47d2448e23", "placeholder": "​", "style": "IPY_MODEL_88809da60e3846dc886cd2545edbc5c3", "value": " 1/1 [00:00<00:00, 21.14it/s]" } }, "4ccb508cf3ab4001a30cea47d2448e23": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "5d3ee4cdf67e44878bac361e143b828a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "5e1150b6dfb8494195191efaeb5b7feb": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_b3eaa93967914d7f8977d736611b845f", "placeholder": "​", "style": "IPY_MODEL_5d3ee4cdf67e44878bac361e143b828a", "value": "100%" } }, "6edda0b1149541499ab485df06711944": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_bff82a2eab2a4f7d87f8037e30839050", "max": 1, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_0c0752e8e9024977979562531ad5a4b7", "value": 1 } }, "88809da60e3846dc886cd2545edbc5c3": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "b3eaa93967914d7f8977d736611b845f": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "bff82a2eab2a4f7d87f8037e30839050": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "ea28bfd1711148c29b494a0194d2ffbf": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } } } } }, "nbformat": 4, "nbformat_minor": 4 }