{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GLAM CSV Explorer\n",
    "\n",
    "[View code at GitHub](https://github.com/GLAM-Workbench/csv-explorer/blob/master/csv-explorer.ipynb) &middot; [View details in GLAM Workbench](https://glam-workbench.net/csv-explorer/)\n",
    "\n",
    "Cultural institutions are making collection data available as machine readable downloads. But how can researchers explore the shape and meaning of this data? How do they know what types of questions they can ask?\n",
    "\n",
    "This notebook provides a quick overview of CSV-formatted data files, particularly those created by GLAM institutions (galleries, libraries, archives, and museums). The list of CSV files from Australian GLAM insitutions provided below [was harvested from state and national government data portals](https://glam-workbench.github.io/glam-data-portals/). You can select a file from the list or upload your own.\n",
    "\n",
    "The CSV Explorer looks at each column in the selected CSV file and tries to identify the type of data inside. It then attempts to tell you something useful about it. There's some more details about the process below.\n",
    "\n",
    "Given all the possible variations in recording and formatting data, there will be errors. But hopefully this will provide you with a useful starting point for further exploration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This notebook is designed to run in Voila as an app (with the code hidden).\n",
    "# To launch this notebook in Voila, just select 'View > Open with Voila in New Browser Tab'\n",
    "# Your browser might ask for permission to open the new tab as a popup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "import pandas as pd\n",
    "from pandas.errors import ParserError\n",
    "import statistics\n",
    "import time\n",
    "import os\n",
    "import io\n",
    "from urllib.parse import urlparse\n",
    "from urllib.error import HTTPError\n",
    "import ipywidgets as widgets\n",
    "from IPython.display import display, HTML, clear_output\n",
    "import altair as alt\n",
    "from wordcloud import WordCloud\n",
    "from slugify import slugify\n",
    "# alt.renderers.enable('notebook')\n",
    "#alt.data_transformers.enable('json', urlpath='files')\n",
    "alt.data_transformers.enable('data_server')\n",
    "#alt.data_transformers.enable('data_server_proxied', urlpath='.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#This is where the results go...\n",
    "results = widgets.Output()\n",
    "status = widgets.Output()\n",
    "pd.set_option('display.max_columns', 10)\n",
    "\n",
    "def read_csv(url, header=0, encoding=0):\n",
    "    '''\n",
    "    Loop through some encoding/parsing options to see if we can get the CSV to open properly.\n",
    "    '''\n",
    "    encodings = ['ISO-8859-1', 'latin-1']\n",
    "    headers = [None]\n",
    "    try:\n",
    "        if encoding > 0 and header > 0:\n",
    "            df = pd.read_csv(url, sep=None, engine='python', na_values=['-', ' '], encoding=encodings[encoding-1], header=headers[header-1])\n",
    "        elif encoding > 0:\n",
    "            df = pd.read_csv(url, sep=None, engine='python', na_values=['-', ' '], encoding=encodings[encoding-1])\n",
    "        elif header > 0:\n",
    "            df = pd.read_csv(url, sep=None, engine='python', na_values=['-', ' '], header=headers[header-1])\n",
    "        else:\n",
    "            df = pd.read_csv(url, sep=None, engine='python', na_values=['-', ' '])\n",
    "    except UnicodeDecodeError:\n",
    "        if encoding == len(encodings):\n",
    "            raise\n",
    "        else:\n",
    "            return read_csv(url=url, header=header, encoding=encoding+1)\n",
    "    except ParserError:\n",
    "        if header == len(headers):\n",
    "            raise\n",
    "        else:\n",
    "            return read_csv(url=url, header=header+1, encoding=encoding)\n",
    "    else:\n",
    "        return df\n",
    "\n",
    "def analyse_csv(b):\n",
    "    '''\n",
    "    Try to open the CSV file, and start the analysis.\n",
    "    '''\n",
    "    results.clear_output()\n",
    "    status.clear_output()\n",
    "    error = ''\n",
    "    if tab.selected_index == 0:\n",
    "        row = csvs.loc[select_csv.value]\n",
    "        url = row['download_url']\n",
    "        title = row['file_title']\n",
    "    elif tab.selected_index == 1:\n",
    "        url = csv_url.value\n",
    "        parsed_url = urlparse(url)\n",
    "        title = os.path.basename(parsed_url.path)\n",
    "    elif tab.selected_index == 2:\n",
    "        # This will change in ipywidgets 8!\n",
    "        title, csv_content = list(csv_upload.value.items())[0]\n",
    "        url = io.BytesIO(csv_content['content'])\n",
    "    with results:\n",
    "        html = f'<hr><h2>{title}</h2>'\n",
    "        if tab.selected_index in [0, 1]:\n",
    "            html += f'<h4>Source</h4><p><a href=\"{url}\">{url}</a> ({row[\"file_size\"]})</p>'\n",
    "        display(HTML(html))\n",
    "        # display(status)\n",
    "        status.append_stdout('Downloading data...')\n",
    "        try:\n",
    "            df = read_csv(url)\n",
    "        except UnicodeDecodeError:\n",
    "            error = 'Unicode error: unable to read the CSV!'\n",
    "        except ParserError:\n",
    "            error = 'Parser error: unable to read the CSV!'\n",
    "        except HTTPError:\n",
    "            error = 'File not found!'\n",
    "        except:\n",
    "            error = 'Unable to read the CSV!'\n",
    "        status.clear_output()\n",
    "        if error:\n",
    "            display(HTML(f'<p class=\"alert alert-danger\">{error}</p>'))\n",
    "        else:\n",
    "            rows, cols = df.shape\n",
    "            size = '<h4>Size</h4><ul>'\n",
    "            size += '<li>{} rows</li>'.format(rows)\n",
    "            size += '<li>{} columns</li></ul>'.format(cols)\n",
    "            cols = \"<h4>Columns</h4><ol>\"\n",
    "            for col in df.columns:\n",
    "                cols += '<li><a style=\"font-family: monospace\" href=\"#{}\" target=\"_self\">{}</a></li>'.format(slugify(col), col)\n",
    "            cols += '</ol>'\n",
    "            display(HTML(size))\n",
    "            display(HTML(cols))\n",
    "            display(HTML('<h4>Sample</h4>'))\n",
    "            display(df.head())\n",
    "            analyse_columns(df)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "k, v = list({'foo': 'bar'}.items())[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "date_cutoff = 0.8\n",
    "cutoff = 0.8\n",
    "unique_cutoff = 0.1\n",
    "category_count = 30\n",
    "\n",
    "def display_dates(df, col):\n",
    "    # Better to group years first, so that the altair data isn't huge\n",
    "    # Get counts by year\n",
    "    counts = df[col].groupby([df[col].dt.year]).agg('count').to_frame()\n",
    "    # Get the full range of years\n",
    "    years = pd.Index([y for y in range(int(counts.index[0]), int(counts.index[-1]) + 1)])\n",
    "    # Set missing years to zero\n",
    "    counts = counts.reindex(years, fill_value=0)\n",
    "    counts = counts.reset_index()\n",
    "    counts.columns = [col, 'count']\n",
    "    chart = alt.Chart(counts).mark_area().encode(\n",
    "        x=alt.X(f'{col}:Q', axis=alt.Axis(format='c', title='Year', tickMinStep=1)),\n",
    "        y='count:Q',\n",
    "        tooltip=[alt.Tooltip('{}:O'.format(col), title='Year'), alt.Tooltip('count:Q', title='Count', format=',')],\n",
    "        color=alt.value('#5254a3')\n",
    "    ).properties(\n",
    "        width=800\n",
    "    )\n",
    "    display(chart)\n",
    "    \n",
    "def display_categories(df, col):\n",
    "    counts = df[col].value_counts()\n",
    "    if counts.size > category_count:\n",
    "        counts = counts[:category_count].to_frame()\n",
    "    else:\n",
    "        counts = counts.to_frame()\n",
    "    counts = counts.reset_index()\n",
    "    counts.columns = [col, 'count']\n",
    "    chart = alt.Chart(counts).mark_bar().encode(\n",
    "        x='count:Q',\n",
    "        y=alt.Y('{}:N'.format(col),  sort=alt.EncodingSortField(field='count', op='count', order='ascending')),\n",
    "        tooltip=[alt.Tooltip('{}:N'.format(col), title='Category'), alt.Tooltip('count:Q', title='Count', format=',')],\n",
    "        color=alt.value('#8ca252')\n",
    "    )\n",
    "    display(chart)\n",
    "\n",
    "def display_wordcloud(df, col, collocates=True):\n",
    "    # Make a word cloud!\n",
    "    # The word cloud software splits the string into individual words and calculates their frquency\n",
    "    words = df[col].str.cat(sep=' ')\n",
    "    wordcloud = WordCloud(width=800, height=300, collocations=collocates).generate(words)\n",
    "    display(wordcloud.to_image())\n",
    "\n",
    "def display_numbers(df, col, unique_count):\n",
    "    #display(df[col])\n",
    "    if unique_count <= 20:\n",
    "        # df[col].replace('0', np.NaN)\n",
    "        counts = df[col].value_counts().to_frame()\n",
    "        counts = counts.reset_index()\n",
    "        counts.columns = [col, 'count']\n",
    "        #display(counts)\n",
    "        chart = alt.Chart(counts).mark_bar().encode(\n",
    "            alt.X('{}:Q'.format(col)),\n",
    "            y='count',\n",
    "            tooltip=[alt.Tooltip('{}:Q'.format(col)), alt.Tooltip('count:Q', title='Count', format=',')],\n",
    "            color=alt.value('#ad494a')\n",
    "        )\n",
    "    else:\n",
    "        chart = alt.Chart(df).mark_bar().encode(\n",
    "            alt.X('{}:Q'.format(col), bin=alt.Bin(maxbins=10, nice=True)),\n",
    "            y='count()',\n",
    "            tooltip=[alt.Tooltip('{}:Q'.format(col), bin=alt.Bin(maxbins=10, nice=True), title='Range'), alt.Tooltip('count():Q', title='Count', format=',')],\n",
    "            color=alt.value('#ad494a')\n",
    "        )\n",
    "    display(chart)\n",
    "\n",
    "def text_field(df, col, value_count, word_counts, details):\n",
    "    html = 'This looks like a text field.'\n",
    "    display(HTML(html))\n",
    "    median_word_count = statistics.median(word_counts)\n",
    "    collocates = True if median_word_count > 1 else False\n",
    "    details['Total number of words'] = word_counts.sum()\n",
    "    details['Highest number of words'] = word_counts.max()\n",
    "    details['Median number of words'] = median_word_count\n",
    "    details['Number of empty records'] = df[col].shape[0] - value_count\n",
    "    display_details(details)\n",
    "    wordcloud = display_wordcloud(df, col, collocates)\n",
    "    display(wordcloud.to_image())\n",
    "    #image_file = 'images/{}_cloud_{}.png'.format(slugify(col), int(time.time()))\n",
    "    #try:\n",
    "    #    image.save(image_file)\n",
    "    #except FileNotFoundError:\n",
    "    #    os.makedirs('images')\n",
    "    #display(HTML('<a href=\"{0}\"><img src=\"{0}\"></a>'.format(image_file)))\n",
    "    \n",
    "def textplus_field(df, col, value_count, unique_count, unique_ratio, word_counts, has_year, details, html):\n",
    "    median_word_count = statistics.median(word_counts)\n",
    "    collocates = True if median_word_count > 1 else False\n",
    "    mixed = False\n",
    "    details['Total number of words'] = word_counts.sum()\n",
    "    details['Highest number of words'] = word_counts.max()\n",
    "    details['Median number of words'] = median_word_count\n",
    "    details['Number of empty records'] = df[col].shape[0] - value_count\n",
    "    display_details(details)\n",
    "    has_mixed = df[col].str.contains(r'(?=\\S*[a-zA-Z\\/])(?=\\S*[0-9])', regex=True)\n",
    "    if has_mixed.sum() / value_count > cutoff and median_word_count <= 2:\n",
    "        mixed = True\n",
    "        html = '<p>This columns contains a small number of words that combine letters and numbers. They\\'re probably collection identifiers. Here\\'s some examples:</p><ul>'\n",
    "        samples = df.loc[df[col].notna()][col].sample(5).to_list()\n",
    "        for sample in samples:\n",
    "            html += '<li>{}</li>'.format(sample)\n",
    "        html += '</ul>'\n",
    "        display(HTML(html))\n",
    "    elif unique_count <= category_count:\n",
    "        display(HTML(f'<p>This look like it contains categories. Let\\'s look at the {category_count} most common.</p>'))\n",
    "        display_categories(df, col)\n",
    "    else:\n",
    "        try:\n",
    "            display(HTML('<p>This look like it contains text.</p>'))\n",
    "            wordcloud = display_wordcloud(df, col, collocates)\n",
    "        except ValueError:\n",
    "            pass\n",
    "        if unique_ratio < unique_cutoff:\n",
    "            display(HTML(f'<p>Less than {unique_cutoff:.2%} of the values are unique, let\\'s look at the {category_count} most common.</p>'))\n",
    "            display_categories(df, col)\n",
    "    has_number = df[col].str.contains(r'\\b\\d+\\b', regex=True)\n",
    "    # Check for dates\n",
    "    if has_year.sum() / value_count > cutoff and mixed is False:\n",
    "        html = '<p>Most of the values in this column include a number that looks like a year. It might be useful to convert them to dates.</p>'\n",
    "        df['{}_years_extracted'.format(col)] = df[col].str.extract(r'\\b(1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1})\\b')\n",
    "        if df['{}_years_extracted'.format(col)].nunique(dropna=True) > 1:\n",
    "            df['{}_date_converted'.format(col)] = pd.to_datetime(df['{}_years_extracted'.format(col)], format='%Y', utc=True)\n",
    "            html += '<p>{:,} of {:,} values in this column were successfully parsed as dates.</p>'.format(df['{}_date_converted'.format(col)].dropna().size, value_count)\n",
    "            details = {}\n",
    "            details['Earliest date'] = df['{}_date_converted'.format(col)].min().strftime('%Y-%m-%d')\n",
    "            details['Latest date'] = df['{}_date_converted'.format(col)].max().strftime('%Y-%m-%d')\n",
    "            display(HTML(html))\n",
    "            display_details(details)\n",
    "            display_dates(df, '{}_date_converted'.format(col))\n",
    "    # Check for numbers\n",
    "    elif has_number.sum() / value_count > cutoff and mixed is False:\n",
    "        html = '<p>Most of the values in this column include a number. It might be useful to extract the values.</p>'\n",
    "        df['{}_numbers_extracted'.format(col)] = df[col].str.extract(r'\\b(\\d+)\\b')\n",
    "        if df['{}_numbers_extracted'.format(col)].nunique(dropna=True) > 2:\n",
    "            df['{}_numbers_extracted'.format(col)] = pd.to_numeric(df['{}_numbers_extracted'.format(col)], errors='coerce', downcast='integer')\n",
    "            details = {}\n",
    "            details['Highest value'] = df['{}_numbers_extracted'.format(col)].max()\n",
    "            details['Lowest value'] = df['{}_numbers_extracted'.format(col)].dropna().min()\n",
    "            display(HTML(html))\n",
    "            display_details(details)\n",
    "            display_numbers(df, '{}_numbers_extracted'.format(col), unique_count)\n",
    "        \n",
    "    \n",
    "def date_field(df, col, value_count, year_count, details, html):\n",
    "    default_dates = pd.to_datetime(df[col], infer_datetime_format=True, errors='coerce', utc=True)\n",
    "    default_dates_count = default_dates.dropna().size\n",
    "    dayfirst_dates = pd.to_datetime(df[col], infer_datetime_format=True, errors='coerce', dayfirst=True, yearfirst=True, utc=True)\n",
    "    dayfirst_dates_count = dayfirst_dates.dropna().size\n",
    "    if (default_dates_count / value_count > date_cutoff) and (default_dates_count >= dayfirst_dates_count):\n",
    "        df['{}_date_converted'.format(col)] = default_dates\n",
    "    elif (dayfirst_dates_count / value_count > date_cutoff) and (dayfirst_dates_count >= default_dates_count):\n",
    "        df['{}_date_converted'.format(col)] = dayfirst_dates\n",
    "    else:\n",
    "        # It's not a known date format, so let's just get the years\n",
    "        df['{}_years_extracted'.format(col)] = df[col].str.extract(r'\\b(1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1})\\b')\n",
    "        df['{}_date_converted'.format(col)] = pd.to_datetime(df['{}_years_extracted'.format(col)], format='%Y', utc=True)\n",
    "    html += '<p>This looks like it contains dates.</p>'\n",
    "    html += '<p>{:,} of {:,} values in this column were successfully parsed as dates.</p>'.format(df['{}_date_converted'.format(col)].dropna().size, value_count)\n",
    "    details['Earliest date'] = df['{}_date_converted'.format(col)].min().strftime('%Y-%m-%d')\n",
    "    details['Latest date'] = df['{}_date_converted'.format(col)].max().strftime('%Y-%m-%d')\n",
    "    display(HTML(html))\n",
    "    display_details(details)\n",
    "    display_dates(df, '{}_date_converted'.format(col))\n",
    "\n",
    "def url_field(df, col, details, html):\n",
    "    display_details(details)\n",
    "    html += '<p>It looks like this column contains urls. Here are some examples:</p><ul>'\n",
    "    samples = df.loc[df[col].notna()][col].sample(5).to_list()\n",
    "    for sample in samples:\n",
    "        html += '<li><a href=\"{0}\">{0}</a></li>'.format(sample)\n",
    "    html += '</ul>'\n",
    "    display(HTML(html))\n",
    "    \n",
    "def unique_field(df, col, details, html):\n",
    "    display_details(details)\n",
    "    html += '<p>This column only contains one value:</p>'\n",
    "    html += '<blockquote>{}</blockquote>'.format(df[col].loc[df[col].first_valid_index()])\n",
    "    display(HTML(html))\n",
    "    \n",
    "def number_field(df, col, value_count, unique_count, unique_ratio, details, html):\n",
    "    has_year = df.loc[(df[col] >= 1700) & (df[col] <= 2019)]\n",
    "    if (has_year.size / value_count) > date_cutoff:\n",
    "        df['{}_date_converted'.format(col)] = pd.to_datetime(df[col], format='%Y', utc=True, errors='coerce')\n",
    "        html += '<p>This looks like it contains dates.</p>'\n",
    "        html += '<p>{:,} of {:,} values in this column were successfully parsed as dates.</p>'.format(df['{}_date_converted'.format(col)].dropna().size, value_count)\n",
    "        details['Earliest date'] = df['{}_date_converted'.format(col)].dropna().min().strftime('%Y-%m-%d')\n",
    "        details['Latest date'] = df['{}_date_converted'.format(col)].dropna().max().strftime('%Y-%m-%d')\n",
    "        display(HTML(html))\n",
    "        display_details(details)\n",
    "        display_dates(df, '{}_date_converted'.format(col))\n",
    "    else:\n",
    "        details['Highest value'] = df[col].max()\n",
    "        details['Lowest value'] = df[col].dropna().min()\n",
    "        display_details(details)\n",
    "        if unique_ratio > cutoff:\n",
    "            html = '{:.2%} of the values in this column are unique, so it\\'s probably some sort of identifier.'.format(unique_ratio)\n",
    "            display(HTML(html))\n",
    "        if unique_count <= 20:\n",
    "            display_categories(df, col)\n",
    "        else:\n",
    "            display_numbers(df, col, unique_count)\n",
    "        #Check for geocoordinates?\n",
    "\n",
    "def display_details(details):\n",
    "    details_df = pd.DataFrame.from_dict(details, orient='index', columns=[' '])\n",
    "    details_df.rename_axis('Summary', axis='columns', inplace=True)\n",
    "    details_df = details_df.style.set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])\n",
    "    display(details_df)\n",
    "\n",
    "def analyse_columns(df):\n",
    "    enriched_df = df.copy()\n",
    "    #out = widgets.Output()\n",
    "    outputs = {}\n",
    "    for index, col in enumerate(enriched_df.columns):\n",
    "        display(HTML('<hr><h3 id=\"{}\">{}. <code>{}</code></h3>'.format(slugify(col), index+1, col)))\n",
    "        details = {}\n",
    "        html = ''\n",
    "        # Are there any values in this column\n",
    "        value_count = enriched_df[col].dropna().size\n",
    "        details['Number of (non empty) values'] = '{:,} ({:.2%} of rows)'.format(value_count, (value_count / enriched_df[col].size))\n",
    "        if value_count:\n",
    "            # How many unique values are there in this column?\n",
    "            unique_count = enriched_df[col].nunique(dropna=True)\n",
    "            # What proportion of the values are unique?\n",
    "            unique_ratio = unique_count / value_count\n",
    "            details['Number of unique values'] = '{:,} ({:.2%} of non-empty values)'.format(unique_count, unique_ratio)\n",
    "            if unique_ratio == 1:\n",
    "                html += '<p>All the values in this column are unique, perhaps it''s some form of identifier.</p>'\n",
    "            if unique_count == 1:\n",
    "                unique_field(enriched_df, col, details, html)\n",
    "            # Check it's a string field\n",
    "            elif enriched_df[col].dtype == 'object':\n",
    "                word_counts = enriched_df[col].dropna().str.split().str.len().fillna(0)\n",
    "                median_word_count = statistics.median(word_counts)\n",
    "                # Check for the presence of years\n",
    "                # year_count = enriched_df[col].str.count(r'\\b1[7-9]{1}\\d{2}\\b|\\b20[0-1]{1}\\d{1}\\b').sum()\n",
    "                if enriched_df[col].str.startswith('http', na=False).sum() > 1:\n",
    "                    url_field(enriched_df, col, details, html)\n",
    "                #elif median_word_count <= 4:\n",
    "                    # How many have words that combine letters and numbers?\n",
    "                else:\n",
    "                    # How many start with words (and no numbers in the first two words)?\n",
    "                    starts_with_words = enriched_df[col].str.contains(r'^[a-zA-Z]+$|^(?:\\b[a-zA-Z]{2,}\\b\\W*){2}', regex=True)\n",
    "                    # How many have patterns that look like years?\n",
    "                    has_year = enriched_df[col].str.contains(r'\\b1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1}\\b', regex=True)\n",
    "                    # If most don't start with words...\n",
    "                    # This filters out titles or names that might include dates.\n",
    "                    if (value_count - starts_with_words.sum()) / value_count > date_cutoff:\n",
    "                        # If most contain years...\n",
    "                        if (has_year.sum() / value_count) > date_cutoff:\n",
    "                            date_field(enriched_df, col, value_count, has_year.sum(), details, html)\n",
    "                        else:\n",
    "                            textplus_field(enriched_df, col, value_count, unique_count, unique_ratio, word_counts, has_year, details, html)\n",
    "                    else:\n",
    "                        textplus_field(enriched_df, col, value_count, unique_count, unique_ratio, word_counts, has_year, details, html)\n",
    "            elif enriched_df[col].dtype in ['int64', 'float64']:\n",
    "                number_field(enriched_df, col, value_count, unique_count, unique_ratio, details, html)\n",
    "        else:\n",
    "            html = 'This column is empty.'\n",
    "            display(HTML(html))\n",
    "              "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f006a5da19ae4f0299da09bce14227ff",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VBox(children=(Tab(children=(VBox(children=(HTML(value='Select a CSV file:'), Dropdown(layout=Layout(width='10…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "csvs = pd.read_csv('https://raw.githubusercontent.com/GLAM-Workbench/ozglam-data/master/glam-datasets-from-gov-portals-csvs.csv', parse_dates=['file_created'])\n",
    "orgs = ['All'] + sorted(csvs['publisher'].unique().tolist())\n",
    "datasets = ['All'] + sorted(csvs['dataset_title'].unique().tolist())\n",
    "csvs.sort_values(by=['file_title', 'file_created'], inplace=True)\n",
    "files = []\n",
    "trigger = None\n",
    "for row in csvs.itertuples():\n",
    "    files.append((f'{row.file_title} ({row.publisher}, {row.file_created.year})', row.Index))\n",
    "\n",
    "def filter_files(field, value):\n",
    "    filtered_csvs = csvs.loc[csvs[field] == value]\n",
    "    filtered_files = []\n",
    "    for row in filtered_csvs.itertuples():\n",
    "        filtered_files.append((f'{row.file_title} ({row.publisher}, {row.file_created.year})', row.Index))\n",
    "    select_csv.options = filtered_files\n",
    "    \n",
    "def reset_options():\n",
    "    select_org.options = orgs\n",
    "    select_dataset.options = datasets\n",
    "    select_csv.options = files\n",
    "    select_org.value = orgs[0]\n",
    "    select_dataset.value = datasets[0]\n",
    "    #select_csv.value = files[0][1]\n",
    "\n",
    "    \n",
    "def filter_by_org(*args):\n",
    "    '''\n",
    "    Update the list of files in the selection dropdown based on the selected organisation.\n",
    "    '''\n",
    "    if select_org.value == 'All':\n",
    "        select_dataset.options = datasets\n",
    "        select_dataset.value = datasets[0]\n",
    "        select_csv.options = files\n",
    "    else:\n",
    "        filter_files('publisher', select_org.value)\n",
    "        if select_dataset.value != 'All':\n",
    "            selected_org = sorted(csvs.loc[csvs['dataset_title'] == select_dataset.value]['publisher'].unique().tolist())[0]\n",
    "            if selected_org != select_org.value:\n",
    "                filtered_datasets = ['All'] + sorted(csvs.loc[csvs['publisher'] == select_org.value]['dataset_title'].unique().tolist())\n",
    "                select_dataset.value = 'All'\n",
    "                select_dataset.options = filtered_datasets\n",
    "        else:\n",
    "            filtered_datasets = ['All'] + sorted(csvs.loc[csvs['publisher'] == select_org.value]['dataset_title'].unique().tolist())\n",
    "            select_dataset.value = 'All'\n",
    "            select_dataset.options = filtered_datasets            \n",
    "\n",
    "def filter_by_dataset(*args):\n",
    "    '''\n",
    "    Update the list of files in the selection dropdown based on the selected organisation.\n",
    "    '''\n",
    "    if select_dataset.value == 'All':\n",
    "        if select_org.value != 'All':\n",
    "            filter_files('publisher', select_org.value)\n",
    "    else:\n",
    "        filter_files('dataset_title', select_dataset.value)\n",
    "        selected_org = sorted(csvs.loc[csvs['dataset_title'] == select_dataset.value]['publisher'].unique().tolist())[0]\n",
    "        #select_org.options = filtered_orgs\n",
    "        if selected_org != select_org.value:\n",
    "            select_org.value = selected_org\n",
    "\n",
    "def clear_all(b):\n",
    "    reset_options()\n",
    "    csv_url.value = ''\n",
    "    results.clear_output()\n",
    "    \n",
    "select_org = widgets.Dropdown(\n",
    "        options=orgs,\n",
    "        description='',\n",
    "        disabled=False,\n",
    "        layout=widgets.Layout(width='100%')\n",
    "    )\n",
    "\n",
    "select_dataset = widgets.Dropdown(\n",
    "        options=datasets,\n",
    "        description='',\n",
    "        disabled=False,\n",
    "        layout=widgets.Layout(width='100%')\n",
    "    )\n",
    "\n",
    "select_csv = widgets.Dropdown(\n",
    "        options=files,\n",
    "        description='',\n",
    "        disabled=False,\n",
    "        layout=widgets.Layout(width='100%')\n",
    "    )\n",
    "\n",
    "csv_url = widgets.Text(\n",
    "        placeholder='Enter the url of a CSV file',\n",
    "        description='Url:',\n",
    "        disabled=False,\n",
    "        layout=widgets.Layout(width='100%')\n",
    "    )\n",
    "\n",
    "csv_upload = widgets.FileUpload(\n",
    "    accept='.csv',\n",
    "    multiple=False\n",
    ")\n",
    "\n",
    "\n",
    "clear_button = widgets.Button(\n",
    "        description='Clear',\n",
    "        disabled=False,\n",
    "        button_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
    "        tooltip='Clear current data',\n",
    "        icon=''\n",
    "    )\n",
    "\n",
    "analyse_button = widgets.Button(\n",
    "        description='Analyse CSV',\n",
    "        disabled=False,\n",
    "        button_style='primary', # 'success', 'info', 'warning', 'danger' or ''\n",
    "        tooltip='Analyse CSV',\n",
    "        icon=''\n",
    "    )\n",
    "\n",
    "# Update the file list when you select an org\n",
    "select_org.observe(filter_by_org)\n",
    "\n",
    "# Update the file list when you select an org\n",
    "select_dataset.observe(filter_by_dataset)\n",
    "\n",
    "clear_button.on_click(clear_all)\n",
    "analyse_button.on_click(analyse_csv)\n",
    "select_org_note = widgets.HTML('Filter by organisation:')\n",
    "select_dataset_note = widgets.HTML('Filter by dataset:')\n",
    "select_note = widgets.HTML('Select a CSV file:')\n",
    "select_tab = widgets.VBox([select_note, select_csv, select_org_note, select_org, select_dataset_note, select_dataset])\n",
    "tab = widgets.Tab(children=[select_tab, csv_url, csv_upload])\n",
    "tab.set_title(0, 'Select CSV')\n",
    "tab.set_title(1, 'Enter CSV url')\n",
    "tab.set_title(2, 'Upload CSV')\n",
    "display(widgets.VBox([tab, widgets.HBox([analyse_button, clear_button]), results, status]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "## More information\n",
    "\n",
    "The GLAM CSV Explorer is a [Jupyter](http://jupyter.org/) notebook, combining live Python code with text and widgets in a form that's easy to hack and build upon. The app makes heavy use of [Pandas](https://pandas.pydata.org/), the all-purpose toolkit for working with tabular data. Pandas is quick and powerful, but has so many options it can be difficult to know where to start. You might like to poke around in the code for ideas.\n",
    "\n",
    "To analyse a CSV, the explorer looks at things like the datatype of a column, and the number of unique values it holds. It also applies a variety of regular expressions to look for dates and numbers. Depending on what it finds, it extracts some summary information, and tries to visualise the results using [WordCloud](https://github.com/amueller/word_cloud) and [Altair](https://altair-viz.github.io/index.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}