{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "26e50a28",
   "metadata": {},
   "source": [
    "# Exploratory analysis\n",
    "\n",
    "The purpose of exploratory analysis is to understand your data and any idiosyncrasies which may be relevant to the task of data linking.\n",
    "\n",
    "Splink includes functionality to visualise and summarise your data, to identify characteristics most salient to data linking.\n",
    "\n",
    "In this notebook we perform some basic exploratory analysis, and interpret the results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96a3d08d",
   "metadata": {},
   "source": [
    "### Read in the data\n",
    "\n",
    "For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.\n",
    "\n",
    "The first five rows of this dataset are printed below.\n",
    "\n",
    "Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ffceed65",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unique_id</th>\n",
       "      <th>first_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>dob</th>\n",
       "      <th>city</th>\n",
       "      <th>email</th>\n",
       "      <th>cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alan</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>robert255@smith.net</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-05-24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Rob</td>\n",
       "      <td>Allen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>London</td>\n",
       "      <td>roberta25@smith.net</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Robert</td>\n",
       "      <td>Alen</td>\n",
       "      <td>1971-06-24</td>\n",
       "      <td>Lonon</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Grace</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1997-04-26</td>\n",
       "      <td>Hull</td>\n",
       "      <td>grace.kelly52@jones.com</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   unique_id first_name surname         dob    city                    email  \\\n",
       "0          0     Robert    Alan  1971-06-24     NaN      robert255@smith.net   \n",
       "1          1     Robert   Allen  1971-05-24     NaN      roberta25@smith.net   \n",
       "2          2        Rob   Allen  1971-06-24  London      roberta25@smith.net   \n",
       "3          3     Robert    Alen  1971-06-24   Lonon                      NaN   \n",
       "4          4      Grace     NaN  1997-04-26    Hull  grace.kelly52@jones.com   \n",
       "\n",
       "   cluster  \n",
       "0        0  \n",
       "1        0  \n",
       "2        0  \n",
       "3        0  \n",
       "4        1  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "import altair as alt\n",
    "alt.renderers.enable('default')\n",
    "\n",
    "\n",
    "df = pd.read_csv(\"./data/fake_1000.csv\")\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d53e50b",
   "metadata": {},
   "source": [
    "### Instantiate the linker\n",
    "\n",
    "Most of Splink's core functionality can be accessed as methods on a linker object.  For example, to make predictions, you would call `linker.predict()`.\n",
    "\n",
    "We therefore begin by instantiating the linker, passing in the data we wish to deduplicate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8a1aa029",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise the linker, passing in the input dataset(s)\n",
    "from splink.duckdb.duckdb_linker import DuckDBLinker\n",
    "linker = DuckDBLinker(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d982939",
   "metadata": {},
   "source": [
    "## Analyse missingness"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfab11e8",
   "metadata": {},
   "source": [
    "It's important to understand the level of missingness in your data, because columns with higher levels of missingness are less useful for data linking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "6dae307c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-11bdf32a6194402faed4bc35c31ee975\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-11bdf32a6194402faed4bc35c31ee975\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-11bdf32a6194402faed4bc35c31ee975\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"4.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300, \"width\": 400}, \"axis\": {\"labelFontSize\": 11}}, \"title\": \"\", \"layer\": [{\"mark\": \"bar\", \"encoding\": {\"color\": {\"type\": \"quantitative\", \"field\": \"null_proportion\", \"legend\": {\"format\": \".0%\"}, \"scale\": {\"range\": \"heatmap\"}, \"title\": \"Missingness\"}, \"tooltip\": [{\"type\": \"nominal\", \"field\": \"column_name\", \"title\": \"Column\"}, {\"type\": \"quantitative\", \"field\": \"null_count\", \"format\": \",.0f\", \"title\": \"Count of nulls\"}, {\"type\": \"quantitative\", \"field\": \"null_proportion\", \"format\": \".2%\", \"title\": \"Percentage of nulls\"}, {\"type\": \"quantitative\", \"field\": \"total_record_count\", \"format\": \",.0f\", \"title\": \"Total record count\"}], \"x\": {\"type\": \"quantitative\", \"axis\": {\"format\": \"%\", \"title\": \"Percentage of nulls\"}, \"field\": \"null_proportion\"}, \"y\": {\"type\": \"nominal\", \"axis\": {\"title\": \"\"}, \"field\": \"column_name\", \"sort\": \"-x\"}}, \"title\": \"Missingness per column out of 1,000 records\"}], \"data\": {\"values\": [{\"null_proportion\": 0.0, \"null_count\": 0, \"total_record_count\": 1000, \"column_name\": \"source_dataset\"}, {\"null_proportion\": 0.0, \"null_count\": 0, \"total_record_count\": 1000, \"column_name\": \"unique_id\"}, {\"null_proportion\": 0.1690000295639038, \"null_count\": 169, \"total_record_count\": 1000, \"column_name\": \"first_name\"}, {\"null_proportion\": 0.1809999942779541, \"null_count\": 181, \"total_record_count\": 1000, \"column_name\": \"surname\"}, {\"null_proportion\": 0.0, \"null_count\": 0, \"total_record_count\": 1000, \"column_name\": \"dob\"}, {\"null_proportion\": 0.18699997663497925, \"null_count\": 187, \"total_record_count\": 1000, \"column_name\": \"city\"}, {\"null_proportion\": 0.21100002527236938, \"null_count\": 211, \"total_record_count\": 1000, \"column_name\": \"email\"}, {\"null_proportion\": 0.0, \"null_count\": 0, \"total_record_count\": 1000, \"column_name\": \"cluster\"}], \"name\": \"data-0e7bce5a1d2f132e282789d6ef7780fe\"}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.8.1.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "<splink.charts.VegaliteNoValidate at 0x12d482c40>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linker.missingness_chart()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c35a4e0",
   "metadata": {},
   "source": [
    "The above summary chart shows that in this dataset, the `email`, `city`, `surname` and `forename` columns contain nulls, but the level of missingness is relatively low (less than 22%)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f11cc6b6",
   "metadata": {},
   "source": [
    "## Analyse the distribution of values in your data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "973cb505",
   "metadata": {},
   "source": [
    "The distribution of values in your data is important for two main reasons:\n",
    "\n",
    "1. Columns with higher cardinality (number of distinct values) are usually more useful for data linking.  For instance, date of birth is a much stronger linkage variable than gender.\n",
    "\n",
    "2. The skew of values is important.  If you have a `city` column that has 1,000 distinct values, but 75% of them are `London`, this is much less useful for linkage than if the 1,000 values were equally distributed\n",
    "\n",
    "The `linker.profile_columns()` method creates summary charts to help you understand these aspects of your data. \n",
    "\n",
    "You may input column names (e.g. `first_name`), or arbitrary sql expressions like `concat(first_name, surname)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "897d183c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-60d35faf68104a408445f055465f5b13\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-60d35faf68104a408445f055465f5b13\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-60d35faf68104a408445f055465f5b13\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"4.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"vconcat\": [{\"hconcat\": [{\"data\": {\"values\": [{\"percentile_ex_nulls\": 0.966305673122406, \"percentile_inc_nulls\": 0.972000002861023, \"value_count\": 28, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9470517635345459, \"percentile_inc_nulls\": 0.9559999704360962, \"value_count\": 16, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 16.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9290012121200562, \"percentile_inc_nulls\": 0.9409999847412109, \"value_count\": 15, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 15.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.9121540188789368, \"percentile_inc_nulls\": 0.9269999861717224, \"value_count\": 14, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 14.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8965102434158325, \"percentile_inc_nulls\": 0.9139999747276306, \"value_count\": 13, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 13.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8820698261260986, \"percentile_inc_nulls\": 0.9020000100135803, \"value_count\": 12, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.8423585891723633, \"percentile_inc_nulls\": 0.8690000176429749, \"value_count\": 11, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 33.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.818291187286377, \"percentile_inc_nulls\": 0.8489999771118164, \"value_count\": 10, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.7641395926475525, \"percentile_inc_nulls\": 0.8040000200271606, \"value_count\": 9, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 45.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.7352586984634399, \"percentile_inc_nulls\": 0.7799999713897705, \"value_count\": 8, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.701564371585846, \"percentile_inc_nulls\": 0.7519999742507935, \"value_count\": 7, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.6293622255325317, \"percentile_inc_nulls\": 0.6920000314712524, \"value_count\": 6, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 60.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.539109468460083, \"percentile_inc_nulls\": 0.6169999837875366, \"value_count\": 5, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 75.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.42358607053756714, \"percentile_inc_nulls\": 0.5210000276565552, \"value_count\": 4, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 96.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.33694344758987427, \"percentile_inc_nulls\": 0.4490000009536743, \"value_count\": 3, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 72.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.23826712369918823, \"percentile_inc_nulls\": 0.3669999837875366, \"value_count\": 2, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 82.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.1690000295639038, \"value_count\": 1, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 198.0, \"distinct_value_count\": 335}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 28, \"group_name\": \"first_name\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 335}]}, \"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"encoding\": {\"x\": {\"type\": \"quantitative\", \"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Count of values\"}, \"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": {\"text\": \"Distribution of counts of values in column first_name\", \"subtitle\": \"In this col, 169 values (16.9%) are null and there are 335 distinct values\"}}, {\"data\": {\"values\": [{\"value_count\": 28, \"group_name\": \"first_name\", \"value\": \"Oliver\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 16, \"group_name\": \"first_name\", \"value\": \"Jacob\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 15, \"group_name\": \"first_name\", \"value\": \"Freddie\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 14, \"group_name\": \"first_name\", \"value\": \"Olivia\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 13, \"group_name\": \"first_name\", \"value\": \"James\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 12, \"group_name\": \"first_name\", \"value\": \"George\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"first_name\", \"value\": \"Elizabeth\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"first_name\", \"value\": \"Alfie\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 11, \"group_name\": \"first_name\", \"value\": \"Jessica\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 10, \"group_name\": \"first_name\", \"value\": \"Logan\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\"}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Top 10 values by value count\"}, {\"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"first_name\", \"value\": \"Rob\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"first_name\", \"value\": \"Hall\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"first_name\", \"value\": \"Lucas\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"first_name\", \"value\": \"Luas\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}, {\"value_count\": 1, \"group_name\": \"first_name\", \"value\": \"Rowe\", \"total_non_null_rows\": 831, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 335}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\", \"scale\": {\"domain\": [0, 28]}}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"data\": {\"values\": [{\"percentile_ex_nulls\": 0.787207841873169, \"percentile_inc_nulls\": 0.8270000219345093, \"value_count\": 173, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 173.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.7380073666572571, \"percentile_inc_nulls\": 0.7870000004768372, \"value_count\": 40, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 40.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6974169611930847, \"percentile_inc_nulls\": 0.7540000081062317, \"value_count\": 33, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 33.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6715867519378662, \"percentile_inc_nulls\": 0.7330000400543213, \"value_count\": 21, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 21.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6494464874267578, \"percentile_inc_nulls\": 0.7150000333786011, \"value_count\": 18, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.6076260805130005, \"percentile_inc_nulls\": 0.6809999942779541, \"value_count\": 17, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 34.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.5682656764984131, \"percentile_inc_nulls\": 0.6489999890327454, \"value_count\": 16, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 32.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.5166051387786865, \"percentile_inc_nulls\": 0.6069999933242798, \"value_count\": 14, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.48462486267089844, \"percentile_inc_nulls\": 0.5809999704360962, \"value_count\": 13, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.4403443932533264, \"percentile_inc_nulls\": 0.5449999570846558, \"value_count\": 12, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 36.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.41574418544769287, \"percentile_inc_nulls\": 0.5249999761581421, \"value_count\": 10, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.3936039209365845, \"percentile_inc_nulls\": 0.5069999694824219, \"value_count\": 9, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.3640836477279663, \"percentile_inc_nulls\": 0.4829999804496765, \"value_count\": 8, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.31242311000823975, \"percentile_inc_nulls\": 0.44099998474121094, \"value_count\": 7, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.2829028367996216, \"percentile_inc_nulls\": 0.4169999957084656, \"value_count\": 6, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 24.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.25830256938934326, \"percentile_inc_nulls\": 0.3970000147819519, \"value_count\": 5, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.24354243278503418, \"percentile_inc_nulls\": 0.38499999046325684, \"value_count\": 4, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.2287822961807251, \"percentile_inc_nulls\": 0.37300002574920654, \"value_count\": 3, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.19680196046829224, \"percentile_inc_nulls\": 0.34700000286102295, \"value_count\": 2, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.18699997663497925, \"value_count\": 1, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 160.0, \"distinct_value_count\": 218}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 173, \"group_name\": \"city\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 173.0, \"distinct_value_count\": 218}]}, \"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"encoding\": {\"x\": {\"type\": \"quantitative\", \"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Count of values\"}, \"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": {\"text\": \"Distribution of counts of values in column city\", \"subtitle\": \"In this col, 187 values (18.7%) are null and there are 218 distinct values\"}}, {\"data\": {\"values\": [{\"value_count\": 173, \"group_name\": \"city\", \"value\": \"London\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 40, \"group_name\": \"city\", \"value\": \"Birmingham\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 33, \"group_name\": \"city\", \"value\": \"Liverpool\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 21, \"group_name\": \"city\", \"value\": \"Coventry\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 18, \"group_name\": \"city\", \"value\": \"Newcastle-upon-Tyne\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 17, \"group_name\": \"city\", \"value\": \"Leeds\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 17, \"group_name\": \"city\", \"value\": \"Manchester\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 16, \"group_name\": \"city\", \"value\": \"Bristol\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 16, \"group_name\": \"city\", \"value\": \"Aberdeen\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 14, \"group_name\": \"city\", \"value\": \"Portsmouth\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\"}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Top 10 values by value count\"}, {\"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"city\", \"value\": \"Hull\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"city\", \"value\": \"Pootsmruth\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"city\", \"value\": \"Lunton\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"city\", \"value\": \"Lvpreool\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}, {\"value_count\": 1, \"group_name\": \"city\", \"value\": \"Loodon\", \"total_non_null_rows\": 813, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 218}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\", \"scale\": {\"domain\": [0, 173]}}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"data\": {\"values\": [{\"percentile_ex_nulls\": 0.9719169735908508, \"percentile_inc_nulls\": 0.9769999980926514, \"value_count\": 23, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 23.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.9377289414405823, \"percentile_inc_nulls\": 0.9490000009536743, \"value_count\": 14, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.9059829115867615, \"percentile_inc_nulls\": 0.9229999780654907, \"value_count\": 13, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 26.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.891330897808075, \"percentile_inc_nulls\": 0.9110000133514404, \"value_count\": 12, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 12.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.8778998851776123, \"percentile_inc_nulls\": 0.8999999761581421, \"value_count\": 11, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 11.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.8534798622131348, \"percentile_inc_nulls\": 0.8799999952316284, \"value_count\": 10, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 20.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.831501841545105, \"percentile_inc_nulls\": 0.8619999885559082, \"value_count\": 9, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.7728937864303589, \"percentile_inc_nulls\": 0.8140000104904175, \"value_count\": 8, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 48.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.721611738204956, \"percentile_inc_nulls\": 0.7720000147819519, \"value_count\": 7, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 42.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.66300368309021, \"percentile_inc_nulls\": 0.7239999771118164, \"value_count\": 6, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 48.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.6080585718154907, \"percentile_inc_nulls\": 0.6790000200271606, \"value_count\": 5, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 45.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.5054944753646851, \"percentile_inc_nulls\": 0.5950000286102295, \"value_count\": 4, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 84.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.39560437202453613, \"percentile_inc_nulls\": 0.5049999952316284, \"value_count\": 3, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.28815627098083496, \"percentile_inc_nulls\": 0.4169999957084656, \"value_count\": 2, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 88.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.1809999942779541, \"value_count\": 1, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 236.0, \"distinct_value_count\": 371}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 23, \"group_name\": \"surname\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 23.0, \"distinct_value_count\": 371}]}, \"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"encoding\": {\"x\": {\"type\": \"quantitative\", \"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Count of values\"}, \"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": {\"text\": \"Distribution of counts of values in column surname\", \"subtitle\": \"In this col, 181 values (18.1%) are null and there are 371 distinct values\"}}, {\"data\": {\"values\": [{\"value_count\": 23, \"group_name\": \"surname\", \"value\": \"Jones\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 14, \"group_name\": \"surname\", \"value\": \"Davies\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 14, \"group_name\": \"surname\", \"value\": \"Taylor\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 13, \"group_name\": \"surname\", \"value\": \"Hall\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 13, \"group_name\": \"surname\", \"value\": \"Campbell\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 12, \"group_name\": \"surname\", \"value\": \"Morgan\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 11, \"group_name\": \"surname\", \"value\": \"Smith\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 10, \"group_name\": \"surname\", \"value\": \"King\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 10, \"group_name\": \"surname\", \"value\": \"Russell\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 9, \"group_name\": \"surname\", \"value\": \"Wright\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\"}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Top 10 values by value count\"}, {\"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"surname\", \"value\": \"Alan\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"surname\", \"value\": \"Alen\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"surname\", \"value\": \"pMurphy\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"surname\", \"value\": \"Isla\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}, {\"value_count\": 1, \"group_name\": \"surname\", \"value\": \"Coox\", \"total_non_null_rows\": 819, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 371}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\", \"scale\": {\"domain\": [0, 23]}}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"data\": {\"values\": [{\"percentile_ex_nulls\": 0.9619771838188171, \"percentile_inc_nulls\": 0.9700000286102295, \"value_count\": 6, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 30.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.8479087352752686, \"percentile_inc_nulls\": 0.8799999952316284, \"value_count\": 5, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.6653992533683777, \"percentile_inc_nulls\": 0.7360000014305115, \"value_count\": 4, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 144.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.47148287296295166, \"percentile_inc_nulls\": 0.5830000042915344, \"value_count\": 3, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 153.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.32446134090423584, \"percentile_inc_nulls\": 0.46700000762939453, \"value_count\": 2, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 116.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.21100002527236938, \"value_count\": 1, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 256.0, \"distinct_value_count\": 424}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 6, \"group_name\": \"email\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 30.0, \"distinct_value_count\": 424}]}, \"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"encoding\": {\"x\": {\"type\": \"quantitative\", \"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Count of values\"}, \"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": {\"text\": \"Distribution of counts of values in column email\", \"subtitle\": \"In this col, 211 values (21.1%) are null and there are 424 distinct values\"}}, {\"data\": {\"values\": [{\"value_count\": 6, \"group_name\": \"email\", \"value\": \"omoore64@randall.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"email\", \"value\": \"j.williams@levine-johnson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"email\", \"value\": \"iwilkinson@bush.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"email\", \"value\": \"fb@nelson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 6, \"group_name\": \"email\", \"value\": \"jessica.miller@johnson.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"email\", \"value\": \"t.m39@brooks-sawyer.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"email\", \"value\": \"r.cole1@ramirez-anthony.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"email\", \"value\": \"oliver.atkinson@moran-smith.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"email\", \"value\": \"hollythomson3@levine-jones.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 5, \"group_name\": \"email\", \"value\": \"leahrussell@charles.net\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\"}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Top 10 values by value count\"}, {\"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"email\", \"value\": \"robert255@smith.net\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"email\", \"value\": \"evihd56@earris-bailey.net\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"email\", \"value\": \"o.griffiths90@reyes-coleman.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"email\", \"value\": \"muhammadsmith@brooks.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}, {\"value_count\": 1, \"group_name\": \"email\", \"value\": \"l.c91@perez-gonzalez.com\", \"total_non_null_rows\": 789, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 424}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\", \"scale\": {\"domain\": [0, 6]}}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Bottom 5 values by value count\"}]}, {\"hconcat\": [{\"data\": {\"values\": [{\"percentile_ex_nulls\": 0.9399999976158142, \"percentile_inc_nulls\": 0.9399999976158142, \"value_count\": 30, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 60.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.9110000133514404, \"percentile_inc_nulls\": 0.9110000133514404, \"value_count\": 29, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 29.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.8830000162124634, \"percentile_inc_nulls\": 0.8830000162124634, \"value_count\": 28, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 28.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.8289999961853027, \"percentile_inc_nulls\": 0.8289999961853027, \"value_count\": 27, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 54.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.7789999842643738, \"percentile_inc_nulls\": 0.7789999842643738, \"value_count\": 25, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 50.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.7070000171661377, \"percentile_inc_nulls\": 0.7070000171661377, \"value_count\": 24, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 72.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.5690000057220459, \"percentile_inc_nulls\": 0.5690000057220459, \"value_count\": 23, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 138.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.48100000619888306, \"percentile_inc_nulls\": 0.48100000619888306, \"value_count\": 22, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 88.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.4599999785423279, \"percentile_inc_nulls\": 0.4599999785423279, \"value_count\": 21, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 21.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.3799999952316284, \"percentile_inc_nulls\": 0.3799999952316284, \"value_count\": 20, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 80.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.3230000138282776, \"percentile_inc_nulls\": 0.3230000138282776, \"value_count\": 19, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 57.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.26899999380111694, \"percentile_inc_nulls\": 0.26899999380111694, \"value_count\": 18, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 54.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.23500001430511475, \"percentile_inc_nulls\": 0.23500001430511475, \"value_count\": 17, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 34.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.203000009059906, \"percentile_inc_nulls\": 0.203000009059906, \"value_count\": 16, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 32.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.15799999237060547, \"percentile_inc_nulls\": 0.15799999237060547, \"value_count\": 15, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 45.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.14399999380111694, \"percentile_inc_nulls\": 0.14399999380111694, \"value_count\": 14, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 14.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.07899999618530273, \"percentile_inc_nulls\": 0.07899999618530273, \"value_count\": 13, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 65.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.05699998140335083, \"percentile_inc_nulls\": 0.05699998140335083, \"value_count\": 11, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 22.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.04799997806549072, \"percentile_inc_nulls\": 0.04799997806549072, \"value_count\": 9, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 9.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.04000002145767212, \"percentile_inc_nulls\": 0.04000002145767212, \"value_count\": 8, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 8.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.03299999237060547, \"percentile_inc_nulls\": 0.03299999237060547, \"value_count\": 7, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 7.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.027000010013580322, \"percentile_inc_nulls\": 0.027000010013580322, \"value_count\": 6, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 6.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.017000019550323486, \"percentile_inc_nulls\": 0.017000019550323486, \"value_count\": 5, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 10.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.013000011444091797, \"percentile_inc_nulls\": 0.013000011444091797, \"value_count\": 4, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 4.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.009999990463256836, \"percentile_inc_nulls\": 0.009999990463256836, \"value_count\": 3, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 3.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.001999974250793457, \"percentile_inc_nulls\": 0.001999974250793457, \"value_count\": 2, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 8.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.0, \"value_count\": 1, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 2.0, \"distinct_value_count\": 61}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 30, \"group_name\": \"substr_dob_1_4_\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 60.0, \"distinct_value_count\": 61}]}, \"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"encoding\": {\"x\": {\"type\": \"quantitative\", \"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Count of values\"}, \"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": {\"text\": \"Distribution of counts of values in column substr(dob, 1,4)\", \"subtitle\": \"In this col, 0 values (0.0%) are null and there are 61 distinct values\"}}, {\"data\": {\"values\": [{\"value_count\": 30, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2011\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 30, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2000\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 29, \"group_name\": \"substr_dob_1_4_\", \"value\": \"1983\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 28, \"group_name\": \"substr_dob_1_4_\", \"value\": \"1984\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 27, \"group_name\": \"substr_dob_1_4_\", \"value\": \"1991\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 27, \"group_name\": \"substr_dob_1_4_\", \"value\": \"1972\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 25, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2017\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 25, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2002\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 24, \"group_name\": \"substr_dob_1_4_\", \"value\": \"1974\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 24, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2010\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\"}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Top 10 values by value count\"}, {\"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2031\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 1, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2027\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 2, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2026\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 2, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2030\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}, {\"value_count\": 2, \"group_name\": \"substr_dob_1_4_\", \"value\": \"2028\", \"total_non_null_rows\": 1000, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 61}]}, \"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"value\", \"sort\": \"-y\", \"title\": null}, \"y\": {\"type\": \"quantitative\", \"field\": \"value_count\", \"title\": \"Value count\", \"scale\": {\"domain\": [0, 30]}}, \"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}]}, \"title\": \"Bottom 5 values by value count\"}]}], \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.8.1.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "<splink.charts.VegaliteNoValidate at 0x12c849430>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linker.profile_columns([\"first_name\", \"city\", \"surname\", \"email\", \"substr(dob, 1,4)\"], top_n=10, bottom_n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c5d3a2d",
   "metadata": {},
   "source": [
    "This chart is very information-dense, but here are some key takehomes relevant to our linkage:\n",
    "\n",
    "- There is strong skew in the `city` field with around 20% of the values being `London`.  We therefore will probably want to use `term_frequency_adjustments` in our linkage model, so that it can weight a match on London differently to a match on, say, `Norwich`.\n",
    "\n",
    "- Looking at the \"Bottom 5 values by value count\", we can see typos in the data in most fields.  This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.\n",
    "\n",
    "- Email is a much more uniquely-identifying field than any others, with a maximum value count of 6.  It's likely to be a strong linking variable."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f37cb1e",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "At this point, we have begin to develop a strong understanding of our data.  It's time to move on to estimating a linkage model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be935e71",
   "metadata": {},
   "source": [
    "## Further reading\n",
    "\n",
    "You can find the documentation for the exploratory analysis tools in Splink [here](https://moj-analytical-services.github.io/splink/linkerexp.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "splink_demos",
   "language": "python",
   "name": "splink_demos"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "83cd1825940a26b927f4456d916a72166c792cbca23141876bf335b1893d7d4c"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}